[PATCH RFC 00/10] KFENCE: A low-overhead sampling-based memory safety error detector

Marco Elver

unread,

Sep 7, 2020, 9:41:11 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors. This
series enables KFENCE for the x86 and arm64 architectures, and adds
KFENCE hooks to the SLAB and SLUB allocators.

KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.

KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error.

Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval, a
guarded allocation from the KFENCE object pool is returned to the main
allocator (SLAB or SLUB). At this point, the timer is reset, and the
next allocation is set up after the expiration of the interval.

To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE.

We have verified by running synthetic benchmarks (sysbench I/O,
hackbench) that a kernel with KFENCE is performance-neutral compared to
a non-KFENCE baseline kernel.

KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
properties. The name "KFENCE" is a homage to the Electric Fence Malloc
Debugger [2].

For more details, see Documentation/dev-tools/kfence.rst added in the
series -- also viewable here:

https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst

[1] http://llvm.org/docs/GwpAsan.html
[2] https://linux.die.net/man/3/efence

Alexander Potapenko (6):
mm: add Kernel Electric-Fence infrastructure
x86, kfence: enable KFENCE for x86
mm, kfence: insert KFENCE hooks for SLAB
mm, kfence: insert KFENCE hooks for SLUB
kfence, kasan: make KFENCE compatible with KASAN
kfence, kmemleak: make KFENCE compatible with KMEMLEAK

Marco Elver (4):
arm64, kfence: enable KFENCE for ARM64
kfence, lockdep: make KFENCE compatible with lockdep
kfence, Documentation: add KFENCE documentation
kfence: add test suite

Documentation/dev-tools/index.rst | 1 +
Documentation/dev-tools/kfence.rst | 285 +++++++++++
MAINTAINERS | 11 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/kfence.h | 39 ++
arch/arm64/mm/fault.c | 4 +
arch/x86/Kconfig | 2 +
arch/x86/include/asm/kfence.h | 60 +++
arch/x86/mm/fault.c | 4 +
include/linux/kfence.h | 174 +++++++
init/main.c | 2 +
kernel/locking/lockdep.c | 8 +
lib/Kconfig.debug | 1 +
lib/Kconfig.kfence | 70 +++
mm/Makefile | 1 +
mm/kasan/common.c | 7 +
mm/kfence/Makefile | 6 +
mm/kfence/core.c | 730 +++++++++++++++++++++++++++
mm/kfence/kfence-test.c | 777 +++++++++++++++++++++++++++++
mm/kfence/kfence.h | 104 ++++
mm/kfence/report.c | 201 ++++++++
mm/kmemleak.c | 11 +
mm/slab.c | 46 +-
mm/slab_common.c | 6 +-
mm/slub.c | 72 ++-
25 files changed, 2591 insertions(+), 32 deletions(-)
create mode 100644 Documentation/dev-tools/kfence.rst
create mode 100644 arch/arm64/include/asm/kfence.h
create mode 100644 arch/x86/include/asm/kfence.h
create mode 100644 include/linux/kfence.h
create mode 100644 lib/Kconfig.kfence
create mode 100644 mm/kfence/Makefile
create mode 100644 mm/kfence/core.c
create mode 100644 mm/kfence/kfence-test.c
create mode 100644 mm/kfence/kfence.h
create mode 100644 mm/kfence/report.c

--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:13 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors.

KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.

KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error.

Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval, a
guarded allocation from the KFENCE object pool is returned to the main
allocator (SLAB or SLUB). At this point, the timer is reset, and the
next allocation is set up after the expiration of the interval.

To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the

allocation to KFENCE. To date, we have verified by running synthetic
benchmarks (sysbench I/O workloads) that a kernel compiled with KFENCE
is performance-neutral compared to the non-KFENCE baseline.

For more details, see Documentation/dev-tools/kfence.rst (added later in
the series).

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---
MAINTAINERS | 11 +
include/linux/kfence.h | 174 ++++++++++
init/main.c | 2 +
lib/Kconfig.debug | 1 +
lib/Kconfig.kfence | 58 ++++
mm/Makefile | 1 +
mm/kfence/Makefile | 3 +
mm/kfence/core.c | 730 +++++++++++++++++++++++++++++++++++++++++
mm/kfence/kfence.h | 104 ++++++
mm/kfence/report.c | 201 ++++++++++++
10 files changed, 1285 insertions(+)

create mode 100644 include/linux/kfence.h
create mode 100644 lib/Kconfig.kfence
create mode 100644 mm/kfence/Makefile
create mode 100644 mm/kfence/core.c

create mode 100644 mm/kfence/kfence.h
create mode 100644 mm/kfence/report.c

diff --git a/MAINTAINERS b/MAINTAINERS
index b5cfab015bd6..863899ed9a29 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9673,6 +9673,17 @@ F: include/linux/keyctl.h
F: include/uapi/linux/keyctl.h
F: security/keys/

+KFENCE
+M: Alexander Potapenko <gli...@google.com>
+M: Marco Elver <el...@google.com>
+R: Dmitry Vyukov <dvy...@google.com>
+L: kasa...@googlegroups.com
+S: Maintained
+F: Documentation/dev-tools/kfence.rst
+F: include/linux/kfence.h
+F: lib/Kconfig.kfence
+F: mm/kfence/
+
KFIFO
M: Stefani Seibold <ste...@seibold.net>
S: Maintained
diff --git a/include/linux/kfence.h b/include/linux/kfence.h
new file mode 100644
index 000000000000..8128ba7b5e90
--- /dev/null
+++ b/include/linux/kfence.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_KFENCE_H
+#define _LINUX_KFENCE_H
+
+#include <linux/mm.h>
+#include <linux/percpu.h>
+#include <linux/static_key.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_KFENCE
+
+/*
+ * We allocate an even number of pages, as it simplifies calculations to map
+ * address to metadata indices; effectively, the very first page serves as an
+ * extended guard page, but otherwise has no special purpose.
+ */
+#define KFENCE_POOL_SIZE ((CONFIG_KFENCE_NUM_OBJECTS + 1) * 2 * PAGE_SIZE)
+#ifdef CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL
+extern char __kfence_pool[KFENCE_POOL_SIZE];
+#else
+extern char *__kfence_pool;
+#endif
+
+extern struct static_key_false kfence_allocation_key;
+
+/**
+ * is_kfence_address() - check if an address belongs to KFENCE pool
+ * @addr: address to check
+ *
+ * Return: true or false depending on whether the address is within the KFENCE
+ * object range.
+ *
+ * KFENCE objects live in a separate page range and are not to be intermixed
+ * with regular heap objects (e.g. KFENCE objects must never be added to the
+ * allocator freelists). Failing to do so may and will result in heap
+ * corruptions, therefore is_kfence_address() must be used to check whether
+ * an object requires specific handling.
+ */
+static __always_inline bool is_kfence_address(const void *addr)
+{
+ return unlikely((char *)addr >= __kfence_pool &&
+ (char *)addr < __kfence_pool + KFENCE_POOL_SIZE);
+}
+
+/**
+ * kfence_init() - perform KFENCE initialization at boot time
+ */
+void kfence_init(void);
+
+/**
+ * kfence_shutdown_cache() - handle shutdown_cache() for KFENCE objects
+ * @s: cache being shut down
+ *
+ * Return: true on success, false if any leftover objects persist.
+ *
+ * Before shutting down a cache, one must ensure there are no remaining objects
+ * allocated from it. KFENCE objects are not referenced from the cache, so
+ * kfence_shutdown_cache() takes care of them.
+ */
+bool __must_check kfence_shutdown_cache(struct kmem_cache *s);
+
+/*
+ * Allocate a KFENCE object. Allocators must not call this function directly,
+ * use kfence_alloc() instead.
+ */
+void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags);
+
+/**
+ * kfence_alloc() - allocate a KFENCE object with a low probability
+ * @s: struct kmem_cache with object requirements
+ * @size: exact size of the object to allocate (can be less than @s->size
+ * e.g. for kmalloc caches)
+ * @flags: GFP flags
+ *
+ * Return:
+ * * NULL - must proceed with allocating as usual,
+ * * non-NULL - pointer to a KFENCE object.
+ *
+ * kfence_alloc() should be inserted into the heap allocation fast path,
+ * allowing it to transparently return KFENCE-allocated objects with a low
+ * probability using a static branch (the probability is controlled by the
+ * kfence.sample_interval boot parameter).
+ */
+static __always_inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
+{
+ return static_branch_unlikely(&kfence_allocation_key) ? __kfence_alloc(s, size, flags) :
+ NULL;
+}
+
+/**
+ * kfence_ksize() - get actual amount of memory allocated for a KFENCE object
+ * @addr: pointer to a heap object
+ *
+ * Return:
+ * * 0 - not a KFENCE object, must call __ksize() instead,
+ * * non-0 - this many bytes can be accessed without causing a memory error.
+ *
+ * kfence_ksize() returns the number of bytes requested for a KFENCE object at
+ * allocation time. This number may be less than the object size of the
+ * corresponding struct kmem_cache.
+ */
+size_t kfence_ksize(const void *addr);
+
+/**
+ * kfence_object_start() - find the beginning of a KFENCE object
+ * @addr - address within a KFENCE-allocated object
+ *
+ * Return: address of the beginning of the object.
+ *
+ * SL[AU]B-allocated objects are laid out within a page one by one, so it is
+ * easy to calculate the beginning of an object given a pointer inside it and
+ * the object size. The same is not true for KFENCE, which places a single
+ * object at either end of the page. This helper function is used to find the
+ * beginning of a KFENCE-allocated object.
+ */
+void *kfence_object_start(const void *addr);
+
+/*
+ * Release a KFENCE-allocated object to KFENCE pool. Allocators must not call
+ * this function directly, use kfence_free() instead.
+ */
+void __kfence_free(void *addr);
+
+/**
+ * kfence_free() - try to release an arbitrary heap object to KFENCE pool
+ * @addr: object to be freed
+ *
+ * Return:
+ * * false - object doesn't belong to KFENCE pool and was ignored,
+ * * true - object was released to KFENCE pool.
+ *
+ * Release a KFENCE object and mark it as freed. May be called on any object,
+ * even non-KFENCE objects, to simplify integration of the hooks into the
+ * allocator's free codepath. The allocator must check the return value to
+ * determine if it was a KFENCE object or not.
+ */
+static __always_inline __must_check bool kfence_free(void *addr)
+{
+ if (!is_kfence_address(addr))
+ return false;
+ __kfence_free(addr);
+ return true;
+}
+
+/**
+ * kfence_handle_page_fault() - perform page fault handling for KFENCE pages
+ * @addr: faulting address
+ *
+ * Return:
+ * * false - address outside KFENCE pool,
+ * * true - page fault handled by KFENCE, no additional handling required.
+ *
+ * A page fault inside KFENCE pool indicates a memory error, such as an
+ * out-of-bounds access, a use-after-free or an invalid memory access. In these
+ * cases KFENCE prints an error message and marks the offending page as
+ * present, so that the kernel can proceed.
+ */
+bool __must_check kfence_handle_page_fault(unsigned long addr);
+
+#else /* CONFIG_KFENCE */
+
+static inline bool is_kfence_address(const void *addr) { return false; }
+static inline void kfence_init(void) { }
+static inline bool __must_check kfence_shutdown_cache(struct kmem_cache *s) { return true; }
+static inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) { return NULL; }
+static inline size_t kfence_ksize(const void *addr) { return 0; }
+static inline void *kfence_object_start(const void *addr) { return NULL; }
+static inline bool __must_check kfence_free(void *addr) { return false; }
+static inline bool __must_check kfence_handle_page_fault(unsigned long addr) { return false; }
+
+#endif
+
+#endif /* _LINUX_KFENCE_H */
diff --git a/init/main.c b/init/main.c
index ae78fb68d231..ec7de9dc1ed8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -39,6 +39,7 @@
#include <linux/security.h>
#include <linux/smp.h>
#include <linux/profile.h>
+#include <linux/kfence.h>
#include <linux/rcupdate.h>
#include <linux/moduleparam.h>
#include <linux/kallsyms.h>
@@ -942,6 +943,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
hrtimers_init();
softirq_init();
timekeeping_init();
+ kfence_init();

/*
* For best initial stack canary entropy, prepare it after:
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e068c3c7189a..d09c6a306532 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -880,6 +880,7 @@ config DEBUG_STACKOVERFLOW
If in doubt, say "N".

source "lib/Kconfig.kasan"
+source "lib/Kconfig.kfence"

endmenu # "Memory Debugging"

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
new file mode 100644
index 000000000000..7ac91162edb0
--- /dev/null
+++ b/lib/Kconfig.kfence
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+config HAVE_ARCH_KFENCE
+ bool
+
+config HAVE_ARCH_KFENCE_STATIC_POOL
+ bool
+ help
+ If the architecture supports using the static pool.
+
+menuconfig KFENCE
+ bool "KFENCE: low-overhead sampling-based memory safety error detector"
+ depends on HAVE_ARCH_KFENCE && !KASAN && (SLAB || SLUB)
+ depends on JUMP_LABEL # To ensure performance, require jump labels
+ select STACKTRACE
+ help
+ KFENCE is low-overhead sampling-based detector for heap out-of-bounds
+ access, use-after-free, and invalid-free errors. KFENCE is designed
+ to have negligible cost to permit enabling it in production
+ environments.
+
+ See <file:Documentation/dev-tools/kfence.rst> for more details.
+
+ Note that, KFENCE is not a substitute for explicit testing with tools
+ such as KASAN. KFENCE can detect a subset of bugs that KASAN can
+ detect (therefore enabling KFENCE together with KASAN does not make
+ sense), albeit at very different performance profiles.
+
+if KFENCE
+
+config KFENCE_SAMPLE_INTERVAL
+ int "Default sample interval in milliseconds"
+ default 100
+ help
+ The KFENCE sample interval determines the frequency with which heap
+ allocations will be guarded by KFENCE. May be overridden via boot
+ parameter "kfence.sample_interval".
+
+config KFENCE_NUM_OBJECTS
+ int "Number of guarded objects available"
+ default 255
+ range 1 65535
+ help
+ The number of guarded objects available. For each KFENCE object, 2
+ pages are required; with one containing the object and two adjacent
+ ones used as guard pages.
+
+config KFENCE_FAULT_INJECTION
+ int "Fault injection for stress testing"
+ default 0
+ depends on EXPERT
+ help
+ The inverse probability with which to randomly protect KFENCE object
+ pages, resulting in spurious use-after-frees. The main purpose of
+ this option is to stress-test KFENCE with concurrent error reports
+ and allocations/frees. A value of 0 disables fault injection.
+
+endif # KFENCE
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..afdf1ae0900b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_PAGE_POISONING) += page_poison.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KASAN) += kasan/
+obj-$(CONFIG_KFENCE) += kfence/
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_MEMTEST) += memtest.o
diff --git a/mm/kfence/Makefile b/mm/kfence/Makefile
new file mode 100644
index 000000000000..d991e9a349f0
--- /dev/null
+++ b/mm/kfence/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_KFENCE) := core.o report.o
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
new file mode 100644
index 000000000000..e638d1f64a32
--- /dev/null
+++ b/mm/kfence/core.c
@@ -0,0 +1,730 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "kfence: " fmt
+
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/debugfs.h>
+#include <linux/kcsan-checks.h>
+#include <linux/kfence.h>
+#include <linux/list.h>
+#include <linux/lockdep.h>
+#include <linux/moduleparam.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/string.h>
+
+#include <asm/kfence.h>
+
+#include "kfence.h"
+
+/* Disables KFENCE on the first warning assuming an irrecoverable error. */
+#define KFENCE_WARN_ON(cond) \
+ ({ \
+ const bool __cond = WARN_ON(cond); \
+ if (unlikely(__cond)) \
+ WRITE_ONCE(kfence_enabled, false); \
+ __cond; \
+ })
+
+#ifndef CONFIG_KFENCE_FAULT_INJECTION /* Only defined with CONFIG_EXPERT. */
+#define CONFIG_KFENCE_FAULT_INJECTION 0
+#endif
+
+/* === Data ================================================================= */
+
+static unsigned long kfence_sample_interval __read_mostly = CONFIG_KFENCE_SAMPLE_INTERVAL;
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+#define MODULE_PARAM_PREFIX "kfence."
+module_param_named(sample_interval, kfence_sample_interval, ulong,
+ IS_ENABLED(CONFIG_DEBUG_KERNEL) ? 0600 : 0400);
+
+static bool kfence_enabled __read_mostly;
+
+/*
+ * The pool of pages used for guard pages and objects. If supported, allocated
+ * statically, so that is_kfence_address() avoids a pointer load, and simply
+ * compares against a constant address. Assume that if KFENCE is compiled into
+ * the kernel, it is usually enabled, and the space is to be allocated one way
+ * or another.
+ */
+#ifdef CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL
+char __kfence_pool[KFENCE_POOL_SIZE] __aligned(KFENCE_POOL_ALIGNMENT);
+#else
+char *__kfence_pool __read_mostly;
+#endif
+EXPORT_SYMBOL(__kfence_pool); /* Export for test modules. */
+
+/*
+ * Per-object metadata, with one-to-one mapping of object metadata to
+ * backing pages (in __kfence_pool).
+ */
+static_assert(CONFIG_KFENCE_NUM_OBJECTS > 0);
+struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
+
+/* Freelist with available objects. */
+static struct list_head kfence_freelist = LIST_HEAD_INIT(kfence_freelist);
+static DEFINE_RAW_SPINLOCK(kfence_freelist_lock); /* Lock protecting freelist. */
+
+/* The static key to set up a KFENCE allocation. */
+DEFINE_STATIC_KEY_FALSE(kfence_allocation_key);
+
+/* Gates the allocation, ensuring only one succeeds in a given period. */
+static atomic_t allocation_gate = ATOMIC_INIT(1);
+
+/* Wait queue to wake up allocation-gate timer task. */
+static DECLARE_WAIT_QUEUE_HEAD(allocation_wait);
+
+/* Statistics counters for debugfs. */
+enum kfence_counter_id {
+ KFENCE_COUNTER_ALLOCATED,
+ KFENCE_COUNTER_ALLOCS,
+ KFENCE_COUNTER_FREES,
+ KFENCE_COUNTER_BUGS,
+ KFENCE_COUNTER_COUNT,
+};
+static atomic_long_t counters[KFENCE_COUNTER_COUNT];
+static const char *const counter_names[] = {
+ [KFENCE_COUNTER_ALLOCATED] = "currently allocated",
+ [KFENCE_COUNTER_ALLOCS] = "total allocations",
+ [KFENCE_COUNTER_FREES] = "total frees",
+ [KFENCE_COUNTER_BUGS] = "total bugs",
+};
+static_assert(ARRAY_SIZE(counter_names) == KFENCE_COUNTER_COUNT);
+
+/* === Internals ============================================================ */
+
+static bool kfence_protect(unsigned long addr)
+{
+ return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), true));
+}
+
+static bool kfence_unprotect(unsigned long addr)
+{
+ return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), false));
+}
+
+static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
+{
+ long index;
+
+ /* The checks do not affect performance; only called from slow-paths. */
+
+ if (!is_kfence_address((void *)addr))
+ return NULL;
+
+ /*
+ * May be an invalid index if called with an address at the edge of
+ * __kfence_pool, in which case we would report an "invalid access"
+ * error.
+ */
+ index = ((addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2)) - 1;
+ if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
+ return NULL;
+
+ return &kfence_metadata[index];
+}
+
+static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta)
+{
+ unsigned long offset = ((meta - kfence_metadata) + 1) * PAGE_SIZE * 2;
+ unsigned long pageaddr = (unsigned long)&__kfence_pool[offset];
+
+ /* The checks do not affect performance; only called from slow-paths. */
+
+ /* Only call with a pointer into kfence_metadata. */
+ if (KFENCE_WARN_ON(meta < kfence_metadata ||
+ meta >= kfence_metadata + ARRAY_SIZE(kfence_metadata)))
+ return 0;
+
+ /*
+ * This metadata object only ever maps to 1 page; verify the calculation
+ * happens and that the stored address was not corrupted.
+ */
+ if (KFENCE_WARN_ON(ALIGN_DOWN(meta->addr, PAGE_SIZE) != pageaddr))
+ return 0;
+
+ return pageaddr;
+}
+
+/*
+ * Update the object's metadata state, including updating the alloc/free stacks
+ * depending on the state transition.
+ */
+static noinline void metadata_update_state(struct kfence_metadata *meta,
+ enum kfence_object_state next)
+{
+ unsigned long *entries = next == KFENCE_OBJECT_FREED ? meta->free_stack : meta->alloc_stack;
+ /*
+ * Skip over 1 (this) functions; noinline ensures we do not accidentally
+ * skip over the caller by never inlining.
+ */
+ const int nentries = stack_trace_save(entries, KFENCE_STACK_DEPTH, 1);
+
+ lockdep_assert_held(&meta->lock);
+
+ if (next == KFENCE_OBJECT_FREED)
+ meta->num_free_stack = nentries;
+ else
+ meta->num_alloc_stack = nentries;
+
+ /*
+ * Pairs with READ_ONCE() in
+ * kfence_shutdown_cache(),
+ * kfence_handle_page_fault().
+ */
+ WRITE_ONCE(meta->state, next);
+}
+
+/* Write canary byte to @addr. */
+static inline bool set_canary_byte(u8 *addr)
+{
+ *addr = KFENCE_CANARY_PATTERN(addr);
+ return true;
+}
+
+/* Check canary byte at @addr. */
+static inline bool check_canary_byte(u8 *addr)
+{
+ if (*addr == KFENCE_CANARY_PATTERN(addr))
+ return true;
+
+ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
+ kfence_report_error((unsigned long)addr, addr_to_metadata((unsigned long)addr),
+ KFENCE_ERROR_CORRUPTION);
+ return false;
+}
+
+static inline void for_each_canary(const struct kfence_metadata *meta, bool (*fn)(u8 *))
+{
+ const int size = abs(meta->size);
+ unsigned long addr;
+
+ lockdep_assert_held(&meta->lock);
+
+ for (addr = ALIGN_DOWN(meta->addr, PAGE_SIZE); addr < meta->addr; addr++) {
+ if (!fn((u8 *)addr))
+ break;
+ }
+
+ for (addr = meta->addr + size; addr < PAGE_ALIGN(meta->addr); addr++) {
+ if (!fn((u8 *)addr))
+ break;
+ }
+}
+
+static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp)
+{
+ /*
+ * Note: for allocations made before RNG initialization, will always
+ * return zero. We still benefit from enabling KFENCE as early as
+ * possible, even when the RNG is not yet available, as this will allow
+ * KFENCE to detect bugs due to earlier allocations. The only downside
+ * is that the out-of-bounds accesses detected are deterministic for
+ * such allocations.
+ */
+ const bool right = prandom_u32_max(2);
+ unsigned long flags;
+ struct kfence_metadata *meta = NULL;
+ void *addr = NULL;
+
+ /* Try to obtain a free object. */
+ raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+ if (!list_empty(&kfence_freelist)) {
+ meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);
+ list_del_init(&meta->list);
+ }
+ raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+ if (!meta)
+ return NULL;
+
+ if (unlikely(!raw_spin_trylock_irqsave(&meta->lock, flags))) {
+ /*
+ * This is extremely unlikely -- we are reporting on a
+ * use-after-free, which locked meta->lock, and the reporting
+ * code via printk calls kmalloc() which ends up in
+ * kfence_alloc() and tries to grab the same object that we're
+ * reporting on. While it has never been observed, lockdep does
+ * report that there is a possibility of deadlock. Fix it by
+ * using trylock and bailing out gracefully.
+ */
+ raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+ /* Put the object back on the freelist. */
+ list_add_tail(&meta->list, &kfence_freelist);
+ raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+
+ return NULL;
+ }
+
+ meta->addr = metadata_to_pageaddr(meta);
+ /* Unprotect if we're reusing this page. */
+ if (meta->state == KFENCE_OBJECT_FREED)
+ kfence_unprotect(meta->addr);
+
+ /* Calculate address for this allocation. */
+ if (right)
+ meta->addr += PAGE_SIZE - size;
+ meta->addr = ALIGN_DOWN(meta->addr, cache->align);
+
+ /* Update remaining metadata. */
+ metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED);
+ /* Pairs with READ_ONCE() in kfence_shutdown_cache(). */
+ WRITE_ONCE(meta->cache, cache);
+ meta->size = right ? -size : size;
+ for_each_canary(meta, set_canary_byte);
+ virt_to_page(meta->addr)->slab_cache = cache;
+
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+ /* Memory initialization. */
+
+ /*
+ * We check slab_want_init_on_alloc() ourselves, rather than letting
+ * SL*B do the initialization, as otherwise we might overwrite KFENCE's
+ * redzone.
+ */
+ addr = (void *)meta->addr;
+ if (unlikely(slab_want_init_on_alloc(gfp, cache)))
+ memzero_explicit(addr, size);
+ if (cache->ctor)
+ cache->ctor(addr);
+
+ if (CONFIG_KFENCE_FAULT_INJECTION && !prandom_u32_max(CONFIG_KFENCE_FAULT_INJECTION))
+ kfence_protect(meta->addr); /* Random "faults" by protecting the object. */
+
+ atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCATED]);
+ atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCS]);
+
+ return addr;
+}
+
+static void kfence_guarded_free(void *addr, struct kfence_metadata *meta)
+{
+ struct kcsan_scoped_access assert_page_exclusive;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&meta->lock, flags);
+
+ if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {
+ /* Invalid or double-free, bail out. */
+ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
+ kfence_report_error((unsigned long)addr, meta, KFENCE_ERROR_INVALID_FREE);
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+ return;
+ }
+
+ /* Detect racy use-after-free, or incorrect reallocation of this page by KFENCE. */
+ kcsan_begin_scoped_access((void *)ALIGN_DOWN((unsigned long)addr, PAGE_SIZE), PAGE_SIZE,
+ KCSAN_ACCESS_SCOPED | KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT,
+ &assert_page_exclusive);
+
+ if (CONFIG_KFENCE_FAULT_INJECTION)
+ kfence_unprotect((unsigned long)addr); /* To check canary bytes. */
+
+ /* Restore page protection if there was an OOB access. */
+ if (meta->unprotected_page) {
+ kfence_protect(meta->unprotected_page);
+ meta->unprotected_page = 0;
+ }
+
+ /* Check canary bytes for memory corruption. */
+ for_each_canary(meta, check_canary_byte);
+
+ /*
+ * Clear memory if init-on-free is set. While we protect the page, the
+ * data is still there, and after a use-after-free is detected, we
+ * unprotect the page, so the data is still accessible.
+ */
+ if (unlikely(slab_want_init_on_free(meta->cache)))
+ memzero_explicit(addr, abs(meta->size));
+
+ /* Mark the object as freed. */
+ metadata_update_state(meta, KFENCE_OBJECT_FREED);
+
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+ /* Protect to detect use-after-frees. */
+ kfence_protect((unsigned long)addr);
+
+ /* Add it to the tail of the freelist for reuse. */
+ raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+ KFENCE_WARN_ON(!list_empty(&meta->list));
+ list_add_tail(&meta->list, &kfence_freelist);
+ kcsan_end_scoped_access(&assert_page_exclusive);
+ raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+
+ atomic_long_dec(&counters[KFENCE_COUNTER_ALLOCATED]);
+ atomic_long_inc(&counters[KFENCE_COUNTER_FREES]);
+}
+
+static void rcu_guarded_free(struct rcu_head *h)
+{
+ struct kfence_metadata *meta = container_of(h, struct kfence_metadata, rcu_head);
+
+ kfence_guarded_free((void *)meta->addr, meta);
+}
+
+static bool __init kfence_initialize_pool(void)
+{
+ unsigned long addr;
+ struct page *pages;
+ int i;
+
+ if (!arch_kfence_initialize_pool())
+ return false;
+
+ addr = (unsigned long)__kfence_pool;
+ pages = virt_to_page(addr);
+
+ /*
+ * Set up non-redzone pages: they must have PG_slab set, to avoid
+ * freeing these as real pages.
+ *
+ * We also want to avoid inserting kfence_free() in the kfree()
+ * fast-path in SLUB, and therefore need to ensure kfree() correctly
+ * enters __slab_free() slow-path.
+ */
+ for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
+ if (!i || (i % 2))
+ continue;
+
+ __SetPageSlab(&pages[i]);
+ }
+
+ /*
+ * Protect the first 2 pages. The first page is mostly unnecessary, and
+ * merely serves as an extended guard page. However, adding one
+ * additional page in the beginning gives us an even number of pages,
+ * which simplifies the mapping of address to metadata index.
+ */
+ for (i = 0; i < 2; i++) {
+ if (unlikely(!kfence_protect(addr)))
+ return false;
+
+ addr += PAGE_SIZE;
+ }
+
+ for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
+ struct kfence_metadata *meta = &kfence_metadata[i];
+
+ /* Initialize metadata. */
+ INIT_LIST_HEAD(&meta->list);
+ raw_spin_lock_init(&meta->lock);
+ meta->state = KFENCE_OBJECT_UNUSED;
+ meta->addr = addr; /* Initialize for validation in metadata_to_pageaddr(). */
+ list_add_tail(&meta->list, &kfence_freelist);
+
+ /* Protect the right redzone. */
+ if (unlikely(!kfence_protect(addr + PAGE_SIZE)))
+ return false;
+
+ addr += 2 * PAGE_SIZE;
+ }
+
+ return true;
+}
+
+/* === DebugFS Interface ==================================================== */
+
+static int stats_show(struct seq_file *seq, void *v)
+{
+ int i;
+
+ seq_printf(seq, "enabled: %i\n", READ_ONCE(kfence_enabled));
+ for (i = 0; i < KFENCE_COUNTER_COUNT; i++)
+ seq_printf(seq, "%s: %ld\n", counter_names[i], atomic_long_read(&counters[i]));
+
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(stats);
+
+/*
+ * debugfs seq_file operations for /sys/kernel/debug/kfence/objects.
+ * start_object() and next_object() return the object index + 1, because NULL is used
+ * to stop iteration.
+ */
+static void *start_object(struct seq_file *seq, loff_t *pos)
+{
+ if (*pos < CONFIG_KFENCE_NUM_OBJECTS)
+ return (void *)((long)*pos + 1);
+ return NULL;
+}
+
+static void stop_object(struct seq_file *seq, void *v)
+{
+}
+
+static void *next_object(struct seq_file *seq, void *v, loff_t *pos)
+{
+ ++*pos;
+ if (*pos < CONFIG_KFENCE_NUM_OBJECTS)
+ return (void *)((long)*pos + 1);
+ return NULL;
+}
+
+static int show_object(struct seq_file *seq, void *v)
+{
+ struct kfence_metadata *meta = &kfence_metadata[(long)v - 1];
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&meta->lock, flags);
+ kfence_print_object(seq, meta);
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+ seq_puts(seq, "---------------------------------\n");
+
+ return 0;
+}
+
+static const struct seq_operations object_seqops = {
+ .start = start_object,
+ .next = next_object,
+ .stop = stop_object,
+ .show = show_object,
+};
+
+static int open_objects(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &object_seqops);
+}
+
+static const struct file_operations objects_fops = {
+ .open = open_objects,
+ .read = seq_read,
+ .llseek = seq_lseek,
+};
+
+static int __init kfence_debugfs_init(void)
+{
+ struct dentry *kfence_dir = debugfs_create_dir("kfence", NULL);
+
+ debugfs_create_file("stats", 0400, kfence_dir, NULL, &stats_fops);
+ debugfs_create_file("objects", 0400, kfence_dir, NULL, &objects_fops);
+ return 0;
+}
+
+late_initcall(kfence_debugfs_init);
+
+/* === Allocation Gate Timer ================================================ */
+
+/*
+ * Set up delayed work, which will enable and disable the static key. We need to
+ * use a work queue (rather than a simple timer), since enabling and disabling a
+ * static key cannot be done from an interrupt.
+ */
+static struct delayed_work kfence_timer;
+static void toggle_allocation_gate(struct work_struct *work)
+{
+ if (!READ_ONCE(kfence_enabled))
+ return;
+
+ /* Enable static key, and await allocation to happen. */
+ atomic_set(&allocation_gate, 0);
+ static_branch_enable(&kfence_allocation_key);
+ wait_event(allocation_wait, atomic_read(&allocation_gate) != 0);
+
+ /* Disable static key and reset timer. */
+ static_branch_disable(&kfence_allocation_key);
+ schedule_delayed_work(&kfence_timer, msecs_to_jiffies(kfence_sample_interval));
+}
+static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate);
+
+/* === Public interface ===================================================== */
+
+void __init kfence_init(void)
+{
+ /* Setting kfence_sample_interval to 0 on boot disables KFENCE. */
+ if (!kfence_sample_interval)
+ return;
+
+ if (!kfence_initialize_pool()) {
+ pr_err("%s failed\n", __func__);
+ return;
+ }
+
+ schedule_delayed_work(&kfence_timer, 0);
+ WRITE_ONCE(kfence_enabled, true);
+ pr_info("initialized - using %zu bytes for %d objects", KFENCE_POOL_SIZE,
+ CONFIG_KFENCE_NUM_OBJECTS);
+ if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
+ pr_cont(" at 0x%px-0x%px\n", (void *)__kfence_pool,
+ (void *)(__kfence_pool + KFENCE_POOL_SIZE));
+ else
+ pr_cont("\n");
+}
+
+bool kfence_shutdown_cache(struct kmem_cache *s)
+{
+ unsigned long flags;
+ struct kfence_metadata *meta;
+ int i;
+
+ for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
+ bool in_use;
+
+ meta = &kfence_metadata[i];
+
+ /*
+ * If we observe some inconsistent cache and state pair where we
+ * should have returned false here, cache destruction is racing
+ * with either kmem_cache_alloc() or kmem_cache_free(). Taking
+ * the lock will not help, as different critical section
+ * serialization will have the same outcome.
+ */
+ if (READ_ONCE(meta->cache) != s ||
+ READ_ONCE(meta->state) != KFENCE_OBJECT_ALLOCATED)
+ continue;
+
+ raw_spin_lock_irqsave(&meta->lock, flags);
+ in_use = meta->cache == s && meta->state == KFENCE_OBJECT_ALLOCATED;
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+ if (in_use)
+ return false;
+ }
+
+ for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
+ meta = &kfence_metadata[i];
+
+ /* See above. */
+ if (READ_ONCE(meta->cache) != s || READ_ONCE(meta->state) != KFENCE_OBJECT_FREED)
+ continue;
+
+ raw_spin_lock_irqsave(&meta->lock, flags);
+ if (meta->cache == s && meta->state == KFENCE_OBJECT_FREED)
+ meta->cache = NULL;
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+ }
+
+ return true;
+}
+
+void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
+{
+ /*
+ * allocation_gate only needs to become non-zero, so it doesn't make
+ * sense to continue writing to it and pay the associated contention
+ * cost, in case we have a large number of concurrent allocations.
+ */
+ if (atomic_read(&allocation_gate) || atomic_inc_return(&allocation_gate) > 1)
+ return NULL;
+ wake_up(&allocation_wait);
+
+ if (!READ_ONCE(kfence_enabled))
+ return NULL;
+
+ if (size > PAGE_SIZE)
+ return NULL;
+
+ return kfence_guarded_alloc(s, size, flags);
+}
+
+size_t kfence_ksize(const void *addr)
+{
+ const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
+
+ /*
+ * Read locklessly -- if there is a race with __kfence_alloc(), this
+ * most certainly is either a use-after-free, or invalid access.
+ */
+ return meta ? abs(meta->size) : 0;
+}
+
+void *kfence_object_start(const void *addr)
+{
+ const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
+
+ /*
+ * Read locklessly -- if there is a race with __kfence_alloc(), this
+ * most certainly is either a use-after-free, or invalid access.
+ */
+ return meta ? (void *)meta->addr : NULL;
+}
+
+void __kfence_free(void *addr)
+{
+ struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
+
+ if (unlikely(meta->cache->flags & SLAB_TYPESAFE_BY_RCU))
+ call_rcu(&meta->rcu_head, rcu_guarded_free);
+ else
+ kfence_guarded_free(addr, meta);
+}
+
+bool kfence_handle_page_fault(unsigned long addr)
+{
+ const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE;
+ struct kfence_metadata *to_report = NULL;
+ enum kfence_error_type error_type;
+ unsigned long flags;
+
+ if (!is_kfence_address((void *)addr))
+ return false;
+
+ if (!READ_ONCE(kfence_enabled)) /* If disabled at runtime ... */
+ return kfence_unprotect(addr); /* ... unprotect and proceed. */
+
+ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
+
+ if (page_index % 2) {
+ /* This is a redzone, report a buffer overflow. */
+ struct kfence_metadata *meta = NULL;
+ int distance = 0;
+
+ meta = addr_to_metadata(addr - PAGE_SIZE);
+ if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {
+ to_report = meta;
+ /* Data race ok; distance calculation approximate. */
+ distance = addr - data_race(meta->addr + abs(meta->size));
+ }
+
+ meta = addr_to_metadata(addr + PAGE_SIZE);
+ if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {
+ /* Data race ok; distance calculation approximate. */
+ if (!to_report || distance > data_race(meta->addr) - addr)
+ to_report = meta;
+ }
+
+ if (!to_report)
+ goto out;
+
+ raw_spin_lock_irqsave(&to_report->lock, flags);
+ to_report->unprotected_page = addr;
+ error_type = KFENCE_ERROR_OOB;
+
+ /*
+ * If the object was freed before we took the look we can still
+ * report this as an OOB -- the report will simply show the
+ * stacktrace of the free as well.
+ */
+ } else {
+ to_report = addr_to_metadata(addr);
+ if (!to_report)
+ goto out;
+
+ raw_spin_lock_irqsave(&to_report->lock, flags);
+ error_type = KFENCE_ERROR_UAF;
+ /*
+ * We may race with __kfence_alloc(), and it is possible that a
+ * freed object may be reallocated. We simply report this as a
+ * use-after-free, with the stack trace showing the place where
+ * the object was re-allocated.
+ */
+ }
+
+out:
+ if (to_report) {
+ kfence_report_error(addr, to_report, error_type);
+ raw_spin_unlock_irqrestore(&to_report->lock, flags);
+ } else {
+ /* This may be a UAF or OOB access, but we can't be sure. */
+ kfence_report_error(addr, NULL, KFENCE_ERROR_INVALID);
+ }
+
+ return kfence_unprotect(addr); /* Unprotect and let access proceed. */
+}
diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
new file mode 100644
index 000000000000..25ce2c0dc092
--- /dev/null
+++ b/mm/kfence/kfence.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef MM_KFENCE_KFENCE_H
+#define MM_KFENCE_KFENCE_H
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+
+#include "../slab.h" /* for struct kmem_cache */
+
+/* For non-debug builds, avoid leaking kernel pointers into dmesg. */
+#ifdef CONFIG_DEBUG_KERNEL
+#define PTR_FMT "%px"
+#else
+#define PTR_FMT "%p"
+#endif
+
+/*
+ * Get the canary byte pattern for @addr. Use a pattern that varies based on the
+ * lower 3 bits of the address, to detect memory corruptions with higher
+ * probability, where similar constants are used.
+ */
+#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)addr & 0x7))
+
+/* Maximum stack depth for reports. */
+#define KFENCE_STACK_DEPTH 64
+
+/* KFENCE object states. */
+enum kfence_object_state {
+ KFENCE_OBJECT_UNUSED, /* Object is unused. */
+ KFENCE_OBJECT_ALLOCATED, /* Object is currently allocated. */
+ KFENCE_OBJECT_FREED, /* Object was allocated, and then freed. */
+};
+
+/* KFENCE metadata per guarded allocation. */
+struct kfence_metadata {
+ struct list_head list; /* Freelist node; access under kfence_freelist_lock. */
+ struct rcu_head rcu_head; /* For delayed freeing. */
+
+ /*
+ * Lock protecting below data; to ensure consistency of the below data,
+ * since the following may execute concurrently: __kfence_alloc(),
+ * __kfence_free(), kfence_handle_page_fault(). However, note that we
+ * cannot grab the same metadata off the freelist twice, and multiple
+ * __kfence_alloc() cannot run concurrently on the same metadata.
+ */
+ raw_spinlock_t lock;
+
+ /* The current state of the object; see above. */
+ enum kfence_object_state state;
+
+ /*
+ * Allocated object address; cannot be calculated from size, because of
+ * alignment requirements.
+ *
+ * Invariant: ALIGN_DOWN(addr, PAGE_SIZE) is constant.
+ */
+ unsigned long addr;
+
+ /*
+ * The size of the original allocation:
+ * size > 0: left page alignment
+ * size < 0: right page alignment
+ */
+ int size;
+
+ /*
+ * The kmem_cache cache of the last allocation; NULL if never allocated
+ * or the cache has already been destroyed.
+ */
+ struct kmem_cache *cache;
+
+ /*
+ * In case of an invalid access, the page that was unprotected; we
+ * optimistically only store address.
+ */
+ unsigned long unprotected_page;
+
+ /* Allocation and free stack information. */
+ int num_alloc_stack;
+ int num_free_stack;
+ unsigned long alloc_stack[KFENCE_STACK_DEPTH];
+ unsigned long free_stack[KFENCE_STACK_DEPTH];
+};
+
+extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
+
+/* KFENCE error types for report generation. */
+enum kfence_error_type {
+ KFENCE_ERROR_OOB, /* Detected a out-of-bounds access. */
+ KFENCE_ERROR_UAF, /* Detected a use-after-free access. */
+ KFENCE_ERROR_CORRUPTION, /* Detected a memory corruption on free. */
+ KFENCE_ERROR_INVALID, /* Invalid access of unknown type. */
+ KFENCE_ERROR_INVALID_FREE, /* Invalid free. */
+};
+
+void kfence_report_error(unsigned long address, const struct kfence_metadata *meta,
+ enum kfence_error_type type);
+
+void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta);
+
+#endif /* MM_KFENCE_KFENCE_H */
diff --git a/mm/kfence/report.c b/mm/kfence/report.c
new file mode 100644
index 000000000000..8c28200e7433
--- /dev/null
+++ b/mm/kfence/report.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdarg.h>
+
+#include <linux/kernel.h>
+#include <linux/lockdep.h>
+#include <linux/printk.h>
+#include <linux/seq_file.h>
+#include <linux/stacktrace.h>
+#include <linux/string.h>
+
+#include <asm/kfence.h>
+
+#include "kfence.h"
+
+/* Helper function to either print to a seq_file or to console. */
+static void seq_con_printf(struct seq_file *seq, const char *fmt, ...)
+{
+ va_list args;
+
+ va_start(args, fmt);
+ if (seq)
+ seq_vprintf(seq, fmt, args);
+ else
+ vprintk(fmt, args);
+ va_end(args);
+}
+
+/* Get the number of stack entries to skip get out of MM internals. */
+static int get_stack_skipnr(const unsigned long stack_entries[], int num_entries,
+ enum kfence_error_type type)
+{
+ char buf[64];
+ int skipnr, fallback = 0;
+
+ for (skipnr = 0; skipnr < num_entries; skipnr++) {
+ int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]);
+
+ /* Depending on error type, find different stack entries. */
+ switch (type) {
+ case KFENCE_ERROR_UAF:
+ case KFENCE_ERROR_OOB:
+ case KFENCE_ERROR_INVALID:
+ if (!strncmp(buf, KFENCE_SKIP_ARCH_FAULT_HANDLER, len))
+ goto found;
+ break;
+ case KFENCE_ERROR_CORRUPTION:
+ case KFENCE_ERROR_INVALID_FREE:
+ if (str_has_prefix(buf, "kfence_") || str_has_prefix(buf, "__kfence_"))
+ fallback = skipnr + 1; /* In case kfree tail calls into kfence. */
+
+ /* Also the *_bulk() variants by only checking prefixes. */
+ if (str_has_prefix(buf, "kfree") || str_has_prefix(buf, "kmem_cache_free"))
+ goto found;
+ break;
+ }
+ }
+ if (fallback < num_entries)
+ return fallback;
+found:
+ skipnr++;
+ return skipnr < num_entries ? skipnr : 0;
+}
+
+static void kfence_print_stack(struct seq_file *seq, const struct kfence_metadata *meta,
+ bool show_alloc)
+{
+ const unsigned long *entries = show_alloc ? meta->alloc_stack : meta->free_stack;
+ const int nentries = show_alloc ? meta->num_alloc_stack : meta->num_free_stack;
+
+ if (nentries) {
+ int i;
+
+ /* stack_trace_seq_print() does not exist; open code our own. */
+ for (i = 0; i < nentries; i++)
+ seq_con_printf(seq, " %pS\n", entries[i]);
+ } else {
+ seq_con_printf(seq, " no %s stack\n", show_alloc ? "allocation" : "deallocation");
+ }
+}
+
+void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta)
+{
+ const int size = abs(meta->size);
+ const unsigned long start = meta->addr;
+ const struct kmem_cache *const cache = meta->cache;
+
+ lockdep_assert_held(&meta->lock);
+
+ if (meta->state == KFENCE_OBJECT_UNUSED) {
+ seq_con_printf(seq, "kfence-#%zd unused\n", meta - kfence_metadata);
+ return;
+ }
+
+ seq_con_printf(seq,
+ "kfence-#%zd [0x" PTR_FMT "-0x" PTR_FMT
+ ", size=%d, cache=%s] allocated in:\n",
+ meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size,
+ (cache && cache->name) ? cache->name : "<destroyed>");
+ kfence_print_stack(seq, meta, true);
+
+ if (meta->state == KFENCE_OBJECT_FREED) {
+ seq_con_printf(seq, "freed in:\n");
+ kfence_print_stack(seq, meta, false);
+ }
+}
+
+/*
+ * Show bytes at @addr that are different from the expected canary values, up to
+ * @max_bytes.
+ */
+static void print_diff_canary(const u8 *addr, size_t max_bytes)
+{
+ const u8 *max_addr = min((const u8 *)PAGE_ALIGN((unsigned long)addr), addr + max_bytes);
+
+ pr_cont("[");
+ for (; addr < max_addr; addr++) {
+ if (*addr == KFENCE_CANARY_PATTERN(addr))
+ pr_cont(" .");
+ else if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
+ pr_cont(" 0x%02x", *addr);
+ else /* Do not leak kernel memory in non-debug builds. */
+ pr_cont(" !");
+ }
+ pr_cont(" ]");
+}
+
+void kfence_report_error(unsigned long address, const struct kfence_metadata *meta,
+ enum kfence_error_type type)
+{
+ unsigned long stack_entries[KFENCE_STACK_DEPTH] = { 0 };
+ int num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 1);
+ int skipnr = get_stack_skipnr(stack_entries, num_stack_entries, type);
+
+ /* KFENCE_ERROR_OOB requires non-NULL meta; for the rest it's optional. */
+ if (WARN_ON(type == KFENCE_ERROR_OOB && !meta))
+ return;
+
+ if (meta)
+ lockdep_assert_held(&meta->lock);
+ /*
+ * Because we may generate reports in printk-unfriendly parts of the
+ * kernel, such as scheduler code, the use of printk() could deadlock.
+ * Until such time that all printing code here is safe in all parts of
+ * the kernel, accept the risk, and just get our message out (given the
+ * system might already behave unpredictably due to the memory error).
+ * As such, also disable lockdep to hide warnings, and avoid disabling
+ * lockdep for the rest of the kernel.
+ */
+ lockdep_off();
+
+ pr_err("==================================================================\n");
+ /* Print report header. */
+ switch (type) {
+ case KFENCE_ERROR_OOB:
+ pr_err("BUG: KFENCE: out-of-bounds in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Out-of-bounds access at 0x" PTR_FMT " (%s of kfence-#%zd):\n",
+ (void *)address, address < meta->addr ? "left" : "right",
+ meta - kfence_metadata);
+ break;
+ case KFENCE_ERROR_UAF:
+ pr_err("BUG: KFENCE: use-after-free in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Use-after-free access at 0x" PTR_FMT ":\n", (void *)address);
+ break;
+ case KFENCE_ERROR_CORRUPTION:
+ pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Detected corrupted memory at 0x" PTR_FMT " ", (void *)address);
+ print_diff_canary((u8 *)address, 16);
+ pr_cont(":\n");
+ break;
+ case KFENCE_ERROR_INVALID:
+ pr_err("BUG: KFENCE: invalid access in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Invalid access at 0x" PTR_FMT ":\n", (void *)address);
+ break;
+ case KFENCE_ERROR_INVALID_FREE:
+ pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Invalid free of 0x" PTR_FMT ":\n", (void *)address);
+ break;
+ }
+
+ /* Print stack trace and object info. */
+ stack_trace_print(stack_entries + skipnr, num_stack_entries - skipnr, 0);
+
+ if (meta) {
+ pr_err("\n");
+ kfence_print_object(NULL, meta);
+ }
+
+ /* Print report footer. */
+ pr_err("\n");
+ dump_stack_print_info(KERN_DEFAULT);
+ pr_err("==================================================================\n");
+
+ lockdep_on();
+
+ if (panic_on_warn)
+ panic("panic_on_warn set ...\n");
+
+ /* We encountered a memory unsafety error, taint the kernel! */
+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+}
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:16 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Add architecture specific implementation details for KFENCE and enable
KFENCE for the x86 architecture. In particular, this implements the
required interface in <asm/kfence.h> for setting up the pool and
providing helper functions for protecting and unprotecting pages.

For x86, we need to ensure that the pool uses 4K pages, which is done
using the set_memory_4k() helper function.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

arch/x86/Kconfig | 2 ++
arch/x86/include/asm/kfence.h | 60 +++++++++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 4 +++
3 files changed, 66 insertions(+)
create mode 100644 arch/x86/include/asm/kfence.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..e22dc722698c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -144,6 +144,8 @@ config X86
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if X86_64
select HAVE_ARCH_KASAN_VMALLOC if X86_64
+ select HAVE_ARCH_KFENCE
+ select HAVE_ARCH_KFENCE_STATIC_POOL
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT
diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
new file mode 100644
index 000000000000..cf09e377faf9
--- /dev/null
+++ b/arch/x86/include/asm/kfence.h
@@ -0,0 +1,60 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+

+#ifndef _ASM_X86_KFENCE_H
+#define _ASM_X86_KFENCE_H
+
+#include <linux/bug.h>
+#include <linux/kfence.h>
+
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+#include <asm/set_memory.h>
+#include <asm/tlbflush.h>
+
+/* The alignment should be at least a 4K page. */
+#define KFENCE_POOL_ALIGNMENT PAGE_SIZE
+
+/*
+ * The page fault handler entry function, up to which the stack trace is
+ * truncated in reports.
+ */
+#define KFENCE_SKIP_ARCH_FAULT_HANDLER "asm_exc_page_fault"
+
+/* Force 4K pages for __kfence_pool. */
+static inline bool arch_kfence_initialize_pool(void)

+{
+ unsigned long addr;
+

+ for (addr = (unsigned long)__kfence_pool; is_kfence_address((void *)addr);
+ addr += PAGE_SIZE) {
+ unsigned int level;
+
+ if (!lookup_address(addr, &level))
+ return false;
+
+ if (level != PG_LEVEL_4K)
+ set_memory_4k(addr, 1);

+ }
+
+ return true;
+}
+

+/* Protect the given page and flush TLBs. */
+static inline bool kfence_protect_page(unsigned long addr, bool protect)
+{
+ unsigned int level;
+ pte_t *pte = lookup_address(addr, &level);
+
+ if (!pte || level != PG_LEVEL_4K)
+ return false;
+
+ if (protect)
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
+ else
+ set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
+
+ flush_tlb_one_kernel(addr);

+ return true;
+}
+

+#endif /* _ASM_X86_KFENCE_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6e3e8a124903..423e15ad5eb6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -9,6 +9,7 @@
#include <linux/kdebug.h> /* oops_begin/end, ... */
#include <linux/extable.h> /* search_exception_tables */
#include <linux/memblock.h> /* max_low_pfn */
+#include <linux/kfence.h> /* kfence_handle_page_fault */
#include <linux/kprobes.h> /* NOKPROBE_SYMBOL, ... */
#include <linux/mmiotrace.h> /* kmmio_handler, ... */
#include <linux/perf_event.h> /* perf_sw_event */
@@ -701,6 +702,9 @@ no_context(struct pt_regs *regs, unsigned long error_code,
}
#endif

+ if (kfence_handle_page_fault(address))
+ return;
+
/*
* 32-bit:
*
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:18 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add architecture specific implementation details for KFENCE and enable

KFENCE for the arm64 architecture. In particular, this implements the
required interface in <asm/kfence.h>. Currently, the arm64 version does
not yet use a statically allocated memory pool, at the cost of a pointer
load for each is_kfence_address().

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---
For ARM64, we would like to solicit feedback on what the best option is
to obtain a constant address for __kfence_pool. One option is to declare
a memory range in the memory layout to be dedicated to KFENCE (like is
done for KASAN), however, it is unclear if this is the best available
option. We would like to avoid touching the memory layout.
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/kfence.h | 39 +++++++++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 4 ++++
3 files changed, 44 insertions(+)
create mode 100644 arch/arm64/include/asm/kfence.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..1acc6b2877c3 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -132,6 +132,7 @@ config ARM64
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN
+ select HAVE_ARCH_KFENCE if (!ARM64_16K_PAGES && !ARM64_64K_PAGES)
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
new file mode 100644
index 000000000000..608dde80e5ca
--- /dev/null
+++ b/arch/arm64/include/asm/kfence.h
@@ -0,0 +1,39 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+

+#ifndef __ASM_KFENCE_H
+#define __ASM_KFENCE_H
+
+#include <linux/kfence.h>
+#include <linux/log2.h>
+#include <linux/mm.h>
+
+#include <asm/cacheflush.h>
+
+#define KFENCE_SKIP_ARCH_FAULT_HANDLER "el1_sync"
+
+/*
+ * FIXME: Support HAVE_ARCH_KFENCE_STATIC_POOL: Use the statically allocated
+ * __kfence_pool, to avoid the extra pointer load for is_kfence_address(). By
+ * default, however, we do not have struct pages for static allocations.
+ */
+

+static inline bool arch_kfence_initialize_pool(void)
+{

+ const unsigned int num_pages = ilog2(roundup_pow_of_two(KFENCE_POOL_SIZE / PAGE_SIZE));
+ struct page *pages = alloc_pages(GFP_KERNEL, num_pages);
+
+ if (!pages)
+ return false;
+
+ __kfence_pool = page_address(pages);

+ return true;
+}
+

+static inline bool kfence_protect_page(unsigned long addr, bool protect)
+{

+ set_memory_valid(addr, 1, !protect);

+
+ return true;
+}
+

+#endif /* __ASM_KFENCE_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f07333e86c2f..d5b72ecbeeea 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -10,6 +10,7 @@
#include <linux/acpi.h>
#include <linux/bitfield.h>
#include <linux/extable.h>
+#include <linux/kfence.h>
#include <linux/signal.h>
#include <linux/mm.h>
#include <linux/hardirq.h>
@@ -310,6 +311,9 @@ static void __do_kernel_fault(unsigned long addr, unsigned int esr,
"Ignoring spurious kernel translation fault at virtual address %016lx\n", addr))
return;

+ if (kfence_handle_page_fault(addr))
+ return;
+
if (is_el1_permission_fault(addr, esr, regs)) {
if (esr & ESR_ELx_WNR)
msg = "write to read-only memory";
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:20 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Inserts KFENCE hooks into the SLAB allocator.

We note the addition of the 'orig_size' argument to slab_alloc*()
functions, to be able to pass the originally requested size to KFENCE.
When KFENCE is disabled, there is no additional overhead, since these
functions are __always_inline.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

mm/slab.c | 46 ++++++++++++++++++++++++++++++++++------------
mm/slab_common.c | 6 +++++-
2 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 3160dff6fd76..30aba06ae02b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -100,6 +100,7 @@
#include <linux/seq_file.h>
#include <linux/notifier.h>
#include <linux/kallsyms.h>
+#include <linux/kfence.h>
#include <linux/cpu.h>
#include <linux/sysctl.h>
#include <linux/module.h>
@@ -3206,7 +3207,7 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
}

static __always_inline void *
-slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t orig_size,
unsigned long caller)
{
unsigned long save_flags;
@@ -3219,6 +3220,10 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
if (unlikely(!cachep))
return NULL;

+ ptr = kfence_alloc(cachep, orig_size, flags);
+ if (unlikely(ptr))
+ goto out_hooks;
+
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);

@@ -3251,6 +3256,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
memset(ptr, 0, cachep->object_size);

+out_hooks:
slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr);
return ptr;
}
@@ -3288,7 +3294,7 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
#endif /* CONFIG_NUMA */

static __always_inline void *
-slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
+slab_alloc(struct kmem_cache *cachep, gfp_t flags, size_t orig_size, unsigned long caller)
{
unsigned long save_flags;
void *objp;
@@ -3299,6 +3305,10 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
if (unlikely(!cachep))
return NULL;

+ objp = kfence_alloc(cachep, orig_size, flags);
+ if (unlikely(objp))
+ goto leave;
+
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);
objp = __do_cache_alloc(cachep, flags);
@@ -3309,6 +3319,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
memset(objp, 0, cachep->object_size);

+leave:
slab_post_alloc_hook(cachep, objcg, flags, 1, &objp);
return objp;
}
@@ -3414,6 +3425,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp,
unsigned long caller)
{
+ if (kfence_free(objp)) {
+ kmemleak_free_recursive(objp, cachep->flags);
+ return;
+ }
+
/* Put the object into the quarantine, don't touch it for now. */
if (kasan_slab_free(cachep, objp, _RET_IP_))
return;
@@ -3479,7 +3495,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
*/
void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
- void *ret = slab_alloc(cachep, flags, _RET_IP_);
+ void *ret = slab_alloc(cachep, flags, cachep->object_size, _RET_IP_);

trace_kmem_cache_alloc(_RET_IP_, ret,
cachep->object_size, cachep->size, flags);
@@ -3512,7 +3528,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,

local_irq_disable();
for (i = 0; i < size; i++) {
- void *objp = __do_cache_alloc(s, flags);
+ void *objp = kfence_alloc(s, s->object_size, flags) ?: __do_cache_alloc(s, flags);

if (unlikely(!objp))
goto error;
@@ -3545,7 +3561,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
{
void *ret;

- ret = slab_alloc(cachep, flags, _RET_IP_);
+ ret = slab_alloc(cachep, flags, size, _RET_IP_);

ret = kasan_kmalloc(cachep, ret, size, flags);
trace_kmalloc(_RET_IP_, ret,
@@ -3571,7 +3587,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_trace);
*/
void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
- void *ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
+ void *ret = slab_alloc_node(cachep, flags, nodeid, cachep->object_size, _RET_IP_);

trace_kmem_cache_alloc_node(_RET_IP_, ret,
cachep->object_size, cachep->size,
@@ -3589,7 +3605,7 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *cachep,
{
void *ret;

- ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
+ ret = slab_alloc_node(cachep, flags, nodeid, size, _RET_IP_);

ret = kasan_kmalloc(cachep, ret, size, flags);
trace_kmalloc_node(_RET_IP_, ret,
@@ -3650,7 +3666,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
cachep = kmalloc_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(cachep)))
return cachep;
- ret = slab_alloc(cachep, flags, caller);
+ ret = slab_alloc(cachep, flags, size, caller);

ret = kasan_kmalloc(cachep, ret, size, flags);
trace_kmalloc(caller, ret,
@@ -4138,18 +4154,24 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
bool to_user)
{
struct kmem_cache *cachep;
- unsigned int objnr;
+ unsigned int objnr = 0;
unsigned long offset;
+ bool is_kfence = is_kfence_address(ptr);

ptr = kasan_reset_tag(ptr);

/* Find and validate object. */
cachep = page->slab_cache;
- objnr = obj_to_index(cachep, page, (void *)ptr);
- BUG_ON(objnr >= cachep->num);
+ if (!is_kfence) {
+ objnr = obj_to_index(cachep, page, (void *)ptr);
+ BUG_ON(objnr >= cachep->num);
+ }

/* Find offset within object. */
- offset = ptr - index_to_obj(cachep, page, objnr) - obj_offset(cachep);
+ if (is_kfence_address(ptr))
+ offset = ptr - kfence_object_start(ptr);
+ else
+ offset = ptr - index_to_obj(cachep, page, objnr) - obj_offset(cachep);

/* Allow address range falling entirely within usercopy region. */
if (offset >= cachep->useroffset &&
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f9ccd5dc13f3..6e35e273681a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -12,6 +12,7 @@
#include <linux/memory.h>
#include <linux/cache.h>
#include <linux/compiler.h>
+#include <linux/kfence.h>
#include <linux/module.h>
#include <linux/cpu.h>
#include <linux/uaccess.h>
@@ -448,6 +449,9 @@ static int shutdown_cache(struct kmem_cache *s)
/* free asan quarantined objects */
kasan_cache_shutdown(s);

+ if (!kfence_shutdown_cache(s))
+ return -EBUSY;
+
if (__kmem_cache_shutdown(s) != 0)
return -EBUSY;

@@ -1171,7 +1175,7 @@ size_t ksize(const void *objp)
if (unlikely(ZERO_OR_NULL_PTR(objp)) || !__kasan_check_read(objp, 1))
return 0;

- size = __ksize(objp);
+ size = kfence_ksize(objp) ?: __ksize(objp);
/*
* We assume that ksize callers could use whole allocated area,
* so we need to unpoison this area.
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:22 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Inserts KFENCE hooks into the SLUB allocator.

We note the addition of the 'orig_size' argument to slab_alloc*()
functions, to be able to pass the originally requested size to KFENCE.
When KFENCE is disabled, there is no additional overhead, since these
functions are __always_inline.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

mm/slub.c | 72 ++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d4177aecedf6..5c5a13a7857c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -27,6 +27,7 @@
#include <linux/ctype.h>
#include <linux/debugobjects.h>
#include <linux/kallsyms.h>
+#include <linux/kfence.h>
#include <linux/memory.h>
#include <linux/math64.h>
#include <linux/fault-inject.h>
@@ -1557,6 +1558,11 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
void *old_tail = *tail ? *tail : *head;
int rsize;

+ if (is_kfence_address(next)) {
+ slab_free_hook(s, next);

+ return true;
+ }
+

/* Head and tail of the reconstructed freelist */
*head = NULL;
*tail = NULL;
@@ -2660,7 +2666,8 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
* already disabled (which is the case for bulk allocation).
*/
static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c)
+ unsigned long addr, struct kmem_cache_cpu *c,
+ size_t orig_size)
{
void *freelist;
struct page *page;
@@ -2763,7 +2770,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* cpu changes by refetching the per cpu area pointer.
*/
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c)
+ unsigned long addr, struct kmem_cache_cpu *c,
+ size_t orig_size)
{
void *p;
unsigned long flags;
@@ -2778,7 +2786,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c = this_cpu_ptr(s->cpu_slab);
#endif

- p = ___slab_alloc(s, gfpflags, node, addr, c);
+ p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
local_irq_restore(flags);
return p;
}
@@ -2805,7 +2813,7 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
* Otherwise we can simply pick the next object from the lockless free list.
*/
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
- gfp_t gfpflags, int node, unsigned long addr)
+ gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
{
void *object;
struct kmem_cache_cpu *c;
@@ -2816,6 +2824,11 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags);
if (!s)
return NULL;
+
+ object = kfence_alloc(s, orig_size, gfpflags);
+ if (unlikely(object))
+ goto out;
+
redo:
/*
* Must read kmem_cache cpu data via this cpu ptr. Preemption is
@@ -2853,7 +2866,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node))) {
- object = __slab_alloc(s, gfpflags, node, addr, c);
+ object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
stat(s, ALLOC_SLOWPATH);
} else {
void *next_object = get_freepointer_safe(s, object);
@@ -2889,20 +2902,21 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
memset(object, 0, s->object_size);

+out:
slab_post_alloc_hook(s, objcg, gfpflags, 1, &object);

return object;
}

static __always_inline void *slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, unsigned long addr)
+ gfp_t gfpflags, unsigned long addr, size_t orig_size)
{
- return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);
+ return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr, orig_size);
}

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
- void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+ void *ret = slab_alloc(s, gfpflags, _RET_IP_, s->object_size);

trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size,
s->size, gfpflags);
@@ -2914,7 +2928,7 @@ EXPORT_SYMBOL(kmem_cache_alloc);
#ifdef CONFIG_TRACING
void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
{
- void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+ void *ret = slab_alloc(s, gfpflags, _RET_IP_, size);
trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags);
ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
@@ -2925,7 +2939,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_trace);
#ifdef CONFIG_NUMA
void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
{
- void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_);
+ void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_, s->object_size);

trace_kmem_cache_alloc_node(_RET_IP_, ret,
s->object_size, s->size, gfpflags, node);
@@ -2939,7 +2953,7 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
gfp_t gfpflags,
int node, size_t size)
{
- void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_);
+ void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_, size);

trace_kmalloc_node(_RET_IP_, ret,
size, s->size, gfpflags, node);
@@ -2973,6 +2987,9 @@ static void __slab_free(struct kmem_cache *s, struct page *page,

stat(s, FREE_SLOWPATH);

+ if (kfence_free(head))
+ return;
+
if (kmem_cache_debug(s) &&
!free_debug_processing(s, page, head, tail, cnt, addr))
return;
@@ -3216,6 +3233,13 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
df->s = cache_from_obj(s, object); /* Support for memcg */
}

+ if (is_kfence_address(object)) {
+ slab_free_hook(df->s, object);
+ WARN_ON(!kfence_free(object));
+ p[size] = NULL; /* mark object processed */
+ return size;
+ }
+
/* Start new detached freelist */
df->page = page;
set_freepointer(df->s, object, NULL);
@@ -3290,8 +3314,14 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);

for (i = 0; i < size; i++) {

- void *object = c->freelist;
+ void *object = kfence_alloc(s, s->object_size, flags);

+ if (unlikely(object)) {
+ p[i] = object;
+ continue;
+ }
+
+ object = c->freelist;
if (unlikely(!object)) {
/*
* We may have removed an object from c->freelist using
@@ -3307,7 +3337,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* of re-populating per CPU c->freelist
*/
p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
- _RET_IP_, c);
+ _RET_IP_, c, size);
if (unlikely(!p[i]))
goto error;

@@ -3962,7 +3992,7 @@ void *__kmalloc(size_t size, gfp_t flags)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc(s, flags, _RET_IP_);
+ ret = slab_alloc(s, flags, _RET_IP_, size);

trace_kmalloc(_RET_IP_, ret, size, s->size, flags);

@@ -4010,7 +4040,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc_node(s, flags, node, _RET_IP_);
+ ret = slab_alloc_node(s, flags, node, _RET_IP_, size);

trace_kmalloc_node(_RET_IP_, ret, size, s->size, flags, node);

@@ -4036,6 +4066,7 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
struct kmem_cache *s;
unsigned int offset;
size_t object_size;

+ bool is_kfence = is_kfence_address(ptr);

ptr = kasan_reset_tag(ptr);

@@ -4048,10 +4079,13 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
to_user, 0, n);

/* Find offset within object. */

- offset = (ptr - page_address(page)) % s->size;
+ if (is_kfence)

+ offset = ptr - kfence_object_start(ptr);
+ else

+ offset = (ptr - page_address(page)) % s->size;

/* Adjust for redzone and reject if within the redzone. */
- if (kmem_cache_debug_flags(s, SLAB_RED_ZONE)) {
+ if (!is_kfence && kmem_cache_debug_flags(s, SLAB_RED_ZONE)) {
if (offset < s->red_left_pad)
usercopy_abort("SLUB object in left red zone",
s->name, to_user, offset, n);
@@ -4460,7 +4494,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc(s, gfpflags, caller);
+ ret = slab_alloc(s, gfpflags, caller, size);

/* Honor the call site pointer we received. */
trace_kmalloc(caller, ret, size, s->size, gfpflags);
@@ -4491,7 +4525,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc_node(s, gfpflags, node, caller);
+ ret = slab_alloc_node(s, gfpflags, node, caller, size);

/* Honor the call site pointer we received. */
trace_kmalloc_node(caller, ret, size, s->size, gfpflags, node);
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:26 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

We make KFENCE compatible with KASAN for testing KFENCE itself. In
particular, KASAN helps to catch any potential corruptions to KFENCE
state, or other corruptions that may be a result of freepointer
corruptions in the main allocators.

To indicate that the combination of the two is generally discouraged,
CONFIG_EXPERT=y should be set. It also gives us the nice property that
KFENCE will be build-tested by allyesconfig builds.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

lib/Kconfig.kfence | 2 +-
mm/kasan/common.c | 7 +++++++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
index 7ac91162edb0..b080e49e15d4 100644
--- a/lib/Kconfig.kfence
+++ b/lib/Kconfig.kfence
@@ -10,7 +10,7 @@ config HAVE_ARCH_KFENCE_STATIC_POOL

menuconfig KFENCE
bool "KFENCE: low-overhead sampling-based memory safety error detector"
- depends on HAVE_ARCH_KFENCE && !KASAN && (SLAB || SLUB)
+ depends on HAVE_ARCH_KFENCE && (!KASAN || EXPERT) && (SLAB || SLUB)

depends on JUMP_LABEL # To ensure performance, require jump labels

select STACKTRACE
help
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 950fd372a07e..f5c49f0fdeff 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -18,6 +18,7 @@
#include <linux/init.h>
#include <linux/kasan.h>
#include <linux/kernel.h>
+#include <linux/kfence.h>
#include <linux/kmemleak.h>
#include <linux/linkage.h>
#include <linux/memblock.h>
@@ -396,6 +397,9 @@ static bool __kasan_slab_free(struct kmem_cache *cache, void *object,
tagged_object = object;
object = reset_tag(object);

+ if (is_kfence_address(object))
+ return false;
+
if (unlikely(nearest_obj(cache, virt_to_head_page(object), object) !=
object)) {
kasan_report_invalid_free(tagged_object, ip);
@@ -444,6 +448,9 @@ static void *__kasan_kmalloc(struct kmem_cache *cache, const void *object,
if (unlikely(object == NULL))
return NULL;

+ if (is_kfence_address(object))
+ return (void *)object;
+
redzone_start = round_up((unsigned long)(object + size),
KASAN_SHADOW_SCALE_SIZE);
redzone_end = round_up((unsigned long)object + cache->object_size,
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:28 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Add compatibility with KMEMLEAK, by making KMEMLEAK aware of the KFENCE
memory pool. This allows building debug kernels with both enabled, which
also helped in debugging KFENCE.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

mm/kmemleak.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5e252d91eb14..2809c25c0a88 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -97,6 +97,7 @@
#include <linux/atomic.h>

#include <linux/kasan.h>
+#include <linux/kfence.h>
#include <linux/kmemleak.h>
#include <linux/memory_hotplug.h>

@@ -1946,8 +1947,18 @@ void __init kmemleak_init(void)
/* register the data/bss sections */
create_object((unsigned long)_sdata, _edata - _sdata,
KMEMLEAK_GREY, GFP_ATOMIC);
+#if defined(CONFIG_KFENCE) && defined(CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL)
+ /* KFENCE objects are located in .bss, which may confuse kmemleak. Skip them. */
+ create_object((unsigned long)__bss_start, __kfence_pool - __bss_start,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+ create_object((unsigned long)__kfence_pool + KFENCE_POOL_SIZE,
+ __bss_stop - (__kfence_pool + KFENCE_POOL_SIZE),
+ KMEMLEAK_GREY, GFP_ATOMIC);
+#else
create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
KMEMLEAK_GREY, GFP_ATOMIC);
+#endif
+
/* only register .data..ro_after_init if not within .data */
if (&__start_ro_after_init < &_sdata || &__end_ro_after_init > &_edata)
create_object((unsigned long)__start_ro_after_init,
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:30 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Lockdep checks that dynamic key registration is only performed on keys
that are not static objects. With KFENCE, it is possible that such a
dynamically allocated key is a KFENCE object which may, however, be
allocated from a static memory pool (if HAVE_ARCH_KFENCE_STATIC_POOL).

Therefore, ignore KFENCE-allocated objects in static_obj().

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

kernel/locking/lockdep.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 54b74fabf40c..0cf5d5ecbd31 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -38,6 +38,7 @@
#include <linux/seq_file.h>
#include <linux/spinlock.h>
#include <linux/kallsyms.h>
+#include <linux/kfence.h>
#include <linux/interrupt.h>
#include <linux/stacktrace.h>
#include <linux/debug_locks.h>
@@ -755,6 +756,13 @@ static int static_obj(const void *obj)
if (arch_is_kernel_initmem_freed(addr))
return 0;

+ /*
+ * KFENCE objects may be allocated from a static memory pool, but are
+ * not actually static objects.
+ */
+ if (is_kfence_address(obj))
+ return 0;
+
/*
* static variable?
*/
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:33 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add KFENCE documentation in dev-tools/kfence.rst, and add to index.

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

Documentation/dev-tools/index.rst | 1 +
Documentation/dev-tools/kfence.rst | 285 +++++++++++++++++++++++++++++
2 files changed, 286 insertions(+)
create mode 100644 Documentation/dev-tools/kfence.rst

diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
index f7809c7b1ba9..1b1cf4f5c9d9 100644
--- a/Documentation/dev-tools/index.rst
+++ b/Documentation/dev-tools/index.rst
@@ -22,6 +22,7 @@ whole; patches welcome!
ubsan
kmemleak
kcsan
+ kfence
gdb-kernel-debugging
kgdb
kselftest
diff --git a/Documentation/dev-tools/kfence.rst b/Documentation/dev-tools/kfence.rst
new file mode 100644
index 000000000000..254f4f089104
--- /dev/null
+++ b/Documentation/dev-tools/kfence.rst
@@ -0,0 +1,285 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Kernel Electric-Fence (KFENCE)
+==============================
+
+Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety
+error detector. KFENCE detects heap out-of-bounds access, use-after-free, and
+invalid-free errors.
+
+KFENCE is designed to be enabled in production kernels, and has near zero
+performance overhead. Compared to KASAN, KFENCE trades performance for
+precision. The main motivation behind KFENCE's design, is that with enough
+total uptime KFENCE will detect bugs in code paths not typically exercised by
+non-production test workloads. One way to quickly achieve a large enough total
+uptime is when the tool is deployed across a large fleet of machines.
+
+Usage
+-----
+
+To enable KFENCE, configure the kernel with::
+
+ CONFIG_KFENCE=y
+
+KFENCE provides several other configuration options to customize behaviour (see
+the respective help text in ``lib/Kconfig.kfence`` for more info).
+
+Tuning performance
+~~~~~~~~~~~~~~~~~~
+
+The most important parameter is KFENCE's sample interval, which can be set via
+the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The
+sample interval determines the frequency with which heap allocations will be
+guarded by KFENCE. The default is configurable via the Kconfig option
+``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0``
+disables KFENCE.
+
+With the Kconfig option ``CONFIG_KFENCE_NUM_OBJECTS`` (default 255), the number
+of available guarded objects can be controlled. Each object requires 2 pages,
+one for the object itself and the other one used as a guard page; object pages
+are interleaved with guard pages, and every object page is therefore surrounded
+by two guard pages.
+
+The total memory dedicated to the KFENCE memory pool can be computed as::
+
+ ( #objects + 1 ) * 2 * PAGE_SIZE
+
+Using the default config, and assuming a page size of 4 KiB, results in
+dedicating 2 MiB to the KFENCE memory pool.
+
+Error reports
+~~~~~~~~~~~~~
+
+A typical out-of-bounds access looks like this::
+
+ ==================================================================
+ BUG: KFENCE: out-of-bounds in test_out_of_bounds_read+0xa3/0x22b
+
+ Out-of-bounds access at 0xffffffffb672efff (left of kfence-#17):
+ test_out_of_bounds_read+0xa3/0x22b
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated in:
+ __kfence_alloc+0x42d/0x4c0
+ __kmalloc+0x133/0x200
+ test_alloc+0xf3/0x25b
+ test_out_of_bounds_read+0x98/0x22b
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+
+The header of the report provides a short summary of the function involved in
+the access. It is followed by more detailed information about the access and
+its origin.
+
+Use-after-free accesses are reported as::
+
+ ==================================================================
+ BUG: KFENCE: use-after-free in test_use_after_free_read+0xb3/0x143
+
+ Use-after-free access at 0xffffffffb673dfe0:
+ test_use_after_free_read+0xb3/0x143
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated in:
+ __kfence_alloc+0x277/0x4c0
+ __kmalloc+0x133/0x200
+ test_alloc+0xf3/0x25b
+ test_use_after_free_read+0x76/0x143
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+ freed in:
+ kfence_guarded_free+0x158/0x380
+ __kfence_free+0x38/0xc0
+ test_use_after_free_read+0xa8/0x143
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+
+KFENCE also reports on invalid frees, such as double-frees::
+
+ ==================================================================
+ BUG: KFENCE: invalid free in test_double_free+0xdc/0x171
+
+ Invalid free of 0xffffffffb6741000:
+ test_double_free+0xdc/0x171
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated in:
+ __kfence_alloc+0x42d/0x4c0
+ __kmalloc+0x133/0x200
+ test_alloc+0xf3/0x25b
+ test_double_free+0x76/0x171
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+ freed in:
+ kfence_guarded_free+0x158/0x380
+ __kfence_free+0x38/0xc0
+ test_double_free+0xa8/0x171
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+
+KFENCE also uses pattern-based redzones on the other side of an object's guard
+page, to detect out-of-bounds writes on the unprotected side of the object.
+These are reported on frees::
+
+ ==================================================================
+ BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184
+
+ Detected corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ]:
+ test_kmalloc_aligned_oob_write+0xef/0x184
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated in:
+ __kfence_alloc+0x277/0x4c0
+ __kmalloc+0x133/0x200
+ test_alloc+0xf3/0x25b
+ test_kmalloc_aligned_oob_write+0x57/0x184
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+
+For such errors, the address where the corruption as well as the corrupt bytes
+are shown.
+
+And finally, KFENCE may also report on invalid accesses to any protected page
+where it was not possible to determine an associated object, e.g. if adjacent
+object pages had not yet been allocated::
+
+ ==================================================================
+ BUG: KFENCE: invalid access in test_invalid_access+0x26/0xe0
+
+ Invalid access at 0xffffffffb670b00a:
+ test_invalid_access+0x26/0xe0
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+
+DebugFS interface
+~~~~~~~~~~~~~~~~~
+
+Some debugging information is exposed via debugfs:
+
+* The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics.
+
+* The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects
+ allocated via KFENCE, including those already freed but protected.
+
+Implementation Details
+----------------------
+
+Guarded allocations are set up based on the sample interval. After expiration
+of the sample interval, a guarded allocation from the KFENCE object pool is
+returned to the main allocator (SLAB or SLUB). At this point, the timer is
+reset, and the next allocation is set up after the expiration of the interval.
+To "gate" a KFENCE allocation through the main allocator's fast-path without
+overhead, KFENCE relies on static branches via the static keys infrastructure.
+The static branch is toggled to redirect the allocation to KFENCE.
+
+KFENCE objects each reside on a dedicated page, at either the left or right
+page boundaries selected at random. The pages to the left and right of the
+object page are "guard pages", whose attributes are changed to a protected
+state, and cause page faults on any attempted access. Such page faults are then
+intercepted by KFENCE, which handles the fault gracefully by reporting an
+out-of-bounds access. The side opposite of an object's guard page is used as a
+pattern-based redzone, to detect out-of-bounds writes on the unprotected sed of
+the object on frees (for special alignment and size combinations, both sides of
+the object are redzoned).
+
+KFENCE also uses pattern-based redzones on the other side of an object's guard
+page, to detect out-of-bounds writes on the unprotected side of the object;
+these are reported on frees.
+
+The following figure illustrates the page layout::
+
+ ---+-----------+-----------+-----------+-----------+-----------+---
+ | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx |
+ | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx |
+ | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x |
+ | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx |
+ | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx |
+ | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx |
+ ---+-----------+-----------+-----------+-----------+-----------+---
+
+Upon deallocation of a KFENCE object, the object's page is again protected and
+the object is marked as freed. Any further access to the object causes a fault
+and KFENCE reports a use-after-free access. Freed objects are inserted at the
+tail of KFENCE's freelist, so that the least recently freed objects are reused
+first, and the chances of detecting use-after-frees of recently freed objects
+is increased.
+
+Interface
+---------
+
+The following describes the functions which are used by allocators as well page
+handling code to set up and deal with KFENCE allocations.
+
+.. kernel-doc:: include/linux/kfence.h
+ :functions: is_kfence_address
+ kfence_shutdown_cache
+ kfence_alloc kfence_free
+ kfence_ksize kfence_object_start
+ kfence_handle_page_fault
+
+Related Tools
+-------------
+
+In userspace, a similar approach is taken by `GWP-ASan
+<http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and
+a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is
+directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another
+similar but non-sampling approach, that also inspired the name "KFENCE", can be
+found in the userspace `Electric Fence Malloc Debugger
+<https://linux.die.net/man/3/efence>`_.
+
+In the kernel, several tools exist to debug memory access errors, and in
+particular KASAN can detect all bug classes that KFENCE can detect. While KASAN
+is more precise, relying on compiler instrumentation, this comes at a
+performance cost. We want to highlight that KASAN and KFENCE are complementary,
+with different target environments. For instance, KASAN is the better
+debugging-aid, where a simple reproducer exists: due to the lower chance to
+detect the error, it would require more effort using KFENCE to debug.
+Deployments at scale, however, would benefit from using KFENCE to discover bugs
+due to code paths not exercised by test cases or fuzzers.
--
2.28.0.526.ge36021eeef-goog

Marco Elver

unread,

Sep 7, 2020, 9:41:36 AM9/7/20

to el...@google.com, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add KFENCE test suite, testing various error detection scenarios. Makes
use of KUnit for test organization. Since KFENCE's interface to obtain
error reports is via the console, the test verifies that KFENCE outputs
expected reports to the console.

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

lib/Kconfig.kfence | 12 +
mm/kfence/Makefile | 3 +
mm/kfence/kfence-test.c | 777 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 792 insertions(+)
create mode 100644 mm/kfence/kfence-test.c

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
index b080e49e15d4..c5e3ef87aa67 100644
--- a/lib/Kconfig.kfence
+++ b/lib/Kconfig.kfence
@@ -55,4 +55,16 @@ config KFENCE_FAULT_INJECTION

this option is to stress-test KFENCE with concurrent error reports

and allocations/frees. A value of 0 disables fault injection.

+config KFENCE_TEST
+ tristate "KFENCE test suite"
+ depends on TRACEPOINTS && KUNIT
+ help
+ Test suite for KFENCE, testing various error detection scenarios with
+ various allocation types, and checking that reports are correctly
+ output to console.
+
+ Say Y here if you want the test to be built into the kernel and run
+ during boot; say M if you want the test to build as a module; say N
+ if you are unsure.
+
endif # KFENCE
diff --git a/mm/kfence/Makefile b/mm/kfence/Makefile
index d991e9a349f0..0ce5f772f9b3 100644
--- a/mm/kfence/Makefile
+++ b/mm/kfence/Makefile
@@ -1,3 +1,6 @@
# SPDX-License-Identifier: GPL-2.0

obj-$(CONFIG_KFENCE) := core.o report.o
+
+CFLAGS_kfence-test.o := -g -fno-omit-frame-pointer -fno-optimize-sibling-calls
+obj-$(CONFIG_KFENCE_TEST) += kfence-test.o
diff --git a/mm/kfence/kfence-test.c b/mm/kfence/kfence-test.c
new file mode 100644
index 000000000000..6c6b713638de
--- /dev/null
+++ b/mm/kfence/kfence-test.c
@@ -0,0 +1,777 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for KFENCE memory safety error detector. Since the interface with
+ * which KFENCE's reports are obtained is via the console, this is the output we
+ * should verify. For each test case checks the presence (or absence) of
+ * generated reports. Relies on 'console' tracepoint to capture reports as they
+ * appear in the kernel log.
+ *
+ * Copyright (C) 2020, Google LLC.
+ * Author: Alexander Potapenko <gli...@google.com>
+ * Marco Elver <el...@google.com>
+ */
+
+#include <kunit/test.h>
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/kfence.h>
+#include <linux/mm.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/tracepoint.h>
+#include <trace/events/printk.h>
+
+#include "kfence.h"
+
+/* Report as observed from console. */
+static struct {
+ spinlock_t lock;
+ int nlines;
+ char lines[2][512];
+} observed = {
+ .lock = __SPIN_LOCK_UNLOCKED(observed.lock),
+};
+
+/* Probe for console output: obtains observed lines of interest. */
+static void probe_console(void *ignore, const char *buf, size_t len)

+{
+ unsigned long flags;

+ int nlines;
+
+ spin_lock_irqsave(&observed.lock, flags);
+ nlines = observed.nlines;
+
+ if (strnstr(buf, "BUG: KFENCE: ", len) && strnstr(buf, "test_", len)) {
+ /*
+ * KFENCE report and related to the test.
+ *
+ * The provided @buf is not NUL-terminated; copy no more than
+ * @len bytes and let strscpy() add the missing NUL-terminator.
+ */
+ strscpy(observed.lines[0], buf, min(len + 1, sizeof(observed.lines[0])));
+ nlines = 1;
+ } else if (nlines == 1 && (strnstr(buf, "at 0x", len) || strnstr(buf, "of 0x", len))) {
+ strscpy(observed.lines[nlines++], buf, min(len + 1, sizeof(observed.lines[0])));
+ }
+
+ WRITE_ONCE(observed.nlines, nlines); /* Publish new nlines. */
+ spin_unlock_irqrestore(&observed.lock, flags);
+}
+
+/* Check if a report related to the test exists. */
+static bool report_available(void)
+{
+ return READ_ONCE(observed.nlines) == ARRAY_SIZE(observed.lines);
+}
+
+/* Information we expect in a report. */
+struct expect_report {
+ enum kfence_error_type type; /* The type or error. */
+ void *fn; /* Function pointer to expected function where access occurred. */
+ char *addr; /* Address at which the bad access occurred. */
+};
+
+/* Check observed report matches information in @r. */
+static bool report_matches(const struct expect_report *r)
+{
+ bool ret = false;
+ unsigned long flags;
+ typeof(observed.lines) expect;
+ const char *end;
+ char *cur;
+
+ /* Doubled-checked locking. */
+ if (!report_available())
+ return false;
+
+ /* Generate expected report contents. */
+
+ /* Title */
+ cur = expect[0];
+ end = &expect[0][sizeof(expect[0]) - 1];
+ switch (r->type) {
+ case KFENCE_ERROR_OOB:
+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: out-of-bounds");
+ break;
+ case KFENCE_ERROR_UAF:
+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: use-after-free");
+ break;
+ case KFENCE_ERROR_CORRUPTION:
+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: memory corruption");

+ break;
+ case KFENCE_ERROR_INVALID:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid access");
+ break;
+ case KFENCE_ERROR_INVALID_FREE:
+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid free");
+ break;
+ }
+
+ scnprintf(cur, end - cur, " in %pS", r->fn);
+ /* The exact offset won't match, remove it; also strip module name. */
+ cur = strchr(expect[0], '+');
+ if (cur)
+ *cur = '\0';
+
+ /* Access information */
+ cur = expect[1];
+ end = &expect[1][sizeof(expect[1]) - 1];
+
+ switch (r->type) {
+ case KFENCE_ERROR_OOB:
+ cur += scnprintf(cur, end - cur, "Out-of-bounds access at");
+ break;
+ case KFENCE_ERROR_UAF:
+ cur += scnprintf(cur, end - cur, "Use-after-free access at");
+ break;
+ case KFENCE_ERROR_CORRUPTION:
+ cur += scnprintf(cur, end - cur, "Detected corrupted memory at");

+ break;
+ case KFENCE_ERROR_INVALID:

+ cur += scnprintf(cur, end - cur, "Invalid access at");
+ break;
+ case KFENCE_ERROR_INVALID_FREE:
+ cur += scnprintf(cur, end - cur, "Invalid free of");
+ break;
+ }
+
+ cur += scnprintf(cur, end - cur, " 0x" PTR_FMT, (void *)r->addr);
+
+ spin_lock_irqsave(&observed.lock, flags);
+ if (!report_available())
+ goto out; /* A new report is being captured. */
+
+ /* Finally match expected output to what we actually observed. */
+ ret = strstr(observed.lines[0], expect[0]) && strstr(observed.lines[1], expect[1]);
+out:
+ spin_unlock_irqrestore(&observed.lock, flags);
+ return ret;
+}
+
+/* ===== Test cases ===== */
+
+#define TEST_PRIV_WANT_MEMCACHE ((void *)1)
+
+/* Cache used by tests; if NULL, allocate from kmalloc instead. */
+static struct kmem_cache *test_cache;
+
+static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t flags,
+ void (*ctor)(void *))
+{
+ if (test->priv != TEST_PRIV_WANT_MEMCACHE)
+ return size;
+
+ kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor);
+
+ /*
+ * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any
+ * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to
+ * allocate via memcg, if enabled.
+ */
+ flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT;
+ test_cache = kmem_cache_create("test", size, 1, flags, ctor);
+ KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache");
+

+ return size;
+}
+

+static void test_cache_destroy(void)
+{
+ if (!test_cache)
+ return;
+
+ kmem_cache_destroy(test_cache);
+ test_cache = NULL;
+}
+
+static inline size_t kmalloc_cache_alignment(size_t size)
+{
+ return kmalloc_caches[kmalloc_type(GFP_KERNEL)][kmalloc_index(size)]->align;
+}
+
+/* Must always inline to match stack trace against caller. */
+static __always_inline void test_free(void *ptr)
+{
+ if (test_cache)
+ kmem_cache_free(test_cache, ptr);
+ else
+ kfree(ptr);
+}
+
+/*
+ * If this should be a KFENCE allocation, and on which side the allocation and
+ * the closest guard page should be.
+ */
+enum allocation_policy {
+ ALLOCATE_ANY, /* KFENCE, any side. */
+ ALLOCATE_LEFT, /* KFENCE, left side of page. */
+ ALLOCATE_RIGHT, /* KFENCE, right side of page. */
+ ALLOCATE_NONE, /* No KFENCE allocation. */
+};
+
+/*
+ * Try to get a guarded allocation from KFENCE. Uses either kmalloc() or the
+ * current test_cache if set up.
+ */
+static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocation_policy policy)
+{
+ void *alloc;
+ unsigned long timeout, resched_after;
+ const char *policy_name;
+
+ switch (policy) {
+ case ALLOCATE_ANY:
+ policy_name = "any";
+ break;
+ case ALLOCATE_LEFT:
+ policy_name = "left";
+ break;
+ case ALLOCATE_RIGHT:
+ policy_name = "right";
+ break;
+ case ALLOCATE_NONE:
+ policy_name = "none";
+ break;
+ }
+
+ kunit_info(test, "%s: size=%zu, gfp=%x, policy=%s, cache=%i\n", __func__, size, gfp,
+ policy_name, !!test_cache);
+
+ /*
+ * 100x the sample interval should be more than enough to ensure we get
+ * a KFENCE allocation eventually.
+ */
+ timeout = jiffies + msecs_to_jiffies(100 * CONFIG_KFENCE_SAMPLE_INTERVAL);
+ /*
+ * Especially for non-preemption kernels, ensure the allocation-gate
+ * timer has time to catch up.
+ */
+ resched_after = jiffies + msecs_to_jiffies(CONFIG_KFENCE_SAMPLE_INTERVAL);
+ do {
+ if (test_cache)
+ alloc = kmem_cache_alloc(test_cache, gfp);
+ else
+ alloc = kmalloc(size, gfp);
+
+ if (is_kfence_address(alloc)) {
+ if (policy == ALLOCATE_ANY)
+ return alloc;
+ if (policy == ALLOCATE_LEFT && IS_ALIGNED((unsigned long)alloc, PAGE_SIZE))
+ return alloc;
+ if (policy == ALLOCATE_RIGHT &&
+ !IS_ALIGNED((unsigned long)alloc, PAGE_SIZE))
+ return alloc;
+ } else if (policy == ALLOCATE_NONE)
+ return alloc;
+
+ test_free(alloc);
+
+ if (time_after(jiffies, resched_after))
+ cond_resched();
+ } while (time_before(jiffies, timeout));
+
+ KUNIT_ASSERT_TRUE_MSG(test, false, "failed to allocate from KFENCE");
+ return NULL; /* Unreachable. */
+}
+
+static void test_out_of_bounds_read(struct kunit *test)
+{
+ size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_OOB,
+ .fn = test_out_of_bounds_read,
+ };
+ char *buf;
+
+ setup_test_cache(test, size, 0, NULL);
+
+ /*
+ * If we don't have our own cache, adjust based on alignment, so that we
+ * actually access guard pages on either side.
+ */
+ if (!test_cache)
+ size = kmalloc_cache_alignment(size);
+
+ /* Test both sides. */
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT);
+ expect.addr = buf - 1;
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+ test_free(buf);
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
+ expect.addr = buf + size;
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+ test_free(buf);
+}
+
+static void test_use_after_free_read(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_UAF,
+ .fn = test_use_after_free_read,
+ };
+
+ setup_test_cache(test, size, 0, NULL);
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ test_free(expect.addr);
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+static void test_double_free(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_INVALID_FREE,
+ .fn = test_double_free,
+ };
+
+ setup_test_cache(test, size, 0, NULL);
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ test_free(expect.addr);
+ test_free(expect.addr); /* Double-free. */
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+static void test_invalid_addr_free(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_INVALID_FREE,
+ .fn = test_invalid_addr_free,
+ };
+ char *buf;
+
+ setup_test_cache(test, size, 0, NULL);
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ expect.addr = buf + 1; /* Free on invalid address. */
+ test_free(expect.addr); /* Invalid address free. */
+ test_free(buf); /* No error. */
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * KFENCE is unable to detect an OOB if the allocation's alignment requirements
+ * leave a gap between the object and the guard page. Specifically, an
+ * allocation of e.g. 73 bytes is aligned on 8 and 128 bytes for SLUB or SLAB
+ * respectively. Therefore it is impossible for the allocated object to adhere
+ * to either of the page boundaries.
+ *
+ * However, we test that an access to memory beyond the gap result in KFENCE
+ * detecting an OOB access.
+ */
+static void test_kmalloc_aligned_oob_read(struct kunit *test)
+{
+ const size_t size = 73;
+ const size_t align = kmalloc_cache_alignment(size);
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_OOB,
+ .fn = test_kmalloc_aligned_oob_read,
+ };
+ char *buf;
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
+
+ /*
+ * The object is offset to the right, so there won't be an OOB to the
+ * left of it.
+ */
+ READ_ONCE(*(buf - 1));
+ KUNIT_EXPECT_FALSE(test, report_available());
+
+ /*
+ * @buf must be aligned on @align, therefore buf + size belongs to the
+ * same page -> no OOB.
+ */
+ READ_ONCE(*(buf + size));
+ KUNIT_EXPECT_FALSE(test, report_available());
+
+ /* Overflowing by @align bytes will result in an OOB. */
+ expect.addr = buf + size + align;
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+
+ test_free(buf);
+}
+
+static void test_kmalloc_aligned_oob_write(struct kunit *test)
+{
+ const size_t size = 73;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_CORRUPTION,
+ .fn = test_kmalloc_aligned_oob_write,
+ };
+ char *buf;
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
+ /*
+ * The object is offset to the right, so we won't get a page
+ * fault immediately after it.
+ */
+ expect.addr = buf + size;
+ WRITE_ONCE(*expect.addr, READ_ONCE(*expect.addr) + 1);
+ KUNIT_EXPECT_FALSE(test, report_available());
+ test_free(buf);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test cache shrinking and destroying with KFENCE. */
+static void test_shrink_memcache(struct kunit *test)
+{
+ const size_t size = 32;
+ void *buf;
+
+ setup_test_cache(test, size, 0, NULL);
+ KUNIT_EXPECT_TRUE(test, test_cache);
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ kmem_cache_shrink(test_cache);
+ test_free(buf);
+
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+static void ctor_set_x(void *obj)
+{
+ /* Every object has at least 8 bytes. */
+ memset(obj, 'x', 8);
+}
+
+/* Ensure that SL*B does not modify KFENCE objects on bulk free. */
+static void test_free_bulk(struct kunit *test)
+{
+ int iter;
+
+ for (iter = 0; iter < 5; iter++) {
+ const size_t size = setup_test_cache(test, 8 + prandom_u32_max(300), 0,
+ (iter & 1) ? ctor_set_x : NULL);
+ void *objects[] = {
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT),
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE),
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT),
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE),
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE),
+ };
+
+ kmem_cache_free_bulk(test_cache, ARRAY_SIZE(objects), objects);
+ KUNIT_ASSERT_FALSE(test, report_available());
+ test_cache_destroy();
+ }
+}
+
+/* Test init-on-free works. */
+static void test_init_on_free(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_UAF,
+ .fn = test_init_on_free,

+ };
+ int i;
+

+ if (!IS_ENABLED(CONFIG_INIT_ON_FREE_DEFAULT_ON))
+ return;
+ /* Assume it hasn't been disabled on command line. */
+
+ setup_test_cache(test, size, 0, NULL);
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ for (i = 0; i < size; i++)
+ expect.addr[i] = i + 1;
+ test_free(expect.addr);
+
+ for (i = 0; i < size; i++) {
+ /*
+ * This may fail if the page was recycled by KFENCE and then
+ * written to again -- this however, is near impossible with a
+ * default config.
+ */
+ KUNIT_EXPECT_EQ(test, expect.addr[i], (char)0);
+
+ if (!i) /* Only check first access to not fail test if page is ever re-protected. */
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+ }
+}
+
+/* Ensure that constructors work properly. */
+static void test_memcache_ctor(struct kunit *test)
+{
+ const size_t size = 32;
+ char *buf;
+ int i;
+
+ setup_test_cache(test, size, 0, ctor_set_x);
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+
+ for (i = 0; i < 8; i++)
+ KUNIT_EXPECT_EQ(test, buf[i], (char)'x');
+
+ test_free(buf);
+
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+/* Test that memory is zeroed if requested. */
+static void test_gfpzero(struct kunit *test)
+{
+ const size_t size = PAGE_SIZE; /* PAGE_SIZE so we can use ALLOCATE_ANY. */
+ char *buf1, *buf2;
+ int i;
+
+ if (CONFIG_KFENCE_SAMPLE_INTERVAL > 100) {
+ kunit_warn(test, "skipping ... would take too long\n");
+ return;
+ }
+
+ setup_test_cache(test, size, 0, NULL);
+ buf1 = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ for (i = 0; i < size; i++)
+ buf1[i] = i + 1;
+ test_free(buf1);
+
+ /* Try to get same address again -- this can take a while. */
+ for (i = 0;; i++) {
+ buf2 = test_alloc(test, size, GFP_KERNEL | __GFP_ZERO, ALLOCATE_ANY);
+ if (buf1 == buf2)
+ break;
+ test_free(buf2);
+
+ if (i == CONFIG_KFENCE_NUM_OBJECTS) {
+ kunit_warn(test, "giving up ... cannot get same object back\n");

+ return;
+ }
+ }
+

+ for (i = 0; i < size; i++)
+ KUNIT_EXPECT_EQ(test, buf2[i], (char)0);
+
+ test_free(buf2);
+
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+static void test_invalid_access(struct kunit *test)
+{
+ const struct expect_report expect = {
+ .type = KFENCE_ERROR_INVALID,
+ .fn = test_invalid_access,
+ .addr = &__kfence_pool[10],
+ };
+
+ READ_ONCE(__kfence_pool[10]);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test SLAB_TYPESAFE_BY_RCU works. */
+static void test_memcache_typesafe_by_rcu(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_UAF,
+ .fn = test_memcache_typesafe_by_rcu,
+ };
+
+ setup_test_cache(test, size, SLAB_TYPESAFE_BY_RCU, NULL);
+ KUNIT_EXPECT_TRUE(test, test_cache); /* Want memcache. */
+
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ *expect.addr = 42;
+
+ rcu_read_lock();
+ test_free(expect.addr);
+ KUNIT_EXPECT_EQ(test, *expect.addr, (char)42);
+ rcu_read_unlock();
+
+ /* No reports yet, memory should not have been freed on access. */
+ KUNIT_EXPECT_FALSE(test, report_available());
+ rcu_barrier(); /* Wait for free to happen. */
+
+ /* Expect use-after-free. */
+ KUNIT_EXPECT_EQ(test, *expect.addr, (char)42);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test krealloc(). */
+static void test_krealloc(struct kunit *test)
+{
+ const size_t size = 32;
+ const struct expect_report expect = {
+ .type = KFENCE_ERROR_UAF,
+ .fn = test_krealloc,
+ .addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY),
+ };
+ char *buf = expect.addr;
+ int i;
+
+ KUNIT_EXPECT_FALSE(test, test_cache);
+ KUNIT_EXPECT_EQ(test, ksize(buf), size); /* Precise size match after KFENCE alloc. */
+ for (i = 0; i < size; i++)
+ buf[i] = i + 1;
+
+ /* Check that we successfully change the size. */
+ buf = krealloc(buf, size * 3, GFP_KERNEL); /* Grow. */
+ /* Note: Might no longer be a KFENCE alloc. */
+ KUNIT_EXPECT_GE(test, ksize(buf), size * 3);
+ for (i = 0; i < size; i++)
+ KUNIT_EXPECT_EQ(test, buf[i], (char)(i + 1));
+ for (; i < size * 3; i++) /* Fill to extra bytes. */
+ buf[i] = i + 1;
+
+ buf = krealloc(buf, size * 2, GFP_KERNEL * 2); /* Shrink. */
+ KUNIT_EXPECT_GE(test, ksize(buf), size * 2);
+ for (i = 0; i < size * 2; i++)
+ KUNIT_EXPECT_EQ(test, buf[i], (char)(i + 1));
+
+ buf = krealloc(buf, 0, GFP_KERNEL); /* Free. */
+ KUNIT_EXPECT_EQ(test, (unsigned long)buf, (unsigned long)ZERO_SIZE_PTR);
+ KUNIT_ASSERT_FALSE(test, report_available()); /* No reports yet! */
+
+ READ_ONCE(*expect.addr); /* Ensure krealloc() actually freed earlier KFENCE object. */
+ KUNIT_ASSERT_TRUE(test, report_matches(&expect));
+}
+
+/* Test that some objects from a bulk allocation belong to KFENCE pool. */
+static void test_memcache_alloc_bulk(struct kunit *test)
+{
+ const size_t size = 32;
+ bool pass = false;
+ unsigned long timeout;
+
+ setup_test_cache(test, size, 0, NULL);
+ KUNIT_EXPECT_TRUE(test, test_cache); /* Want memcache. */
+ /*
+ * 100x the sample interval should be more than enough to ensure we get
+ * a KFENCE allocation eventually.
+ */
+ timeout = jiffies + msecs_to_jiffies(100 * CONFIG_KFENCE_SAMPLE_INTERVAL);
+ do {
+ void *objects[100];
+ int i, num = kmem_cache_alloc_bulk(test_cache, GFP_ATOMIC, ARRAY_SIZE(objects),
+ objects);
+ if (!num)
+ continue;
+ for (i = 0; i < ARRAY_SIZE(objects); i++) {
+ if (is_kfence_address(objects[i])) {
+ pass = true;
+ break;
+ }
+ }
+ kmem_cache_free_bulk(test_cache, num, objects);
+ /*
+ * kmem_cache_alloc_bulk() disables interrupts, and calling it
+ * in a tight loop may not give KFENCE a chance to switch the
+ * static branch. Call cond_resched() to let KFENCE chime in.
+ */
+ cond_resched();
+ } while (!pass && time_before(jiffies, timeout));
+
+ KUNIT_EXPECT_TRUE(test, pass);
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+/*
+ * KUnit does not provide a way to provide arguments to tests, and we encode
+ * additional info in the name. Set up 2 tests per test case, one using the
+ * default allocator, and another using a custom memcache (suffix '-memcache').
+ */
+#define KFENCE_KUNIT_CASE(test_name) \
+ { .run_case = test_name, .name = #test_name }, \
+ { .run_case = test_name, .name = #test_name "-memcache" }
+
+static struct kunit_case kfence_test_cases[] = {
+ KFENCE_KUNIT_CASE(test_out_of_bounds_read),
+ KFENCE_KUNIT_CASE(test_use_after_free_read),
+ KFENCE_KUNIT_CASE(test_double_free),
+ KFENCE_KUNIT_CASE(test_invalid_addr_free),
+ KFENCE_KUNIT_CASE(test_free_bulk),
+ KFENCE_KUNIT_CASE(test_init_on_free),
+ KUNIT_CASE(test_kmalloc_aligned_oob_read),
+ KUNIT_CASE(test_kmalloc_aligned_oob_write),
+ KUNIT_CASE(test_shrink_memcache),
+ KUNIT_CASE(test_memcache_ctor),
+ KUNIT_CASE(test_invalid_access),
+ KUNIT_CASE(test_gfpzero),
+ KUNIT_CASE(test_memcache_typesafe_by_rcu),
+ KUNIT_CASE(test_krealloc),
+ KUNIT_CASE(test_memcache_alloc_bulk),
+ {},
+};
+
+/* ===== End test cases ===== */
+
+static int test_init(struct kunit *test)

+{
+ unsigned long flags;

+ int i;
+
+ spin_lock_irqsave(&observed.lock, flags);
+ for (i = 0; i < ARRAY_SIZE(observed.lines); i++)
+ observed.lines[i][0] = '\0';
+ observed.nlines = 0;
+ spin_unlock_irqrestore(&observed.lock, flags);
+
+ /* Any test with 'memcache' in its name will want a memcache. */
+ if (strstr(test->name, "memcache"))
+ test->priv = TEST_PRIV_WANT_MEMCACHE;
+ else
+ test->priv = NULL;

+
+ return 0;
+}
+

+static void test_exit(struct kunit *test)
+{
+ test_cache_destroy();
+}
+
+static struct kunit_suite kfence_test_suite = {
+ .name = "kfence-test",
+ .test_cases = kfence_test_cases,
+ .init = test_init,
+ .exit = test_exit,
+};
+static struct kunit_suite *kfence_test_suites[] = { &kfence_test_suite, NULL };
+
+static void register_tracepoints(struct tracepoint *tp, void *ignore)
+{
+ check_trace_callback_type_console(probe_console);
+ if (!strcmp(tp->name, "console"))
+ WARN_ON(tracepoint_probe_register(tp, probe_console, NULL));
+}
+
+static void unregister_tracepoints(struct tracepoint *tp, void *ignore)
+{
+ if (!strcmp(tp->name, "console"))
+ tracepoint_probe_unregister(tp, probe_console, NULL);
+}
+
+/*
+ * We only want to do tracepoints setup and teardown once, therefore we have to
+ * customize the init and exit functions and cannot rely on kunit_test_suite().
+ */
+static int __init kfence_test_init(void)
+{
+ /*
+ * Because we want to be able to build the test as a module, we need to
+ * iterate through all known tracepoints, since the static registration
+ * won't work here.
+ */
+ for_each_kernel_tracepoint(register_tracepoints, NULL);
+ return __kunit_test_suites_init(kfence_test_suites);
+}
+
+static void kfence_test_exit(void)
+{
+ __kunit_test_suites_exit(kfence_test_suites);
+ for_each_kernel_tracepoint(unregister_tracepoints, NULL);
+ tracepoint_synchronize_unregister();
+}
+
+late_initcall(kfence_test_init);
+module_exit(kfence_test_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Alexander Potapenko <gli...@google.com>, Marco Elver <el...@google.com>");
--
2.28.0.526.ge36021eeef-goog

Andrey Konovalov

unread,

Sep 7, 2020, 11:34:21 AM9/7/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E . McKenney, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, dave....@linux.intel.com, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

Does the user need to know that this is object #17? This doesn't seem
like something that can be useful for anything.

Same here.

Also, this says object #24, but the stack trace above doesn't mention
which object it is. Is it the same one?

It's not really clear what is 0xac here. Value of the corrupted byte?
What does '.' stand for?

Also, if this is to be used in production, printing kernel memory
bytes might lead to info-leaks.

Only for freed allocations, right?

> At this point, the timer is
> +reset, and the next allocation is set up after the expiration of the interval.
> +To "gate" a KFENCE allocation through the main allocator's fast-path without
> +overhead, KFENCE relies on static branches via the static keys infrastructure.
> +The static branch is toggled to redirect the allocation to KFENCE.
> +
> +KFENCE objects each reside on a dedicated page, at either the left or right
> +page boundaries selected at random. The pages to the left and right of the
> +object page are "guard pages", whose attributes are changed to a protected
> +state, and cause page faults on any attempted access. Such page faults are then
> +intercepted by KFENCE, which handles the fault gracefully by reporting an
> +out-of-bounds access.

I'd start a new paragraph here:

> The side opposite of an object's guard page is used as a

Not a native speaker, but "The side opposite _to_" sounds better. Or
"The opposite side of".

> +pattern-based redzone, to detect out-of-bounds writes on the unprotected sed of

"sed"?

> +the object on frees (for special alignment and size combinations, both sides of
> +the object are redzoned).
> +
> +KFENCE also uses pattern-based redzones on the other side of an object's guard
> +page, to detect out-of-bounds writes on the unprotected side of the object;
> +these are reported on frees.

Not really clear, what is "other side" and how it's different from the
"opposite side" mentioned above. The figure doesn't really help.

Seems really similar to KASAN's quarantine? Is the implementation much
different?

Jonathan Cameron

unread,

Sep 7, 2020, 11:43:27 AM9/7/20

to Marco Elver, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Interesting bit of work. A few trivial things inline I spotted whilst having
a first read through.

Thanks,

Jonathan

> +
> +static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp)
> +{
> + /*
> + * Note: for allocations made before RNG initialization, will always
> + * return zero. We still benefit from enabling KFENCE as early as
> + * possible, even when the RNG is not yet available, as this will allow
> + * KFENCE to detect bugs due to earlier allocations. The only downside
> + * is that the out-of-bounds accesses detected are deterministic for
> + * such allocations.
> + */
> + const bool right = prandom_u32_max(2);
> + unsigned long flags;
> + struct kfence_metadata *meta = NULL;
> + void *addr = NULL;

I think this is set in all paths, so no need to initialize here.

...

> +
> +size_t kfence_ksize(const void *addr)
> +{
> + const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> +
> + /*
> + * Read locklessly -- if there is a race with __kfence_alloc(), this
> + * most certainly is either a use-after-free, or invalid access.
> + */
> + return meta ? abs(meta->size) : 0;
> +}
> +
> +void *kfence_object_start(const void *addr)
> +{
> + const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> +
> + /*
> + * Read locklessly -- if there is a race with __kfence_alloc(), this
> + * most certainly is either a use-after-free, or invalid access.

To my reading using "most certainly" makes this statement less clear

Read locklessly -- if there is a race with __kfence_alloc() this
is either a use-after-free or invalid access.

Same for other cases of that particular "most certainly".

Not need to set to NULL here as assigned 3 lines down.

...

Marco Elver

unread,

Sep 7, 2020, 12:33:51 PM9/7/20

to Andrey Konovalov, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E . McKenney, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

Some arguments for keeping it:

- We need to write something like "left of <object>". And then we need
to say where <object> is allocated. Giving objects names makes it
easier to understand the link between "left of <object>" and the
stacktrace shown after "<object> allocated in". We could make <object>
just "object", but reading "left/right of object" and then "object
allocated in:" can be a little confusing.

- We can look up the object via its number in the debugfs objects list
(/sys/kernel/debug/kfence/objects). For example, if we see an OOB
access, we can then check the objects file and see if the object is
still allocated or not, or if it has been recycled.

I don't believe it's distracting anyone, and if there is a chance that
keeping this information can help debug a problem, we ought to keep
it.

Right, the above stacktrace should then say "kfence-#24". (But the
address also hints at this.)

We can probably explain that better below. The values are the corrupt
bytes, the '.' are untouched bytes.

> Also, if this is to be used in production, printing kernel memory
> bytes might lead to info-leaks.

We do not print them if !CONFIG_DEBUG_KERNEL, and instead show '!' for
changed bytes. Maybe we can add this somewhere here as well.

Which "freed allocation"? What this paragraph says is that after the
sample interval elapsed, we'll return a KFENCE allocation on kmalloc.
It doesn't yet talk about freeing.

> > At this point, the timer is
> > +reset, and the next allocation is set up after the expiration of the interval.
> > +To "gate" a KFENCE allocation through the main allocator's fast-path without
> > +overhead, KFENCE relies on static branches via the static keys infrastructure.
> > +The static branch is toggled to redirect the allocation to KFENCE.
> > +
> > +KFENCE objects each reside on a dedicated page, at either the left or right
> > +page boundaries selected at random. The pages to the left and right of the
> > +object page are "guard pages", whose attributes are changed to a protected
> > +state, and cause page faults on any attempted access. Such page faults are then
> > +intercepted by KFENCE, which handles the fault gracefully by reporting an
> > +out-of-bounds access.
>
> I'd start a new paragraph here:
>
> > The side opposite of an object's guard page is used as a
>
> Not a native speaker, but "The side opposite _to_" sounds better. Or
> "The opposite side of".

All are fine. Using "to" indicates direction, which in this case is valid too.

> > +pattern-based redzone, to detect out-of-bounds writes on the unprotected sed of
>
> "sed"?

side

> > +the object on frees (for special alignment and size combinations, both sides of
> > +the object are redzoned).
> > +
> > +KFENCE also uses pattern-based redzones on the other side of an object's guard
> > +page, to detect out-of-bounds writes on the unprotected side of the object;
> > +these are reported on frees.
>
> Not really clear, what is "other side" and how it's different from the
> "opposite side" mentioned above. The figure doesn't really help.

Redzone and guard page sandwich the object. Not sure how I can make it
clearer yet, but I'll try.

It's a list, and we just insert at the tail. Why does it matter?

Thanks for the comments. I'll try to fix in v2.

Thanks,
-- Marco

Marco Elver

unread,

Sep 7, 2020, 12:38:17 PM9/7/20

to Jonathan Cameron, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

On Mon, 7 Sep 2020 at 17:43, Jonathan Cameron
<Jonathan...@huawei.com> wrote:
...

> Interesting bit of work. A few trivial things inline I spotted whilst having
> a first read through.
>
> Thanks,
>
> Jonathan

Thank you for having a look! We'll address these for v2.

Thanks,
-- Marco

Andrey Konovalov

unread,

Sep 7, 2020, 1:55:43 PM9/7/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E . McKenney, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

It says that an allocation is returned to the main allocator, and this
is what is usually described with the word "freed". Do you mean
something else here?

If the implementation is similar, we can then reuse quarantine. But I
guess it's not.

Marco Elver

unread,

Sep 7, 2020, 2:16:26 PM9/7/20

to Andrey Konovalov, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E . McKenney, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

On Mon, 7 Sep 2020 at 19:55, Andrey Konovalov <andre...@google.com> wrote:
> On Mon, Sep 7, 2020 at 6:33 PM Marco Elver <el...@google.com> wrote:

[...]

> > > > +Guarded allocations are set up based on the sample interval. After expiration
> > > > +of the sample interval, a guarded allocation from the KFENCE object pool is
> > > > +returned to the main allocator (SLAB or SLUB).
> > >
> > > Only for freed allocations, right?
> >
> > Which "freed allocation"? What this paragraph says is that after the
> > sample interval elapsed, we'll return a KFENCE allocation on kmalloc.
> > It doesn't yet talk about freeing.
>
> It says that an allocation is returned to the main allocator, and this
> is what is usually described with the word "freed". Do you mean
> something else here?

Ah, I see what's goin on. So the "returned to the main allocator" is
ambiguous here. I meant to say "returned" as in kfence gives sl[au]b a
kfence object to return for the next kmalloc. I'll reword this as it
seems the phrase is overloaded in this context already.

[...]

> > > > +Upon deallocation of a KFENCE object, the object's page is again protected and
> > > > +the object is marked as freed. Any further access to the object causes a fault
> > > > +and KFENCE reports a use-after-free access. Freed objects are inserted at the
> > > > +tail of KFENCE's freelist, so that the least recently freed objects are reused
> > > > +first, and the chances of detecting use-after-frees of recently freed objects
> > > > +is increased.
> > >
> > > Seems really similar to KASAN's quarantine? Is the implementation much
> > > different?
> >
> > It's a list, and we just insert at the tail. Why does it matter?
>
> If the implementation is similar, we can then reuse quarantine. But I
> guess it's not.

The concept is similar, but the implementations are very different.
Both use a list (although KASAN quarantine seems to reimplement its
own singly-linked list). We just rely on a standard doubly-linked
list, without any of the delayed freeing logic of the KASAN quarantine
as KFENCE objects just change state to "freed" until they're reused
(freed kfence objects are just inserted at the tail, and the next
object to be used for an allocation is at the head).

Thanks,
-- Marco

Vlastimil Babka

unread,

Sep 8, 2020, 7:48:30 AM9/8/20

to Marco Elver, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On 9/7/20 3:40 PM, Marco Elver wrote:
> This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
> low-overhead sampling-based memory safety error detector of heap
> use-after-free, invalid-free, and out-of-bounds access errors. This
> series enables KFENCE for the x86 and arm64 architectures, and adds
> KFENCE hooks to the SLAB and SLUB allocators.
>
> KFENCE is designed to be enabled in production kernels, and has near
> zero performance overhead. Compared to KASAN, KFENCE trades performance
> for precision. The main motivation behind KFENCE's design, is that with
> enough total uptime KFENCE will detect bugs in code paths not typically
> exercised by non-production test workloads. One way to quickly achieve a
> large enough total uptime is when the tool is deployed across a large
> fleet of machines.

Looks nice!

> KFENCE objects each reside on a dedicated page, at either the left or
> right page boundaries. The pages to the left and right of the object
> page are "guard pages", whose attributes are changed to a protected
> state, and cause page faults on any attempted access to them. Such page
> faults are then intercepted by KFENCE, which handles the fault
> gracefully by reporting a memory access error.
>
> Guarded allocations are set up based on a sample interval (can be set
> via kfence.sample_interval). After expiration of the sample interval, a
> guarded allocation from the KFENCE object pool is returned to the main
> allocator (SLAB or SLUB). At this point, the timer is reset, and the
> next allocation is set up after the expiration of the interval.
>
> To enable/disable a KFENCE allocation through the main allocator's
> fast-path without overhead, KFENCE relies on static branches via the
> static keys infrastructure. The static branch is toggled to redirect the
> allocation to KFENCE.

Toggling a static branch is AFAIK quite disruptive (PeterZ will probably tell
you better), and with the default 100ms sample interval, I'd think it's not good
to toggle it so often? Did you measure what performance would you get, if the
static key was only for long-term toggling the whole feature on and off (boot
time or even runtime), but the decisions "am I in a sample interval right now?"
would be normal tests behind this static key? Thanks.

Catalin Marinas

unread,

Sep 8, 2020, 7:53:25 AM9/8/20

to Marco Elver, gli...@google.com, ak...@linux-foundation.org, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Could you instead do:

#if defined(CONFIG_KFENCE) && defined(CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL)
delete_object_part((unsigned long)__kfence_pool, KFENCE_POOL_SIZE);
#endif

--
Catalin

Alexander Potapenko

unread,

Sep 8, 2020, 8:16:22 AM9/8/20

to Vlastimil Babka, Marco Elver, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, pau...@kernel.org, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, dave....@linux.intel.com, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, linu...@vger.kernel.org, LKML, kasan-dev, linux-ar...@lists.infradead.org, Linux Memory Management List

> Toggling a static branch is AFAIK quite disruptive (PeterZ will probably tell
> you better), and with the default 100ms sample interval, I'd think it's not good
> to toggle it so often? Did you measure what performance would you get, if the
> static key was only for long-term toggling the whole feature on and off (boot
> time or even runtime), but the decisions "am I in a sample interval right now?"
> would be normal tests behind this static key? Thanks.

100ms is the default that we use for testing, but for production it
should be fine to pick a longer interval (e.g. 1 second or more).
We haven't noticed any performance impact with neither 100ms nor bigger values.

Regarding using normal branches, they are quite expensive.
E.g. at some point we used to have a branch in slab_free() to check
whether the freed object belonged to KFENCE pool.
When the pool address was taken from memory, this resulted in some
non-zero performance penalty.

As for enabling the whole feature at runtime, our intention is to let
the users have it enabled by default, otherwise someone will need to
tell every machine in the fleet when the feature is to be enabled.

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Alexander Potapenko

unread,

Sep 8, 2020, 8:30:11 AM9/8/20

to Catalin Marinas, Marco Elver, Andrew Morton, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, pau...@kernel.org, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, dave....@linux.intel.com, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, linu...@vger.kernel.org, LKML, kasan-dev, linux-ar...@lists.infradead.org, Linux Memory Management List

> Could you instead do:
>
> #if defined(CONFIG_KFENCE) && defined(CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL)
> delete_object_part((unsigned long)__kfence_pool, KFENCE_POOL_SIZE);
> #endif

Thanks, we'll apply this to v2!

Vlastimil Babka

unread,

Sep 8, 2020, 10:40:03 AM9/8/20

to Alexander Potapenko, Marco Elver, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, pau...@kernel.org, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, dave....@linux.intel.com, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, linu...@vger.kernel.org, LKML, kasan-dev, linux-ar...@lists.infradead.org, Linux Memory Management List

On 9/8/20 2:16 PM, Alexander Potapenko wrote:
>> Toggling a static branch is AFAIK quite disruptive (PeterZ will probably tell
>> you better), and with the default 100ms sample interval, I'd think it's not good
>> to toggle it so often? Did you measure what performance would you get, if the
>> static key was only for long-term toggling the whole feature on and off (boot
>> time or even runtime), but the decisions "am I in a sample interval right now?"
>> would be normal tests behind this static key? Thanks.
>
> 100ms is the default that we use for testing, but for production it
> should be fine to pick a longer interval (e.g. 1 second or more).
> We haven't noticed any performance impact with neither 100ms nor bigger values.

Hmm, I see.

> Regarding using normal branches, they are quite expensive.
> E.g. at some point we used to have a branch in slab_free() to check
> whether the freed object belonged to KFENCE pool.
> When the pool address was taken from memory, this resulted in some
> non-zero performance penalty.

Well yeah, if the checks involve extra cache misses, that adds up. But AFAICS
you can't avoid that kind of checks with static key anyway (am I looking right
at is_kfence_address()?) because some kfence-allocated objects will exist even
after the sampling period ended, right?
So AFAICS kfence_alloc() is the only user of the static key and I wonder if it
really makes such difference there.

> As for enabling the whole feature at runtime, our intention is to let
> the users have it enabled by default, otherwise someone will need to
> tell every machine in the fleet when the feature is to be enabled.

Sure, but I guess there are tools that make it no difference in effort between 1
machine and fleet.

I'll try to explain my general purpose distro-kernel POV. What I like e.g. about
debug_pagealloc and page_owner (and contributed to that state of these features)
is that a distro kernel can be shipped with them compiled in, but they are
static-key disabled thus have no overhead, until a user enables them on boot,
without a need to replace the kernel with a debug one first. Users can enable
them for their own debugging, or when asked by somebody from the distro
assisting with the debugging.

I think KFENCE has similar potential and could work the same way - compiled in
always, but a static key would eliminate everything, even the
is_kfence_address() checks, until it became enabled (but then it would probably
be a one-way street for the rest of the kernel's uptime). Some distro users
would decide to enable it always, some not, but could be advised to when needed.
So the existing static key could be repurposed for this, or if it's really worth
having the current one to control just the sampling period, then there would be two?

Thanks.

Dave Hansen

unread,

Sep 8, 2020, 10:52:27 AM9/8/20

to Marco Elver, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On 9/7/20 6:40 AM, Marco Elver wrote:
> KFENCE is designed to be enabled in production kernels, and has near
> zero performance overhead. Compared to KASAN, KFENCE trades performance
> for precision.

Could you talk a little bit about where you expect folks to continue to
use KASAN? How would a developer or a tester choose which one to use?

> KFENCE objects each reside on a dedicated page, at either the left or
> right page boundaries. The pages to the left and right of the object
> page are "guard pages", whose attributes are changed to a protected
> state, and cause page faults on any attempted access to them. Such page
> faults are then intercepted by KFENCE, which handles the fault
> gracefully by reporting a memory access error.

How much memory overhead does this end up having? I know it depends on
the object size and so forth. But, could you give some real-world
examples of memory consumption? Also, what's the worst case? Say I
have a ton of worst-case-sized (32b) slab objects. Will I notice?

Marco Elver

unread,

Sep 8, 2020, 11:21:36 AM9/8/20

to Vlastimil Babka, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, pau...@kernel.org, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, dave....@linux.intel.com, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, linu...@vger.kernel.org, LKML, kasan-dev, linux-ar...@lists.infradead.org, Linux Memory Management List

On Tue, Sep 08, 2020 at 04:40PM +0200, Vlastimil Babka wrote:
> On 9/8/20 2:16 PM, Alexander Potapenko wrote:
> >> Toggling a static branch is AFAIK quite disruptive (PeterZ will probably tell
> >> you better), and with the default 100ms sample interval, I'd think it's not good
> >> to toggle it so often? Did you measure what performance would you get, if the
> >> static key was only for long-term toggling the whole feature on and off (boot
> >> time or even runtime), but the decisions "am I in a sample interval right now?"
> >> would be normal tests behind this static key? Thanks.
> >
> > 100ms is the default that we use for testing, but for production it
> > should be fine to pick a longer interval (e.g. 1 second or more).
> > We haven't noticed any performance impact with neither 100ms nor bigger values.
>
> Hmm, I see.

To add to this, we initially also weren't sure what the results would be
toggling the static branches at varying intervals. In the end we were
pleasantly surprised, and our benchmarking results always proved there
is no noticeable slowdown above 100ms (somewhat noticeable in the range
of 1-10ms but it's tolerable if you wanted to go there).

I think we were initially, just like you might be, deceived about the
time scales here. 100ms is a really long time for a computer.

> > Regarding using normal branches, they are quite expensive.
> > E.g. at some point we used to have a branch in slab_free() to check
> > whether the freed object belonged to KFENCE pool.
> > When the pool address was taken from memory, this resulted in some
> > non-zero performance penalty.
>
> Well yeah, if the checks involve extra cache misses, that adds up. But AFAICS
> you can't avoid that kind of checks with static key anyway (am I looking right
> at is_kfence_address()?) because some kfence-allocated objects will exist even
> after the sampling period ended, right?
> So AFAICS kfence_alloc() is the only user of the static key and I wonder if it
> really makes such difference there.

The really important bit here is to differentiate between fast-paths and
slow-paths!

We insert kfence_alloc() into the allocator fast-paths, which is where
the majority of cost would be. On the other hand, the major user of
is_kfence_address(), kfence_free(), is only inserted into the slow-path.

As a result, is_kfence_address() usage has negligible cost (esp. if the
statically allocated pool is used) -- we benchmarked this quite
extensively.

> > As for enabling the whole feature at runtime, our intention is to let
> > the users have it enabled by default, otherwise someone will need to
> > tell every machine in the fleet when the feature is to be enabled.
>
> Sure, but I guess there are tools that make it no difference in effort between 1
> machine and fleet.
>
> I'll try to explain my general purpose distro-kernel POV. What I like e.g. about
> debug_pagealloc and page_owner (and contributed to that state of these features)
> is that a distro kernel can be shipped with them compiled in, but they are
> static-key disabled thus have no overhead, until a user enables them on boot,
> without a need to replace the kernel with a debug one first. Users can enable
> them for their own debugging, or when asked by somebody from the distro
> assisting with the debugging.
>
> I think KFENCE has similar potential and could work the same way - compiled in
> always, but a static key would eliminate everything, even the
> is_kfence_address() checks,

[ See my answer for the cost of is_kfence_address() above. In short,
until we add is_kfence_address() to fast-paths, introducing yet
another static branch would be premature optimization. ]

> until it became enabled (but then it would probably
> be a one-way street for the rest of the kernel's uptime). Some distro users
> would decide to enable it always, some not, but could be advised to when needed.
> So the existing static key could be repurposed for this, or if it's really worth
> having the current one to control just the sampling period, then there would be two?

You can already do this. Just set CONFIG_KFENCE_SAMPLE_INTERVAL=0. When
you decide to enable it, set kfence.sample_interval=<somenumber> as a
boot parameter.

I'll add something to that effect into Documentation/dev-tools/kfence.rst.

Thanks,
-- Marco

Marco Elver

unread,

Sep 8, 2020, 11:31:12 AM9/8/20

to Dave Hansen, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On Tue, Sep 08, 2020 at 07:52AM -0700, Dave Hansen wrote:
> On 9/7/20 6:40 AM, Marco Elver wrote:
> > KFENCE is designed to be enabled in production kernels, and has near
> > zero performance overhead. Compared to KASAN, KFENCE trades performance
> > for precision.
>
> Could you talk a little bit about where you expect folks to continue to
> use KASAN? How would a developer or a tester choose which one to use?

We mention some of this in Documentation/dev-tools/kfence.rst:

In the kernel, several tools exist to debug memory access errors, and in

particular KASAN can detect all bug classes that KFENCE can detect. While KASAN

is more precise, relying on compiler instrumentation, this comes at a

performance cost. We want to highlight that KASAN and KFENCE are complementary,

with different target environments. For instance, KASAN is the better

debugging-aid, where a simple reproducer exists: due to the lower chance to

detect the error, it would require more effort using KFENCE to debug.

Deployments at scale, however, would benefit from using KFENCE to discover bugs

due to code paths not exercised by test cases or fuzzers.

If you can afford to use KASAN, continue using KASAN. Usually this only
applies to test environments. If you have kernels for production use,
and cannot enable KASAN for the obvious cost reasons, you could consider
KFENCE.

I'll try to make this clearer, maybe summarizing what I said here in
Documentation as well.

> > KFENCE objects each reside on a dedicated page, at either the left or
> > right page boundaries. The pages to the left and right of the object
> > page are "guard pages", whose attributes are changed to a protected
> > state, and cause page faults on any attempted access to them. Such page
> > faults are then intercepted by KFENCE, which handles the fault
> > gracefully by reporting a memory access error.
>
> How much memory overhead does this end up having? I know it depends on
> the object size and so forth. But, could you give some real-world
> examples of memory consumption? Also, what's the worst case? Say I
> have a ton of worst-case-sized (32b) slab objects. Will I notice?

KFENCE objects are limited (default 255). If we exhaust KFENCE's memory
pool, no more KFENCE allocations will occur.
Documentation/dev-tools/kfence.rst gives a formula to calculate the
KFENCE pool size:

The total memory dedicated to the KFENCE memory pool can be computed as::

( #objects + 1 ) * 2 * PAGE_SIZE

Using the default config, and assuming a page size of 4 KiB, results in

dedicating 2 MiB to the KFENCE memory pool.

Does that clarify this point? Or anything else that could help clarify
this?

Thanks,
-- Marco

Vlastimil Babka

unread,

Sep 8, 2020, 11:36:47 AM9/8/20

to Marco Elver, Dave Hansen, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On 9/8/20 5:31 PM, Marco Elver wrote:
>>
>> How much memory overhead does this end up having? I know it depends on
>> the object size and so forth. But, could you give some real-world
>> examples of memory consumption? Also, what's the worst case? Say I
>> have a ton of worst-case-sized (32b) slab objects. Will I notice?
>
> KFENCE objects are limited (default 255). If we exhaust KFENCE's memory
> pool, no more KFENCE allocations will occur.
> Documentation/dev-tools/kfence.rst gives a formula to calculate the
> KFENCE pool size:
>
> The total memory dedicated to the KFENCE memory pool can be computed as::
>
> ( #objects + 1 ) * 2 * PAGE_SIZE
>
> Using the default config, and assuming a page size of 4 KiB, results in
> dedicating 2 MiB to the KFENCE memory pool.
>
> Does that clarify this point? Or anything else that could help clarify
> this?

Hmm did you observe that with this limit, a long-running system would eventually
converge to KFENCE memory pool being filled with long-aged objects, so there
would be no space to sample new ones?

> Thanks,
> -- Marco
>

Dave Hansen

unread,

Sep 8, 2020, 11:37:09 AM9/8/20

to Marco Elver, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On 9/8/20 8:31 AM, Marco Elver wrote:
...

> If you can afford to use KASAN, continue using KASAN. Usually this only
> applies to test environments. If you have kernels for production use,
> and cannot enable KASAN for the obvious cost reasons, you could consider
> KFENCE.

That's a really nice, succinct way to put it. You might even want to
consider putting this in the Kconfig help text.

>>> KFENCE objects each reside on a dedicated page, at either the left or
>>> right page boundaries. The pages to the left and right of the object
>>> page are "guard pages", whose attributes are changed to a protected
>>> state, and cause page faults on any attempted access to them. Such page
>>> faults are then intercepted by KFENCE, which handles the fault
>>> gracefully by reporting a memory access error.
>>
>> How much memory overhead does this end up having? I know it depends on
>> the object size and so forth. But, could you give some real-world
>> examples of memory consumption? Also, what's the worst case? Say I
>> have a ton of worst-case-sized (32b) slab objects. Will I notice?
>
> KFENCE objects are limited (default 255). If we exhaust KFENCE's memory
> pool, no more KFENCE allocations will occur.
> Documentation/dev-tools/kfence.rst gives a formula to calculate the
> KFENCE pool size:
>
> The total memory dedicated to the KFENCE memory pool can be computed as::
>
> ( #objects + 1 ) * 2 * PAGE_SIZE
>
> Using the default config, and assuming a page size of 4 KiB, results in
> dedicating 2 MiB to the KFENCE memory pool.
>
> Does that clarify this point? Or anything else that could help clarify
> this?

That clears it up, thanks!

I would suggest adding a tiny nugget about this in the cover letter,
just saying that the worst-case memory consumption on x86 is ~2M.

Dave Hansen

unread,

Sep 8, 2020, 11:54:42 AM9/8/20

to Marco Elver, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On 9/7/20 6:40 AM, Marco Elver wrote:

> +The most important parameter is KFENCE's sample interval, which can be set via
> +the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The
> +sample interval determines the frequency with which heap allocations will be
> +guarded by KFENCE. The default is configurable via the Kconfig option
> +``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0``
> +disables KFENCE.
> +
> +With the Kconfig option ``CONFIG_KFENCE_NUM_OBJECTS`` (default 255), the number
> +of available guarded objects can be controlled. Each object requires 2 pages,
> +one for the object itself and the other one used as a guard page; object pages
> +are interleaved with guard pages, and every object page is therefore surrounded
> +by two guard pages.

Is it hard to make these both tunable at runtime?

It would be nice if I hit a KFENCE error on a system to bump up the
number of objects and turn up the frequency of guarded objects to try to
hit it again. That would be a really nice feature for development
environments.

It would also be nice to have a counter somewhere (/proc/vmstat?) to
explicitly say how many pages are currently being used.

I didn't mention it elsewhere, but this work looks really nice. It has
very little impact on the core kernel and looks like a very nice tool to
have in the toolbox. I don't see any major reasons we wouldn't want to
merge after our typical bikeshedding. :)

Marco Elver

unread,

Sep 8, 2020, 11:56:39 AM9/8/20

to Vlastimil Babka, Dave Hansen, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Sure, that's a possibility. But remember that we're not trying to
deterministically detect bugs on 1 system (if you wanted that, you
should use KASAN), but a fleet of machines! The non-determinism of which
allocations will end up in KFENCE, will ensure we won't end up with a
fleet of machines of identical allocations. That's exactly what we're
after. Even if we eventually exhaust the pool, you'll still detect bugs
if there are any.

If you are overly worried, either the sample interval or number of
available objects needs to be tweaked to be larger. The default of 255
is quite conservative, and even using something larger on a modern
system is hardly noticeable. Choosing a sample interval & number of
objects should also factor in how many machines you plan to deploy this
on. Monitoring /sys/kernel/debug/kfence/stats can help you here.

Thanks,
-- Marco

Marco Elver

unread,

Sep 8, 2020, 12:14:35 PM9/8/20

to Dave Hansen, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, dave....@linux.intel.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, cor...@lwn.net, kees...@chromium.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On Tue, Sep 08, 2020 at 08:54AM -0700, Dave Hansen wrote:
> On 9/7/20 6:40 AM, Marco Elver wrote:
> > +The most important parameter is KFENCE's sample interval, which can be set via
> > +the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The
> > +sample interval determines the frequency with which heap allocations will be
> > +guarded by KFENCE. The default is configurable via the Kconfig option
> > +``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0``
> > +disables KFENCE.
> > +
> > +With the Kconfig option ``CONFIG_KFENCE_NUM_OBJECTS`` (default 255), the number
> > +of available guarded objects can be controlled. Each object requires 2 pages,
> > +one for the object itself and the other one used as a guard page; object pages
> > +are interleaved with guard pages, and every object page is therefore surrounded
> > +by two guard pages.
>
> Is it hard to make these both tunable at runtime?

The number of objects is quite hard, because it really complicates
bookkeeping and might also have an impact on performance, which is why
we prefer the statically allocated pool (like on x86, and we're trying
to get it for arm64 as well).

The sample interval is already tunable, just write to
/sys/module/kfence/parameters/sample_interval. Although we have this
(see core.c):

module_param_named(sample_interval, kfence_sample_interval, ulong,
IS_ENABLED(CONFIG_DEBUG_KERNEL) ? 0600 : 0400);

I was wondering if it should also be tweakable on non-debug kernels, but
I fear it might be abused. Sure, you need to be root to change it, but
maybe I'm being overly conservative here? If you don't see huge problems
with it we could just make it 0600 for all builds.

> It would be nice if I hit a KFENCE error on a system to bump up the
> number of objects and turn up the frequency of guarded objects to try to
> hit it again. That would be a really nice feature for development
> environments.

Indeed, which is why we also found it might be useful to tweak
sample_interval at runtime for debug-kernels. Although I don't know how
much luck you'll have hitting it again.

My strategy at that point would be to take the stack traces, try to
construct test-cases for those code paths, and run them through KASAN
(if it isn't immediately obvious what the problem is).

> It would also be nice to have a counter somewhere (/proc/vmstat?) to
> explicitly say how many pages are currently being used.

You can check /sys/kernel/debug/kfence/stats. On a system I just booted:

[root@syzkaller][~]# cat /sys/kernel/debug/kfence/stats
enabled: 1
currently allocated: 18
total allocations: 105
total frees: 87
total bugs: 0

The "currently allocated" count is the currently used KFENCE objects (of
255 for the default config).

> I didn't mention it elsewhere, but this work looks really nice. It has
> very little impact on the core kernel and looks like a very nice tool to
> have in the toolbox. I don't see any major reasons we wouldn't want to
> merge after our typical bikeshedding. :)

Thank you!

-- Marco

Marco Elver

unread,

Sep 9, 2020, 11:13:58 AM9/9/20

to Catalin Marinas, Mark Rutland, Will Deacon, Linux ARM, Alexander Potapenko, Christoph Lameter, Andrew Morton, David Rientjes, Pekka Enberg, Joonsoo Kim, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux Memory Management List

Hello arm64 maintainers,

On Mon, 7 Sep 2020 at 15:41, Marco Elver <el...@google.com> wrote:
> Add architecture specific implementation details for KFENCE and enable
> KFENCE for the arm64 architecture. In particular, this implements the
> required interface in <asm/kfence.h>. Currently, the arm64 version does
> not yet use a statically allocated memory pool, at the cost of a pointer
> load for each is_kfence_address().

> For ARM64, we would like to solicit feedback on what the best option is
> to obtain a constant address for __kfence_pool. One option is to declare
> a memory range in the memory layout to be dedicated to KFENCE (like is
> done for KASAN), however, it is unclear if this is the best available
> option. We would like to avoid touching the memory layout.

We can't yet tell what the best option for this might be. So, any
suggestions on how to go about switching to a static pool would be
much appreciated.

Many thanks,
-- Marco

Dmitry Vyukov

unread,

Sep 10, 2020, 10:58:07 AM9/10/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Mon, Sep 7, 2020 at 3:41 PM Marco Elver <el...@google.com> wrote:
> +config KFENCE_NUM_OBJECTS
> + int "Number of guarded objects available"
> + default 255
> + range 1 65535
> + help
> + The number of guarded objects available. For each KFENCE object, 2
> + pages are required; with one containing the object and two adjacent
> + ones used as guard pages.

Hi Marco,

Wonder if you tested build/boot with KFENCE_NUM_OBJECTS=65535? Can a
compiler create such a large object?

> +config KFENCE_FAULT_INJECTION
> + int "Fault injection for stress testing"
> + default 0
> + depends on EXPERT
> + help
> + The inverse probability with which to randomly protect KFENCE object
> + pages, resulting in spurious use-after-frees. The main purpose of
> + this option is to stress-test KFENCE with concurrent error reports
> + and allocations/frees. A value of 0 disables fault injection.

I would name this differently. "FAULT_INJECTION" is already taken for
a different thing, so it's a bit confusing.
KFENCE_DEBUG_SOMETHING may be a better name.
It would also be good to make it very clear in the short description
that this is for testing of KFENCE itself. When I configure syzbot I
routinely can't figure out if various DEBUG configs detect user
errors, or enable additional unit tests, or something else.
Maybe it should depend on DEBUG_KERNEL as well?

> +/*
> + * Get the canary byte pattern for @addr. Use a pattern that varies based on the
> + * lower 3 bits of the address, to detect memory corruptions with higher
> + * probability, where similar constants are used.
> + */
> +#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)addr & 0x7))

(addr) in macro body

> + seq_con_printf(seq,
> + "kfence-#%zd [0x" PTR_FMT "-0x" PTR_FMT

PTR_FMT is only used in this file, should it be declared in report.c?

Please post example reports somewhere. It's hard to figure out all
details of the reporting/formatting.

Marco Elver

unread,

Sep 10, 2020, 11:06:39 AM9/10/20

to Dmitry Vyukov, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Thu, 10 Sep 2020 at 16:58, Dmitry Vyukov <dvy...@google.com> wrote:
>
> On Mon, Sep 7, 2020 at 3:41 PM Marco Elver <el...@google.com> wrote:
> > +config KFENCE_NUM_OBJECTS
> > + int "Number of guarded objects available"
> > + default 255
> > + range 1 65535
> > + help
> > + The number of guarded objects available. For each KFENCE object, 2
> > + pages are required; with one containing the object and two adjacent
> > + ones used as guard pages.
>
> Hi Marco,
>
> Wonder if you tested build/boot with KFENCE_NUM_OBJECTS=65535? Can a
> compiler create such a large object?

Indeed, I get a "ld: kernel image bigger than KERNEL_IMAGE_SIZE".
Let's lower it to something more reasonable.

The main reason to have the limit is to constrain random configs and
avoid the inevitable error reports.

> > +config KFENCE_FAULT_INJECTION
> > + int "Fault injection for stress testing"
> > + default 0
> > + depends on EXPERT
> > + help
> > + The inverse probability with which to randomly protect KFENCE object
> > + pages, resulting in spurious use-after-frees. The main purpose of
> > + this option is to stress-test KFENCE with concurrent error reports
> > + and allocations/frees. A value of 0 disables fault injection.
>
> I would name this differently. "FAULT_INJECTION" is already taken for
> a different thing, so it's a bit confusing.
> KFENCE_DEBUG_SOMETHING may be a better name.
> It would also be good to make it very clear in the short description
> that this is for testing of KFENCE itself. When I configure syzbot I
> routinely can't figure out if various DEBUG configs detect user
> errors, or enable additional unit tests, or something else.

Makes sense, we'll change the name.

> Maybe it should depend on DEBUG_KERNEL as well?

EXPERT selects DEBUG_KERNEL, so depending on DEBUG_KERNEL doesn't make sense.

> > +/*
> > + * Get the canary byte pattern for @addr. Use a pattern that varies based on the
> > + * lower 3 bits of the address, to detect memory corruptions with higher
> > + * probability, where similar constants are used.
> > + */
> > +#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)addr & 0x7))
>
> (addr) in macro body

Done for v2.

> > + seq_con_printf(seq,
> > + "kfence-#%zd [0x" PTR_FMT "-0x" PTR_FMT
>
> PTR_FMT is only used in this file, should it be declared in report.c?

It's also used by the test.

> Please post example reports somewhere. It's hard to figure out all
> details of the reporting/formatting.

They can be seen in Documentation added later in the series (also
viewable here: https://github.com/google/kasan/blob/kfence/Documentation/dev-tools/kfence.rst)

Thank you!

-- Marco

Dmitry Vyukov

unread,

Sep 10, 2020, 11:43:06 AM9/10/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Mon, Sep 7, 2020 at 3:41 PM Marco Elver <el...@google.com> wrote:

> + meta->addr = metadata_to_pageaddr(meta);
> + /* Unprotect if we're reusing this page. */
> + if (meta->state == KFENCE_OBJECT_FREED)
> + kfence_unprotect(meta->addr);
> +
> + /* Calculate address for this allocation. */
> + if (right)
> + meta->addr += PAGE_SIZE - size;
> + meta->addr = ALIGN_DOWN(meta->addr, cache->align);

I would move this ALIGN_DOWN under the (right) if.
Do I understand it correctly that it will work, but we expect it to do
nothing for !right? If cache align is >PAGE_SIZE, nothing good will
happen anyway, right?
The previous 2 lines look like part of the same calculation -- "figure
out the addr for the right case".

> + atomic_long_inc(&counters-F[KFENCE_COUNTER_ALLOCS]);
> + return addr;
> +}
> +
> +static void kfence_guarded_free(void *addr, struct kfence_metadata *meta)
> +{
> + struct kcsan_scoped_access assert_page_exclusive;

> + unsigned long flags;
> +

> + raw_spin_lock_irqsave(&meta->lock, flags);
> +
> + if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {
> + /* Invalid or double-free, bail out. */
> + atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
> + kfence_report_error((unsigned long)addr, meta, KFENCE_ERROR_INVALID_FREE);
> + raw_spin_unlock_irqrestore(&meta->lock, flags);
> + return;
> + }
> +
> + /* Detect racy use-after-free, or incorrect reallocation of this page by KFENCE. */
> + kcsan_begin_scoped_access((void *)ALIGN_DOWN((unsigned long)addr, PAGE_SIZE), PAGE_SIZE,
> + KCSAN_ACCESS_SCOPED | KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT,
> + &assert_page_exclusive);
> +
> + if (CONFIG_KFENCE_FAULT_INJECTION)
> + kfence_unprotect((unsigned long)addr); /* To check canary bytes. */
> +
> + /* Restore page protection if there was an OOB access. */
> + if (meta->unprotected_page) {
> + kfence_protect(meta->unprotected_page);
> + meta->unprotected_page = 0;
> + }
> +
> + /* Check canary bytes for memory corruption. */
> + for_each_canary(meta, check_canary_byte);
> +
> + /*
> + * Clear memory if init-on-free is set. While we protect the page, the
> + * data is still there, and after a use-after-free is detected, we
> + * unprotect the page, so the data is still accessible.
> + */
> + if (unlikely(slab_want_init_on_free(meta->cache)))
> + memzero_explicit(addr, abs(meta->size));
> +
> + /* Mark the object as freed. */
> + metadata_update_state(meta, KFENCE_OBJECT_FREED);

> +
> + raw_spin_unlock_irqrestore(&meta->lock, flags);
> +

> + /* Protect to detect use-after-frees. */
> + kfence_protect((unsigned long)addr);
> +
> + /* Add it to the tail of the freelist for reuse. */
> + raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
> + KFENCE_WARN_ON(!list_empty(&meta->list));
> + list_add_tail(&meta->list, &kfence_freelist);
> + kcsan_end_scoped_access(&assert_page_exclusive);
> + raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
> +
> + atomic_long_dec(&counters[KFENCE_COUNTER_ALLOCATED]);
> + atomic_long_inc(&counters[KFENCE_COUNTER_FREES]);
> +}
> +
> +static void rcu_guarded_free(struct rcu_head *h)
> +{
> + struct kfence_metadata *meta = container_of(h, struct kfence_metadata, rcu_head);
> +
> + kfence_guarded_free((void *)meta->addr, meta);
> +}
> +
> +static bool __init kfence_initialize_pool(void)
> +{
> + unsigned long addr;
> + struct page *pages;
> + int i;
> +
> + if (!arch_kfence_initialize_pool())
> + return false;
> +
> + addr = (unsigned long)__kfence_pool;
> + pages = virt_to_page(addr);
> +
> + /*
> + * Set up non-redzone pages: they must have PG_slab set, to avoid
> + * freeing these as real pages.
> + *
> + * We also want to avoid inserting kfence_free() in the kfree()
> + * fast-path in SLUB, and therefore need to ensure kfree() correctly
> + * enters __slab_free() slow-path.
> + */
> + for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
> + if (!i || (i % 2))
> + continue;
> +
> + __SetPageSlab(&pages[i]);
> + }
> +
> + /*
> + * Protect the first 2 pages. The first page is mostly unnecessary, and
> + * merely serves as an extended guard page. However, adding one
> + * additional page in the beginning gives us an even number of pages,
> + * which simplifies the mapping of address to metadata index.
> + */
> + for (i = 0; i < 2; i++) {
> + if (unlikely(!kfence_protect(addr)))
> + return false;
> +
> + addr += PAGE_SIZE;
> + }
> +
> + for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
> + struct kfence_metadata *meta = &kfence_metadata[i];
> +
> + /* Initialize metadata. */
> + INIT_LIST_HEAD(&meta->list);
> + raw_spin_lock_init(&meta->lock);
> + meta->state = KFENCE_OBJECT_UNUSED;
> + meta->addr = addr; /* Initialize for validation in metadata_to_pageaddr(). */

> + list_add_tail(&meta->list, &kfence_freelist);
> +

> + /* Protect the right redzone. */
> + if (unlikely(!kfence_protect(addr + PAGE_SIZE)))
> + return false;
> +
> + addr += 2 * PAGE_SIZE;
> + }
> +
> + return true;
> +}
> +
> +/* === DebugFS Interface ==================================================== */
> +
> +static int stats_show(struct seq_file *seq, void *v)

> +{
> + int i;
> +

> + seq_printf(seq, "enabled: %i\n", READ_ONCE(kfence_enabled));
> + for (i = 0; i < KFENCE_COUNTER_COUNT; i++)
> + seq_printf(seq, "%s: %ld\n", counter_names[i], atomic_long_read(&counters[i]));

> +
> + return 0;
> +}

> +DEFINE_SHOW_ATTRIBUTE(stats);
> +
> +/*
> + * debugfs seq_file operations for /sys/kernel/debug/kfence/objects.
> + * start_object() and next_object() return the object index + 1, because NULL is used
> + * to stop iteration.
> + */
> +static void *start_object(struct seq_file *seq, loff_t *pos)
> +{
> + if (*pos < CONFIG_KFENCE_NUM_OBJECTS)
> + return (void *)((long)*pos + 1);

> + return NULL;
> +}
> +

> +static void stop_object(struct seq_file *seq, void *v)
> +{
> +}
> +
> +static void *next_object(struct seq_file *seq, void *v, loff_t *pos)
> +{
> + ++*pos;
> + if (*pos < CONFIG_KFENCE_NUM_OBJECTS)
> + return (void *)((long)*pos + 1);

> + return NULL;
> +}
> +

> +static int show_object(struct seq_file *seq, void *v)
> +{
> + struct kfence_metadata *meta = &kfence_metadata[(long)v - 1];

> + unsigned long flags;
> +

> + raw_spin_lock_irqsave(&meta->lock, flags);
> + kfence_print_object(seq, meta);
> + raw_spin_unlock_irqrestore(&meta->lock, flags);
> + seq_puts(seq, "---------------------------------\n");

> +
> + return 0;
> +}
> +

> +static const struct seq_operations object_seqops = {
> + .start = start_object,
> + .next = next_object,
> + .stop = stop_object,
> + .show = show_object,
> +};
> +
> +static int open_objects(struct inode *inode, struct file *file)
> +{
> + return seq_open(file, &object_seqops);
> +}
> +
> +static const struct file_operations objects_fops = {
> + .open = open_objects,
> + .read = seq_read,
> + .llseek = seq_lseek,
> +};
> +
> +static int __init kfence_debugfs_init(void)
> +{
> + struct dentry *kfence_dir = debugfs_create_dir("kfence", NULL);
> +
> + debugfs_create_file("stats", 0400, kfence_dir, NULL, &stats_fops);
> + debugfs_create_file("objects", 0400, kfence_dir, NULL, &objects_fops);

> + return 0;
> +}
> +

> +late_initcall(kfence_debugfs_init);
> +
> +/* === Allocation Gate Timer ================================================ */
> +
> +/*
> + * Set up delayed work, which will enable and disable the static key. We need to
> + * use a work queue (rather than a simple timer), since enabling and disabling a
> + * static key cannot be done from an interrupt.
> + */
> +static struct delayed_work kfence_timer;
> +static void toggle_allocation_gate(struct work_struct *work)
> +{
> + if (!READ_ONCE(kfence_enabled))
> + return;
> +
> + /* Enable static key, and await allocation to happen. */
> + atomic_set(&allocation_gate, 0);
> + static_branch_enable(&kfence_allocation_key);
> + wait_event(allocation_wait, atomic_read(&allocation_gate) != 0);
> +
> + /* Disable static key and reset timer. */
> + static_branch_disable(&kfence_allocation_key);
> + schedule_delayed_work(&kfence_timer, msecs_to_jiffies(kfence_sample_interval));
> +}
> +static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate);
> +
> +/* === Public interface ===================================================== */
> +
> +void __init kfence_init(void)
> +{
> + /* Setting kfence_sample_interval to 0 on boot disables KFENCE. */
> + if (!kfence_sample_interval)
> + return;
> +
> + if (!kfence_initialize_pool()) {
> + pr_err("%s failed\n", __func__);
> + return;
> + }
> +
> + schedule_delayed_work(&kfence_timer, 0);
> + WRITE_ONCE(kfence_enabled, true);

Can toggle_allocation_gate run before we set kfence_enabled? If yes,
it can break. If not, it's still somewhat confusing.

> + pr_info("initialized - using %zu bytes for %d objects", KFENCE_POOL_SIZE,
> + CONFIG_KFENCE_NUM_OBJECTS);
> + if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
> + pr_cont(" at 0x%px-0x%px\n", (void *)__kfence_pool,
> + (void *)(__kfence_pool + KFENCE_POOL_SIZE));
> + else
> + pr_cont("\n");
> +}
> +
> +bool kfence_shutdown_cache(struct kmem_cache *s)
> +{
> + unsigned long flags;
> + struct kfence_metadata *meta;
> + int i;
> +
> + for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
> + bool in_use;
> +
> + meta = &kfence_metadata[i];
> +
> + /*
> + * If we observe some inconsistent cache and state pair where we
> + * should have returned false here, cache destruction is racing
> + * with either kmem_cache_alloc() or kmem_cache_free(). Taking
> + * the lock will not help, as different critical section
> + * serialization will have the same outcome.
> + */
> + if (READ_ONCE(meta->cache) != s ||
> + READ_ONCE(meta->state) != KFENCE_OBJECT_ALLOCATED)
> + continue;
> +
> + raw_spin_lock_irqsave(&meta->lock, flags);
> + in_use = meta->cache == s && meta->state == KFENCE_OBJECT_ALLOCATED;

> + raw_spin_unlock_irqrestore(&meta->lock, flags);
> +

> + if (in_use)

> + return false;
> + }
> +

> + for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
> + meta = &kfence_metadata[i];
> +
> + /* See above. */
> + if (READ_ONCE(meta->cache) != s || READ_ONCE(meta->state) != KFENCE_OBJECT_FREED)
> + continue;
> +
> + raw_spin_lock_irqsave(&meta->lock, flags);
> + if (meta->cache == s && meta->state == KFENCE_OBJECT_FREED)
> + meta->cache = NULL;

> + raw_spin_unlock_irqrestore(&meta->lock, flags);
> + }
> +

> + return true;
> +}
> +
> +void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
> +{
> + /*
> + * allocation_gate only needs to become non-zero, so it doesn't make
> + * sense to continue writing to it and pay the associated contention
> + * cost, in case we have a large number of concurrent allocations.
> + */
> + if (atomic_read(&allocation_gate) || atomic_inc_return(&allocation_gate) > 1)
> + return NULL;
> + wake_up(&allocation_wait);
> +
> + if (!READ_ONCE(kfence_enabled))
> + return NULL;
> +
> + if (size > PAGE_SIZE)
> + return NULL;
> +
> + return kfence_guarded_alloc(s, size, flags);
> +}

> +
> +size_t kfence_ksize(const void *addr)
> +{
> + const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> +
> + /*
> + * Read locklessly -- if there is a race with __kfence_alloc(), this
> + * most certainly is either a use-after-free, or invalid access.
> + */
> + return meta ? abs(meta->size) : 0;
> +}
> +
> +void *kfence_object_start(const void *addr)
> +{
> + const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> +
> + /*
> + * Read locklessly -- if there is a race with __kfence_alloc(), this
> + * most certainly is either a use-after-free, or invalid access.

> + */
> + return meta ? (void *)meta->addr : NULL;
> +}
> +
> +void __kfence_free(void *addr)
> +{
> + struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> +
> + if (unlikely(meta->cache->flags & SLAB_TYPESAFE_BY_RCU))

This may deserve a comment as to why we apply rcu on object level
whereas SLAB_TYPESAFE_BY_RCU means slab level only.

> + call_rcu(&meta->rcu_head, rcu_guarded_free);
> + else
> + kfence_guarded_free(addr, meta);
> +}
> +
> +bool kfence_handle_page_fault(unsigned long addr)
> +{
> + const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE;
> + struct kfence_metadata *to_report = NULL;
> + enum kfence_error_type error_type;
> + unsigned long flags;
> +
> + if (!is_kfence_address((void *)addr))
> + return false;
> +
> + if (!READ_ONCE(kfence_enabled)) /* If disabled at runtime ... */
> + return kfence_unprotect(addr); /* ... unprotect and proceed. */
> +
> + atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
> +
> + if (page_index % 2) {
> + /* This is a redzone, report a buffer overflow. */
> + struct kfence_metadata *meta = NULL;

> diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
> new file mode 100644
> index 000000000000..25ce2c0dc092
> --- /dev/null
> +++ b/mm/kfence/kfence.h
> @@ -0,0 +1,104 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef MM_KFENCE_KFENCE_H
> +#define MM_KFENCE_KFENCE_H
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/types.h>
> +
> +#include "../slab.h" /* for struct kmem_cache */
> +
> +/* For non-debug builds, avoid leaking kernel pointers into dmesg. */
> +#ifdef CONFIG_DEBUG_KERNEL
> +#define PTR_FMT "%px"
> +#else
> +#define PTR_FMT "%p"
> +#endif
> +

> +/*
> + * Get the canary byte pattern for @addr. Use a pattern that varies based on the
> + * lower 3 bits of the address, to detect memory corruptions with higher
> + * probability, where similar constants are used.
> + */
> +#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)addr & 0x7))

> +
> +/* Maximum stack depth for reports. */
> +#define KFENCE_STACK_DEPTH 64
> +
> +/* KFENCE object states. */
> +enum kfence_object_state {
> + KFENCE_OBJECT_UNUSED, /* Object is unused. */
> + KFENCE_OBJECT_ALLOCATED, /* Object is currently allocated. */
> + KFENCE_OBJECT_FREED, /* Object was allocated, and then freed. */
> +};
> +
> +/* KFENCE metadata per guarded allocation. */
> +struct kfence_metadata {
> + struct list_head list; /* Freelist node; access under kfence_freelist_lock. */
> + struct rcu_head rcu_head; /* For delayed freeing. */
> +
> + /*
> + * Lock protecting below data; to ensure consistency of the below data,
> + * since the following may execute concurrently: __kfence_alloc(),
> + * __kfence_free(), kfence_handle_page_fault(). However, note that we
> + * cannot grab the same metadata off the freelist twice, and multiple
> + * __kfence_alloc() cannot run concurrently on the same metadata.
> + */
> + raw_spinlock_t lock;
> +
> + /* The current state of the object; see above. */
> + enum kfence_object_state state;
> +
> + /*
> + * Allocated object address; cannot be calculated from size, because of
> + * alignment requirements.
> + *
> + * Invariant: ALIGN_DOWN(addr, PAGE_SIZE) is constant.
> + */
> + unsigned long addr;
> +
> + /*
> + * The size of the original allocation:
> + * size > 0: left page alignment
> + * size < 0: right page alignment
> + */
> + int size;
> +
> + /*
> + * The kmem_cache cache of the last allocation; NULL if never allocated
> + * or the cache has already been destroyed.
> + */
> + struct kmem_cache *cache;
> +
> + /*
> + * In case of an invalid access, the page that was unprotected; we
> + * optimistically only store address.
> + */
> + unsigned long unprotected_page;
> +
> + /* Allocation and free stack information. */
> + int num_alloc_stack;
> + int num_free_stack;
> + unsigned long alloc_stack[KFENCE_STACK_DEPTH];
> + unsigned long free_stack[KFENCE_STACK_DEPTH];
> +};
> +
> +extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
> +
> +/* KFENCE error types for report generation. */
> +enum kfence_error_type {
> + KFENCE_ERROR_OOB, /* Detected a out-of-bounds access. */
> + KFENCE_ERROR_UAF, /* Detected a use-after-free access. */
> + KFENCE_ERROR_CORRUPTION, /* Detected a memory corruption on free. */
> + KFENCE_ERROR_INVALID, /* Invalid access of unknown type. */
> + KFENCE_ERROR_INVALID_FREE, /* Invalid free. */
> +};
> +
> +void kfence_report_error(unsigned long address, const struct kfence_metadata *meta,
> + enum kfence_error_type type);
> +
> +void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta);
> +
> +#endif /* MM_KFENCE_KFENCE_H */
> diff --git a/mm/kfence/report.c b/mm/kfence/report.c
> new file mode 100644
> index 000000000000..8c28200e7433
> --- /dev/null
> +++ b/mm/kfence/report.c
> @@ -0,0 +1,201 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <stdarg.h>
> +
> +#include <linux/kernel.h>
> +#include <linux/lockdep.h>
> +#include <linux/printk.h>
> +#include <linux/seq_file.h>
> +#include <linux/stacktrace.h>
> +#include <linux/string.h>
> +
> +#include <asm/kfence.h>
> +
> +#include "kfence.h"
> +
> +/* Helper function to either print to a seq_file or to console. */
> +static void seq_con_printf(struct seq_file *seq, const char *fmt, ...)
> +{
> + va_list args;
> +
> + va_start(args, fmt);
> + if (seq)
> + seq_vprintf(seq, fmt, args);
> + else
> + vprintk(fmt, args);
> + va_end(args);
> +}
> +
> +/* Get the number of stack entries to skip get out of MM internals. */
> +static int get_stack_skipnr(const unsigned long stack_entries[], int num_entries,
> + enum kfence_error_type type)
> +{
> + char buf[64];
> + int skipnr, fallback = 0;
> +
> + for (skipnr = 0; skipnr < num_entries; skipnr++) {
> + int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]);
> +
> + /* Depending on error type, find different stack entries. */
> + switch (type) {
> + case KFENCE_ERROR_UAF:
> + case KFENCE_ERROR_OOB:
> + case KFENCE_ERROR_INVALID:
> + if (!strncmp(buf, KFENCE_SKIP_ARCH_FAULT_HANDLER, len))
> + goto found;

> + break;
> + case KFENCE_ERROR_CORRUPTION:

> + case KFENCE_ERROR_INVALID_FREE:
> + if (str_has_prefix(buf, "kfence_") || str_has_prefix(buf, "__kfence_"))
> + fallback = skipnr + 1; /* In case kfree tail calls into kfence. */
> +
> + /* Also the *_bulk() variants by only checking prefixes. */
> + if (str_has_prefix(buf, "kfree") || str_has_prefix(buf, "kmem_cache_free"))
> + goto found;
> + break;
> + }
> + }
> + if (fallback < num_entries)
> + return fallback;
> +found:
> + skipnr++;
> + return skipnr < num_entries ? skipnr : 0;
> +}
> +
> +static void kfence_print_stack(struct seq_file *seq, const struct kfence_metadata *meta,
> + bool show_alloc)
> +{
> + const unsigned long *entries = show_alloc ? meta->alloc_stack : meta->free_stack;
> + const int nentries = show_alloc ? meta->num_alloc_stack : meta->num_free_stack;
> +
> + if (nentries) {
> + int i;
> +
> + /* stack_trace_seq_print() does not exist; open code our own. */
> + for (i = 0; i < nentries; i++)
> + seq_con_printf(seq, " %pS\n", entries[i]);
> + } else {
> + seq_con_printf(seq, " no %s stack\n", show_alloc ? "allocation" : "deallocation");
> + }
> +}
> +
> +void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta)
> +{
> + const int size = abs(meta->size);

This negative encoding is somewhat confusing. We do lots of abs, but
do we even look at the sign anywhere? I can't find any use that is not
abs.

> + const unsigned long start = meta->addr;
> + const struct kmem_cache *const cache = meta->cache;
> +
> + lockdep_assert_held(&meta->lock);
> +
> + if (meta->state == KFENCE_OBJECT_UNUSED) {
> + seq_con_printf(seq, "kfence-#%zd unused\n", meta - kfence_metadata);
> + return;
> + }
> +

> + seq_con_printf(seq,
> + "kfence-#%zd [0x" PTR_FMT "-0x" PTR_FMT

> + ", size=%d, cache=%s] allocated in:\n",
> + meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size,
> + (cache && cache->name) ? cache->name : "<destroyed>");
> + kfence_print_stack(seq, meta, true);
> +

> + if (meta->state == KFENCE_OBJECT_FREED) {

> + seq_con_printf(seq, "freed in:\n");
> + kfence_print_stack(seq, meta, false);
> + }
> +}
> +
> +/*
> + * Show bytes at @addr that are different from the expected canary values, up to
> + * @max_bytes.
> + */
> +static void print_diff_canary(const u8 *addr, size_t max_bytes)
> +{
> + const u8 *max_addr = min((const u8 *)PAGE_ALIGN((unsigned long)addr), addr + max_bytes);
> +
> + pr_cont("[");
> + for (; addr < max_addr; addr++) {
> + if (*addr == KFENCE_CANARY_PATTERN(addr))
> + pr_cont(" .");
> + else if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
> + pr_cont(" 0x%02x", *addr);
> + else /* Do not leak kernel memory in non-debug builds. */
> + pr_cont(" !");
> + }
> + pr_cont(" ]");
> +}
> +
> +void kfence_report_error(unsigned long address, const struct kfence_metadata *meta,
> + enum kfence_error_type type)
> +{
> + unsigned long stack_entries[KFENCE_STACK_DEPTH] = { 0 };
> + int num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 1);
> + int skipnr = get_stack_skipnr(stack_entries, num_stack_entries, type);
> +
> + /* KFENCE_ERROR_OOB requires non-NULL meta; for the rest it's optional. */
> + if (WARN_ON(type == KFENCE_ERROR_OOB && !meta))
> + return;
> +
> + if (meta)
> + lockdep_assert_held(&meta->lock);
> + /*
> + * Because we may generate reports in printk-unfriendly parts of the
> + * kernel, such as scheduler code, the use of printk() could deadlock.
> + * Until such time that all printing code here is safe in all parts of
> + * the kernel, accept the risk, and just get our message out (given the
> + * system might already behave unpredictably due to the memory error).
> + * As such, also disable lockdep to hide warnings, and avoid disabling
> + * lockdep for the rest of the kernel.
> + */
> + lockdep_off();
> +
> + pr_err("==================================================================\n");
> + /* Print report header. */
> + switch (type) {
> + case KFENCE_ERROR_OOB:
> + pr_err("BUG: KFENCE: out-of-bounds in %pS\n\n", (void *)stack_entries[skipnr]);
> + pr_err("Out-of-bounds access at 0x" PTR_FMT " (%s of kfence-#%zd):\n",
> + (void *)address, address < meta->addr ? "left" : "right",
> + meta - kfence_metadata);

> + break;
> + case KFENCE_ERROR_UAF:

> + pr_err("BUG: KFENCE: use-after-free in %pS\n\n", (void *)stack_entries[skipnr]);
> + pr_err("Use-after-free access at 0x" PTR_FMT ":\n", (void *)address);

> + break;
> + case KFENCE_ERROR_CORRUPTION:

> + pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]);
> + pr_err("Detected corrupted memory at 0x" PTR_FMT " ", (void *)address);
> + print_diff_canary((u8 *)address, 16);
> + pr_cont(":\n");

> + break;
> + case KFENCE_ERROR_INVALID:

> + pr_err("BUG: KFENCE: invalid access in %pS\n\n", (void *)stack_entries[skipnr]);
> + pr_err("Invalid access at 0x" PTR_FMT ":\n", (void *)address);

> + break;
> + case KFENCE_ERROR_INVALID_FREE:

> + pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]);
> + pr_err("Invalid free of 0x" PTR_FMT ":\n", (void *)address);
> + break;
> + }
> +
> + /* Print stack trace and object info. */
> + stack_trace_print(stack_entries + skipnr, num_stack_entries - skipnr, 0);
> +
> + if (meta) {
> + pr_err("\n");
> + kfence_print_object(NULL, meta);
> + }
> +
> + /* Print report footer. */
> + pr_err("\n");
> + dump_stack_print_info(KERN_DEFAULT);
> + pr_err("==================================================================\n");
> +
> + lockdep_on();
> +
> + if (panic_on_warn)
> + panic("panic_on_warn set ...\n");
> +
> + /* We encountered a memory unsafety error, taint the kernel! */
> + add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> +}
> --
> 2.28.0.526.ge36021eeef-goog
>

Dmitry Vyukov

unread,

Sep 10, 2020, 11:48:46 AM9/10/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

Looking at the first report. I got impression we are trying to skip
__kfence frames, but this includes it:

kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32,
cache=kmalloc-32] allocated in:

__kfence_alloc+0x42d/0x4c0
__kmalloc+0x133/0x200

Is it working as intended?

Alexander Potapenko

unread,

Sep 10, 2020, 12:19:24 PM9/10/20

to Dmitry Vyukov, Marco Elver, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Thu, Sep 10, 2020 at 5:43 PM Dmitry Vyukov <dvy...@google.com> wrote:

> > + /* Calculate address for this allocation. */
> > + if (right)
> > + meta->addr += PAGE_SIZE - size;
> > + meta->addr = ALIGN_DOWN(meta->addr, cache->align);
>
> I would move this ALIGN_DOWN under the (right) if.
> Do I understand it correctly that it will work, but we expect it to do
> nothing for !right? If cache align is >PAGE_SIZE, nothing good will
> happen anyway, right?
> The previous 2 lines look like part of the same calculation -- "figure
> out the addr for the right case".

Yes, makes sense.

> > +
> > + schedule_delayed_work(&kfence_timer, 0);
> > + WRITE_ONCE(kfence_enabled, true);
>
> Can toggle_allocation_gate run before we set kfence_enabled? If yes,
> it can break. If not, it's still somewhat confusing.

Correct, it should go after we enable KFENCE. We'll fix that in v2.

> > +void __kfence_free(void *addr)
> > +{
> > + struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> > +
> > + if (unlikely(meta->cache->flags & SLAB_TYPESAFE_BY_RCU))
>
> This may deserve a comment as to why we apply rcu on object level
> whereas SLAB_TYPESAFE_BY_RCU means slab level only.

Sorry, what do you mean by "slab level"?
SLAB_TYPESAFE_BY_RCU means we have to wait for possible RCU accesses
in flight before freeing objects from that slab - that's basically
what we are doing here below:

> > + call_rcu(&meta->rcu_head, rcu_guarded_free);
> > + else
> > + kfence_guarded_free(addr, meta);
> > +}

> > +void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta)
> > +{
> > + const int size = abs(meta->size);
>
> This negative encoding is somewhat confusing. We do lots of abs, but
> do we even look at the sign anywhere? I can't find any use that is not
> abs.

I think initially there was a reason for this, but now we don't seem
to use it anywhere. Nice catch!

Alex

Marco Elver

unread,

Sep 10, 2020, 12:22:26 PM9/10/20

to Dmitry Vyukov, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

We're not skipping them for the allocation/free stacks. We can skip
the kfence+kmalloc frame as well.

Dmitry Vyukov

unread,

Sep 10, 2020, 1:11:54 PM9/10/20

to Alexander Potapenko, Marco Elver, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

Exactly! You see it is confusing :)
SLAB_TYPESAFE_BY_RCU does not mean that. rcu-freeing only applies to
whole pages, that's what I mean by "slab level" (whole slabs are freed
by rcu).

Marco Elver

unread,

Sep 10, 2020, 1:41:18 PM9/10/20

to Dmitry Vyukov, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

In the case here, we have to defer freeing the object, because unlike
real SLAB_TYPESAFE_BY_RCU slabs, our page here may get recycled for
other-typed objects. We can update the comment to be clearer.

Paul E. McKenney

unread,

Sep 10, 2020, 4:25:29 PM9/10/20

to Dmitry Vyukov, Alexander Potapenko, Marco Elver, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

Just confirming Dmitry's description of SLAB_TYPESAFE_BY_RCU semantics.

Thanx, Paul

Dmitry Vyukov

unread,

Sep 11, 2020, 3:05:06 AM9/11/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Mon, Sep 7, 2020 at 3:41 PM Marco Elver <el...@google.com> wrote:
>

> From: Alexander Potapenko <gli...@google.com>
>
> We make KFENCE compatible with KASAN for testing KFENCE itself. In
> particular, KASAN helps to catch any potential corruptions to KFENCE
> state, or other corruptions that may be a result of freepointer
> corruptions in the main allocators.
>
> To indicate that the combination of the two is generally discouraged,
> CONFIG_EXPERT=y should be set. It also gives us the nice property that
> KFENCE will be build-tested by allyesconfig builds.

>
> Co-developed-by: Marco Elver <el...@google.com>
> Signed-off-by: Marco Elver <el...@google.com>
> Signed-off-by: Alexander Potapenko <gli...@google.com>
> ---

> lib/Kconfig.kfence | 2 +-
> mm/kasan/common.c | 7 +++++++
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
> index 7ac91162edb0..b080e49e15d4 100644
> --- a/lib/Kconfig.kfence
> +++ b/lib/Kconfig.kfence
> @@ -10,7 +10,7 @@ config HAVE_ARCH_KFENCE_STATIC_POOL
>
> menuconfig KFENCE
> bool "KFENCE: low-overhead sampling-based memory safety error detector"
> - depends on HAVE_ARCH_KFENCE && !KASAN && (SLAB || SLUB)
> + depends on HAVE_ARCH_KFENCE && (!KASAN || EXPERT) && (SLAB || SLUB)
> depends on JUMP_LABEL # To ensure performance, require jump labels
> select STACKTRACE
> help
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 950fd372a07e..f5c49f0fdeff 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -18,6 +18,7 @@
> #include <linux/init.h>
> #include <linux/kasan.h>
> #include <linux/kernel.h>
> +#include <linux/kfence.h>
> #include <linux/kmemleak.h>
> #include <linux/linkage.h>
> #include <linux/memblock.h>
> @@ -396,6 +397,9 @@ static bool __kasan_slab_free(struct kmem_cache *cache, void *object,
> tagged_object = object;
> object = reset_tag(object);
>
> + if (is_kfence_address(object))
> + return false;

Is this needed?
At least in the slab patch I see that we do :

if (kfence_free(objp)) {
kmemleak_free_recursive(objp, cachep->flags);
return;
}

before:

/* Put the object into the quarantine, don't touch it for now. */ /*
Put the object into the quarantine, don't touch it for now. */
if (kasan_slab_free(cachep, objp, _RET_IP_)) if
(kasan_slab_free(cachep, objp, _RET_IP_))
return; return;

If it's not supposed to be triggered, it can make sense to replace
with BUG/WARN.

> if (unlikely(nearest_obj(cache, virt_to_head_page(object), object) !=
> object)) {
> kasan_report_invalid_free(tagged_object, ip);
> @@ -444,6 +448,9 @@ static void *__kasan_kmalloc(struct kmem_cache *cache, const void *object,
> if (unlikely(object == NULL))
> return NULL;
>
> + if (is_kfence_address(object))
> + return (void *)object;
> +
> redzone_start = round_up((unsigned long)(object + size),
> KASAN_SHADOW_SCALE_SIZE);
> redzone_end = round_up((unsigned long)object + cache->object_size,
> --
> 2.28.0.526.ge36021eeef-goog
>

Dmitry Vyukov

unread,

Sep 11, 2020, 3:14:24 AM9/11/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Mon, Sep 7, 2020 at 3:41 PM Marco Elver <el...@google.com> wrote:
>

> Add KFENCE documentation in dev-tools/kfence.rst, and add to index.
>
> Co-developed-by: Alexander Potapenko <gli...@google.com>
> Signed-off-by: Alexander Potapenko <gli...@google.com>
> Signed-off-by: Marco Elver <el...@google.com>
> ---
> Documentation/dev-tools/index.rst | 1 +
> Documentation/dev-tools/kfence.rst | 285 +++++++++++++++++++++++++++++
> 2 files changed, 286 insertions(+)
> create mode 100644 Documentation/dev-tools/kfence.rst
>
> diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
> index f7809c7b1ba9..1b1cf4f5c9d9 100644
> --- a/Documentation/dev-tools/index.rst
> +++ b/Documentation/dev-tools/index.rst
> @@ -22,6 +22,7 @@ whole; patches welcome!
> ubsan
> kmemleak
> kcsan
> + kfence
> gdb-kernel-debugging
> kgdb
> kselftest
> diff --git a/Documentation/dev-tools/kfence.rst b/Documentation/dev-tools/kfence.rst
> new file mode 100644
> index 000000000000..254f4f089104
> --- /dev/null
> +++ b/Documentation/dev-tools/kfence.rst
> @@ -0,0 +1,285 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Kernel Electric-Fence (KFENCE)
> +==============================
> +
> +Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety
> +error detector. KFENCE detects heap out-of-bounds access, use-after-free, and
> +invalid-free errors.
> +
> +KFENCE is designed to be enabled in production kernels, and has near zero
> +performance overhead. Compared to KASAN, KFENCE trades performance for
> +precision. The main motivation behind KFENCE's design, is that with enough
> +total uptime KFENCE will detect bugs in code paths not typically exercised by
> +non-production test workloads. One way to quickly achieve a large enough total
> +uptime is when the tool is deployed across a large fleet of machines.
> +
> +Usage
> +-----
> +
> +To enable KFENCE, configure the kernel with::
> +
> + CONFIG_KFENCE=y
> +
> +KFENCE provides several other configuration options to customize behaviour (see
> +the respective help text in ``lib/Kconfig.kfence`` for more info).
> +
> +Tuning performance
> +~~~~~~~~~~~~~~~~~~
> +

> +The most important parameter is KFENCE's sample interval, which can be set via
> +the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The
> +sample interval determines the frequency with which heap allocations will be
> +guarded by KFENCE. The default is configurable via the Kconfig option
> +``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0``
> +disables KFENCE.
> +
> +With the Kconfig option ``CONFIG_KFENCE_NUM_OBJECTS`` (default 255), the number
> +of available guarded objects can be controlled. Each object requires 2 pages,
> +one for the object itself and the other one used as a guard page; object pages
> +are interleaved with guard pages, and every object page is therefore surrounded
> +by two guard pages.

> +
> +The total memory dedicated to the KFENCE memory pool can be computed as::
> +
> + ( #objects + 1 ) * 2 * PAGE_SIZE
> +
> +Using the default config, and assuming a page size of 4 KiB, results in
> +dedicating 2 MiB to the KFENCE memory pool.
> +
> +Error reports
> +~~~~~~~~~~~~~
> +
> +A typical out-of-bounds access looks like this::
> +
> + ==================================================================
> + BUG: KFENCE: out-of-bounds in test_out_of_bounds_read+0xa3/0x22b
> +
> + Out-of-bounds access at 0xffffffffb672efff (left of kfence-#17):
> + test_out_of_bounds_read+0xa3/0x22b
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated in:
> + __kfence_alloc+0x42d/0x4c0
> + __kmalloc+0x133/0x200
> + test_alloc+0xf3/0x25b
> + test_out_of_bounds_read+0x98/0x22b
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7
> + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> + ==================================================================
> +
> +The header of the report provides a short summary of the function involved in
> +the access. It is followed by more detailed information about the access and
> +its origin.
> +
> +Use-after-free accesses are reported as::
> +
> + ==================================================================
> + BUG: KFENCE: use-after-free in test_use_after_free_read+0xb3/0x143
> +
> + Use-after-free access at 0xffffffffb673dfe0:
> + test_use_after_free_read+0xb3/0x143
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated in:
> + __kfence_alloc+0x277/0x4c0
> + __kmalloc+0x133/0x200
> + test_alloc+0xf3/0x25b
> + test_use_after_free_read+0x76/0x143
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30

Empty line between stacks for consistency and readability.

> + freed in:
> + kfence_guarded_free+0x158/0x380
> + __kfence_free+0x38/0xc0
> + test_use_after_free_read+0xa8/0x143
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
> + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> + ==================================================================
> +
> +KFENCE also reports on invalid frees, such as double-frees::
> +
> + ==================================================================
> + BUG: KFENCE: invalid free in test_double_free+0xdc/0x171
> +
> + Invalid free of 0xffffffffb6741000:
> + test_double_free+0xdc/0x171
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated in:
> + __kfence_alloc+0x42d/0x4c0
> + __kmalloc+0x133/0x200
> + test_alloc+0xf3/0x25b
> + test_double_free+0x76/0x171
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> + freed in:
> + kfence_guarded_free+0x158/0x380
> + __kfence_free+0x38/0xc0
> + test_double_free+0xa8/0x171
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
> + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> + ==================================================================
> +
> +KFENCE also uses pattern-based redzones on the other side of an object's guard
> +page, to detect out-of-bounds writes on the unprotected side of the object.
> +These are reported on frees::
> +
> + ==================================================================
> + BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184
> +
> + Detected corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ]:
> + test_kmalloc_aligned_oob_write+0xef/0x184
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated in:
> + __kfence_alloc+0x277/0x4c0
> + __kmalloc+0x133/0x200
> + test_alloc+0xf3/0x25b
> + test_kmalloc_aligned_oob_write+0x57/0x184
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
> + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> + ==================================================================
> +
> +For such errors, the address where the corruption as well as the corrupt bytes
> +are shown.
> +
> +And finally, KFENCE may also report on invalid accesses to any protected page
> +where it was not possible to determine an associated object, e.g. if adjacent
> +object pages had not yet been allocated::
> +
> + ==================================================================
> + BUG: KFENCE: invalid access in test_invalid_access+0x26/0xe0
> +
> + Invalid access at 0xffffffffb670b00a:
> + test_invalid_access+0x26/0xe0
> + kunit_try_run_case+0x51/0x85
> + kunit_generic_run_threadfn_adapter+0x16/0x30
> + kthread+0x137/0x160
> + ret_from_fork+0x22/0x30
> +
> + CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
> + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> + ==================================================================
> +
> +DebugFS interface
> +~~~~~~~~~~~~~~~~~
> +
> +Some debugging information is exposed via debugfs:
> +
> +* The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics.
> +
> +* The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects
> + allocated via KFENCE, including those already freed but protected.
> +
> +Implementation Details
> +----------------------
> +

> +Guarded allocations are set up based on the sample interval. After expiration

> +of the sample interval, a guarded allocation from the KFENCE object pool is
> +returned to the main allocator (SLAB or SLUB). At this point, the timer is
> +reset, and the next allocation is set up after the expiration of the interval.
> +To "gate" a KFENCE allocation through the main allocator's fast-path without
> +overhead, KFENCE relies on static branches via the static keys infrastructure.
> +The static branch is toggled to redirect the allocation to KFENCE.
> +
> +KFENCE objects each reside on a dedicated page, at either the left or right

Do we mention anywhere explicitly that KFENCE currently only supports
allocations <=page_size?
May be worth mentioning. It kinda follows from implementation but
quite implicitly. One may also be confused assuming KFENCE handles
larger allocations, but then not being able to figure out.

> +page boundaries selected at random. The pages to the left and right of the
> +object page are "guard pages", whose attributes are changed to a protected
> +state, and cause page faults on any attempted access. Such page faults are then
> +intercepted by KFENCE, which handles the fault gracefully by reporting an
> +out-of-bounds access. The side opposite of an object's guard page is used as a
> +pattern-based redzone, to detect out-of-bounds writes on the unprotected sed of
> +the object on frees (for special alignment and size combinations, both sides of
> +the object are redzoned).
> +
> +KFENCE also uses pattern-based redzones on the other side of an object's guard
> +page, to detect out-of-bounds writes on the unprotected side of the object;
> +these are reported on frees.
> +
> +The following figure illustrates the page layout::
> +
> + ---+-----------+-----------+-----------+-----------+-----------+---
> + | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx |
> + | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx |
> + | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x |
> + | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx |
> + | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx |
> + | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx |
> + ---+-----------+-----------+-----------+-----------+-----------+---
> +

> +Upon deallocation of a KFENCE object, the object's page is again protected and
> +the object is marked as freed. Any further access to the object causes a fault
> +and KFENCE reports a use-after-free access. Freed objects are inserted at the
> +tail of KFENCE's freelist, so that the least recently freed objects are reused
> +first, and the chances of detecting use-after-frees of recently freed objects
> +is increased.

> +
> +Interface
> +---------
> +
> +The following describes the functions which are used by allocators as well page
> +handling code to set up and deal with KFENCE allocations.
> +
> +.. kernel-doc:: include/linux/kfence.h
> + :functions: is_kfence_address
> + kfence_shutdown_cache
> + kfence_alloc kfence_free
> + kfence_ksize kfence_object_start
> + kfence_handle_page_fault
> +
> +Related Tools
> +-------------
> +
> +In userspace, a similar approach is taken by `GWP-ASan
> +<http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and
> +a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is
> +directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another
> +similar but non-sampling approach, that also inspired the name "KFENCE", can be
> +found in the userspace `Electric Fence Malloc Debugger
> +<https://linux.die.net/man/3/efence>`_.
> +
> +In the kernel, several tools exist to debug memory access errors, and in
> +particular KASAN can detect all bug classes that KFENCE can detect. While KASAN
> +is more precise, relying on compiler instrumentation, this comes at a
> +performance cost. We want to highlight that KASAN and KFENCE are complementary,
> +with different target environments. For instance, KASAN is the better
> +debugging-aid, where a simple reproducer exists: due to the lower chance to
> +detect the error, it would require more effort using KFENCE to debug.
> +Deployments at scale, however, would benefit from using KFENCE to discover bugs
> +due to code paths not exercised by test cases or fuzzers.
> --
> 2.28.0.526.ge36021eeef-goog
>

Dmitry Vyukov

unread,

Sep 11, 2020, 3:17:27 AM9/11/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Mon, Sep 7, 2020 at 3:41 PM Marco Elver <el...@google.com> wrote:
>

> From: Alexander Potapenko <gli...@google.com>
>
> Inserts KFENCE hooks into the SLAB allocator.
>
> We note the addition of the 'orig_size' argument to slab_alloc*()
> functions, to be able to pass the originally requested size to KFENCE.
> When KFENCE is disabled, there is no additional overhead, since these
> functions are __always_inline.

>
> Co-developed-by: Marco Elver <el...@google.com>
> Signed-off-by: Marco Elver <el...@google.com>
> Signed-off-by: Alexander Potapenko <gli...@google.com>
> ---

> mm/slab.c | 46 ++++++++++++++++++++++++++++++++++------------
> mm/slab_common.c | 6 +++++-
> 2 files changed, 39 insertions(+), 13 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 3160dff6fd76..30aba06ae02b 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -100,6 +100,7 @@
> #include <linux/seq_file.h>
> #include <linux/notifier.h>
> #include <linux/kallsyms.h>
> +#include <linux/kfence.h>
> #include <linux/cpu.h>
> #include <linux/sysctl.h>
> #include <linux/module.h>
> @@ -3206,7 +3207,7 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
> }
>
> static __always_inline void *
> -slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
> +slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t orig_size,
> unsigned long caller)
> {
> unsigned long save_flags;
> @@ -3219,6 +3220,10 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
> if (unlikely(!cachep))
> return NULL;
>
> + ptr = kfence_alloc(cachep, orig_size, flags);
> + if (unlikely(ptr))
> + goto out_hooks;
> +
> cache_alloc_debugcheck_before(cachep, flags);
> local_irq_save(save_flags);
>
> @@ -3251,6 +3256,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
> if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
> memset(ptr, 0, cachep->object_size);
>
> +out_hooks:
> slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr);
> return ptr;
> }
> @@ -3288,7 +3294,7 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
> #endif /* CONFIG_NUMA */
>
> static __always_inline void *
> -slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
> +slab_alloc(struct kmem_cache *cachep, gfp_t flags, size_t orig_size, unsigned long caller)
> {
> unsigned long save_flags;
> void *objp;
> @@ -3299,6 +3305,10 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
> if (unlikely(!cachep))
> return NULL;
>
> + objp = kfence_alloc(cachep, orig_size, flags);
> + if (unlikely(objp))
> + goto leave;
> +
> cache_alloc_debugcheck_before(cachep, flags);
> local_irq_save(save_flags);
> objp = __do_cache_alloc(cachep, flags);
> @@ -3309,6 +3319,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
> if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
> memset(objp, 0, cachep->object_size);
>
> +leave:
> slab_post_alloc_hook(cachep, objcg, flags, 1, &objp);
> return objp;
> }
> @@ -3414,6 +3425,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp,
> unsigned long caller)
> {
> + if (kfence_free(objp)) {
> + kmemleak_free_recursive(objp, cachep->flags);
> + return;
> + }
> +

> /* Put the object into the quarantine, don't touch it for now. */
> if (kasan_slab_free(cachep, objp, _RET_IP_))

> return;
> @@ -3479,7 +3495,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
> */
> void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
> {
> - void *ret = slab_alloc(cachep, flags, _RET_IP_);
> + void *ret = slab_alloc(cachep, flags, cachep->object_size, _RET_IP_);

It's kinda minor, but since we are talking about malloc fast path:
will passing 0 instead of cachep->object_size (here and everywhere
else) and then using cachep->object_size on the slow path if 0 is
passed as size improve codegen?

> trace_kmem_cache_alloc(_RET_IP_, ret,
> cachep->object_size, cachep->size, flags);
> @@ -3512,7 +3528,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>
> local_irq_disable();
> for (i = 0; i < size; i++) {
> - void *objp = __do_cache_alloc(s, flags);
> + void *objp = kfence_alloc(s, s->object_size, flags) ?: __do_cache_alloc(s, flags);
>
> if (unlikely(!objp))
> goto error;
> @@ -3545,7 +3561,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
> {
> void *ret;
>
> - ret = slab_alloc(cachep, flags, _RET_IP_);
> + ret = slab_alloc(cachep, flags, size, _RET_IP_);
>
> ret = kasan_kmalloc(cachep, ret, size, flags);
> trace_kmalloc(_RET_IP_, ret,
> @@ -3571,7 +3587,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_trace);
> */
> void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
> {
> - void *ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
> + void *ret = slab_alloc_node(cachep, flags, nodeid, cachep->object_size, _RET_IP_);
>
> trace_kmem_cache_alloc_node(_RET_IP_, ret,
> cachep->object_size, cachep->size,
> @@ -3589,7 +3605,7 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *cachep,
> {
> void *ret;
>
> - ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
> + ret = slab_alloc_node(cachep, flags, nodeid, size, _RET_IP_);
>
> ret = kasan_kmalloc(cachep, ret, size, flags);
> trace_kmalloc_node(_RET_IP_, ret,
> @@ -3650,7 +3666,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
> cachep = kmalloc_slab(size, flags);
> if (unlikely(ZERO_OR_NULL_PTR(cachep)))
> return cachep;
> - ret = slab_alloc(cachep, flags, caller);
> + ret = slab_alloc(cachep, flags, size, caller);
>
> ret = kasan_kmalloc(cachep, ret, size, flags);
> trace_kmalloc(caller, ret,
> @@ -4138,18 +4154,24 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
> bool to_user)
> {
> struct kmem_cache *cachep;
> - unsigned int objnr;
> + unsigned int objnr = 0;
> unsigned long offset;
> + bool is_kfence = is_kfence_address(ptr);
>
> ptr = kasan_reset_tag(ptr);
>
> /* Find and validate object. */
> cachep = page->slab_cache;
> - objnr = obj_to_index(cachep, page, (void *)ptr);
> - BUG_ON(objnr >= cachep->num);
> + if (!is_kfence) {
> + objnr = obj_to_index(cachep, page, (void *)ptr);
> + BUG_ON(objnr >= cachep->num);
> + }
>
> /* Find offset within object. */
> - offset = ptr - index_to_obj(cachep, page, objnr) - obj_offset(cachep);
> + if (is_kfence_address(ptr))
> + offset = ptr - kfence_object_start(ptr);
> + else
> + offset = ptr - index_to_obj(cachep, page, objnr) - obj_offset(cachep);
>
> /* Allow address range falling entirely within usercopy region. */
> if (offset >= cachep->useroffset &&
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index f9ccd5dc13f3..6e35e273681a 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -12,6 +12,7 @@
> #include <linux/memory.h>
> #include <linux/cache.h>
> #include <linux/compiler.h>
> +#include <linux/kfence.h>
> #include <linux/module.h>
> #include <linux/cpu.h>
> #include <linux/uaccess.h>
> @@ -448,6 +449,9 @@ static int shutdown_cache(struct kmem_cache *s)
> /* free asan quarantined objects */
> kasan_cache_shutdown(s);
>
> + if (!kfence_shutdown_cache(s))
> + return -EBUSY;
> +
> if (__kmem_cache_shutdown(s) != 0)
> return -EBUSY;
>
> @@ -1171,7 +1175,7 @@ size_t ksize(const void *objp)
> if (unlikely(ZERO_OR_NULL_PTR(objp)) || !__kasan_check_read(objp, 1))
> return 0;
>
> - size = __ksize(objp);
> + size = kfence_ksize(objp) ?: __ksize(objp);
> /*
> * We assume that ksize callers could use whole allocated area,
> * so we need to unpoison this area.
> --
> 2.28.0.526.ge36021eeef-goog
>

Dmitry Vyukov

unread,

Sep 11, 2020, 3:36:10 AM9/11/20

to Marco Elver, Vlastimil Babka, Dave Hansen, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

Hi Marco,

I reviewed patches and they look good to me (minus some local comments
that I've left).

The main question/concern I have is what Vlastimil mentioned re
long-aged objects.
Is the default sample interval values reasonable for typical
workloads? Do we have any guidelines on choosing the sample interval?
Should it depend on workload/use pattern?
By "reasonable" I mean if the pool will last long enough to still
sample something after hours/days? Have you tried any experiments with
some workload (both short-lived processes and long-lived
processes/namespaces) capturing state of the pool? It can make sense
to do to better understand dynamics. I suspect that the rate may need
to be orders of magnitude lower.

Also I am wondering about the boot process (both kernel and init).
It's both inherently almost the same for the whole population of
machines and inherently produces persistent objects. Should we lower
the rate for the first minute of uptime? Or maybe make it proportional
to uptime?

I feel it's quite an important aspect. We can have this awesome idea
and implementation, but radically lower its utility by using bad
sampling value (which will have silent "failure mode" -- no bugs
detected).

But to make it clear: all of this does not conflict with the merge of
the first version. Just having tunable sampling interval is good
enough. We will get the ultimate understanding only when we start
using it widely anyway.

Marco Elver

unread,

Sep 11, 2020, 3:46:14 AM9/11/20

to Dmitry Vyukov, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

Done for v2.

We can add a note that "allocation sizes up to PAGE_SIZE are supported".

Thanks,
-- Marco

Marco Elver

unread,

Sep 11, 2020, 8:03:17 AM9/11/20

to Dmitry Vyukov, Vlastimil Babka, Dave Hansen, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <dvy...@google.com> wrote:
> On Tue, Sep 8, 2020 at 5:56 PM Marco Elver <el...@google.com> wrote:
> > On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote:

[...]

> > > Hmm did you observe that with this limit, a long-running system would eventually
> > > converge to KFENCE memory pool being filled with long-aged objects, so there
> > > would be no space to sample new ones?
> >
> > Sure, that's a possibility. But remember that we're not trying to
> > deterministically detect bugs on 1 system (if you wanted that, you
> > should use KASAN), but a fleet of machines! The non-determinism of which
> > allocations will end up in KFENCE, will ensure we won't end up with a
> > fleet of machines of identical allocations. That's exactly what we're
> > after. Even if we eventually exhaust the pool, you'll still detect bugs
> > if there are any.
> >
> > If you are overly worried, either the sample interval or number of
> > available objects needs to be tweaked to be larger. The default of 255
> > is quite conservative, and even using something larger on a modern
> > system is hardly noticeable. Choosing a sample interval & number of
> > objects should also factor in how many machines you plan to deploy this
> > on. Monitoring /sys/kernel/debug/kfence/stats can help you here.
>
> Hi Marco,
>
> I reviewed patches and they look good to me (minus some local comments
> that I've left).

Thank you.

> The main question/concern I have is what Vlastimil mentioned re
> long-aged objects.
> Is the default sample interval values reasonable for typical
> workloads? Do we have any guidelines on choosing the sample interval?
> Should it depend on workload/use pattern?

As I hinted at before, the sample interval & number of objects needs
to depend on:
- number of machines,
- workload,
- acceptable overhead (performance, memory).

However, workload can vary greatly, and something more dynamic may be
needed. We do have the option to monitor
/sys/kernel/debug/kfence/stats and even change the sample interval at
runtime, e.g. from a user space tool that checks the currently used
objects, and as the pool is closer to exhausted, starts increasing
/sys/module/kfence/parameters/sample_interval.

Of course, if we figure out the best dynamic policy, we can add this
policy into the kernel. But I don't think it makes sense to hard-code
such a policy right now.

> By "reasonable" I mean if the pool will last long enough to still
> sample something after hours/days? Have you tried any experiments with
> some workload (both short-lived processes and long-lived
> processes/namespaces) capturing state of the pool? It can make sense
> to do to better understand dynamics. I suspect that the rate may need
> to be orders of magnitude lower.

Yes, the current default sample interval is a lower bound, and is also
a reasonable default for testing. I expect real deployments to use
much higher sample intervals (lower rate).

So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that
allocated KFENCE objects isn't artificially capped):

-- With a mostly vanilla config + KFENCE (sample interval 100 ms),
after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects
(total allocations >600). Those aren't always the same objects, with
roughly ~2 allocations/frees per second.

-- Then running sysbench I/O benchmark, KFENCE objects allocated peak
at 82. During the benchmark, allocations/frees per second are closer
to 10-15. After the benchmark, the KFENCE objects allocated remain at
82, and allocations/frees per second fall back to ~2.

-- For the same system, changing the sample interval to 1 ms (echo 1 >
/sys/module/kfence/parameters/sample_interval), and re-running the
benchmark gives me: KFENCE objects allocated peak at exactly 500, with
~500 allocations/frees per second. After that, allocated KFENCE
objects dropped a little to 496, and allocations/frees per second fell
back to ~2.

-- The long-lived objects are due to caches, and just running 'echo 1
> /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to
45.

> Also I am wondering about the boot process (both kernel and init).
> It's both inherently almost the same for the whole population of
> machines and inherently produces persistent objects. Should we lower
> the rate for the first minute of uptime? Or maybe make it proportional
> to uptime?

It should depend on current usage, which is dependent on the workload.
I don't think uptime helps much, as seen above. If we imagine a user
space tool that tweaks this for us, we can initialize KFENCE with a
very large sample interval, and once booted, this user space
tool/script adjusts /sys/module/kfence/parameters/sample_interval.

At the very least, I think I'll just make
/sys/module/kfence/parameters/sample_interval root-writable
unconditionally, so that we can experiment with such a tool.

Lowering the rate for the first minute of uptime might also be an
option, although if we do that, we can also just move kfence_init() to
the end of start_kernel(). IMHO, I think it still makes sense to
sample normally during boot, because who knows how those allocations
are used with different workloads once the kernel is live. With a
sample interval of 1000 ms (which is closer to what we probably want
in production), I see no more than 20 KFENCE objects allocated after
boot. I think we can live with that.

> I feel it's quite an important aspect. We can have this awesome idea
> and implementation, but radically lower its utility by using bad
> sampling value (which will have silent "failure mode" -- no bugs
> detected).

As a first step, I think monitoring the entire fleet here is key here
(collect /sys/kernel/debug/kfence/stats). Essentially, as long as
allocations/frees per second remains >0, we're probably fine, even if
we always run at max. KFENCE objects allocated.

An improvement over allocations/frees per second >0 would be
dynamically tweaking sample_interval based on how close we get to max
KFENCE objects allocated.

Yet another option is to skip KFENCE allocations based on the memcache
name, e.g. for those caches dedicated to long-lived allocations.

> But to make it clear: all of this does not conflict with the merge of
> the first version. Just having tunable sampling interval is good
> enough. We will get the ultimate understanding only when we start
> using it widely anyway.

Thanks,
-- Marco

Marco Elver

unread,

Sep 11, 2020, 8:24:21 AM9/11/20

to Dmitry Vyukov, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

It doesn't save us much, maybe 1 instruction based on what I'm looking
at right now. The main worry I have is that the 'orig_size' argument
is now part of slab_alloc, and changing its semantics may cause
problems in future if it's no longer just passed to kfence_alloc().
Today, we can do the 'size = size ?: cache->object_size' trick inside
kfence_alloc(), but at the cost breaking the intuitive semantics of
slab_alloc's orig_size argument for future users. Is it worth it?

Thanks,
-- Marco

Marco Elver

unread,

Sep 11, 2020, 9:01:07 AM9/11/20

to Dmitry Vyukov, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

It is required for SLUB. For SLAB, it seems it might not be necessary.
Making the check in kasan/common.c conditional on the allocator seems
ugly, so I propose we keep it there.

Thanks,
-- Marco

Dmitry Vyukov

unread,

Sep 11, 2020, 9:03:59 AM9/11/20

to Marco Elver, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

I don't have an answer to this question. I will leave this to others.
If nobody has strong support for changing semantics, let's leave it as
is. Maybe keep in mind as potential ballast.
FWIW most likely misuse of 0 size for other future purposes should
manifest itself in a quite straightforward way.

Dmitry Vyukov

unread,

Sep 11, 2020, 9:10:00 AM9/11/20

to Marco Elver, Vlastimil Babka, Dave Hansen, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

Interesting. What type of caches is this? If there is some type of
cache that caches particularly lots of sampled objects, we could
potentially change the cache to release sampled objects eagerly.

Marco Elver

unread,

Sep 11, 2020, 9:33:57 AM9/11/20

to Dmitry Vyukov, Vlastimil Babka, Dave Hansen, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

On Fri, 11 Sep 2020 at 15:10, Dmitry Vyukov <dvy...@google.com> wrote:
> On Fri, Sep 11, 2020 at 2:03 PM Marco Elver <el...@google.com> wrote:
> > On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <dvy...@google.com> wrote:

[...]

The 2 major users of KFENCE objects for that workload are
'buffer_head' and 'bio-0'.

If we want to deal with those, I guess there are 2 options:

1. More complex, but more precise: make the users of them check
is_kfence_address() and release their buffers earlier.

2. Simpler, generic solution: make KFENCE stop return allocations for
non-kmalloc_caches memcaches after more than ~90% of the pool is
exhausted. This assumes that creators of long-lived objects usually
set up their own memcaches.

I'm currently inclined to go for (2).

Thanks,
-- Marco

Marco Elver

unread,

Sep 11, 2020, 12:34:05 PM9/11/20

to Dmitry Vyukov, Vlastimil Babka, Dave Hansen, Alexander Potapenko, Andrew Morton, Catalin Marinas, Christoph Lameter, David Rientjes, Joonsoo Kim, Mark Rutland, Pekka Enberg, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Dave Hansen, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Corbet, Kees Cook, Peter Zijlstra, Qian Cai, Thomas Gleixner, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

Ok, after some offline chat, we determined that (2) would be premature
and we can't really say if kmalloc should have precedence if we reach
some usage threshold. So for now, let's just leave as-is and start
with the recommendation to monitor and adjust based on usage, fleet
size, etc.

Thanks,
-- Marco

Marco Elver

unread,

Sep 15, 2020, 9:21:00 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors. This
series enables KFENCE for the x86 and arm64 architectures, and adds
KFENCE hooks to the SLAB and SLUB allocators.

KFENCE is designed to be enabled in production kernels, and has near

zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.

KFENCE objects each reside on a dedicated page, at either the left or

right page boundaries. The pages to the left and right of the object

page are "guard pages", whose attributes are changed to a protected

state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error.

Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval,
the next allocation through the main allocator (SLAB or SLUB) returns a
guarded allocation from the KFENCE object pool. At this point, the timer
is reset, and the next allocation is set up after the expiration of the
interval.

To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE.

The KFENCE memory pool is of fixed size, and if the pool is exhausted no
further KFENCE allocations occur. The default config is conservative
with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB
pages).

We have verified by running synthetic benchmarks (sysbench I/O,
hackbench) that a kernel with KFENCE is performance-neutral compared to
a non-KFENCE baseline kernel.

KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
properties. The name "KFENCE" is a homage to the Electric Fence Malloc
Debugger [2].

For more details, see Documentation/dev-tools/kfence.rst added in the
series -- also viewable here:

https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst

[1] http://llvm.org/docs/GwpAsan.html
[2] https://linux.die.net/man/3/efence

v2:
* Various comment/documentation changes (see details in patches).
* Various smaller fixes (see details in patches).
* Change all reports to reference the kfence object, "kfence-#nn".
* Skip allocation/free internals stack trace.
* Rework KMEMLEAK compatibility patch.

RFC/v1: https://lkml.kernel.org/r/20200907134055....@google.com

Alexander Potapenko (6):
mm: add Kernel Electric-Fence infrastructure
x86, kfence: enable KFENCE for x86
mm, kfence: insert KFENCE hooks for SLAB
mm, kfence: insert KFENCE hooks for SLUB
kfence, kasan: make KFENCE compatible with KASAN
kfence, kmemleak: make KFENCE compatible with KMEMLEAK

Marco Elver (4):
arm64, kfence: enable KFENCE for ARM64
kfence, lockdep: make KFENCE compatible with lockdep
kfence, Documentation: add KFENCE documentation
kfence: add test suite

Documentation/dev-tools/index.rst | 1 +
Documentation/dev-tools/kfence.rst | 291 +++++++++++
MAINTAINERS | 11 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/kfence.h | 39 ++
arch/arm64/mm/fault.c | 4 +
arch/x86/Kconfig | 2 +
arch/x86/include/asm/kfence.h | 60 +++
arch/x86/mm/fault.c | 4 +
include/linux/kfence.h | 174 +++++++
init/main.c | 2 +
kernel/locking/lockdep.c | 8 +
lib/Kconfig.debug | 1 +
lib/Kconfig.kfence | 78 +++
mm/Makefile | 1 +
mm/kasan/common.c | 7 +
mm/kfence/Makefile | 6 +
mm/kfence/core.c | 733 +++++++++++++++++++++++++++
mm/kfence/kfence.h | 102 ++++
mm/kfence/kfence_test.c | 777 +++++++++++++++++++++++++++++
mm/kfence/report.c | 219 ++++++++
mm/kmemleak.c | 6 +
mm/slab.c | 46 +-
mm/slab_common.c | 6 +-
mm/slub.c | 72 ++-
25 files changed, 2619 insertions(+), 32 deletions(-)
create mode 100644 Documentation/dev-tools/kfence.rst
create mode 100644 arch/arm64/include/asm/kfence.h
create mode 100644 arch/x86/include/asm/kfence.h
create mode 100644 include/linux/kfence.h
create mode 100644 lib/Kconfig.kfence
create mode 100644 mm/kfence/Makefile
create mode 100644 mm/kfence/core.c
create mode 100644 mm/kfence/kfence.h
create mode 100644 mm/kfence/kfence_test.c
create mode 100644 mm/kfence/report.c

--
2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:03 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors.

KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.

KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error.

Guarded allocations are set up based on a sample interval (can be set

via kfence.sample_interval). After expiration of the sample interval, a
guarded allocation from the KFENCE object pool is returned to the main
allocator (SLAB or SLUB). At this point, the timer is reset, and the

next allocation is set up after the expiration of the interval.

To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the

allocation to KFENCE. To date, we have verified by running synthetic
benchmarks (sysbench I/O workloads) that a kernel compiled with KFENCE
is performance-neutral compared to the non-KFENCE baseline.

For more details, see Documentation/dev-tools/kfence.rst (added later in
the series).

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

v2:
* Add missing __printf attribute to seq_con_printf, and fix new warning.
[reported by kernel test robot <l...@intel.com>]
* Fix up some comments [reported by Jonathan Cameron].
* Remove 2 cases of redundant stack variable initialization
[reported by Jonathan Cameron].
* Fix printf format [reported by kernel test robot <l...@intel.com>].
* Print (in kfence-#nn) after address, to more clearly establish link
between first and second stacktrace [reported by Andrey Konovalov].
* Make choice between KASAN and KFENCE clearer in Kconfig help text
[suggested by Dave Hansen].
* Document CONFIG_KFENCE_SAMPLE_INTERVAL=0.
* Shorten memory corruption report line length.
* Make /sys/module/kfence/parameters/sample_interval root-writable for
all builds (to enable debugging, automatic dynamic tweaking).
* Reports by Dmitry Vyukov:
* Do not store negative size for right-located objects
* Only cache-align addresses of right-located objects.
* Run toggle_allocation_gate() after KFENCE is enabled.
* Add empty line between allocation and free stacks.
* Add comment about SLAB_TYPESAFE_BY_RCU.
* Also skip internals for allocation/free stacks.
* s/KFENCE_FAULT_INJECTION/KFENCE_STRESS_TEST_FAULTS/ as FAULT_INJECTION
is already overloaded in different contexts.
* Parenthesis for macro variable.
* Lower max of KFENCE_NUM_OBJECTS config variable.
---
MAINTAINERS | 11 +
include/linux/kfence.h | 174 ++++++++++
init/main.c | 2 +
lib/Kconfig.debug | 1 +
lib/Kconfig.kfence | 65 ++++
mm/Makefile | 1 +
mm/kfence/Makefile | 3 +
mm/kfence/core.c | 733 +++++++++++++++++++++++++++++++++++++++++
mm/kfence/kfence.h | 102 ++++++
mm/kfence/report.c | 219 ++++++++++++
10 files changed, 1311 insertions(+)

create mode 100644 include/linux/kfence.h
create mode 100644 lib/Kconfig.kfence
create mode 100644 mm/kfence/Makefile
create mode 100644 mm/kfence/core.c
create mode 100644 mm/kfence/kfence.h

create mode 100644 mm/kfence/report.c

diff --git a/MAINTAINERS b/MAINTAINERS
index b5cfab015bd6..863899ed9a29 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9673,6 +9673,17 @@ F: include/linux/keyctl.h
F: include/uapi/linux/keyctl.h
F: security/keys/

+KFENCE
+M: Alexander Potapenko <gli...@google.com>
+M: Marco Elver <el...@google.com>
+R: Dmitry Vyukov <dvy...@google.com>
+L: kasa...@googlegroups.com
+S: Maintained
+F: Documentation/dev-tools/kfence.rst
+F: include/linux/kfence.h
+F: lib/Kconfig.kfence
+F: mm/kfence/
+
KFIFO
M: Stefani Seibold <ste...@seibold.net>
S: Maintained
diff --git a/include/linux/kfence.h b/include/linux/kfence.h
new file mode 100644
index 000000000000..8128ba7b5e90
--- /dev/null
+++ b/include/linux/kfence.h
@@ -0,0 +1,174 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+

+#ifndef _LINUX_KFENCE_H
+#define _LINUX_KFENCE_H
+
+#include <linux/mm.h>
+#include <linux/percpu.h>
+#include <linux/static_key.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_KFENCE
+
+/*
+ * We allocate an even number of pages, as it simplifies calculations to map
+ * address to metadata indices; effectively, the very first page serves as an
+ * extended guard page, but otherwise has no special purpose.
+ */
+#define KFENCE_POOL_SIZE ((CONFIG_KFENCE_NUM_OBJECTS + 1) * 2 * PAGE_SIZE)
+#ifdef CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL
+extern char __kfence_pool[KFENCE_POOL_SIZE];
+#else
+extern char *__kfence_pool;
+#endif
+
+extern struct static_key_false kfence_allocation_key;
+
+/**
+ * is_kfence_address() - check if an address belongs to KFENCE pool
+ * @addr: address to check
+ *
+ * Return: true or false depending on whether the address is within the KFENCE
+ * object range.
+ *
+ * KFENCE objects live in a separate page range and are not to be intermixed
+ * with regular heap objects (e.g. KFENCE objects must never be added to the
+ * allocator freelists). Failing to do so may and will result in heap
+ * corruptions, therefore is_kfence_address() must be used to check whether
+ * an object requires specific handling.
+ */
+static __always_inline bool is_kfence_address(const void *addr)
+{
+ return unlikely((char *)addr >= __kfence_pool &&
+ (char *)addr < __kfence_pool + KFENCE_POOL_SIZE);
+}
+
+/**
+ * kfence_init() - perform KFENCE initialization at boot time
+ */
+void kfence_init(void);
+
+/**
+ * kfence_shutdown_cache() - handle shutdown_cache() for KFENCE objects
+ * @s: cache being shut down
+ *
+ * Return: true on success, false if any leftover objects persist.
+ *
+ * Before shutting down a cache, one must ensure there are no remaining objects
+ * allocated from it. KFENCE objects are not referenced from the cache, so
+ * kfence_shutdown_cache() takes care of them.
+ */
+bool __must_check kfence_shutdown_cache(struct kmem_cache *s);
+
+/*
+ * Allocate a KFENCE object. Allocators must not call this function directly,
+ * use kfence_alloc() instead.
+ */

+void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags);
+

+/**
+ * kfence_alloc() - allocate a KFENCE object with a low probability
+ * @s: struct kmem_cache with object requirements
+ * @size: exact size of the object to allocate (can be less than @s->size
+ * e.g. for kmalloc caches)
+ * @flags: GFP flags
+ *
+ * Return:
+ * * NULL - must proceed with allocating as usual,
+ * * non-NULL - pointer to a KFENCE object.
+ *
+ * kfence_alloc() should be inserted into the heap allocation fast path,
+ * allowing it to transparently return KFENCE-allocated objects with a low
+ * probability using a static branch (the probability is controlled by the
+ * kfence.sample_interval boot parameter).
+ */
+static __always_inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
+{
+ return static_branch_unlikely(&kfence_allocation_key) ? __kfence_alloc(s, size, flags) :
+ NULL;
+}
+
+/**
+ * kfence_ksize() - get actual amount of memory allocated for a KFENCE object
+ * @addr: pointer to a heap object
+ *
+ * Return:
+ * * 0 - not a KFENCE object, must call __ksize() instead,
+ * * non-0 - this many bytes can be accessed without causing a memory error.
+ *
+ * kfence_ksize() returns the number of bytes requested for a KFENCE object at
+ * allocation time. This number may be less than the object size of the
+ * corresponding struct kmem_cache.
+ */

+size_t kfence_ksize(const void *addr);
+

+/**
+ * kfence_object_start() - find the beginning of a KFENCE object
+ * @addr - address within a KFENCE-allocated object
+ *
+ * Return: address of the beginning of the object.
+ *
+ * SL[AU]B-allocated objects are laid out within a page one by one, so it is
+ * easy to calculate the beginning of an object given a pointer inside it and
+ * the object size. The same is not true for KFENCE, which places a single
+ * object at either end of the page. This helper function is used to find the
+ * beginning of a KFENCE-allocated object.
+ */

+void *kfence_object_start(const void *addr);
+

+/*
+ * Release a KFENCE-allocated object to KFENCE pool. Allocators must not call
+ * this function directly, use kfence_free() instead.
+ */
+void __kfence_free(void *addr);
+
+/**
+ * kfence_free() - try to release an arbitrary heap object to KFENCE pool
+ * @addr: object to be freed
+ *
+ * Return:
+ * * false - object doesn't belong to KFENCE pool and was ignored,
+ * * true - object was released to KFENCE pool.
+ *
+ * Release a KFENCE object and mark it as freed. May be called on any object,
+ * even non-KFENCE objects, to simplify integration of the hooks into the
+ * allocator's free codepath. The allocator must check the return value to
+ * determine if it was a KFENCE object or not.
+ */
+static __always_inline __must_check bool kfence_free(void *addr)
+{
+ if (!is_kfence_address(addr))
+ return false;
+ __kfence_free(addr);

+ return true;
+}
+

+/**
+ * kfence_handle_page_fault() - perform page fault handling for KFENCE pages
+ * @addr: faulting address
+ *
+ * Return:
+ * * false - address outside KFENCE pool,
+ * * true - page fault handled by KFENCE, no additional handling required.
+ *
+ * A page fault inside KFENCE pool indicates a memory error, such as an
+ * out-of-bounds access, a use-after-free or an invalid memory access. In these
+ * cases KFENCE prints an error message and marks the offending page as
+ * present, so that the kernel can proceed.
+ */
+bool __must_check kfence_handle_page_fault(unsigned long addr);
+
+#else /* CONFIG_KFENCE */
+
+static inline bool is_kfence_address(const void *addr) { return false; }
+static inline void kfence_init(void) { }
+static inline bool __must_check kfence_shutdown_cache(struct kmem_cache *s) { return true; }
+static inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) { return NULL; }
+static inline size_t kfence_ksize(const void *addr) { return 0; }
+static inline void *kfence_object_start(const void *addr) { return NULL; }
+static inline bool __must_check kfence_free(void *addr) { return false; }
+static inline bool __must_check kfence_handle_page_fault(unsigned long addr) { return false; }
+
+#endif
+
+#endif /* _LINUX_KFENCE_H */
diff --git a/init/main.c b/init/main.c
index ae78fb68d231..ec7de9dc1ed8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -39,6 +39,7 @@
#include <linux/security.h>
#include <linux/smp.h>
#include <linux/profile.h>
+#include <linux/kfence.h>
#include <linux/rcupdate.h>
#include <linux/moduleparam.h>
#include <linux/kallsyms.h>
@@ -942,6 +943,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
hrtimers_init();
softirq_init();
timekeeping_init();
+ kfence_init();

/*
* For best initial stack canary entropy, prepare it after:
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e068c3c7189a..d09c6a306532 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -880,6 +880,7 @@ config DEBUG_STACKOVERFLOW
If in doubt, say "N".

source "lib/Kconfig.kasan"
+source "lib/Kconfig.kfence"

endmenu # "Memory Debugging"

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
new file mode 100644
index 000000000000..6a90fef41832
--- /dev/null
+++ b/lib/Kconfig.kfence
@@ -0,0 +1,65 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+config HAVE_ARCH_KFENCE
+ bool
+
+config HAVE_ARCH_KFENCE_STATIC_POOL
+ bool
+ help
+ If the architecture supports using the static pool.
+
+menuconfig KFENCE
+ bool "KFENCE: low-overhead sampling-based memory safety error detector"
+ depends on HAVE_ARCH_KFENCE && !KASAN && (SLAB || SLUB)
+ depends on JUMP_LABEL # To ensure performance, require jump labels
+ select STACKTRACE
+ help
+ KFENCE is low-overhead sampling-based detector for heap out-of-bounds
+ access, use-after-free, and invalid-free errors. KFENCE is designed
+ to have negligible cost to permit enabling it in production
+ environments.
+
+ See <file:Documentation/dev-tools/kfence.rst> for more details.
+
+ Note that, KFENCE is not a substitute for explicit testing with tools
+ such as KASAN. KFENCE can detect a subset of bugs that KASAN can
+ detect, albeit at very different performance profiles. If you can
+ afford to use KASAN, continue using KASAN, for example in test
+ environments. If your kernel targets production use, and cannot
+ enable KASAN due to its cost, consider using KFENCE.
+
+if KFENCE
+
+config KFENCE_SAMPLE_INTERVAL
+ int "Default sample interval in milliseconds"
+ default 100
+ help
+ The KFENCE sample interval determines the frequency with which heap
+ allocations will be guarded by KFENCE. May be overridden via boot
+ parameter "kfence.sample_interval".
+
+ Set this to 0 to disable KFENCE by default, in which case only
+ setting "kfence.sample_interval" to a non-zero value enables KFENCE.
+

+config KFENCE_NUM_OBJECTS
+ int "Number of guarded objects available"
+ default 255

+ range 1 16383

+ help
+ The number of guarded objects available. For each KFENCE object, 2

+ pages are required; with one containing the object and two adjacent
+ ones used as guard pages.
+
+config KFENCE_STRESS_TEST_FAULTS
+ int "Stress testing of fault handling and error reporting"

+ default 0
+ depends on EXPERT
+ help
+ The inverse probability with which to randomly protect KFENCE object
+ pages, resulting in spurious use-after-frees. The main purpose of

+ this option is to stress test KFENCE with concurrent error reports
+ and allocations/frees. A value of 0 disables stress testing logic.
+
+ The option is only to test KFENCE; set to 0 if you are unsure.
+
+endif # KFENCE
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..afdf1ae0900b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_PAGE_POISONING) += page_poison.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KASAN) += kasan/
+obj-$(CONFIG_KFENCE) += kfence/
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_MEMTEST) += memtest.o
diff --git a/mm/kfence/Makefile b/mm/kfence/Makefile
new file mode 100644
index 000000000000..d991e9a349f0
--- /dev/null
+++ b/mm/kfence/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_KFENCE) := core.o report.o
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
new file mode 100644
index 000000000000..dabc43251577
--- /dev/null
+++ b/mm/kfence/core.c
@@ -0,0 +1,733 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "kfence: " fmt
+
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/debugfs.h>
+#include <linux/kcsan-checks.h>
+#include <linux/kfence.h>
+#include <linux/list.h>
+#include <linux/lockdep.h>
+#include <linux/moduleparam.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>

+#include <linux/string.h>
+
+#include <asm/kfence.h>
+
+#include "kfence.h"
+

+/* Disables KFENCE on the first warning assuming an irrecoverable error. */
+#define KFENCE_WARN_ON(cond) \
+ ({ \
+ const bool __cond = WARN_ON(cond); \
+ if (unlikely(__cond)) \
+ WRITE_ONCE(kfence_enabled, false); \
+ __cond; \
+ })
+
+#ifndef CONFIG_KFENCE_STRESS_TEST_FAULTS /* Only defined with CONFIG_EXPERT. */
+#define CONFIG_KFENCE_STRESS_TEST_FAULTS 0
+#endif
+
+/* === Data ================================================================= */
+
+static unsigned long kfence_sample_interval __read_mostly = CONFIG_KFENCE_SAMPLE_INTERVAL;
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+#define MODULE_PARAM_PREFIX "kfence."
+module_param_named(sample_interval, kfence_sample_interval, ulong, 0600);
+
+static bool kfence_enabled __read_mostly;
+
+/*
+ * The pool of pages used for guard pages and objects. If supported, allocated
+ * statically, so that is_kfence_address() avoids a pointer load, and simply
+ * compares against a constant address. Assume that if KFENCE is compiled into
+ * the kernel, it is usually enabled, and the space is to be allocated one way
+ * or another.
+ */
+#ifdef CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL
+char __kfence_pool[KFENCE_POOL_SIZE] __aligned(KFENCE_POOL_ALIGNMENT);
+#else
+char *__kfence_pool __read_mostly;
+#endif
+EXPORT_SYMBOL(__kfence_pool); /* Export for test modules. */
+
+/*
+ * Per-object metadata, with one-to-one mapping of object metadata to
+ * backing pages (in __kfence_pool).
+ */
+static_assert(CONFIG_KFENCE_NUM_OBJECTS > 0);
+struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
+
+/* Freelist with available objects. */
+static struct list_head kfence_freelist = LIST_HEAD_INIT(kfence_freelist);
+static DEFINE_RAW_SPINLOCK(kfence_freelist_lock); /* Lock protecting freelist. */
+
+/* The static key to set up a KFENCE allocation. */
+DEFINE_STATIC_KEY_FALSE(kfence_allocation_key);
+
+/* Gates the allocation, ensuring only one succeeds in a given period. */
+static atomic_t allocation_gate = ATOMIC_INIT(1);
+
+/* Wait queue to wake up allocation-gate timer task. */
+static DECLARE_WAIT_QUEUE_HEAD(allocation_wait);
+
+/* Statistics counters for debugfs. */
+enum kfence_counter_id {
+ KFENCE_COUNTER_ALLOCATED,
+ KFENCE_COUNTER_ALLOCS,
+ KFENCE_COUNTER_FREES,
+ KFENCE_COUNTER_BUGS,
+ KFENCE_COUNTER_COUNT,
+};
+static atomic_long_t counters[KFENCE_COUNTER_COUNT];
+static const char *const counter_names[] = {
+ [KFENCE_COUNTER_ALLOCATED] = "currently allocated",
+ [KFENCE_COUNTER_ALLOCS] = "total allocations",
+ [KFENCE_COUNTER_FREES] = "total frees",
+ [KFENCE_COUNTER_BUGS] = "total bugs",
+};
+static_assert(ARRAY_SIZE(counter_names) == KFENCE_COUNTER_COUNT);
+
+/* === Internals ============================================================ */
+
+static bool kfence_protect(unsigned long addr)
+{
+ return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), true));
+}
+
+static bool kfence_unprotect(unsigned long addr)
+{
+ return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), false));
+}
+
+static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
+{
+ long index;
+
+ /* The checks do not affect performance; only called from slow-paths. */

+
+ if (!is_kfence_address((void *)addr))

+ return NULL;
+
+ /*
+ * May be an invalid index if called with an address at the edge of
+ * __kfence_pool, in which case we would report an "invalid access"
+ * error.
+ */
+ index = ((addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2)) - 1;
+ if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
+ return NULL;
+
+ return &kfence_metadata[index];
+}
+
+static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta)
+{
+ unsigned long offset = ((meta - kfence_metadata) + 1) * PAGE_SIZE * 2;
+ unsigned long pageaddr = (unsigned long)&__kfence_pool[offset];
+
+ /* The checks do not affect performance; only called from slow-paths. */
+
+ /* Only call with a pointer into kfence_metadata. */
+ if (KFENCE_WARN_ON(meta < kfence_metadata ||
+ meta >= kfence_metadata + ARRAY_SIZE(kfence_metadata)))
+ return 0;
+
+ /*
+ * This metadata object only ever maps to 1 page; verify the calculation
+ * happens and that the stored address was not corrupted.
+ */
+ if (KFENCE_WARN_ON(ALIGN_DOWN(meta->addr, PAGE_SIZE) != pageaddr))
+ return 0;
+
+ return pageaddr;
+}
+
+/*
+ * Update the object's metadata state, including updating the alloc/free stacks
+ * depending on the state transition.
+ */
+static noinline void metadata_update_state(struct kfence_metadata *meta,
+ enum kfence_object_state next)
+{
+ unsigned long *entries = next == KFENCE_OBJECT_FREED ? meta->free_stack : meta->alloc_stack;
+ /*
+ * Skip over 1 (this) functions; noinline ensures we do not accidentally
+ * skip over the caller by never inlining.
+ */
+ const int nentries = stack_trace_save(entries, KFENCE_STACK_DEPTH, 1);

+
+ lockdep_assert_held(&meta->lock);
+

+ if (next == KFENCE_OBJECT_FREED)
+ meta->num_free_stack = nentries;
+ else
+ meta->num_alloc_stack = nentries;
+
+ /*
+ * Pairs with READ_ONCE() in
+ * kfence_shutdown_cache(),
+ * kfence_handle_page_fault().
+ */
+ WRITE_ONCE(meta->state, next);
+}
+
+/* Write canary byte to @addr. */
+static inline bool set_canary_byte(u8 *addr)
+{
+ *addr = KFENCE_CANARY_PATTERN(addr);

+ return true;
+}
+

+/* Check canary byte at @addr. */
+static inline bool check_canary_byte(u8 *addr)
+{

+ if (*addr == KFENCE_CANARY_PATTERN(addr))

+ return true;

+
+ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);

+ kfence_report_error((unsigned long)addr, addr_to_metadata((unsigned long)addr),
+ KFENCE_ERROR_CORRUPTION);

+ return false;
+}
+

+static inline void for_each_canary(const struct kfence_metadata *meta, bool (*fn)(u8 *))

+{
+ unsigned long addr;
+

+ lockdep_assert_held(&meta->lock);
+
+ for (addr = ALIGN_DOWN(meta->addr, PAGE_SIZE); addr < meta->addr; addr++) {
+ if (!fn((u8 *)addr))
+ break;
+ }
+
+ for (addr = meta->addr + meta->size; addr < PAGE_ALIGN(meta->addr); addr++) {
+ if (!fn((u8 *)addr))

+ break;
+ }
+}
+

+static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp)
+{
+ struct kfence_metadata *meta = NULL;
+ unsigned long flags;
+ void *addr;
+
+ /* Try to obtain a free object. */
+ raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+ if (!list_empty(&kfence_freelist)) {
+ meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);
+ list_del_init(&meta->list);
+ }
+ raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+ if (!meta)
+ return NULL;
+
+ if (unlikely(!raw_spin_trylock_irqsave(&meta->lock, flags))) {
+ /*
+ * This is extremely unlikely -- we are reporting on a
+ * use-after-free, which locked meta->lock, and the reporting
+ * code via printk calls kmalloc() which ends up in
+ * kfence_alloc() and tries to grab the same object that we're
+ * reporting on. While it has never been observed, lockdep does
+ * report that there is a possibility of deadlock. Fix it by
+ * using trylock and bailing out gracefully.
+ */
+ raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+ /* Put the object back on the freelist. */
+ list_add_tail(&meta->list, &kfence_freelist);
+ raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+

+ return NULL;
+ }
+

+ meta->addr = metadata_to_pageaddr(meta);
+ /* Unprotect if we're reusing this page. */
+ if (meta->state == KFENCE_OBJECT_FREED)
+ kfence_unprotect(meta->addr);
+
+ /*

+ * Note: for allocations made before RNG initialization, will always
+ * return zero. We still benefit from enabling KFENCE as early as
+ * possible, even when the RNG is not yet available, as this will allow
+ * KFENCE to detect bugs due to earlier allocations. The only downside
+ * is that the out-of-bounds accesses detected are deterministic for
+ * such allocations.
+ */
+ if (prandom_u32_max(2)) {
+ /* Allocate on the "right" side, re-calculate address. */

+ meta->addr += PAGE_SIZE - size;
+ meta->addr = ALIGN_DOWN(meta->addr, cache->align);

+ }
+

+ /* Update remaining metadata. */
+ metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED);
+ /* Pairs with READ_ONCE() in kfence_shutdown_cache(). */
+ WRITE_ONCE(meta->cache, cache);

+ meta->size = size;

+ for_each_canary(meta, set_canary_byte);
+ virt_to_page(meta->addr)->slab_cache = cache;
+
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+ /* Memory initialization. */
+
+ /*
+ * We check slab_want_init_on_alloc() ourselves, rather than letting
+ * SL*B do the initialization, as otherwise we might overwrite KFENCE's
+ * redzone.
+ */
+ addr = (void *)meta->addr;
+ if (unlikely(slab_want_init_on_alloc(gfp, cache)))
+ memzero_explicit(addr, size);
+ if (cache->ctor)
+ cache->ctor(addr);
+

+ if (CONFIG_KFENCE_STRESS_TEST_FAULTS && !prandom_u32_max(CONFIG_KFENCE_STRESS_TEST_FAULTS))

+ kfence_protect(meta->addr); /* Random "faults" by protecting the object. */
+
+ atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCATED]);

+ atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCS]);
+

+ return addr;
+}
+

+static void kfence_guarded_free(void *addr, struct kfence_metadata *meta)
+{

+ struct kcsan_scoped_access assert_page_exclusive;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&meta->lock, flags);
+
+ if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {
+ /* Invalid or double-free, bail out. */
+ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
+ kfence_report_error((unsigned long)addr, meta, KFENCE_ERROR_INVALID_FREE);

+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+ return;
+ }
+

+ /* Detect racy use-after-free, or incorrect reallocation of this page by KFENCE. */
+ kcsan_begin_scoped_access((void *)ALIGN_DOWN((unsigned long)addr, PAGE_SIZE), PAGE_SIZE,
+ KCSAN_ACCESS_SCOPED | KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT,
+ &assert_page_exclusive);
+

+ if (CONFIG_KFENCE_STRESS_TEST_FAULTS)

+ kfence_unprotect((unsigned long)addr); /* To check canary bytes. */
+
+ /* Restore page protection if there was an OOB access. */
+ if (meta->unprotected_page) {
+ kfence_protect(meta->unprotected_page);
+ meta->unprotected_page = 0;
+ }
+
+ /* Check canary bytes for memory corruption. */
+ for_each_canary(meta, check_canary_byte);
+
+ /*
+ * Clear memory if init-on-free is set. While we protect the page, the
+ * data is still there, and after a use-after-free is detected, we
+ * unprotect the page, so the data is still accessible.
+ */
+ if (unlikely(slab_want_init_on_free(meta->cache)))

+ memzero_explicit(addr, meta->size);

+
+ /* Mark the object as freed. */
+ metadata_update_state(meta, KFENCE_OBJECT_FREED);
+
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+ /* Protect to detect use-after-frees. */
+ kfence_protect((unsigned long)addr);
+
+ /* Add it to the tail of the freelist for reuse. */
+ raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+ KFENCE_WARN_ON(!list_empty(&meta->list));
+ list_add_tail(&meta->list, &kfence_freelist);
+ kcsan_end_scoped_access(&assert_page_exclusive);
+ raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+
+ atomic_long_dec(&counters[KFENCE_COUNTER_ALLOCATED]);
+ atomic_long_inc(&counters[KFENCE_COUNTER_FREES]);
+}
+
+static void rcu_guarded_free(struct rcu_head *h)

+{

+ struct kfence_metadata *meta = container_of(h, struct kfence_metadata, rcu_head);
+
+ kfence_guarded_free((void *)meta->addr, meta);
+}
+
+static bool __init kfence_initialize_pool(void)
+{
+ unsigned long addr;
+ struct page *pages;
+ int i;
+
+ if (!arch_kfence_initialize_pool())
+ return false;
+
+ addr = (unsigned long)__kfence_pool;
+ pages = virt_to_page(addr);
+
+ /*

+ * Set up object pages: they must have PG_slab set, to avoid freeing
+ * these as real pages.

+{

+ debugfs_create_file("stats", 0444, kfence_dir, NULL, &stats_fops);

+ debugfs_create_file("objects", 0400, kfence_dir, NULL, &objects_fops);
+ return 0;
+}
+
+late_initcall(kfence_debugfs_init);
+
+/* === Allocation Gate Timer ================================================ */
+
+/*
+ * Set up delayed work, which will enable and disable the static key. We need to
+ * use a work queue (rather than a simple timer), since enabling and disabling a
+ * static key cannot be done from an interrupt.
+ */
+static struct delayed_work kfence_timer;
+static void toggle_allocation_gate(struct work_struct *work)
+{
+ if (!READ_ONCE(kfence_enabled))

+ return;
+

+ /* Enable static key, and await allocation to happen. */
+ atomic_set(&allocation_gate, 0);
+ static_branch_enable(&kfence_allocation_key);
+ wait_event(allocation_wait, atomic_read(&allocation_gate) != 0);
+
+ /* Disable static key and reset timer. */
+ static_branch_disable(&kfence_allocation_key);
+ schedule_delayed_work(&kfence_timer, msecs_to_jiffies(kfence_sample_interval));
+}
+static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate);
+
+/* === Public interface ===================================================== */
+
+void __init kfence_init(void)
+{
+ /* Setting kfence_sample_interval to 0 on boot disables KFENCE. */
+ if (!kfence_sample_interval)
+ return;
+
+ if (!kfence_initialize_pool()) {
+ pr_err("%s failed\n", __func__);
+ return;
+ }
+

+ WRITE_ONCE(kfence_enabled, true);
+ schedule_delayed_work(&kfence_timer, 0);
+ pr_info("initialized - using %lu bytes for %d objects", KFENCE_POOL_SIZE,

+ const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
+
+ /*
+ * Read locklessly -- if there is a race with __kfence_alloc(), this is
+ * either a use-after-free or invalid access.
+ */
+ return meta ? meta->size : 0;

+}
+
+void *kfence_object_start(const void *addr)
+{

+ const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
+
+ /*
+ * Read locklessly -- if there is a race with __kfence_alloc(), this is
+ * either a use-after-free or invalid access.

+ */
+ return meta ? (void *)meta->addr : NULL;
+}
+

+void __kfence_free(void *addr)
+{
+ struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
+

+ /*
+ * If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
+ * the object, as the object page may be recycled for other-typed
+ * objects once it has been freed.
+ */

+ if (unlikely(meta->cache->flags & SLAB_TYPESAFE_BY_RCU))

+ call_rcu(&meta->rcu_head, rcu_guarded_free);
+ else
+ kfence_guarded_free(addr, meta);
+}

+
+bool kfence_handle_page_fault(unsigned long addr)
+{
+ const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE;
+ struct kfence_metadata *to_report = NULL;
+ enum kfence_error_type error_type;
+ unsigned long flags;
+
+ if (!is_kfence_address((void *)addr))
+ return false;
+
+ if (!READ_ONCE(kfence_enabled)) /* If disabled at runtime ... */
+ return kfence_unprotect(addr); /* ... unprotect and proceed. */
+
+ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
+
+ if (page_index % 2) {
+ /* This is a redzone, report a buffer overflow. */

+ struct kfence_metadata *meta;

+ int distance = 0;
+
+ meta = addr_to_metadata(addr - PAGE_SIZE);
+ if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {
+ to_report = meta;
+ /* Data race ok; distance calculation approximate. */

+ distance = addr - data_race(meta->addr + meta->size);

new file mode 100644
index 000000000000..5905628d4faa
--- /dev/null
+++ b/mm/kfence/kfence.h
@@ -0,0 +1,102 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef MM_KFENCE_KFENCE_H
+#define MM_KFENCE_KFENCE_H
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+
+#include "../slab.h" /* for struct kmem_cache */
+
+/* For non-debug builds, avoid leaking kernel pointers into dmesg. */
+#ifdef CONFIG_DEBUG_KERNEL
+#define PTR_FMT "%px"
+#else
+#define PTR_FMT "%p"
+#endif
+

+/*
+ * Get the canary byte pattern for @addr. Use a pattern that varies based on the
+ * lower 3 bits of the address, to detect memory corruptions with higher
+ * probability, where similar constants are used.
+ */

+#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)(addr) & 0x7))

+ * The size of the original allocation.
+ */
+ size_t size;

+
+void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta);
+

+#endif /* MM_KFENCE_KFENCE_H */
diff --git a/mm/kfence/report.c b/mm/kfence/report.c

new file mode 100644
index 000000000000..0375867e85b3
--- /dev/null
+++ b/mm/kfence/report.c
@@ -0,0 +1,219 @@

+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdarg.h>
+
+#include <linux/kernel.h>
+#include <linux/lockdep.h>
+#include <linux/printk.h>
+#include <linux/seq_file.h>
+#include <linux/stacktrace.h>
+#include <linux/string.h>
+
+#include <asm/kfence.h>
+
+#include "kfence.h"
+
+/* Helper function to either print to a seq_file or to console. */

+__printf(2, 3)

+static void seq_con_printf(struct seq_file *seq, const char *fmt, ...)
+{
+ va_list args;
+
+ va_start(args, fmt);
+ if (seq)
+ seq_vprintf(seq, fmt, args);
+ else
+ vprintk(fmt, args);
+ va_end(args);
+}
+

+/*
+ * Get the number of stack entries to skip get out of MM internals. @type is
+ * optional, and if set to NULL, assumes an allocation or free stack.
+ */

+static int get_stack_skipnr(const unsigned long stack_entries[], int num_entries,

+ const enum kfence_error_type *type)

+{
+ char buf[64];
+ int skipnr, fallback = 0;

+ bool is_access_fault = false;
+
+ if (type) {

+ /* Depending on error type, find different stack entries. */

+ switch (*type) {

+ case KFENCE_ERROR_UAF:
+ case KFENCE_ERROR_OOB:
+ case KFENCE_ERROR_INVALID:

+ is_access_fault = true;

+ break;
+ case KFENCE_ERROR_CORRUPTION:
+ case KFENCE_ERROR_INVALID_FREE:

+ break;
+ }
+ }

+
+ for (skipnr = 0; skipnr < num_entries; skipnr++) {
+ int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]);
+

+ if (is_access_fault) {

+ if (!strncmp(buf, KFENCE_SKIP_ARCH_FAULT_HANDLER, len))
+ goto found;

+ } else {

+ if (str_has_prefix(buf, "kfence_") || str_has_prefix(buf, "__kfence_"))

+ fallback = skipnr + 1; /* In case of tail calls into kfence. */

+
+ /* Also the *_bulk() variants by only checking prefixes. */
+ if (str_has_prefix(buf, "kfree") ||

+ str_has_prefix(buf, "kmem_cache_free") ||
+ str_has_prefix(buf, "__kmalloc") ||
+ str_has_prefix(buf, "kmem_cache_alloc"))
+ goto found;
+ }

+ }
+ if (fallback < num_entries)
+ return fallback;
+found:
+ skipnr++;
+ return skipnr < num_entries ? skipnr : 0;
+}
+
+static void kfence_print_stack(struct seq_file *seq, const struct kfence_metadata *meta,
+ bool show_alloc)
+{
+ const unsigned long *entries = show_alloc ? meta->alloc_stack : meta->free_stack;
+ const int nentries = show_alloc ? meta->num_alloc_stack : meta->num_free_stack;
+
+ if (nentries) {

+ /* Skip allocation/free internals stack. */
+ int i = get_stack_skipnr(entries, nentries, NULL);

+
+ /* stack_trace_seq_print() does not exist; open code our own. */

+ for (; i < nentries; i++)
+ seq_con_printf(seq, " %pS\n", (void *)entries[i]);

+ } else {
+ seq_con_printf(seq, " no %s stack\n", show_alloc ? "allocation" : "deallocation");
+ }
+}

+
+void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta)
+{
+ const int size = abs(meta->size);

+ const unsigned long start = meta->addr;
+ const struct kmem_cache *const cache = meta->cache;
+
+ lockdep_assert_held(&meta->lock);
+
+ if (meta->state == KFENCE_OBJECT_UNUSED) {
+ seq_con_printf(seq, "kfence-#%zd unused\n", meta - kfence_metadata);
+ return;
+ }
+

+ seq_con_printf(seq,
+ "kfence-#%zd [0x" PTR_FMT "-0x" PTR_FMT

+ ", size=%d, cache=%s] allocated in:\n",
+ meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size,
+ (cache && cache->name) ? cache->name : "<destroyed>");
+ kfence_print_stack(seq, meta, true);
+
+ if (meta->state == KFENCE_OBJECT_FREED) {

+ seq_con_printf(seq, "\nfreed in:\n");

+ int skipnr = get_stack_skipnr(stack_entries, num_stack_entries, &type);
+ const ptrdiff_t object_index = meta ? meta - kfence_metadata : -1;
+
+ /* Require non-NULL meta, except if KFENCE_ERROR_INVALID. */
+ if (WARN_ON(type != KFENCE_ERROR_INVALID && !meta))

+ return;
+
+ if (meta)
+ lockdep_assert_held(&meta->lock);
+ /*
+ * Because we may generate reports in printk-unfriendly parts of the
+ * kernel, such as scheduler code, the use of printk() could deadlock.
+ * Until such time that all printing code here is safe in all parts of
+ * the kernel, accept the risk, and just get our message out (given the
+ * system might already behave unpredictably due to the memory error).
+ * As such, also disable lockdep to hide warnings, and avoid disabling
+ * lockdep for the rest of the kernel.
+ */
+ lockdep_off();
+
+ pr_err("==================================================================\n");
+ /* Print report header. */
+ switch (type) {
+ case KFENCE_ERROR_OOB:
+ pr_err("BUG: KFENCE: out-of-bounds in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Out-of-bounds access at 0x" PTR_FMT " (%s of kfence-#%zd):\n",

+ (void *)address, address < meta->addr ? "left" : "right", object_index);

+ break;
+ case KFENCE_ERROR_UAF:
+ pr_err("BUG: KFENCE: use-after-free in %pS\n\n", (void *)stack_entries[skipnr]);

+ pr_err("Use-after-free access at 0x" PTR_FMT " (in kfence-#%zd):\n",
+ (void *)address, object_index);

+ break;
+ case KFENCE_ERROR_CORRUPTION:

+ pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Corrupted memory at 0x" PTR_FMT " ", (void *)address);
+ print_diff_canary((u8 *)address, 16);
+ pr_cont(" (in kfence-#%zd):\n", object_index);

+ break;
+ case KFENCE_ERROR_INVALID:
+ pr_err("BUG: KFENCE: invalid access in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Invalid access at 0x" PTR_FMT ":\n", (void *)address);
+ break;
+ case KFENCE_ERROR_INVALID_FREE:
+ pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]);

+ pr_err("Invalid free of 0x" PTR_FMT " (in kfence-#%zd):\n", (void *)address,
+ object_index);

2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:04 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Add architecture specific implementation details for KFENCE and enable

KFENCE for the x86 architecture. In particular, this implements the
required interface in <asm/kfence.h> for setting up the pool and
providing helper functions for protecting and unprotecting pages.

For x86, we need to ensure that the pool uses 4K pages, which is done
using the set_memory_4k() helper function.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

arch/x86/Kconfig | 2 ++
arch/x86/include/asm/kfence.h | 60 +++++++++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 4 +++
3 files changed, 66 insertions(+)
create mode 100644 arch/x86/include/asm/kfence.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..e22dc722698c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -144,6 +144,8 @@ config X86
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if X86_64
select HAVE_ARCH_KASAN_VMALLOC if X86_64
+ select HAVE_ARCH_KFENCE
+ select HAVE_ARCH_KFENCE_STATIC_POOL
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT
diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
new file mode 100644
index 000000000000..cf09e377faf9
--- /dev/null
+++ b/arch/x86/include/asm/kfence.h
@@ -0,0 +1,60 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+

+#ifndef _ASM_X86_KFENCE_H
+#define _ASM_X86_KFENCE_H
+
+#include <linux/bug.h>
+#include <linux/kfence.h>
+
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+#include <asm/set_memory.h>
+#include <asm/tlbflush.h>
+
+/* The alignment should be at least a 4K page. */
+#define KFENCE_POOL_ALIGNMENT PAGE_SIZE
+
+/*
+ * The page fault handler entry function, up to which the stack trace is
+ * truncated in reports.
+ */
+#define KFENCE_SKIP_ARCH_FAULT_HANDLER "asm_exc_page_fault"
+
+/* Force 4K pages for __kfence_pool. */
+static inline bool arch_kfence_initialize_pool(void)

+{
+ unsigned long addr;
+

+ for (addr = (unsigned long)__kfence_pool; is_kfence_address((void *)addr);
+ addr += PAGE_SIZE) {
+ unsigned int level;
+
+ if (!lookup_address(addr, &level))
+ return false;
+
+ if (level != PG_LEVEL_4K)
+ set_memory_4k(addr, 1);

+ }
+
+ return true;
+}
+

+/* Protect the given page and flush TLBs. */
+static inline bool kfence_protect_page(unsigned long addr, bool protect)
+{
+ unsigned int level;
+ pte_t *pte = lookup_address(addr, &level);
+
+ if (!pte || level != PG_LEVEL_4K)
+ return false;
+
+ if (protect)
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
+ else
+ set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
+
+ flush_tlb_one_kernel(addr);

+ return true;
+}
+

+#endif /* _ASM_X86_KFENCE_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6e3e8a124903..423e15ad5eb6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -9,6 +9,7 @@
#include <linux/kdebug.h> /* oops_begin/end, ... */
#include <linux/extable.h> /* search_exception_tables */
#include <linux/memblock.h> /* max_low_pfn */
+#include <linux/kfence.h> /* kfence_handle_page_fault */
#include <linux/kprobes.h> /* NOKPROBE_SYMBOL, ... */
#include <linux/mmiotrace.h> /* kmmio_handler, ... */
#include <linux/perf_event.h> /* perf_sw_event */
@@ -701,6 +702,9 @@ no_context(struct pt_regs *regs, unsigned long error_code,
}
#endif

+ if (kfence_handle_page_fault(address))
+ return;
+
/*
* 32-bit:
*
--
2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:07 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add architecture specific implementation details for KFENCE and enable

KFENCE for the arm64 architecture. In particular, this implements the
required interface in <asm/kfence.h>. Currently, the arm64 version does
not yet use a statically allocated memory pool, at the cost of a pointer
load for each is_kfence_address().

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

For ARM64, we would like to solicit feedback on what the best option is
to obtain a constant address for __kfence_pool. One option is to declare
a memory range in the memory layout to be dedicated to KFENCE (like is
done for KASAN), however, it is unclear if this is the best available
option. We would like to avoid touching the memory layout.

---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/kfence.h | 39 +++++++++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 4 ++++
3 files changed, 44 insertions(+)
create mode 100644 arch/arm64/include/asm/kfence.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..1acc6b2877c3 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -132,6 +132,7 @@ config ARM64
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN
+ select HAVE_ARCH_KFENCE if (!ARM64_16K_PAGES && !ARM64_64K_PAGES)
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
new file mode 100644
index 000000000000..608dde80e5ca
--- /dev/null
+++ b/arch/arm64/include/asm/kfence.h
@@ -0,0 +1,39 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+

+#ifndef __ASM_KFENCE_H
+#define __ASM_KFENCE_H
+
+#include <linux/kfence.h>
+#include <linux/log2.h>
+#include <linux/mm.h>
+
+#include <asm/cacheflush.h>
+
+#define KFENCE_SKIP_ARCH_FAULT_HANDLER "el1_sync"
+
+/*
+ * FIXME: Support HAVE_ARCH_KFENCE_STATIC_POOL: Use the statically allocated
+ * __kfence_pool, to avoid the extra pointer load for is_kfence_address(). By
+ * default, however, we do not have struct pages for static allocations.
+ */
+

+static inline bool arch_kfence_initialize_pool(void)
+{

+ const unsigned int num_pages = ilog2(roundup_pow_of_two(KFENCE_POOL_SIZE / PAGE_SIZE));
+ struct page *pages = alloc_pages(GFP_KERNEL, num_pages);
+
+ if (!pages)
+ return false;
+
+ __kfence_pool = page_address(pages);

+ return true;
+}
+

+static inline bool kfence_protect_page(unsigned long addr, bool protect)
+{

+ set_memory_valid(addr, 1, !protect);

+
+ return true;
+}
+

+#endif /* __ASM_KFENCE_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f07333e86c2f..d5b72ecbeeea 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -10,6 +10,7 @@
#include <linux/acpi.h>
#include <linux/bitfield.h>
#include <linux/extable.h>
+#include <linux/kfence.h>
#include <linux/signal.h>
#include <linux/mm.h>
#include <linux/hardirq.h>
@@ -310,6 +311,9 @@ static void __do_kernel_fault(unsigned long addr, unsigned int esr,
"Ignoring spurious kernel translation fault at virtual address %016lx\n", addr))
return;

+ if (kfence_handle_page_fault(addr))
+ return;
+
if (is_el1_permission_fault(addr, esr, regs)) {
if (esr & ESR_ELx_WNR)
msg = "write to read-only memory";
--
2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:10 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Inserts KFENCE hooks into the SLAB allocator.

We note the addition of the 'orig_size' argument to slab_alloc*()
functions, to be able to pass the originally requested size to KFENCE.
When KFENCE is disabled, there is no additional overhead, since these
functions are __always_inline.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

+ kmemleak_free_recursive(objp, cachep->flags);
+ return;
+ }
+

/* Put the object into the quarantine, don't touch it for now. */
if (kasan_slab_free(cachep, objp, _RET_IP_))
return;
@@ -3479,7 +3495,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
*/
void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
- void *ret = slab_alloc(cachep, flags, _RET_IP_);
+ void *ret = slab_alloc(cachep, flags, cachep->object_size, _RET_IP_);

2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:11 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Inserts KFENCE hooks into the SLUB allocator.

We note the addition of the 'orig_size' argument to slab_alloc*()
functions, to be able to pass the originally requested size to KFENCE.
When KFENCE is disabled, there is no additional overhead, since these
functions are __always_inline.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

mm/slub.c | 72 ++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d4177aecedf6..5c5a13a7857c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -27,6 +27,7 @@
#include <linux/ctype.h>
#include <linux/debugobjects.h>
#include <linux/kallsyms.h>
+#include <linux/kfence.h>
#include <linux/memory.h>
#include <linux/math64.h>
#include <linux/fault-inject.h>
@@ -1557,6 +1558,11 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
void *old_tail = *tail ? *tail : *head;
int rsize;

+ if (is_kfence_address(next)) {
+ slab_free_hook(s, next);

+ return true;
+ }
+

/* Head and tail of the reconstructed freelist */
*head = NULL;
*tail = NULL;
@@ -2660,7 +2666,8 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
* already disabled (which is the case for bulk allocation).
*/
static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c)
+ unsigned long addr, struct kmem_cache_cpu *c,
+ size_t orig_size)
{
void *freelist;
struct page *page;
@@ -2763,7 +2770,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* cpu changes by refetching the per cpu area pointer.
*/
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c)
+ unsigned long addr, struct kmem_cache_cpu *c,
+ size_t orig_size)
{
void *p;
unsigned long flags;
@@ -2778,7 +2786,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c = this_cpu_ptr(s->cpu_slab);
#endif

- p = ___slab_alloc(s, gfpflags, node, addr, c);
+ p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
local_irq_restore(flags);
return p;
}
@@ -2805,7 +2813,7 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
* Otherwise we can simply pick the next object from the lockless free list.
*/
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
- gfp_t gfpflags, int node, unsigned long addr)
+ gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
{
void *object;
struct kmem_cache_cpu *c;
@@ -2816,6 +2824,11 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags);
if (!s)
return NULL;
+
+ object = kfence_alloc(s, orig_size, gfpflags);
+ if (unlikely(object))
+ goto out;
+
redo:
/*
* Must read kmem_cache cpu data via this cpu ptr. Preemption is
@@ -2853,7 +2866,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node))) {
- object = __slab_alloc(s, gfpflags, node, addr, c);
+ object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
stat(s, ALLOC_SLOWPATH);
} else {
void *next_object = get_freepointer_safe(s, object);
@@ -2889,20 +2902,21 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
memset(object, 0, s->object_size);

+out:
slab_post_alloc_hook(s, objcg, gfpflags, 1, &object);

return object;
}

static __always_inline void *slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, unsigned long addr)
+ gfp_t gfpflags, unsigned long addr, size_t orig_size)
{
- return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);
+ return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr, orig_size);
}

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
- void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+ void *ret = slab_alloc(s, gfpflags, _RET_IP_, s->object_size);

trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size,
s->size, gfpflags);
@@ -2914,7 +2928,7 @@ EXPORT_SYMBOL(kmem_cache_alloc);
#ifdef CONFIG_TRACING
void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
{
- void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+ void *ret = slab_alloc(s, gfpflags, _RET_IP_, size);
trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags);
ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
@@ -2925,7 +2939,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_trace);
#ifdef CONFIG_NUMA
void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
{
- void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_);
+ void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_, s->object_size);

trace_kmem_cache_alloc_node(_RET_IP_, ret,
s->object_size, s->size, gfpflags, node);
@@ -2939,7 +2953,7 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
gfp_t gfpflags,
int node, size_t size)
{
- void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_);
+ void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_, size);

trace_kmalloc_node(_RET_IP_, ret,
size, s->size, gfpflags, node);
@@ -2973,6 +2987,9 @@ static void __slab_free(struct kmem_cache *s, struct page *page,

stat(s, FREE_SLOWPATH);

+ if (kfence_free(head))
+ return;
+
if (kmem_cache_debug(s) &&
!free_debug_processing(s, page, head, tail, cnt, addr))
return;
@@ -3216,6 +3233,13 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
df->s = cache_from_obj(s, object); /* Support for memcg */
}

+ if (is_kfence_address(object)) {
+ slab_free_hook(df->s, object);
+ WARN_ON(!kfence_free(object));
+ p[size] = NULL; /* mark object processed */
+ return size;
+ }
+
/* Start new detached freelist */
df->page = page;
set_freepointer(df->s, object, NULL);
@@ -3290,8 +3314,14 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);

for (i = 0; i < size; i++) {

- void *object = c->freelist;
+ void *object = kfence_alloc(s, s->object_size, flags);

+ if (unlikely(object)) {
+ p[i] = object;
+ continue;
+ }
+
+ object = c->freelist;
if (unlikely(!object)) {
/*
* We may have removed an object from c->freelist using
@@ -3307,7 +3337,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* of re-populating per CPU c->freelist
*/
p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
- _RET_IP_, c);
+ _RET_IP_, c, size);
if (unlikely(!p[i]))
goto error;

@@ -3962,7 +3992,7 @@ void *__kmalloc(size_t size, gfp_t flags)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc(s, flags, _RET_IP_);
+ ret = slab_alloc(s, flags, _RET_IP_, size);

trace_kmalloc(_RET_IP_, ret, size, s->size, flags);

@@ -4010,7 +4040,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc_node(s, flags, node, _RET_IP_);
+ ret = slab_alloc_node(s, flags, node, _RET_IP_, size);

trace_kmalloc_node(_RET_IP_, ret, size, s->size, flags, node);

@@ -4036,6 +4066,7 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
struct kmem_cache *s;
unsigned int offset;
size_t object_size;

+ bool is_kfence = is_kfence_address(ptr);

ptr = kasan_reset_tag(ptr);

@@ -4048,10 +4079,13 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
to_user, 0, n);

/* Find offset within object. */

- offset = (ptr - page_address(page)) % s->size;
+ if (is_kfence)

+ offset = ptr - kfence_object_start(ptr);
+ else

+ offset = (ptr - page_address(page)) % s->size;

/* Adjust for redzone and reject if within the redzone. */
- if (kmem_cache_debug_flags(s, SLAB_RED_ZONE)) {
+ if (!is_kfence && kmem_cache_debug_flags(s, SLAB_RED_ZONE)) {
if (offset < s->red_left_pad)
usercopy_abort("SLUB object in left red zone",
s->name, to_user, offset, n);
@@ -4460,7 +4494,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc(s, gfpflags, caller);
+ ret = slab_alloc(s, gfpflags, caller, size);

/* Honor the call site pointer we received. */
trace_kmalloc(caller, ret, size, s->size, gfpflags);
@@ -4491,7 +4525,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc_node(s, gfpflags, node, caller);
+ ret = slab_alloc_node(s, gfpflags, node, caller, size);

/* Honor the call site pointer we received. */
trace_kmalloc_node(caller, ret, size, s->size, gfpflags, node);
--
2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:15 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

We make KFENCE compatible with KASAN for testing KFENCE itself. In
particular, KASAN helps to catch any potential corruptions to KFENCE
state, or other corruptions that may be a result of freepointer
corruptions in the main allocators.

To indicate that the combination of the two is generally discouraged,
CONFIG_EXPERT=y should be set. It also gives us the nice property that
KFENCE will be build-tested by allyesconfig builds.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

lib/Kconfig.kfence | 2 +-
mm/kasan/common.c | 7 +++++++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence

index 6a90fef41832..872bcbdd8cc4 100644

--- a/lib/Kconfig.kfence
+++ b/lib/Kconfig.kfence
@@ -10,7 +10,7 @@ config HAVE_ARCH_KFENCE_STATIC_POOL

menuconfig KFENCE
bool "KFENCE: low-overhead sampling-based memory safety error detector"
- depends on HAVE_ARCH_KFENCE && !KASAN && (SLAB || SLUB)
+ depends on HAVE_ARCH_KFENCE && (!KASAN || EXPERT) && (SLAB || SLUB)

depends on JUMP_LABEL # To ensure performance, require jump labels

select STACKTRACE
help
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 950fd372a07e..f5c49f0fdeff 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -18,6 +18,7 @@
#include <linux/init.h>
#include <linux/kasan.h>
#include <linux/kernel.h>
+#include <linux/kfence.h>
#include <linux/kmemleak.h>
#include <linux/linkage.h>
#include <linux/memblock.h>
@@ -396,6 +397,9 @@ static bool __kasan_slab_free(struct kmem_cache *cache, void *object,
tagged_object = object;
object = reset_tag(object);

+ if (is_kfence_address(object))
+ return false;

+

if (unlikely(nearest_obj(cache, virt_to_head_page(object), object) !=
object)) {
kasan_report_invalid_free(tagged_object, ip);
@@ -444,6 +448,9 @@ static void *__kasan_kmalloc(struct kmem_cache *cache, const void *object,
if (unlikely(object == NULL))
return NULL;

+ if (is_kfence_address(object))

+ return (void *)object;
+
redzone_start = round_up((unsigned long)(object + size),
KASAN_SHADOW_SCALE_SIZE);
redzone_end = round_up((unsigned long)object + cache->object_size,
--

2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:17 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Add compatibility with KMEMLEAK, by making KMEMLEAK aware of the KFENCE
memory pool. This allows building debug kernels with both enabled, which
also helped in debugging KFENCE.

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

v2:
* Rework using delete_object_part() [suggested by Catalin Marinas].
---
mm/kmemleak.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5e252d91eb14..feff16068e8e 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -97,6 +97,7 @@
#include <linux/atomic.h>

#include <linux/kasan.h>
+#include <linux/kfence.h>
#include <linux/kmemleak.h>
#include <linux/memory_hotplug.h>

@@ -1948,6 +1949,11 @@ void __init kmemleak_init(void)
KMEMLEAK_GREY, GFP_ATOMIC);
create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
KMEMLEAK_GREY, GFP_ATOMIC);
+#if defined(CONFIG_KFENCE) && defined(CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL)
+ /* KFENCE objects are located in .bss, which may confuse kmemleak. Skip them. */
+ delete_object_part((unsigned long)__kfence_pool, KFENCE_POOL_SIZE);
+#endif
+
/* only register .data..ro_after_init if not within .data */
if (&__start_ro_after_init < &_sdata || &__end_ro_after_init > &_edata)
create_object((unsigned long)__start_ro_after_init,
--
2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:19 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Lockdep checks that dynamic key registration is only performed on keys
that are not static objects. With KFENCE, it is possible that such a
dynamically allocated key is a KFENCE object which may, however, be
allocated from a static memory pool (if HAVE_ARCH_KFENCE_STATIC_POOL).

Therefore, ignore KFENCE-allocated objects in static_obj().

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

kernel/locking/lockdep.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 54b74fabf40c..0cf5d5ecbd31 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -38,6 +38,7 @@
#include <linux/seq_file.h>
#include <linux/spinlock.h>
#include <linux/kallsyms.h>
+#include <linux/kfence.h>
#include <linux/interrupt.h>
#include <linux/stacktrace.h>
#include <linux/debug_locks.h>
@@ -755,6 +756,13 @@ static int static_obj(const void *obj)
if (arch_is_kernel_initmem_freed(addr))
return 0;

+ /*
+ * KFENCE objects may be allocated from a static memory pool, but are
+ * not actually static objects.
+ */
+ if (is_kfence_address(obj))
+ return 0;
+
/*
* static variable?
*/
--
2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:21 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add KFENCE documentation in dev-tools/kfence.rst, and add to index.

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

v2:
* Many clarifications based on comments from Andrey Konovalov.
* Document CONFIG_KFENCE_SAMPLE_INTERVAL=0 usage.
* Make use-cases between KASAN and KFENCE clearer.
* Be clearer about the fact the pool is fixed size.
* Update based on reporting changes.
* Explicitly mention max supported allocation size is PAGE_SIZE.
---
Documentation/dev-tools/index.rst | 1 +
Documentation/dev-tools/kfence.rst | 291 +++++++++++++++++++++++++++++
2 files changed, 292 insertions(+)

create mode 100644 Documentation/dev-tools/kfence.rst

diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
index f7809c7b1ba9..1b1cf4f5c9d9 100644
--- a/Documentation/dev-tools/index.rst
+++ b/Documentation/dev-tools/index.rst
@@ -22,6 +22,7 @@ whole; patches welcome!
ubsan
kmemleak
kcsan
+ kfence
gdb-kernel-debugging
kgdb
kselftest
diff --git a/Documentation/dev-tools/kfence.rst b/Documentation/dev-tools/kfence.rst

new file mode 100644
index 000000000000..efe86b1b1074
--- /dev/null
+++ b/Documentation/dev-tools/kfence.rst
@@ -0,0 +1,291 @@

+.. SPDX-License-Identifier: GPL-2.0
+
+Kernel Electric-Fence (KFENCE)
+==============================
+
+Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety
+error detector. KFENCE detects heap out-of-bounds access, use-after-free, and
+invalid-free errors.
+
+KFENCE is designed to be enabled in production kernels, and has near zero
+performance overhead. Compared to KASAN, KFENCE trades performance for
+precision. The main motivation behind KFENCE's design, is that with enough
+total uptime KFENCE will detect bugs in code paths not typically exercised by
+non-production test workloads. One way to quickly achieve a large enough total
+uptime is when the tool is deployed across a large fleet of machines.
+
+Usage
+-----
+
+To enable KFENCE, configure the kernel with::
+
+ CONFIG_KFENCE=y
+

+To build a kernel with KFENCE support, but disabled by default (to enable, set
+``kfence.sample_interval`` to non-zero value), configure the kernel with::
+
+ CONFIG_KFENCE=y
+ CONFIG_KFENCE_SAMPLE_INTERVAL=0

+
+KFENCE provides several other configuration options to customize behaviour (see
+the respective help text in ``lib/Kconfig.kfence`` for more info).
+
+Tuning performance
+~~~~~~~~~~~~~~~~~~
+
+The most important parameter is KFENCE's sample interval, which can be set via
+the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The
+sample interval determines the frequency with which heap allocations will be
+guarded by KFENCE. The default is configurable via the Kconfig option
+``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0``
+disables KFENCE.
+

+The KFENCE memory pool is of fixed size, and if the pool is exhausted, no
+further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default
+255), the number of available guarded objects can be controlled. Each object
+requires 2 pages, one for the object itself and the other one used as a guard
+page; object pages are interleaved with guard pages, and every object page is
+therefore surrounded by two guard pages.

+
+The total memory dedicated to the KFENCE memory pool can be computed as::
+
+ ( #objects + 1 ) * 2 * PAGE_SIZE
+
+Using the default config, and assuming a page size of 4 KiB, results in
+dedicating 2 MiB to the KFENCE memory pool.
+
+Error reports
+~~~~~~~~~~~~~
+
+A typical out-of-bounds access looks like this::
+
+ ==================================================================
+ BUG: KFENCE: out-of-bounds in test_out_of_bounds_read+0xa3/0x22b
+
+ Out-of-bounds access at 0xffffffffb672efff (left of kfence-#17):
+ test_out_of_bounds_read+0xa3/0x22b
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated in:

+ test_alloc+0xf3/0x25b
+ test_out_of_bounds_read+0x98/0x22b
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+
+The header of the report provides a short summary of the function involved in
+the access. It is followed by more detailed information about the access and

+its origin. Note that, real kernel addresses are only shown for
+``CONFIG_DEBUG_KERNEL=y`` builds.

+
+Use-after-free accesses are reported as::
+
+ ==================================================================
+ BUG: KFENCE: use-after-free in test_use_after_free_read+0xb3/0x143
+

+ Use-after-free access at 0xffffffffb673dfe0 (in kfence-#24):

+ test_use_after_free_read+0xb3/0x143
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated in:

+ test_alloc+0xf3/0x25b
+ test_use_after_free_read+0x76/0x143
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30

+
+ freed in:

+ test_use_after_free_read+0xa8/0x143
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+
+KFENCE also reports on invalid frees, such as double-frees::
+
+ ==================================================================
+ BUG: KFENCE: invalid free in test_double_free+0xdc/0x171
+
+ Invalid free of 0xffffffffb6741000:
+ test_double_free+0xdc/0x171
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated in:

+ test_alloc+0xf3/0x25b
+ test_double_free+0x76/0x171
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+

+ freed in:

+ test_double_free+0xa8/0x171
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+
+KFENCE also uses pattern-based redzones on the other side of an object's guard
+page, to detect out-of-bounds writes on the unprotected side of the object.
+These are reported on frees::
+
+ ==================================================================

+ BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184
+
+ Corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ] (in kfence-#69):

+ test_kmalloc_aligned_oob_write+0xef/0x184
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated in:

+ test_alloc+0xf3/0x25b
+ test_kmalloc_aligned_oob_write+0x57/0x184
+ kunit_try_run_case+0x51/0x85
+ kunit_generic_run_threadfn_adapter+0x16/0x30
+ kthread+0x137/0x160
+ ret_from_fork+0x22/0x30
+
+ CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+ ==================================================================
+

+For such errors, the address where the corruption as well as the invalidly
+written bytes (offset from the address) are shown; in this representation, '.'
+denote untouched bytes. In the example above ``0xac`` is the value written to
+the invalid address at offset 0, and the remaining '.' denote that no following
+bytes have been touched. Note that, real values are only shown for
+``CONFIG_DEBUG_KERNEL=y`` builds; to avoid information disclosure for non-debug
+builds, '!' is used instead to denote invalidly written bytes.

+of the sample interval, the next allocation through the main allocator (SLAB or
+SLUB) returns a guarded allocation from the KFENCE object pool (allocation
+sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and
+the next allocation is set up after the expiration of the interval. To "gate" a
+KFENCE allocation through the main allocator's fast-path without overhead,
+KFENCE relies on static branches via the static keys infrastructure. The static
+branch is toggled to redirect the allocation to KFENCE.
+
+KFENCE objects each reside on a dedicated page, at either the left or right
+page boundaries selected at random. The pages to the left and right of the
+object page are "guard pages", whose attributes are changed to a protected
+state, and cause page faults on any attempted access. Such page faults are then
+intercepted by KFENCE, which handles the fault gracefully by reporting an
+out-of-bounds access.
+
+To detect out-of-bounds writes to memory within the object's page itself,
+KFENCE also uses pattern-based redzones. For each object page, a redzone is set
+up for all non-object memory. For typical alignments, the redzone is only
+required on the unguarded side of an object. Because KFENCE must honor the
+cache's requested alignment, special alignments may result in unprotected gaps
+on either side of an object, all of which are redzoned.

+
+It is worth highlighting that KASAN and KFENCE are complementary, with
+different target environments. For instance, KASAN is the better debugging-aid,
+where test cases or reproducers exists: due to the lower chance to detect the
+error, it would require more effort using KFENCE to debug. Deployments at scale
+that cannot afford to enable KASAN, however, would benefit from using KFENCE to
+discover bugs due to code paths not exercised by test cases or fuzzers.
--
2.28.0.618.gf4bc123cb7-goog

Marco Elver

unread,

Sep 15, 2020, 9:21:23 AM9/15/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add KFENCE test suite, testing various error detection scenarios. Makes
use of KUnit for test organization. Since KFENCE's interface to obtain
error reports is via the console, the test verifies that KFENCE outputs
expected reports to the console.

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---
v2:

* Update for shortened memory corruption report.
---
lib/Kconfig.kfence | 13 +
mm/kfence/Makefile | 3 +
mm/kfence/kfence_test.c | 777 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 793 insertions(+)
create mode 100644 mm/kfence/kfence_test.c

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
index 872bcbdd8cc4..46d9b6693abb 100644
--- a/lib/Kconfig.kfence
+++ b/lib/Kconfig.kfence
@@ -62,4 +62,17 @@ config KFENCE_STRESS_TEST_FAULTS

The option is only to test KFENCE; set to 0 if you are unsure.

+config KFENCE_KUNIT_TEST
+ tristate "KFENCE integration test suite" if !KUNIT_ALL_TESTS
+ default KUNIT_ALL_TESTS
+ depends on TRACEPOINTS && KUNIT
+ help
+ Test suite for KFENCE, testing various error detection scenarios with
+ various allocation types, and checking that reports are correctly
+ output to console.
+
+ Say Y here if you want the test to be built into the kernel and run
+ during boot; say M if you want the test to build as a module; say N
+ if you are unsure.
+
endif # KFENCE
diff --git a/mm/kfence/Makefile b/mm/kfence/Makefile
index d991e9a349f0..6872cd5e5390 100644
--- a/mm/kfence/Makefile
+++ b/mm/kfence/Makefile
@@ -1,3 +1,6 @@
# SPDX-License-Identifier: GPL-2.0

obj-$(CONFIG_KFENCE) := core.o report.o
+
+CFLAGS_kfence_test.o := -g -fno-omit-frame-pointer -fno-optimize-sibling-calls
+obj-$(CONFIG_KFENCE_KUNIT_TEST) += kfence_test.o
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
new file mode 100644
index 000000000000..30ac88b678ca
--- /dev/null
+++ b/mm/kfence/kfence_test.c
@@ -0,0 +1,777 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for KFENCE memory safety error detector. Since the interface with
+ * which KFENCE's reports are obtained is via the console, this is the output we
+ * should verify. For each test case checks the presence (or absence) of
+ * generated reports. Relies on 'console' tracepoint to capture reports as they
+ * appear in the kernel log.
+ *
+ * Copyright (C) 2020, Google LLC.
+ * Author: Alexander Potapenko <gli...@google.com>
+ * Marco Elver <el...@google.com>
+ */
+
+#include <kunit/test.h>
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/kfence.h>
+#include <linux/mm.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/tracepoint.h>
+#include <trace/events/printk.h>
+
+#include "kfence.h"
+
+/* Report as observed from console. */
+static struct {
+ spinlock_t lock;
+ int nlines;
+ char lines[2][512];
+} observed = {
+ .lock = __SPIN_LOCK_UNLOCKED(observed.lock),
+};
+
+/* Probe for console output: obtains observed lines of interest. */
+static void probe_console(void *ignore, const char *buf, size_t len)

+{
+ unsigned long flags;

+ int nlines;
+
+ spin_lock_irqsave(&observed.lock, flags);
+ nlines = observed.nlines;
+
+ if (strnstr(buf, "BUG: KFENCE: ", len) && strnstr(buf, "test_", len)) {
+ /*
+ * KFENCE report and related to the test.
+ *
+ * The provided @buf is not NUL-terminated; copy no more than
+ * @len bytes and let strscpy() add the missing NUL-terminator.
+ */
+ strscpy(observed.lines[0], buf, min(len + 1, sizeof(observed.lines[0])));
+ nlines = 1;
+ } else if (nlines == 1 && (strnstr(buf, "at 0x", len) || strnstr(buf, "of 0x", len))) {
+ strscpy(observed.lines[nlines++], buf, min(len + 1, sizeof(observed.lines[0])));
+ }
+
+ WRITE_ONCE(observed.nlines, nlines); /* Publish new nlines. */
+ spin_unlock_irqrestore(&observed.lock, flags);
+}
+
+/* Check if a report related to the test exists. */
+static bool report_available(void)
+{
+ return READ_ONCE(observed.nlines) == ARRAY_SIZE(observed.lines);
+}
+
+/* Information we expect in a report. */
+struct expect_report {
+ enum kfence_error_type type; /* The type or error. */
+ void *fn; /* Function pointer to expected function where access occurred. */
+ char *addr; /* Address at which the bad access occurred. */
+};
+
+/* Check observed report matches information in @r. */
+static bool report_matches(const struct expect_report *r)
+{
+ bool ret = false;
+ unsigned long flags;
+ typeof(observed.lines) expect;
+ const char *end;
+ char *cur;
+
+ /* Doubled-checked locking. */
+ if (!report_available())
+ return false;
+
+ /* Generate expected report contents. */
+
+ /* Title */
+ cur = expect[0];
+ end = &expect[0][sizeof(expect[0]) - 1];
+ switch (r->type) {
+ case KFENCE_ERROR_OOB:
+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: out-of-bounds");

+ break;
+ case KFENCE_ERROR_UAF:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: use-after-free");

+ break;
+ case KFENCE_ERROR_CORRUPTION:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: memory corruption");

+ break;
+ case KFENCE_ERROR_INVALID:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid access");

+ break;
+ case KFENCE_ERROR_INVALID_FREE:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid free");
+ break;
+ }
+
+ scnprintf(cur, end - cur, " in %pS", r->fn);
+ /* The exact offset won't match, remove it; also strip module name. */
+ cur = strchr(expect[0], '+');
+ if (cur)
+ *cur = '\0';
+
+ /* Access information */
+ cur = expect[1];
+ end = &expect[1][sizeof(expect[1]) - 1];
+
+ switch (r->type) {
+ case KFENCE_ERROR_OOB:
+ cur += scnprintf(cur, end - cur, "Out-of-bounds access at");

+ break;
+ case KFENCE_ERROR_UAF:

+ cur += scnprintf(cur, end - cur, "Use-after-free access at");

+ break;
+ case KFENCE_ERROR_CORRUPTION:

+ cur += scnprintf(cur, end - cur, "Corrupted memory at");

+ break;
+ case KFENCE_ERROR_INVALID:

+ cur += scnprintf(cur, end - cur, "Invalid access at");

+ break;
+ case KFENCE_ERROR_INVALID_FREE:

+ cur += scnprintf(cur, end - cur, "Invalid free of");
+ break;
+ }
+
+ cur += scnprintf(cur, end - cur, " 0x" PTR_FMT, (void *)r->addr);
+
+ spin_lock_irqsave(&observed.lock, flags);
+ if (!report_available())
+ goto out; /* A new report is being captured. */
+
+ /* Finally match expected output to what we actually observed. */
+ ret = strstr(observed.lines[0], expect[0]) && strstr(observed.lines[1], expect[1]);
+out:
+ spin_unlock_irqrestore(&observed.lock, flags);
+ return ret;
+}
+
+/* ===== Test cases ===== */
+
+#define TEST_PRIV_WANT_MEMCACHE ((void *)1)
+
+/* Cache used by tests; if NULL, allocate from kmalloc instead. */
+static struct kmem_cache *test_cache;
+
+static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t flags,
+ void (*ctor)(void *))
+{
+ if (test->priv != TEST_PRIV_WANT_MEMCACHE)
+ return size;
+
+ kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor);
+
+ /*
+ * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any
+ * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to
+ * allocate via memcg, if enabled.
+ */
+ flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT;
+ test_cache = kmem_cache_create("test", size, 1, flags, ctor);
+ KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache");
+

+ return size;
+}
+

+static void test_cache_destroy(void)
+{
+ if (!test_cache)
+ return;
+
+ kmem_cache_destroy(test_cache);
+ test_cache = NULL;
+}
+
+static inline size_t kmalloc_cache_alignment(size_t size)
+{
+ return kmalloc_caches[kmalloc_type(GFP_KERNEL)][kmalloc_index(size)]->align;
+}
+
+/* Must always inline to match stack trace against caller. */
+static __always_inline void test_free(void *ptr)
+{
+ if (test_cache)
+ kmem_cache_free(test_cache, ptr);
+ else
+ kfree(ptr);
+}
+
+/*
+ * If this should be a KFENCE allocation, and on which side the allocation and
+ * the closest guard page should be.
+ */
+enum allocation_policy {
+ ALLOCATE_ANY, /* KFENCE, any side. */
+ ALLOCATE_LEFT, /* KFENCE, left side of page. */
+ ALLOCATE_RIGHT, /* KFENCE, right side of page. */
+ ALLOCATE_NONE, /* No KFENCE allocation. */
+};
+
+/*
+ * Try to get a guarded allocation from KFENCE. Uses either kmalloc() or the
+ * current test_cache if set up.
+ */
+static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocation_policy policy)
+{
+ void *alloc;
+ unsigned long timeout, resched_after;
+ const char *policy_name;
+
+ switch (policy) {
+ case ALLOCATE_ANY:
+ policy_name = "any";
+ break;
+ case ALLOCATE_LEFT:
+ policy_name = "left";
+ break;
+ case ALLOCATE_RIGHT:
+ policy_name = "right";
+ break;
+ case ALLOCATE_NONE:
+ policy_name = "none";
+ break;
+ }
+
+ kunit_info(test, "%s: size=%zu, gfp=%x, policy=%s, cache=%i\n", __func__, size, gfp,
+ policy_name, !!test_cache);
+
+ /*
+ * 100x the sample interval should be more than enough to ensure we get
+ * a KFENCE allocation eventually.
+ */
+ timeout = jiffies + msecs_to_jiffies(100 * CONFIG_KFENCE_SAMPLE_INTERVAL);
+ /*
+ * Especially for non-preemption kernels, ensure the allocation-gate
+ * timer has time to catch up.
+ */
+ resched_after = jiffies + msecs_to_jiffies(CONFIG_KFENCE_SAMPLE_INTERVAL);
+ do {
+ if (test_cache)
+ alloc = kmem_cache_alloc(test_cache, gfp);
+ else
+ alloc = kmalloc(size, gfp);
+
+ if (is_kfence_address(alloc)) {
+ if (policy == ALLOCATE_ANY)
+ return alloc;
+ if (policy == ALLOCATE_LEFT && IS_ALIGNED((unsigned long)alloc, PAGE_SIZE))
+ return alloc;
+ if (policy == ALLOCATE_RIGHT &&
+ !IS_ALIGNED((unsigned long)alloc, PAGE_SIZE))
+ return alloc;
+ } else if (policy == ALLOCATE_NONE)
+ return alloc;
+
+ test_free(alloc);
+
+ if (time_after(jiffies, resched_after))
+ cond_resched();
+ } while (time_before(jiffies, timeout));
+
+ KUNIT_ASSERT_TRUE_MSG(test, false, "failed to allocate from KFENCE");
+ return NULL; /* Unreachable. */
+}
+
+static void test_out_of_bounds_read(struct kunit *test)
+{
+ size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_OOB,
+ .fn = test_out_of_bounds_read,
+ };
+ char *buf;
+
+ setup_test_cache(test, size, 0, NULL);
+
+ /*
+ * If we don't have our own cache, adjust based on alignment, so that we
+ * actually access guard pages on either side.
+ */
+ if (!test_cache)
+ size = kmalloc_cache_alignment(size);
+
+ /* Test both sides. */
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT);
+ expect.addr = buf - 1;
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+ test_free(buf);
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
+ expect.addr = buf + size;
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+ test_free(buf);
+}
+
+static void test_use_after_free_read(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_UAF,
+ .fn = test_use_after_free_read,
+ };
+
+ setup_test_cache(test, size, 0, NULL);
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ test_free(expect.addr);
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+static void test_double_free(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_INVALID_FREE,
+ .fn = test_double_free,
+ };
+
+ setup_test_cache(test, size, 0, NULL);
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ test_free(expect.addr);
+ test_free(expect.addr); /* Double-free. */
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+static void test_invalid_addr_free(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_INVALID_FREE,
+ .fn = test_invalid_addr_free,
+ };
+ char *buf;
+
+ setup_test_cache(test, size, 0, NULL);
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ expect.addr = buf + 1; /* Free on invalid address. */
+ test_free(expect.addr); /* Invalid address free. */
+ test_free(buf); /* No error. */
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/*
+ * KFENCE is unable to detect an OOB if the allocation's alignment requirements
+ * leave a gap between the object and the guard page. Specifically, an
+ * allocation of e.g. 73 bytes is aligned on 8 and 128 bytes for SLUB or SLAB
+ * respectively. Therefore it is impossible for the allocated object to adhere
+ * to either of the page boundaries.
+ *
+ * However, we test that an access to memory beyond the gap result in KFENCE
+ * detecting an OOB access.
+ */
+static void test_kmalloc_aligned_oob_read(struct kunit *test)
+{
+ const size_t size = 73;
+ const size_t align = kmalloc_cache_alignment(size);
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_OOB,
+ .fn = test_kmalloc_aligned_oob_read,
+ };
+ char *buf;
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
+
+ /*
+ * The object is offset to the right, so there won't be an OOB to the
+ * left of it.
+ */
+ READ_ONCE(*(buf - 1));
+ KUNIT_EXPECT_FALSE(test, report_available());
+
+ /*
+ * @buf must be aligned on @align, therefore buf + size belongs to the
+ * same page -> no OOB.
+ */
+ READ_ONCE(*(buf + size));
+ KUNIT_EXPECT_FALSE(test, report_available());
+
+ /* Overflowing by @align bytes will result in an OOB. */
+ expect.addr = buf + size + align;
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+
+ test_free(buf);
+}
+
+static void test_kmalloc_aligned_oob_write(struct kunit *test)
+{
+ const size_t size = 73;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_CORRUPTION,
+ .fn = test_kmalloc_aligned_oob_write,
+ };
+ char *buf;
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
+ /*
+ * The object is offset to the right, so we won't get a page
+ * fault immediately after it.
+ */
+ expect.addr = buf + size;
+ WRITE_ONCE(*expect.addr, READ_ONCE(*expect.addr) + 1);
+ KUNIT_EXPECT_FALSE(test, report_available());
+ test_free(buf);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test cache shrinking and destroying with KFENCE. */
+static void test_shrink_memcache(struct kunit *test)
+{
+ const size_t size = 32;
+ void *buf;
+
+ setup_test_cache(test, size, 0, NULL);
+ KUNIT_EXPECT_TRUE(test, test_cache);
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ kmem_cache_shrink(test_cache);
+ test_free(buf);
+
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+static void ctor_set_x(void *obj)
+{
+ /* Every object has at least 8 bytes. */
+ memset(obj, 'x', 8);
+}
+
+/* Ensure that SL*B does not modify KFENCE objects on bulk free. */
+static void test_free_bulk(struct kunit *test)
+{
+ int iter;
+
+ for (iter = 0; iter < 5; iter++) {
+ const size_t size = setup_test_cache(test, 8 + prandom_u32_max(300), 0,
+ (iter & 1) ? ctor_set_x : NULL);
+ void *objects[] = {
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT),
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE),
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT),
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE),
+ test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE),
+ };
+
+ kmem_cache_free_bulk(test_cache, ARRAY_SIZE(objects), objects);
+ KUNIT_ASSERT_FALSE(test, report_available());
+ test_cache_destroy();
+ }
+}
+
+/* Test init-on-free works. */
+static void test_init_on_free(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_UAF,
+ .fn = test_init_on_free,

+ };
+ int i;
+

+ if (!IS_ENABLED(CONFIG_INIT_ON_FREE_DEFAULT_ON))
+ return;
+ /* Assume it hasn't been disabled on command line. */
+
+ setup_test_cache(test, size, 0, NULL);
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ for (i = 0; i < size; i++)
+ expect.addr[i] = i + 1;
+ test_free(expect.addr);
+
+ for (i = 0; i < size; i++) {
+ /*
+ * This may fail if the page was recycled by KFENCE and then
+ * written to again -- this however, is near impossible with a
+ * default config.
+ */
+ KUNIT_EXPECT_EQ(test, expect.addr[i], (char)0);
+
+ if (!i) /* Only check first access to not fail test if page is ever re-protected. */
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+ }
+}
+
+/* Ensure that constructors work properly. */
+static void test_memcache_ctor(struct kunit *test)
+{
+ const size_t size = 32;
+ char *buf;
+ int i;
+
+ setup_test_cache(test, size, 0, ctor_set_x);
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+
+ for (i = 0; i < 8; i++)
+ KUNIT_EXPECT_EQ(test, buf[i], (char)'x');
+
+ test_free(buf);
+
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+/* Test that memory is zeroed if requested. */
+static void test_gfpzero(struct kunit *test)
+{
+ const size_t size = PAGE_SIZE; /* PAGE_SIZE so we can use ALLOCATE_ANY. */
+ char *buf1, *buf2;
+ int i;
+
+ if (CONFIG_KFENCE_SAMPLE_INTERVAL > 100) {
+ kunit_warn(test, "skipping ... would take too long\n");
+ return;
+ }
+
+ setup_test_cache(test, size, 0, NULL);
+ buf1 = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ for (i = 0; i < size; i++)
+ buf1[i] = i + 1;
+ test_free(buf1);
+
+ /* Try to get same address again -- this can take a while. */
+ for (i = 0;; i++) {
+ buf2 = test_alloc(test, size, GFP_KERNEL | __GFP_ZERO, ALLOCATE_ANY);
+ if (buf1 == buf2)
+ break;
+ test_free(buf2);
+
+ if (i == CONFIG_KFENCE_NUM_OBJECTS) {
+ kunit_warn(test, "giving up ... cannot get same object back\n");

+ return;
+ }
+ }
+

+ for (i = 0; i < size; i++)
+ KUNIT_EXPECT_EQ(test, buf2[i], (char)0);
+
+ test_free(buf2);
+
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+static void test_invalid_access(struct kunit *test)
+{
+ const struct expect_report expect = {
+ .type = KFENCE_ERROR_INVALID,
+ .fn = test_invalid_access,
+ .addr = &__kfence_pool[10],
+ };
+
+ READ_ONCE(__kfence_pool[10]);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test SLAB_TYPESAFE_BY_RCU works. */
+static void test_memcache_typesafe_by_rcu(struct kunit *test)
+{
+ const size_t size = 32;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_UAF,
+ .fn = test_memcache_typesafe_by_rcu,
+ };
+
+ setup_test_cache(test, size, SLAB_TYPESAFE_BY_RCU, NULL);
+ KUNIT_EXPECT_TRUE(test, test_cache); /* Want memcache. */
+
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ *expect.addr = 42;
+
+ rcu_read_lock();
+ test_free(expect.addr);
+ KUNIT_EXPECT_EQ(test, *expect.addr, (char)42);
+ rcu_read_unlock();
+
+ /* No reports yet, memory should not have been freed on access. */
+ KUNIT_EXPECT_FALSE(test, report_available());
+ rcu_barrier(); /* Wait for free to happen. */
+
+ /* Expect use-after-free. */
+ KUNIT_EXPECT_EQ(test, *expect.addr, (char)42);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+}
+
+/* Test krealloc(). */
+static void test_krealloc(struct kunit *test)
+{
+ const size_t size = 32;
+ const struct expect_report expect = {
+ .type = KFENCE_ERROR_UAF,
+ .fn = test_krealloc,
+ .addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY),
+ };
+ char *buf = expect.addr;
+ int i;
+
+ KUNIT_EXPECT_FALSE(test, test_cache);
+ KUNIT_EXPECT_EQ(test, ksize(buf), size); /* Precise size match after KFENCE alloc. */
+ for (i = 0; i < size; i++)
+ buf[i] = i + 1;
+
+ /* Check that we successfully change the size. */
+ buf = krealloc(buf, size * 3, GFP_KERNEL); /* Grow. */
+ /* Note: Might no longer be a KFENCE alloc. */
+ KUNIT_EXPECT_GE(test, ksize(buf), size * 3);
+ for (i = 0; i < size; i++)
+ KUNIT_EXPECT_EQ(test, buf[i], (char)(i + 1));
+ for (; i < size * 3; i++) /* Fill to extra bytes. */
+ buf[i] = i + 1;
+
+ buf = krealloc(buf, size * 2, GFP_KERNEL * 2); /* Shrink. */
+ KUNIT_EXPECT_GE(test, ksize(buf), size * 2);
+ for (i = 0; i < size * 2; i++)
+ KUNIT_EXPECT_EQ(test, buf[i], (char)(i + 1));
+
+ buf = krealloc(buf, 0, GFP_KERNEL); /* Free. */
+ KUNIT_EXPECT_EQ(test, (unsigned long)buf, (unsigned long)ZERO_SIZE_PTR);
+ KUNIT_ASSERT_FALSE(test, report_available()); /* No reports yet! */
+
+ READ_ONCE(*expect.addr); /* Ensure krealloc() actually freed earlier KFENCE object. */
+ KUNIT_ASSERT_TRUE(test, report_matches(&expect));
+}
+
+/* Test that some objects from a bulk allocation belong to KFENCE pool. */
+static void test_memcache_alloc_bulk(struct kunit *test)
+{
+ const size_t size = 32;
+ bool pass = false;
+ unsigned long timeout;
+
+ setup_test_cache(test, size, 0, NULL);
+ KUNIT_EXPECT_TRUE(test, test_cache); /* Want memcache. */
+ /*
+ * 100x the sample interval should be more than enough to ensure we get
+ * a KFENCE allocation eventually.
+ */
+ timeout = jiffies + msecs_to_jiffies(100 * CONFIG_KFENCE_SAMPLE_INTERVAL);
+ do {
+ void *objects[100];
+ int i, num = kmem_cache_alloc_bulk(test_cache, GFP_ATOMIC, ARRAY_SIZE(objects),
+ objects);
+ if (!num)
+ continue;
+ for (i = 0; i < ARRAY_SIZE(objects); i++) {
+ if (is_kfence_address(objects[i])) {
+ pass = true;
+ break;
+ }
+ }
+ kmem_cache_free_bulk(test_cache, num, objects);
+ /*
+ * kmem_cache_alloc_bulk() disables interrupts, and calling it
+ * in a tight loop may not give KFENCE a chance to switch the
+ * static branch. Call cond_resched() to let KFENCE chime in.
+ */
+ cond_resched();
+ } while (!pass && time_before(jiffies, timeout));
+
+ KUNIT_EXPECT_TRUE(test, pass);
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+/*
+ * KUnit does not provide a way to provide arguments to tests, and we encode
+ * additional info in the name. Set up 2 tests per test case, one using the
+ * default allocator, and another using a custom memcache (suffix '-memcache').
+ */
+#define KFENCE_KUNIT_CASE(test_name) \
+ { .run_case = test_name, .name = #test_name }, \
+ { .run_case = test_name, .name = #test_name "-memcache" }
+
+static struct kunit_case kfence_test_cases[] = {
+ KFENCE_KUNIT_CASE(test_out_of_bounds_read),
+ KFENCE_KUNIT_CASE(test_use_after_free_read),
+ KFENCE_KUNIT_CASE(test_double_free),
+ KFENCE_KUNIT_CASE(test_invalid_addr_free),
+ KFENCE_KUNIT_CASE(test_free_bulk),
+ KFENCE_KUNIT_CASE(test_init_on_free),
+ KUNIT_CASE(test_kmalloc_aligned_oob_read),
+ KUNIT_CASE(test_kmalloc_aligned_oob_write),
+ KUNIT_CASE(test_shrink_memcache),
+ KUNIT_CASE(test_memcache_ctor),
+ KUNIT_CASE(test_invalid_access),
+ KUNIT_CASE(test_gfpzero),
+ KUNIT_CASE(test_memcache_typesafe_by_rcu),
+ KUNIT_CASE(test_krealloc),
+ KUNIT_CASE(test_memcache_alloc_bulk),
+ {},
+};
+
+/* ===== End test cases ===== */
+
+static int test_init(struct kunit *test)

+{
+ unsigned long flags;

+ int i;
+
+ spin_lock_irqsave(&observed.lock, flags);
+ for (i = 0; i < ARRAY_SIZE(observed.lines); i++)
+ observed.lines[i][0] = '\0';
+ observed.nlines = 0;
+ spin_unlock_irqrestore(&observed.lock, flags);
+
+ /* Any test with 'memcache' in its name will want a memcache. */
+ if (strstr(test->name, "memcache"))
+ test->priv = TEST_PRIV_WANT_MEMCACHE;
+ else
+ test->priv = NULL;

+
+ return 0;
+}
+

+static void test_exit(struct kunit *test)
+{
+ test_cache_destroy();
+}
+
+static struct kunit_suite kfence_test_suite = {
+ .name = "kfence",
+ .test_cases = kfence_test_cases,
+ .init = test_init,
+ .exit = test_exit,
+};
+static struct kunit_suite *kfence_test_suites[] = { &kfence_test_suite, NULL };
+
+static void register_tracepoints(struct tracepoint *tp, void *ignore)
+{
+ check_trace_callback_type_console(probe_console);
+ if (!strcmp(tp->name, "console"))
+ WARN_ON(tracepoint_probe_register(tp, probe_console, NULL));
+}
+
+static void unregister_tracepoints(struct tracepoint *tp, void *ignore)
+{
+ if (!strcmp(tp->name, "console"))
+ tracepoint_probe_unregister(tp, probe_console, NULL);
+}
+
+/*
+ * We only want to do tracepoints setup and teardown once, therefore we have to
+ * customize the init and exit functions and cannot rely on kunit_test_suite().
+ */
+static int __init kfence_test_init(void)
+{
+ /*
+ * Because we want to be able to build the test as a module, we need to
+ * iterate through all known tracepoints, since the static registration
+ * won't work here.
+ */
+ for_each_kernel_tracepoint(register_tracepoints, NULL);
+ return __kunit_test_suites_init(kfence_test_suites);
+}
+
+static void kfence_test_exit(void)
+{
+ __kunit_test_suites_exit(kfence_test_suites);
+ for_each_kernel_tracepoint(unregister_tracepoints, NULL);
+ tracepoint_synchronize_unregister();
+}
+
+late_initcall(kfence_test_init);
+module_exit(kfence_test_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Alexander Potapenko <gli...@google.com>, Marco Elver <el...@google.com>");
--
2.28.0.618.gf4bc123cb7-goog

Dmitry Vyukov

unread,

Sep 15, 2020, 9:49:59 AM9/15/20

to Marco Elver, Andrew Morton, Alexander Potapenko, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, Qian Cai, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM

I see all of my comments from v1 are resolved. So this is:

Reviewed-by: Dmitry Vyukov <dvy...@google.com>

for the series.

SeongJae Park

unread,

Sep 15, 2020, 9:59:08 AM9/15/20

to Marco Elver, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, linu...@vger.kernel.org, pet...@infradead.org, dave....@linux.intel.com, linu...@kvack.org, edum...@google.com, h...@zytor.com, wi...@kernel.org, cor...@lwn.net, x...@kernel.org, kasa...@googlegroups.com, mi...@redhat.com, linux-ar...@lists.infradead.org, arya...@virtuozzo.com, kees...@chromium.org, pau...@kernel.org, ja...@google.com, andre...@google.com, c...@lca.pw, lu...@kernel.org, tg...@linutronix.de, dvy...@google.com, gre...@linuxfoundation.org, linux-...@vger.kernel.org, b...@alien8.de

On Mon, 7 Sep 2020 15:40:46 +0200 Marco Elver <el...@google.com> wrote:

> From: Alexander Potapenko <gli...@google.com>
>

> This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
> low-overhead sampling-based memory safety error detector of heap
> use-after-free, invalid-free, and out-of-bounds access errors.
>

> KFENCE is designed to be enabled in production kernels, and has near
> zero performance overhead. Compared to KASAN, KFENCE trades performance
> for precision. The main motivation behind KFENCE's design, is that with
> enough total uptime KFENCE will detect bugs in code paths not typically
> exercised by non-production test workloads. One way to quickly achieve a
> large enough total uptime is when the tool is deployed across a large
> fleet of machines.
>
> KFENCE objects each reside on a dedicated page, at either the left or
> right page boundaries. The pages to the left and right of the object
> page are "guard pages", whose attributes are changed to a protected
> state, and cause page faults on any attempted access to them. Such page
> faults are then intercepted by KFENCE, which handles the fault
> gracefully by reporting a memory access error.
>
> Guarded allocations are set up based on a sample interval (can be set

> via kfence.sample_interval). After expiration of the sample interval, a
> guarded allocation from the KFENCE object pool is returned to the main
> allocator (SLAB or SLUB). At this point, the timer is reset, and the

> next allocation is set up after the expiration of the interval.
>
> To enable/disable a KFENCE allocation through the main allocator's
> fast-path without overhead, KFENCE relies on static branches via the
> static keys infrastructure. The static branch is toggled to redirect the

> allocation to KFENCE. To date, we have verified by running synthetic
> benchmarks (sysbench I/O workloads) that a kernel compiled with KFENCE
> is performance-neutral compared to the non-KFENCE baseline.
>
> For more details, see Documentation/dev-tools/kfence.rst (added later in
> the series).

So interesting feature! I left some tirvial comments below.

>
> Co-developed-by: Marco Elver <el...@google.com>
> Signed-off-by: Marco Elver <el...@google.com>
> Signed-off-by: Alexander Potapenko <gli...@google.com>
> ---

> MAINTAINERS | 11 +
> include/linux/kfence.h | 174 ++++++++++

> init/main.c | 2 +
> lib/Kconfig.debug | 1 +
> lib/Kconfig.kfence | 58 ++++

> mm/Makefile | 1 +
> mm/kfence/Makefile | 3 +

> mm/kfence/core.c | 730 +++++++++++++++++++++++++++++++++++++++++
> mm/kfence/kfence.h | 104 ++++++
> mm/kfence/report.c | 201 ++++++++++++
> 10 files changed, 1285 insertions(+)

> create mode 100644 include/linux/kfence.h
> create mode 100644 lib/Kconfig.kfence
> create mode 100644 mm/kfence/Makefile
> create mode 100644 mm/kfence/core.c
> create mode 100644 mm/kfence/kfence.h

> create mode 100644 mm/kfence/report.c
[...]
> diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
> new file mode 100644
> index 000000000000..7ac91162edb0
> --- /dev/null
> +++ b/lib/Kconfig.kfence
> @@ -0,0 +1,58 @@

> +# SPDX-License-Identifier: GPL-2.0-only
> +
> +config HAVE_ARCH_KFENCE
> + bool
> +
> +config HAVE_ARCH_KFENCE_STATIC_POOL
> + bool
> + help
> + If the architecture supports using the static pool.
> +
> +menuconfig KFENCE
> + bool "KFENCE: low-overhead sampling-based memory safety error detector"
> + depends on HAVE_ARCH_KFENCE && !KASAN && (SLAB || SLUB)
> + depends on JUMP_LABEL # To ensure performance, require jump labels
> + select STACKTRACE
> + help
> + KFENCE is low-overhead sampling-based detector for heap out-of-bounds
> + access, use-after-free, and invalid-free errors. KFENCE is designed
> + to have negligible cost to permit enabling it in production
> + environments.
> +
> + See <file:Documentation/dev-tools/kfence.rst> for more details.

This patch doesn't provide the file yet. Why don't you add the reference with
the patch introducing the file?

> +
> + Note that, KFENCE is not a substitute for explicit testing with tools
> + such as KASAN. KFENCE can detect a subset of bugs that KASAN can

> + detect (therefore enabling KFENCE together with KASAN does not make
> + sense), albeit at very different performance profiles.
[...]
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> new file mode 100644
> index 000000000000..e638d1f64a32
> --- /dev/null
> +++ b/mm/kfence/core.c
> @@ -0,0 +1,730 @@

> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define pr_fmt(fmt) "kfence: " fmt

[...]

> +
> +static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
> +{
> + long index;
> +
> + /* The checks do not affect performance; only called from slow-paths. */
> +
> + if (!is_kfence_address((void *)addr))
> + return NULL;

> +
> + /*

> + * May be an invalid index if called with an address at the edge of
> + * __kfence_pool, in which case we would report an "invalid access"
> + * error.
> + */
> + index = ((addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2)) - 1;

Seems the outermost parentheses unnecessary.

> + if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
> + return NULL;
> +
> + return &kfence_metadata[index];
> +}
> +
> +static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta)
> +{
> + unsigned long offset = ((meta - kfence_metadata) + 1) * PAGE_SIZE * 2;

Seems the innermost parentheses unnecessary.

> + unsigned long pageaddr = (unsigned long)&__kfence_pool[offset];
> +
> + /* The checks do not affect performance; only called from slow-paths. */
> +
> + /* Only call with a pointer into kfence_metadata. */
> + if (KFENCE_WARN_ON(meta < kfence_metadata ||
> + meta >= kfence_metadata + ARRAY_SIZE(kfence_metadata)))

Is there a reason to use ARRAY_SIZE(kfence_metadata) instead of
CONFIG_KFENCE_NUM_OBJECTS?

> + return 0;
> +

> + /*
> + * This metadata object only ever maps to 1 page; verify the calculation
> + * happens and that the stored address was not corrupted.
> + */
> + if (KFENCE_WARN_ON(ALIGN_DOWN(meta->addr, PAGE_SIZE) != pageaddr))

> + return 0;
> +
> + return pageaddr;
> +}
[...]

> +void __init kfence_init(void)
> +{
> + /* Setting kfence_sample_interval to 0 on boot disables KFENCE. */
> + if (!kfence_sample_interval)
> + return;
> +
> + if (!kfence_initialize_pool()) {
> + pr_err("%s failed\n", __func__);

> + return;
> + }
> +

> + schedule_delayed_work(&kfence_timer, 0);
> + WRITE_ONCE(kfence_enabled, true);

> + pr_info("initialized - using %zu bytes for %d objects", KFENCE_POOL_SIZE,

> + CONFIG_KFENCE_NUM_OBJECTS);
> + if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
> + pr_cont(" at 0x%px-0x%px\n", (void *)__kfence_pool,
> + (void *)(__kfence_pool + KFENCE_POOL_SIZE));

Why don't you use PTR_FMT that defined in 'kfence.h'?

> + else
> + pr_cont("\n");
> +}

[...]
> diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
> new file mode 100644
> index 000000000000..25ce2c0dc092
> --- /dev/null
> +++ b/mm/kfence/kfence.h
> @@ -0,0 +1,104 @@

> +/* SPDX-License-Identifier: GPL-2.0 */
> +

> +#ifndef MM_KFENCE_KFENCE_H
> +#define MM_KFENCE_KFENCE_H
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/types.h>
> +
> +#include "../slab.h" /* for struct kmem_cache */
> +
> +/* For non-debug builds, avoid leaking kernel pointers into dmesg. */
> +#ifdef CONFIG_DEBUG_KERNEL
> +#define PTR_FMT "%px"
> +#else
> +#define PTR_FMT "%p"
> +#endif
> +
> +/*
> + * Get the canary byte pattern for @addr. Use a pattern that varies based on the
> + * lower 3 bits of the address, to detect memory corruptions with higher
> + * probability, where similar constants are used.
> + */

> +#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)addr & 0x7))

> +
> +/* Maximum stack depth for reports. */
> +#define KFENCE_STACK_DEPTH 64
> +
> +/* KFENCE object states. */
> +enum kfence_object_state {
> + KFENCE_OBJECT_UNUSED, /* Object is unused. */
> + KFENCE_OBJECT_ALLOCATED, /* Object is currently allocated. */
> + KFENCE_OBJECT_FREED, /* Object was allocated, and then freed. */

Aligning the comments would look better (same to below comments).

> +};
[...]
> diff --git a/mm/kfence/report.c b/mm/kfence/report.c
> new file mode 100644
> index 000000000000..8c28200e7433
> --- /dev/null
> +++ b/mm/kfence/report.c
> @@ -0,0 +1,201 @@
> +// SPDX-License-Identifier: GPL-2.0
[...]
> +/* Get the number of stack entries to skip get out of MM internals. */

> +static int get_stack_skipnr(const unsigned long stack_entries[], int num_entries,

> + enum kfence_error_type type)
> +{

> + char buf[64];
> + int skipnr, fallback = 0;
> +

> + for (skipnr = 0; skipnr < num_entries; skipnr++) {
> + int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]);
> +

> + /* Depending on error type, find different stack entries. */

> + switch (type) {

> + case KFENCE_ERROR_UAF:
> + case KFENCE_ERROR_OOB:
> + case KFENCE_ERROR_INVALID:

> + if (!strncmp(buf, KFENCE_SKIP_ARCH_FAULT_HANDLER, len))

Seems KFENCE_SKIP_ARCH_FAULT_HANDLER not defined yet?

> + goto found;
> + break;
[...]

Thanks,
SeongJae Park

Marco Elver

unread,

Sep 15, 2020, 10:14:57 AM9/15/20

to SeongJae Park, gli...@google.com, ak...@linux-foundation.org, catalin...@arm.com, c...@linux.com, rien...@google.com, iamjoon...@lge.com, mark.r...@arm.com, pen...@kernel.org, linu...@vger.kernel.org, pet...@infradead.org, dave....@linux.intel.com, linu...@kvack.org, edum...@google.com, h...@zytor.com, wi...@kernel.org, cor...@lwn.net, x...@kernel.org, kasa...@googlegroups.com, mi...@redhat.com, linux-ar...@lists.infradead.org, arya...@virtuozzo.com, kees...@chromium.org, pau...@kernel.org, ja...@google.com, andre...@google.com, c...@lca.pw, lu...@kernel.org, tg...@linutronix.de, dvy...@google.com, gre...@linuxfoundation.org, linux-...@vger.kernel.org, b...@alien8.de

On Tue, Sep 15, 2020 at 03:57PM +0200, SeongJae Park wrote:
[...]

>
> So interesting feature! I left some tirvial comments below.

Thank you!

Sure, will fix for v3.

Will fix.

> > + if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
> > + return NULL;
> > +
> > + return &kfence_metadata[index];
> > +}
> > +
> > +static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta)
> > +{
> > + unsigned long offset = ((meta - kfence_metadata) + 1) * PAGE_SIZE * 2;
>
> Seems the innermost parentheses unnecessary.

Will fix.

> > + unsigned long pageaddr = (unsigned long)&__kfence_pool[offset];
> > +
> > + /* The checks do not affect performance; only called from slow-paths. */
> > +
> > + /* Only call with a pointer into kfence_metadata. */
> > + if (KFENCE_WARN_ON(meta < kfence_metadata ||
> > + meta >= kfence_metadata + ARRAY_SIZE(kfence_metadata)))
>
> Is there a reason to use ARRAY_SIZE(kfence_metadata) instead of
> CONFIG_KFENCE_NUM_OBJECTS?

They're equivalent. We can switch it. (Although I don't see one being
superior to the other.. maybe we save on compile-time?)

It's unnecessary, since all this is conditional on
IS_ENABLED(CONFIG_DEBUG_KERNEL)) and we can just avoid the indirection
through PTR_FMT.

Will fix.

Correct, it'll be defined in <asm/kfence.h> in the x86 and arm64
patches. Leaving this is fine, since no architecture has selected
HAVE_ARCH_KFENCE in this patch yet; as a result, we also can't break the
build even if this is undefined.

Thanks,
-- Marco

SeongJae Park

unread,

Sep 15, 2020, 10:27:13 AM9/15/20

to Marco Elver, SeongJae Park, mark.r...@arm.com, linu...@vger.kernel.org, pet...@infradead.org, catalin...@arm.com, dave....@linux.intel.com, linu...@kvack.org, edum...@google.com, gli...@google.com, h...@zytor.com, c...@linux.com, wi...@kernel.org, cor...@lwn.net, x...@kernel.org, kasa...@googlegroups.com, mi...@redhat.com, dvy...@google.com, rien...@google.com, arya...@virtuozzo.com, kees...@chromium.org, pau...@kernel.org, ja...@google.com, andre...@google.com, c...@lca.pw, lu...@kernel.org, tg...@linutronix.de, ak...@linux-foundation.org, linux-ar...@lists.infradead.org, gre...@linuxfoundation.org, linux-...@vger.kernel.org, pen...@kernel.org, b...@alien8.de, iamjoon...@lge.com

On Tue, 15 Sep 2020 16:14:49 +0200 Marco Elver <el...@google.com> wrote:

> On Tue, Sep 15, 2020 at 03:57PM +0200, SeongJae Park wrote:
> [...]
> >
> > So interesting feature! I left some tirvial comments below.
>
> Thank you!
[...]

> > > +
> > > + /* Only call with a pointer into kfence_metadata. */
> > > + if (KFENCE_WARN_ON(meta < kfence_metadata ||
> > > + meta >= kfence_metadata + ARRAY_SIZE(kfence_metadata)))
> >
> > Is there a reason to use ARRAY_SIZE(kfence_metadata) instead of
> > CONFIG_KFENCE_NUM_OBJECTS?
>
> They're equivalent. We can switch it. (Although I don't see one being
> superior to the other.. maybe we save on compile-time?)

I prefer CONFIG_KFENCE_NUM_OBJECTS here just because it's more widely used in
the code. Also, I personally think it's more easy to read.

[...]

> > > + pr_info("initialized - using %zu bytes for %d objects", KFENCE_POOL_SIZE,
> > > + CONFIG_KFENCE_NUM_OBJECTS);
> > > + if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
> > > + pr_cont(" at 0x%px-0x%px\n", (void *)__kfence_pool,
> > > + (void *)(__kfence_pool + KFENCE_POOL_SIZE));
> >
> > Why don't you use PTR_FMT that defined in 'kfence.h'?
>
> It's unnecessary, since all this is conditional on
> IS_ENABLED(CONFIG_DEBUG_KERNEL)) and we can just avoid the indirection
> through PTR_FMT.

Ok, agreed.

[...]

> > > + for (skipnr = 0; skipnr < num_entries; skipnr++) {
> > > + int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]);
> > > +
> > > + /* Depending on error type, find different stack entries. */
> > > + switch (type) {
> > > + case KFENCE_ERROR_UAF:
> > > + case KFENCE_ERROR_OOB:
> > > + case KFENCE_ERROR_INVALID:
> > > + if (!strncmp(buf, KFENCE_SKIP_ARCH_FAULT_HANDLER, len))
> >
> > Seems KFENCE_SKIP_ARCH_FAULT_HANDLER not defined yet?
>
> Correct, it'll be defined in <asm/kfence.h> in the x86 and arm64
> patches. Leaving this is fine, since no architecture has selected
> HAVE_ARCH_KFENCE in this patch yet; as a result, we also can't break the
> build even if this is undefined.

Ah, got it. Thank you for the kind explanation.

Thanks,
SeongJae Park

>
> Thanks,
> -- Marco

Christopher Lameter

unread,

Sep 17, 2020, 5:37:12 AM9/17/20

to Marco Elver, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On Tue, 15 Sep 2020, Marco Elver wrote:

> @@ -3206,7 +3207,7 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
> }
>
> static __always_inline void *
> -slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
> +slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t orig_size,
> unsigned long caller)
> {

The size of the object is available via a field in kmem_cache. And a
pointer to the current kmem_cache is already passed to the function. Why
is there a need to add an additional parameter?

Christopher Lameter

unread,

Sep 17, 2020, 5:40:10 AM9/17/20

to Marco Elver, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, c...@lca.pw, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On Tue, 15 Sep 2020, Marco Elver wrote:

> void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> {
> - void *ret = slab_alloc(s, gfpflags, _RET_IP_);
> + void *ret = slab_alloc(s, gfpflags, _RET_IP_, s->object_size);

The additional size parameter is a part of a struct kmem_cache that is
already passed to the function. Why does the parameter list need to be
expanded?

Alexander Potapenko

unread,

Sep 17, 2020, 5:48:12 AM9/17/20

to Christopher Lameter, Marco Elver, Andrew Morton, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Dave Hansen, David Rientjes, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, Qian Cai, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

> > static __always_inline void *
> > -slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
> > +slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t orig_size,
> > unsigned long caller)
> > {
>
> The size of the object is available via a field in kmem_cache. And a
> pointer to the current kmem_cache is already passed to the function. Why
> is there a need to add an additional parameter?

That's because we want to do our best detecting bugs on
kmalloc-allocated objects.
kmalloc is using size classes, so e.g. when allocating 272 bytes the
object will be padded to 512.
As a result, placing that object at the end of the page won't really
help to detect out-of-bound accesses that are off by less than 270
bytes.

We probably need to better clarify this in the patch description.

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Alexander Potapenko

unread,

Sep 17, 2020, 5:51:32 AM9/17/20

to Christopher Lameter, Marco Elver, Andrew Morton, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Dave Hansen, David Rientjes, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, Qian Cai, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

See my response to the similar question about the SLAB allocator:
https://lore.kernel.org/linux-arm-kernel/CAG_fn=XMc8NPZPFtUE=rdoR=XJH4F+TxZs-w5...@mail.gmail.com/

Qian Cai

unread,

Sep 18, 2020, 7:17:21 AM9/18/20

to Marco Elver, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Does anybody else grow tried of all those different *imperfect* versions of in-
kernel memory safety error detectors? KASAN-generic, KFENCE, KASAN-tag-based
etc. Then, we have old things like page_poison, SLUB debugging, debug_pagealloc
etc which are pretty much inefficient to detect bugs those days compared to
KASAN. Can't we work towards having a single implementation and clean up all
those mess?

Marco Elver

unread,

Sep 18, 2020, 7:59:28 AM9/18/20

to Qian Cai, Andrew Morton, Alexander Potapenko, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Ingo Molnar, Jann Horn, Jonathan Cameron, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

If you have suggestions on how to get a zero-overhead, precise
("perfect") memory safety error detector without new hardware
extensions, we're open to suggestions -- many people over many years
have researched this problems, and while we're making progress for C
(and C++), the fact remains that what you're asking is likely
impossible. This might be useful background:
https://arxiv.org/pdf/1802.09517.pdf

The fact remains that requirements and environments vary across
applications and usecases. Maybe for one usecase (debugging, test env)
normal KASAN is just fine. But that doesn't work for production, where
we want to have max performance.

MTE will get us closer (no silicon yet, and ARM64 only for now), but
depending on implementation might come with small overheads, although
quite acceptable for most environments with increasing processing
power modern CPUs deliver.

Yet for other environments, where even a small performance regression
is unacceptable, and where it's infeasible to capture in tests what
the workloads execute, KFENCE is a very attractive option.

There have also been discussions on using Rust in the kernel [1], but
this is just not feasible for core kernel code in the near future
(even then, you'll still need dynamic error detection tools for all
the unsafe bits, of which there are many in an OS kernel).
[1] https://lwn.net/Articles/829858/

Thanks,
-- Marco

Marco Elver

unread,

Sep 21, 2020, 9:26:37 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst

[1] http://llvm.org/docs/GwpAsan.html
[2] https://linux.die.net/man/3/efence

v3:
* Rewrite SLAB/SLUB patch descriptions to clarify need for 'orig_size'.

* Various smaller fixes (see details in patches).

v2: https://lkml.kernel.org/r/20200915132046....@google.com

kernel/locking/lockdep.c | 8 +
lib/Kconfig.debug | 1 +

create mode 100644 include/linux/kfence.h
create mode 100644 lib/Kconfig.kfence
create mode 100644 mm/kfence/Makefile
create mode 100644 mm/kfence/core.c
create mode 100644 mm/kfence/kfence.h

create mode 100644 mm/kfence/kfence_test.c
create mode 100644 mm/kfence/report.c

--

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:26:40 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors.

KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.

KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault

gracefully by reporting a memory access error. To detect out-of-bounds
writes to memory within the object's page itself, KFENCE also uses
pattern-based redzones. The following figure illustrates the page
layout:

---+-----------+-----------+-----------+-----------+-----------+---

---+-----------+-----------+-----------+-----------+-----------+---

Guarded allocations are set up based on a sample interval (can be set

via kfence.sample_interval). After expiration of the sample interval, a
guarded allocation from the KFENCE object pool is returned to the main
allocator (SLAB or SLUB). At this point, the timer is reset, and the

next allocation is set up after the expiration of the interval.

To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the

allocation to KFENCE. To date, we have verified by running synthetic
benchmarks (sysbench I/O workloads) that a kernel compiled with KFENCE
is performance-neutral compared to the non-KFENCE baseline.

For more details, see Documentation/dev-tools/kfence.rst (added later in
the series).

Reviewed-by: Dmitry Vyukov <dvy...@google.com>

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

v3:
* Reports by SeongJae Park:
* Remove reference to Documentation/dev-tools/kfence.rst.
* Remove redundant braces.
* Use CONFIG_KFENCE_NUM_OBJECTS instead of ARRAY_SIZE(...).
* Align some comments.
* Add figure from Documentation/dev-tools/kfence.rst added later in
series to patch description.

---
MAINTAINERS | 11 +
include/linux/kfence.h | 174 ++++++++++

init/main.c | 2 +
lib/Kconfig.debug | 1 +
lib/Kconfig.kfence | 63 ++++

mm/Makefile | 1 +
mm/kfence/Makefile | 3 +

mm/kfence/core.c | 733 +++++++++++++++++++++++++++++++++++++++++
mm/kfence/kfence.h | 102 ++++++
mm/kfence/report.c | 219 ++++++++++++

10 files changed, 1309 insertions(+)

create mode 100644 include/linux/kfence.h
create mode 100644 lib/Kconfig.kfence
create mode 100644 mm/kfence/Makefile
create mode 100644 mm/kfence/core.c
create mode 100644 mm/kfence/kfence.h

create mode 100644 mm/kfence/report.c

diff --git a/MAINTAINERS b/MAINTAINERS
index b5cfab015bd6..863899ed9a29 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9673,6 +9673,17 @@ F: include/linux/keyctl.h
F: include/uapi/linux/keyctl.h
F: security/keys/

+KFENCE
+M: Alexander Potapenko <gli...@google.com>
+M: Marco Elver <el...@google.com>
+R: Dmitry Vyukov <dvy...@google.com>
+L: kasa...@googlegroups.com
+S: Maintained
+F: Documentation/dev-tools/kfence.rst
+F: include/linux/kfence.h
+F: lib/Kconfig.kfence
+F: mm/kfence/
+
KFIFO
M: Stefani Seibold <ste...@seibold.net>
S: Maintained
diff --git a/include/linux/kfence.h b/include/linux/kfence.h

new file mode 100644

index 000000000000..8128ba7b5e90
--- /dev/null
+++ b/include/linux/kfence.h

@@ -0,0 +1,174 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+

+#ifndef _LINUX_KFENCE_H
+#define _LINUX_KFENCE_H
+
+#include <linux/mm.h>
+#include <linux/percpu.h>
+#include <linux/static_key.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_KFENCE

+
+/*

+ return true;
+}
+

+/**
+ * kfence_handle_page_fault() - perform page fault handling for KFENCE pages
+ * @addr: faulting address
+ *
+ * Return:
+ * * false - address outside KFENCE pool,
+ * * true - page fault handled by KFENCE, no additional handling required.
+ *
+ * A page fault inside KFENCE pool indicates a memory error, such as an
+ * out-of-bounds access, a use-after-free or an invalid memory access. In these
+ * cases KFENCE prints an error message and marks the offending page as
+ * present, so that the kernel can proceed.
+ */
+bool __must_check kfence_handle_page_fault(unsigned long addr);
+

+#else /* CONFIG_KFENCE */
+

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
new file mode 100644

index 000000000000..4c2ea1c722de
--- /dev/null
+++ b/lib/Kconfig.kfence
@@ -0,0 +1,63 @@

+# SPDX-License-Identifier: GPL-2.0-only
+
+config HAVE_ARCH_KFENCE
+ bool
+
+config HAVE_ARCH_KFENCE_STATIC_POOL
+ bool
+ help
+ If the architecture supports using the static pool.
+
+menuconfig KFENCE
+ bool "KFENCE: low-overhead sampling-based memory safety error detector"
+ depends on HAVE_ARCH_KFENCE && !KASAN && (SLAB || SLUB)
+ depends on JUMP_LABEL # To ensure performance, require jump labels
+ select STACKTRACE
+ help
+ KFENCE is low-overhead sampling-based detector for heap out-of-bounds
+ access, use-after-free, and invalid-free errors. KFENCE is designed
+ to have negligible cost to permit enabling it in production
+ environments.
+

+ Note that, KFENCE is not a substitute for explicit testing with tools
+ such as KASAN. KFENCE can detect a subset of bugs that KASAN can

new file mode 100644

index 000000000000..d991e9a349f0
--- /dev/null
+++ b/mm/kfence/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_KFENCE) := core.o report.o

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
new file mode 100644

index 000000000000..4af407837830
--- /dev/null
+++ b/mm/kfence/core.c
@@ -0,0 +1,733 @@

+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "kfence: " fmt

+
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/debugfs.h>
+#include <linux/kcsan-checks.h>
+#include <linux/kfence.h>
+#include <linux/list.h>
+#include <linux/lockdep.h>
+#include <linux/moduleparam.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/seq_file.h>

+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/string.h>
+
+#include <asm/kfence.h>
+
+#include "kfence.h"
+

+/* Disables KFENCE on the first warning assuming an irrecoverable error. */
+#define KFENCE_WARN_ON(cond) \
+ ({ \
+ const bool __cond = WARN_ON(cond); \
+ if (unlikely(__cond)) \
+ WRITE_ONCE(kfence_enabled, false); \
+ __cond; \
+ })
+
+#ifndef CONFIG_KFENCE_STRESS_TEST_FAULTS /* Only defined with CONFIG_EXPERT. */
+#define CONFIG_KFENCE_STRESS_TEST_FAULTS 0
+#endif
+
+/* === Data ================================================================= */
+
+static unsigned long kfence_sample_interval __read_mostly = CONFIG_KFENCE_SAMPLE_INTERVAL;
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+#define MODULE_PARAM_PREFIX "kfence."
+module_param_named(sample_interval, kfence_sample_interval, ulong, 0600);
+
+static bool kfence_enabled __read_mostly;

+
+/*

+
+static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
+{
+ long index;
+
+ /* The checks do not affect performance; only called from slow-paths. */
+
+ if (!is_kfence_address((void *)addr))
+ return NULL;
+
+ /*
+ * May be an invalid index if called with an address at the edge of
+ * __kfence_pool, in which case we would report an "invalid access"
+ * error.
+ */

+ index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1;

+ if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
+ return NULL;
+
+ return &kfence_metadata[index];
+}
+
+static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta)
+{

+ unsigned long offset = (meta - kfence_metadata + 1) * PAGE_SIZE * 2;

+ unsigned long pageaddr = (unsigned long)&__kfence_pool[offset];
+
+ /* The checks do not affect performance; only called from slow-paths. */

+
+ /* Only call with a pointer into kfence_metadata. */
+ if (KFENCE_WARN_ON(meta < kfence_metadata ||

+ meta >= kfence_metadata + CONFIG_KFENCE_NUM_OBJECTS))

+ return 0;
+
+ /*
+ * This metadata object only ever maps to 1 page; verify the calculation
+ * happens and that the stored address was not corrupted.
+ */
+ if (KFENCE_WARN_ON(ALIGN_DOWN(meta->addr, PAGE_SIZE) != pageaddr))
+ return 0;
+
+ return pageaddr;
+}

+
+/*
+ * Update the object's metadata state, including updating the alloc/free stacks
+ * depending on the state transition.
+ */
+static noinline void metadata_update_state(struct kfence_metadata *meta,
+ enum kfence_object_state next)
+{
+ unsigned long *entries = next == KFENCE_OBJECT_FREED ? meta->free_stack : meta->alloc_stack;
+ /*
+ * Skip over 1 (this) functions; noinline ensures we do not accidentally
+ * skip over the caller by never inlining.
+ */
+ const int nentries = stack_trace_save(entries, KFENCE_STACK_DEPTH, 1);
+
+ lockdep_assert_held(&meta->lock);
+
+ if (next == KFENCE_OBJECT_FREED)
+ meta->num_free_stack = nentries;
+ else
+ meta->num_alloc_stack = nentries;

+
+ /*

+ * Pairs with READ_ONCE() in
+ * kfence_shutdown_cache(),
+ * kfence_handle_page_fault().
+ */
+ WRITE_ONCE(meta->state, next);
+}
+
+/* Write canary byte to @addr. */
+static inline bool set_canary_byte(u8 *addr)
+{
+ *addr = KFENCE_CANARY_PATTERN(addr);

+ return true;
+}
+

+/* Check canary byte at @addr. */
+static inline bool check_canary_byte(u8 *addr)
+{
+ if (*addr == KFENCE_CANARY_PATTERN(addr))

+ return true;
+

+ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
+ kfence_report_error((unsigned long)addr, addr_to_metadata((unsigned long)addr),
+ KFENCE_ERROR_CORRUPTION);

+ return false;
+}
+

+static inline void for_each_canary(const struct kfence_metadata *meta, bool (*fn)(u8 *))

+{
+ unsigned long addr;
+

+ lockdep_assert_held(&meta->lock);
+

+ for (addr = ALIGN_DOWN(meta->addr, PAGE_SIZE); addr < meta->addr; addr++) {
+ if (!fn((u8 *)addr))

+ break;
+ }
+

+ for (addr = meta->addr + meta->size; addr < PAGE_ALIGN(meta->addr); addr++) {
+ if (!fn((u8 *)addr))

+ break;
+ }
+}
+

+static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp)
+{
+ struct kfence_metadata *meta = NULL;

+ unsigned long flags;
+ void *addr;
+

+ /* Try to obtain a free object. */
+ raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+ if (!list_empty(&kfence_freelist)) {
+ meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);
+ list_del_init(&meta->list);
+ }
+ raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+ if (!meta)

+ return NULL;
+
+ if (unlikely(!raw_spin_trylock_irqsave(&meta->lock, flags))) {
+ /*

+ * This is extremely unlikely -- we are reporting on a
+ * use-after-free, which locked meta->lock, and the reporting
+ * code via printk calls kmalloc() which ends up in
+ * kfence_alloc() and tries to grab the same object that we're
+ * reporting on. While it has never been observed, lockdep does
+ * report that there is a possibility of deadlock. Fix it by
+ * using trylock and bailing out gracefully.
+ */
+ raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+ /* Put the object back on the freelist. */
+ list_add_tail(&meta->list, &kfence_freelist);
+ raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+

+ return NULL;
+ }
+

+ meta->addr = metadata_to_pageaddr(meta);
+ /* Unprotect if we're reusing this page. */
+ if (meta->state == KFENCE_OBJECT_FREED)
+ kfence_unprotect(meta->addr);

+
+ /*

+ * Note: for allocations made before RNG initialization, will always
+ * return zero. We still benefit from enabling KFENCE as early as
+ * possible, even when the RNG is not yet available, as this will allow
+ * KFENCE to detect bugs due to earlier allocations. The only downside
+ * is that the out-of-bounds accesses detected are deterministic for
+ * such allocations.
+ */
+ if (prandom_u32_max(2)) {
+ /* Allocate on the "right" side, re-calculate address. */
+ meta->addr += PAGE_SIZE - size;
+ meta->addr = ALIGN_DOWN(meta->addr, cache->align);
+ }
+
+ /* Update remaining metadata. */
+ metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED);
+ /* Pairs with READ_ONCE() in kfence_shutdown_cache(). */
+ WRITE_ONCE(meta->cache, cache);
+ meta->size = size;
+ for_each_canary(meta, set_canary_byte);
+ virt_to_page(meta->addr)->slab_cache = cache;
+
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+ /* Memory initialization. */

+
+ /*

+ * We check slab_want_init_on_alloc() ourselves, rather than letting
+ * SL*B do the initialization, as otherwise we might overwrite KFENCE's
+ * redzone.
+ */
+ addr = (void *)meta->addr;
+ if (unlikely(slab_want_init_on_alloc(gfp, cache)))
+ memzero_explicit(addr, size);
+ if (cache->ctor)
+ cache->ctor(addr);
+
+ if (CONFIG_KFENCE_STRESS_TEST_FAULTS && !prandom_u32_max(CONFIG_KFENCE_STRESS_TEST_FAULTS))
+ kfence_protect(meta->addr); /* Random "faults" by protecting the object. */
+
+ atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCATED]);
+ atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCS]);
+
+ return addr;
+}
+

+static void kfence_guarded_free(void *addr, struct kfence_metadata *meta)
+{
+ struct kcsan_scoped_access assert_page_exclusive;

+ unsigned long flags;
+

+ raw_spin_lock_irqsave(&meta->lock, flags);
+
+ if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {
+ /* Invalid or double-free, bail out. */
+ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
+ kfence_report_error((unsigned long)addr, meta, KFENCE_ERROR_INVALID_FREE);
+ raw_spin_unlock_irqrestore(&meta->lock, flags);

+ return;
+ }
+

+ /* Detect racy use-after-free, or incorrect reallocation of this page by KFENCE. */
+ kcsan_begin_scoped_access((void *)ALIGN_DOWN((unsigned long)addr, PAGE_SIZE), PAGE_SIZE,
+ KCSAN_ACCESS_SCOPED | KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT,
+ &assert_page_exclusive);
+
+ if (CONFIG_KFENCE_STRESS_TEST_FAULTS)
+ kfence_unprotect((unsigned long)addr); /* To check canary bytes. */
+
+ /* Restore page protection if there was an OOB access. */
+ if (meta->unprotected_page) {
+ kfence_protect(meta->unprotected_page);
+ meta->unprotected_page = 0;
+ }
+
+ /* Check canary bytes for memory corruption. */
+ for_each_canary(meta, check_canary_byte);

+
+ /*

+{
+ unsigned long addr;

+ struct page *pages;
+ int i;
+
+ if (!arch_kfence_initialize_pool())
+ return false;
+

+ addr = (unsigned long)__kfence_pool;
+ pages = virt_to_page(addr);

+
+ /*

+ * Set up object pages: they must have PG_slab set, to avoid freeing
+ * these as real pages.
+ *
+ * We also want to avoid inserting kfence_free() in the kfree()
+ * fast-path in SLUB, and therefore need to ensure kfree() correctly
+ * enters __slab_free() slow-path.
+ */
+ for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
+ if (!i || (i % 2))
+ continue;
+
+ __SetPageSlab(&pages[i]);
+ }

+
+ /*

+ * Protect the first 2 pages. The first page is mostly unnecessary, and
+ * merely serves as an extended guard page. However, adding one
+ * additional page in the beginning gives us an even number of pages,
+ * which simplifies the mapping of address to metadata index.
+ */
+ for (i = 0; i < 2; i++) {
+ if (unlikely(!kfence_protect(addr)))

+ return false;
+
+ addr += PAGE_SIZE;
+ }
+

+ for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
+ struct kfence_metadata *meta = &kfence_metadata[i];
+
+ /* Initialize metadata. */
+ INIT_LIST_HEAD(&meta->list);
+ raw_spin_lock_init(&meta->lock);
+ meta->state = KFENCE_OBJECT_UNUSED;
+ meta->addr = addr; /* Initialize for validation in metadata_to_pageaddr(). */
+ list_add_tail(&meta->list, &kfence_freelist);
+
+ /* Protect the right redzone. */
+ if (unlikely(!kfence_protect(addr + PAGE_SIZE)))

+ return false;
+

+ addr += 2 * PAGE_SIZE;
+ }
+

+ return true;
+}
+

+/* === DebugFS Interface ==================================================== */
+
+static int stats_show(struct seq_file *seq, void *v)

+{
+ int i;
+

+ seq_printf(seq, "enabled: %i\n", READ_ONCE(kfence_enabled));
+ for (i = 0; i < KFENCE_COUNTER_COUNT; i++)
+ seq_printf(seq, "%s: %ld\n", counter_names[i], atomic_long_read(&counters[i]));

+
+ return 0;
+}

+DEFINE_SHOW_ATTRIBUTE(stats);
+
+/*
+ * debugfs seq_file operations for /sys/kernel/debug/kfence/objects.
+ * start_object() and next_object() return the object index + 1, because NULL is used
+ * to stop iteration.
+ */
+static void *start_object(struct seq_file *seq, loff_t *pos)
+{
+ if (*pos < CONFIG_KFENCE_NUM_OBJECTS)
+ return (void *)((long)*pos + 1);

+ return NULL;
+}
+

+static void stop_object(struct seq_file *seq, void *v)
+{
+}
+
+static void *next_object(struct seq_file *seq, void *v, loff_t *pos)
+{
+ ++*pos;
+ if (*pos < CONFIG_KFENCE_NUM_OBJECTS)
+ return (void *)((long)*pos + 1);

+ return NULL;
+}
+

+static int show_object(struct seq_file *seq, void *v)
+{
+ struct kfence_metadata *meta = &kfence_metadata[(long)v - 1];

+ unsigned long flags;
+

+ raw_spin_lock_irqsave(&meta->lock, flags);
+ kfence_print_object(seq, meta);
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+ seq_puts(seq, "---------------------------------\n");
+

+ return 0;
+}
+

+static const struct seq_operations object_seqops = {
+ .start = start_object,
+ .next = next_object,
+ .stop = stop_object,
+ .show = show_object,
+};
+
+static int open_objects(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &object_seqops);
+}
+
+static const struct file_operations objects_fops = {
+ .open = open_objects,
+ .read = seq_read,
+ .llseek = seq_lseek,
+};
+
+static int __init kfence_debugfs_init(void)
+{
+ struct dentry *kfence_dir = debugfs_create_dir("kfence", NULL);
+
+ debugfs_create_file("stats", 0444, kfence_dir, NULL, &stats_fops);
+ debugfs_create_file("objects", 0400, kfence_dir, NULL, &objects_fops);

+ return 0;
+}
+

+late_initcall(kfence_debugfs_init);
+
+/* === Allocation Gate Timer ================================================ */
+
+/*
+ * Set up delayed work, which will enable and disable the static key. We need to
+ * use a work queue (rather than a simple timer), since enabling and disabling a
+ * static key cannot be done from an interrupt.
+ */
+static struct delayed_work kfence_timer;
+static void toggle_allocation_gate(struct work_struct *work)
+{
+ if (!READ_ONCE(kfence_enabled))

+ return;
+

+ /* Enable static key, and await allocation to happen. */
+ atomic_set(&allocation_gate, 0);
+ static_branch_enable(&kfence_allocation_key);
+ wait_event(allocation_wait, atomic_read(&allocation_gate) != 0);
+
+ /* Disable static key and reset timer. */
+ static_branch_disable(&kfence_allocation_key);
+ schedule_delayed_work(&kfence_timer, msecs_to_jiffies(kfence_sample_interval));
+}
+static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate);
+
+/* === Public interface ===================================================== */
+

+void __init kfence_init(void)
+{
+ /* Setting kfence_sample_interval to 0 on boot disables KFENCE. */
+ if (!kfence_sample_interval)
+ return;
+
+ if (!kfence_initialize_pool()) {
+ pr_err("%s failed\n", __func__);
+ return;
+ }
+

+ WRITE_ONCE(kfence_enabled, true);
+ schedule_delayed_work(&kfence_timer, 0);

+ pr_info("initialized - using %lu bytes for %d objects", KFENCE_POOL_SIZE,

+ CONFIG_KFENCE_NUM_OBJECTS);
+ if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
+ pr_cont(" at 0x%px-0x%px\n", (void *)__kfence_pool,
+ (void *)(__kfence_pool + KFENCE_POOL_SIZE));

+ else
+ pr_cont("\n");
+}

+
+bool kfence_shutdown_cache(struct kmem_cache *s)

+{
+ unsigned long flags;

+ struct kfence_metadata *meta;
+ int i;
+

+ for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
+ bool in_use;
+
+ meta = &kfence_metadata[i];

+
+ /*

+ * If we observe some inconsistent cache and state pair where we
+ * should have returned false here, cache destruction is racing
+ * with either kmem_cache_alloc() or kmem_cache_free(). Taking
+ * the lock will not help, as different critical section
+ * serialization will have the same outcome.
+ */
+ if (READ_ONCE(meta->cache) != s ||
+ READ_ONCE(meta->state) != KFENCE_OBJECT_ALLOCATED)
+ continue;
+
+ raw_spin_lock_irqsave(&meta->lock, flags);
+ in_use = meta->cache == s && meta->state == KFENCE_OBJECT_ALLOCATED;
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+ if (in_use)

+ return false;
+ }
+

+ for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
+ meta = &kfence_metadata[i];
+
+ /* See above. */
+ if (READ_ONCE(meta->cache) != s || READ_ONCE(meta->state) != KFENCE_OBJECT_FREED)
+ continue;
+
+ raw_spin_lock_irqsave(&meta->lock, flags);
+ if (meta->cache == s && meta->state == KFENCE_OBJECT_FREED)
+ meta->cache = NULL;
+ raw_spin_unlock_irqrestore(&meta->lock, flags);
+ }
+

+ return true;
+}
+

+void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)

+{
+ /*

+ * allocation_gate only needs to become non-zero, so it doesn't make
+ * sense to continue writing to it and pay the associated contention
+ * cost, in case we have a large number of concurrent allocations.
+ */
+ if (atomic_read(&allocation_gate) || atomic_inc_return(&allocation_gate) > 1)
+ return NULL;
+ wake_up(&allocation_wait);
+
+ if (!READ_ONCE(kfence_enabled))

+ return NULL;
+
+ if (size > PAGE_SIZE)
+ return NULL;
+

+ return kfence_guarded_alloc(s, size, flags);
+}
+
+size_t kfence_ksize(const void *addr)
+{
+ const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);

+
+ /*

+ * Read locklessly -- if there is a race with __kfence_alloc(), this is
+ * either a use-after-free or invalid access.
+ */
+ return meta ? meta->size : 0;
+}
+
+void *kfence_object_start(const void *addr)
+{
+ const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);

+
+ /*

+ * Read locklessly -- if there is a race with __kfence_alloc(), this is
+ * either a use-after-free or invalid access.
+ */
+ return meta ? (void *)meta->addr : NULL;
+}
+
+void __kfence_free(void *addr)
+{
+ struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);

+
+ /*

+ * If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
+ * the object, as the object page may be recycled for other-typed
+ * objects once it has been freed.
+ */
+ if (unlikely(meta->cache->flags & SLAB_TYPESAFE_BY_RCU))
+ call_rcu(&meta->rcu_head, rcu_guarded_free);
+ else
+ kfence_guarded_free(addr, meta);
+}
+
+bool kfence_handle_page_fault(unsigned long addr)
+{
+ const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE;
+ struct kfence_metadata *to_report = NULL;
+ enum kfence_error_type error_type;
+ unsigned long flags;

+
+ if (!is_kfence_address((void *)addr))

+
+ /*

diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
new file mode 100644

index 000000000000..2f606a3f58b6
--- /dev/null
+++ b/mm/kfence/kfence.h
@@ -0,0 +1,102 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef MM_KFENCE_KFENCE_H
+#define MM_KFENCE_KFENCE_H
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+
+#include "../slab.h" /* for struct kmem_cache */
+
+/* For non-debug builds, avoid leaking kernel pointers into dmesg. */
+#ifdef CONFIG_DEBUG_KERNEL
+#define PTR_FMT "%px"
+#else
+#define PTR_FMT "%p"
+#endif
+
+/*
+ * Get the canary byte pattern for @addr. Use a pattern that varies based on the
+ * lower 3 bits of the address, to detect memory corruptions with higher
+ * probability, where similar constants are used.
+ */

+#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)(addr) & 0x7))

+
+/* Maximum stack depth for reports. */
+#define KFENCE_STACK_DEPTH 64
+
+/* KFENCE object states. */
+enum kfence_object_state {
+ KFENCE_OBJECT_UNUSED, /* Object is unused. */
+ KFENCE_OBJECT_ALLOCATED, /* Object is currently allocated. */
+ KFENCE_OBJECT_FREED, /* Object was allocated, and then freed. */

+};
+
+/* KFENCE metadata per guarded allocation. */
+struct kfence_metadata {
+ struct list_head list; /* Freelist node; access under kfence_freelist_lock. */
+ struct rcu_head rcu_head; /* For delayed freeing. */

+
+ /*

+ * Lock protecting below data; to ensure consistency of the below data,
+ * since the following may execute concurrently: __kfence_alloc(),
+ * __kfence_free(), kfence_handle_page_fault(). However, note that we
+ * cannot grab the same metadata off the freelist twice, and multiple
+ * __kfence_alloc() cannot run concurrently on the same metadata.
+ */
+ raw_spinlock_t lock;
+
+ /* The current state of the object; see above. */
+ enum kfence_object_state state;

+
+ /*

+ * Allocated object address; cannot be calculated from size, because of
+ * alignment requirements.
+ *
+ * Invariant: ALIGN_DOWN(addr, PAGE_SIZE) is constant.
+ */
+ unsigned long addr;

+
+ /*

+ * The size of the original allocation.
+ */
+ size_t size;

+
+ /*

+ * The kmem_cache cache of the last allocation; NULL if never allocated
+ * or the cache has already been destroyed.
+ */
+ struct kmem_cache *cache;

+
+ /*

+ * In case of an invalid access, the page that was unprotected; we
+ * optimistically only store address.
+ */
+ unsigned long unprotected_page;
+
+ /* Allocation and free stack information. */
+ int num_alloc_stack;
+ int num_free_stack;
+ unsigned long alloc_stack[KFENCE_STACK_DEPTH];
+ unsigned long free_stack[KFENCE_STACK_DEPTH];
+};
+
+extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
+
+/* KFENCE error types for report generation. */
+enum kfence_error_type {
+ KFENCE_ERROR_OOB, /* Detected a out-of-bounds access. */
+ KFENCE_ERROR_UAF, /* Detected a use-after-free access. */
+ KFENCE_ERROR_CORRUPTION, /* Detected a memory corruption on free. */
+ KFENCE_ERROR_INVALID, /* Invalid access of unknown type. */
+ KFENCE_ERROR_INVALID_FREE, /* Invalid free. */
+};
+
+void kfence_report_error(unsigned long address, const struct kfence_metadata *meta,

+ enum kfence_error_type type);
+

+void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta);
+
+#endif /* MM_KFENCE_KFENCE_H */

diff --git a/mm/kfence/report.c b/mm/kfence/report.c
new file mode 100644

index 000000000000..0375867e85b3
--- /dev/null
+++ b/mm/kfence/report.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0
+

+#include <stdarg.h>
+
+#include <linux/kernel.h>
+#include <linux/lockdep.h>
+#include <linux/printk.h>
+#include <linux/seq_file.h>
+#include <linux/stacktrace.h>
+#include <linux/string.h>
+
+#include <asm/kfence.h>

+
+#include "kfence.h"
+

+/* Helper function to either print to a seq_file or to console. */
+__printf(2, 3)
+static void seq_con_printf(struct seq_file *seq, const char *fmt, ...)
+{
+ va_list args;
+
+ va_start(args, fmt);
+ if (seq)
+ seq_vprintf(seq, fmt, args);
+ else
+ vprintk(fmt, args);
+ va_end(args);
+}

+
+/*

+ * Get the number of stack entries to skip get out of MM internals. @type is
+ * optional, and if set to NULL, assumes an allocation or free stack.

+ */

+static int get_stack_skipnr(const unsigned long stack_entries[], int num_entries,

+ const enum kfence_error_type *type)

+{
+ char buf[64];
+ int skipnr, fallback = 0;

+ bool is_access_fault = false;
+
+ if (type) {

+ /* Depending on error type, find different stack entries. */

+ switch (*type) {

+ case KFENCE_ERROR_UAF:
+ case KFENCE_ERROR_OOB:
+ case KFENCE_ERROR_INVALID:

+ is_access_fault = true;
+ break;

+ case KFENCE_ERROR_CORRUPTION:
+ case KFENCE_ERROR_INVALID_FREE:

+ break;
+ }
+ }
+

+ for (skipnr = 0; skipnr < num_entries; skipnr++) {
+ int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]);
+

+ if (is_access_fault) {

+ if (!strncmp(buf, KFENCE_SKIP_ARCH_FAULT_HANDLER, len))

+ goto found;
+ } else {
+ if (str_has_prefix(buf, "kfence_") || str_has_prefix(buf, "__kfence_"))
+ fallback = skipnr + 1; /* In case of tail calls into kfence. */
+
+ /* Also the *_bulk() variants by only checking prefixes. */
+ if (str_has_prefix(buf, "kfree") ||
+ str_has_prefix(buf, "kmem_cache_free") ||
+ str_has_prefix(buf, "__kmalloc") ||
+ str_has_prefix(buf, "kmem_cache_alloc"))
+ goto found;
+ }
+ }
+ if (fallback < num_entries)
+ return fallback;
+found:
+ skipnr++;

+ return skipnr < num_entries ? skipnr : 0;
+}
+

+static void kfence_print_stack(struct seq_file *seq, const struct kfence_metadata *meta,
+ bool show_alloc)
+{
+ const unsigned long *entries = show_alloc ? meta->alloc_stack : meta->free_stack;
+ const int nentries = show_alloc ? meta->num_alloc_stack : meta->num_free_stack;
+
+ if (nentries) {
+ /* Skip allocation/free internals stack. */
+ int i = get_stack_skipnr(entries, nentries, NULL);
+
+ /* stack_trace_seq_print() does not exist; open code our own. */
+ for (; i < nentries; i++)
+ seq_con_printf(seq, " %pS\n", (void *)entries[i]);
+ } else {
+ seq_con_printf(seq, " no %s stack\n", show_alloc ? "allocation" : "deallocation");
+ }
+}
+

+void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta)
+{

+ const int size = abs(meta->size);
+ const unsigned long start = meta->addr;
+ const struct kmem_cache *const cache = meta->cache;
+
+ lockdep_assert_held(&meta->lock);
+
+ if (meta->state == KFENCE_OBJECT_UNUSED) {
+ seq_con_printf(seq, "kfence-#%zd unused\n", meta - kfence_metadata);

+ return;
+ }
+

+ seq_con_printf(seq,
+ "kfence-#%zd [0x" PTR_FMT "-0x" PTR_FMT
+ ", size=%d, cache=%s] allocated in:\n",
+ meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size,
+ (cache && cache->name) ? cache->name : "<destroyed>");
+ kfence_print_stack(seq, meta, true);
+
+ if (meta->state == KFENCE_OBJECT_FREED) {
+ seq_con_printf(seq, "\nfreed in:\n");
+ kfence_print_stack(seq, meta, false);
+ }

+}
+
+/*

+ * Show bytes at @addr that are different from the expected canary values, up to
+ * @max_bytes.
+ */
+static void print_diff_canary(const u8 *addr, size_t max_bytes)
+{
+ const u8 *max_addr = min((const u8 *)PAGE_ALIGN((unsigned long)addr), addr + max_bytes);
+
+ pr_cont("[");
+ for (; addr < max_addr; addr++) {
+ if (*addr == KFENCE_CANARY_PATTERN(addr))
+ pr_cont(" .");
+ else if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
+ pr_cont(" 0x%02x", *addr);
+ else /* Do not leak kernel memory in non-debug builds. */
+ pr_cont(" !");
+ }
+ pr_cont(" ]");
+}
+
+void kfence_report_error(unsigned long address, const struct kfence_metadata *meta,

+ enum kfence_error_type type)
+{

+ unsigned long stack_entries[KFENCE_STACK_DEPTH] = { 0 };
+ int num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 1);
+ int skipnr = get_stack_skipnr(stack_entries, num_stack_entries, &type);
+ const ptrdiff_t object_index = meta ? meta - kfence_metadata : -1;
+
+ /* Require non-NULL meta, except if KFENCE_ERROR_INVALID. */
+ if (WARN_ON(type != KFENCE_ERROR_INVALID && !meta))

+ return;
+
+ if (meta)
+ lockdep_assert_held(&meta->lock);
+ /*

+ * Because we may generate reports in printk-unfriendly parts of the
+ * kernel, such as scheduler code, the use of printk() could deadlock.
+ * Until such time that all printing code here is safe in all parts of
+ * the kernel, accept the risk, and just get our message out (given the
+ * system might already behave unpredictably due to the memory error).
+ * As such, also disable lockdep to hide warnings, and avoid disabling
+ * lockdep for the rest of the kernel.
+ */
+ lockdep_off();
+
+ pr_err("==================================================================\n");

+ /* Print report header. */
+ switch (type) {

+ case KFENCE_ERROR_OOB:
+ pr_err("BUG: KFENCE: out-of-bounds in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Out-of-bounds access at 0x" PTR_FMT " (%s of kfence-#%zd):\n",
+ (void *)address, address < meta->addr ? "left" : "right", object_index);

+ break;
+ case KFENCE_ERROR_UAF:

+ pr_err("BUG: KFENCE: use-after-free in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Use-after-free access at 0x" PTR_FMT " (in kfence-#%zd):\n",
+ (void *)address, object_index);

+ break;
+ case KFENCE_ERROR_CORRUPTION:

+ pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Corrupted memory at 0x" PTR_FMT " ", (void *)address);
+ print_diff_canary((u8 *)address, 16);
+ pr_cont(" (in kfence-#%zd):\n", object_index);

+ break;
+ case KFENCE_ERROR_INVALID:

+ pr_err("BUG: KFENCE: invalid access in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Invalid access at 0x" PTR_FMT ":\n", (void *)address);

+ break;
+ case KFENCE_ERROR_INVALID_FREE:

+ pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]);
+ pr_err("Invalid free of 0x" PTR_FMT " (in kfence-#%zd):\n", (void *)address,
+ object_index);

+ break;
+ }
+

+ /* Print stack trace and object info. */
+ stack_trace_print(stack_entries + skipnr, num_stack_entries - skipnr, 0);
+
+ if (meta) {
+ pr_err("\n");
+ kfence_print_object(NULL, meta);
+ }
+
+ /* Print report footer. */
+ pr_err("\n");
+ dump_stack_print_info(KERN_DEFAULT);
+ pr_err("==================================================================\n");
+
+ lockdep_on();
+
+ if (panic_on_warn)
+ panic("panic_on_warn set ...\n");
+
+ /* We encountered a memory unsafety error, taint the kernel! */
+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+}
--

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:26:42 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Add architecture specific implementation details for KFENCE and enable

KFENCE for the x86 architecture. In particular, this implements the
required interface in <asm/kfence.h> for setting up the pool and
providing helper functions for protecting and unprotecting pages.

For x86, we need to ensure that the pool uses 4K pages, which is done
using the set_memory_4k() helper function.

Reviewed-by: Dmitry Vyukov <dvy...@google.com>
Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

arch/x86/Kconfig | 2 ++
arch/x86/include/asm/kfence.h | 60 +++++++++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 4 +++
3 files changed, 66 insertions(+)
create mode 100644 arch/x86/include/asm/kfence.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..e22dc722698c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -144,6 +144,8 @@ config X86

select HAVE_ARCH_JUMP_LABEL_RELATIVE

select HAVE_ARCH_KASAN if X86_64
select HAVE_ARCH_KASAN_VMALLOC if X86_64
+ select HAVE_ARCH_KFENCE
+ select HAVE_ARCH_KFENCE_STATIC_POOL
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT
diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h

new file mode 100644

index 000000000000..cf09e377faf9
--- /dev/null
+++ b/arch/x86/include/asm/kfence.h

@@ -0,0 +1,60 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+

+#ifndef _ASM_X86_KFENCE_H
+#define _ASM_X86_KFENCE_H
+
+#include <linux/bug.h>
+#include <linux/kfence.h>
+
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+#include <asm/set_memory.h>
+#include <asm/tlbflush.h>
+
+/* The alignment should be at least a 4K page. */
+#define KFENCE_POOL_ALIGNMENT PAGE_SIZE

+
+/*

+ * The page fault handler entry function, up to which the stack trace is
+ * truncated in reports.
+ */
+#define KFENCE_SKIP_ARCH_FAULT_HANDLER "asm_exc_page_fault"
+
+/* Force 4K pages for __kfence_pool. */
+static inline bool arch_kfence_initialize_pool(void)

+{
+ unsigned long addr;
+

+ for (addr = (unsigned long)__kfence_pool; is_kfence_address((void *)addr);
+ addr += PAGE_SIZE) {
+ unsigned int level;
+
+ if (!lookup_address(addr, &level))

+ return false;
+

+ if (level != PG_LEVEL_4K)
+ set_memory_4k(addr, 1);

+ }
+
+ return true;
+}
+

+/* Protect the given page and flush TLBs. */

+static inline bool kfence_protect_page(unsigned long addr, bool protect)
+{

+ unsigned int level;
+ pte_t *pte = lookup_address(addr, &level);
+
+ if (!pte || level != PG_LEVEL_4K)

+ return false;
+

+ if (protect)
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
+ else
+ set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
+
+ flush_tlb_one_kernel(addr);

+ return true;
+}
+

+#endif /* _ASM_X86_KFENCE_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6e3e8a124903..423e15ad5eb6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -9,6 +9,7 @@
#include <linux/kdebug.h> /* oops_begin/end, ... */
#include <linux/extable.h> /* search_exception_tables */
#include <linux/memblock.h> /* max_low_pfn */
+#include <linux/kfence.h> /* kfence_handle_page_fault */
#include <linux/kprobes.h> /* NOKPROBE_SYMBOL, ... */
#include <linux/mmiotrace.h> /* kmmio_handler, ... */
#include <linux/perf_event.h> /* perf_sw_event */
@@ -701,6 +702,9 @@ no_context(struct pt_regs *regs, unsigned long error_code,
}
#endif

+ if (kfence_handle_page_fault(address))
+ return;
+
/*
* 32-bit:
*
--

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:26:44 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add architecture specific implementation details for KFENCE and enable

KFENCE for the arm64 architecture. In particular, this implements the
required interface in <asm/kfence.h>. Currently, the arm64 version does
not yet use a statically allocated memory pool, at the cost of a pointer
load for each is_kfence_address().

Reviewed-by: Dmitry Vyukov <dvy...@google.com>

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

For ARM64, we would like to solicit feedback on what the best option is
to obtain a constant address for __kfence_pool. One option is to declare
a memory range in the memory layout to be dedicated to KFENCE (like is
done for KASAN), however, it is unclear if this is the best available
option. We would like to avoid touching the memory layout.
---

arch/arm64/Kconfig | 1 +

arch/arm64/include/asm/kfence.h | 39 +++++++++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 4 ++++
3 files changed, 44 insertions(+)
create mode 100644 arch/arm64/include/asm/kfence.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..1acc6b2877c3 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -132,6 +132,7 @@ config ARM64

select HAVE_ARCH_JUMP_LABEL_RELATIVE

select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN
+ select HAVE_ARCH_KFENCE if (!ARM64_16K_PAGES && !ARM64_64K_PAGES)
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h

new file mode 100644

index 000000000000..608dde80e5ca
--- /dev/null
+++ b/arch/arm64/include/asm/kfence.h

@@ -0,0 +1,39 @@

+/* SPDX-License-Identifier: GPL-2.0 */
+

+#ifndef __ASM_KFENCE_H
+#define __ASM_KFENCE_H
+
+#include <linux/kfence.h>
+#include <linux/log2.h>
+#include <linux/mm.h>
+
+#include <asm/cacheflush.h>
+
+#define KFENCE_SKIP_ARCH_FAULT_HANDLER "el1_sync"
+
+/*
+ * FIXME: Support HAVE_ARCH_KFENCE_STATIC_POOL: Use the statically allocated
+ * __kfence_pool, to avoid the extra pointer load for is_kfence_address(). By
+ * default, however, we do not have struct pages for static allocations.
+ */
+

+static inline bool arch_kfence_initialize_pool(void)
+{

+ const unsigned int num_pages = ilog2(roundup_pow_of_two(KFENCE_POOL_SIZE / PAGE_SIZE));
+ struct page *pages = alloc_pages(GFP_KERNEL, num_pages);
+
+ if (!pages)

+ return false;
+
+ __kfence_pool = page_address(pages);

+ return true;
+}
+

+static inline bool kfence_protect_page(unsigned long addr, bool protect)
+{

+ set_memory_valid(addr, 1, !protect);

+
+ return true;
+}
+

+#endif /* __ASM_KFENCE_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f07333e86c2f..d5b72ecbeeea 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -10,6 +10,7 @@
#include <linux/acpi.h>
#include <linux/bitfield.h>
#include <linux/extable.h>
+#include <linux/kfence.h>
#include <linux/signal.h>
#include <linux/mm.h>
#include <linux/hardirq.h>
@@ -310,6 +311,9 @@ static void __do_kernel_fault(unsigned long addr, unsigned int esr,
"Ignoring spurious kernel translation fault at virtual address %016lx\n", addr))
return;

+ if (kfence_handle_page_fault(addr))
+ return;
+
if (is_el1_permission_fault(addr, esr, regs)) {
if (esr & ESR_ELx_WNR)
msg = "write to read-only memory";
--

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:26:48 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Inserts KFENCE hooks into the SLAB allocator.

To pass the originally requested size to KFENCE, add an argument
'orig_size' to slab_alloc*(). The additional argument is required to
preserve the requested original size for kmalloc() allocations, which
uses size classes (e.g. an allocation of 272 bytes will return an object
of size 512). Therefore, kmem_cache::size does not represent the
kmalloc-caller's requested size, and we must introduce the argument
'orig_size' to propagate the originally requested size to KFENCE.

Without the originally requested size, we would not be able to detect
out-of-bounds accesses for objects placed at the end of a KFENCE object
page if that object is not equal to the kmalloc-size class it was
bucketed into.

When KFENCE is disabled, there is no additional overhead, since

slab_alloc*() functions are __always_inline.

Reviewed-by: Dmitry Vyukov <dvy...@google.com>

Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

v3:
* Rewrite patch description to clarify need for 'orig_size'
[reported by Christopher Lameter].

---
mm/slab.c | 46 ++++++++++++++++++++++++++++++++++------------
mm/slab_common.c | 6 +++++-
2 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 3160dff6fd76..30aba06ae02b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -100,6 +100,7 @@
#include <linux/seq_file.h>
#include <linux/notifier.h>

#include <linux/kallsyms.h>
+#include <linux/kfence.h>

#include <linux/cpu.h>
#include <linux/sysctl.h>
#include <linux/module.h>

@@ -3206,7 +3207,7 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
}

static __always_inline void *
-slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t orig_size,
unsigned long caller)
{

+ kmemleak_free_recursive(objp, cachep->flags);
+ return;
+ }
+

+ bool is_kfence = is_kfence_address(ptr);

ptr = kasan_reset_tag(ptr);

/* Find and validate object. */
cachep = page->slab_cache;
- objnr = obj_to_index(cachep, page, (void *)ptr);
- BUG_ON(objnr >= cachep->num);
+ if (!is_kfence) {
+ objnr = obj_to_index(cachep, page, (void *)ptr);
+ BUG_ON(objnr >= cachep->num);
+ }

/* Find offset within object. */

- offset = ptr - index_to_obj(cachep, page, objnr) - obj_offset(cachep);
+ if (is_kfence_address(ptr))

+ offset = ptr - kfence_object_start(ptr);
+ else

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:26:51 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Inserts KFENCE hooks into the SLUB allocator.

To pass the originally requested size to KFENCE, add an argument
'orig_size' to slab_alloc*(). The additional argument is required to
preserve the requested original size for kmalloc() allocations, which
uses size classes (e.g. an allocation of 272 bytes will return an object
of size 512). Therefore, kmem_cache::size does not represent the
kmalloc-caller's requested size, and we must introduce the argument
'orig_size' to propagate the originally requested size to KFENCE.

Without the originally requested size, we would not be able to detect
out-of-bounds accesses for objects placed at the end of a KFENCE object
page if that object is not equal to the kmalloc-size class it was
bucketed into.

When KFENCE is disabled, there is no additional overhead, since
slab_alloc*() functions are __always_inline.

Reviewed-by: Dmitry Vyukov <dvy...@google.com>
Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---
v3:
* Rewrite patch description to clarify need for 'orig_size'
[reported by Christopher Lameter].
---

mm/slub.c | 72 ++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d4177aecedf6..5c5a13a7857c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -27,6 +27,7 @@
#include <linux/ctype.h>
#include <linux/debugobjects.h>

#include <linux/kallsyms.h>
+#include <linux/kfence.h>

#include <linux/memory.h>
#include <linux/math64.h>
#include <linux/fault-inject.h>
@@ -1557,6 +1558,11 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
void *old_tail = *tail ? *tail : *head;
int rsize;

+ if (is_kfence_address(next)) {
+ slab_free_hook(s, next);

+ return true;
+ }
+

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
- void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+ void *ret = slab_alloc(s, gfpflags, _RET_IP_, s->object_size);

+ return size;
+ }
+

/* Start new detached freelist */
df->page = page;
set_freepointer(df->s, object, NULL);

@@ -3290,8 +3314,14 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
c = this_cpu_ptr(s->cpu_slab);

for (i = 0; i < size; i++) {

+ bool is_kfence = is_kfence_address(ptr);

ptr = kasan_reset_tag(ptr);

@@ -4048,10 +4079,13 @@ void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
to_user, 0, n);

/* Find offset within object. */

- offset = (ptr - page_address(page)) % s->size;
+ if (is_kfence)

+ offset = ptr - kfence_object_start(ptr);
+ else

+ offset = (ptr - page_address(page)) % s->size;

/* Adjust for redzone and reject if within the redzone. */
- if (kmem_cache_debug_flags(s, SLAB_RED_ZONE)) {
+ if (!is_kfence && kmem_cache_debug_flags(s, SLAB_RED_ZONE)) {
if (offset < s->red_left_pad)
usercopy_abort("SLUB object in left red zone",
s->name, to_user, offset, n);
@@ -4460,7 +4494,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc(s, gfpflags, caller);
+ ret = slab_alloc(s, gfpflags, caller, size);

/* Honor the call site pointer we received. */
trace_kmalloc(caller, ret, size, s->size, gfpflags);
@@ -4491,7 +4525,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

- ret = slab_alloc_node(s, gfpflags, node, caller);
+ ret = slab_alloc_node(s, gfpflags, node, caller, size);

/* Honor the call site pointer we received. */
trace_kmalloc_node(caller, ret, size, s->size, gfpflags, node);
--

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:26:52 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

We make KFENCE compatible with KASAN for testing KFENCE itself. In
particular, KASAN helps to catch any potential corruptions to KFENCE
state, or other corruptions that may be a result of freepointer
corruptions in the main allocators.

To indicate that the combination of the two is generally discouraged,
CONFIG_EXPERT=y should be set. It also gives us the nice property that
KFENCE will be build-tested by allyesconfig builds.

Reviewed-by: Dmitry Vyukov <dvy...@google.com>
Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

lib/Kconfig.kfence | 2 +-
mm/kasan/common.c | 7 +++++++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
index 4c2ea1c722de..6825c1c07a10 100644
--- a/lib/Kconfig.kfence
+++ b/lib/Kconfig.kfence

@@ -10,7 +10,7 @@ config HAVE_ARCH_KFENCE_STATIC_POOL

menuconfig KFENCE
bool "KFENCE: low-overhead sampling-based memory safety error detector"
- depends on HAVE_ARCH_KFENCE && !KASAN && (SLAB || SLUB)
+ depends on HAVE_ARCH_KFENCE && (!KASAN || EXPERT) && (SLAB || SLUB)

depends on JUMP_LABEL # To ensure performance, require jump labels

select STACKTRACE
help
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 950fd372a07e..f5c49f0fdeff 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -18,6 +18,7 @@
#include <linux/init.h>
#include <linux/kasan.h>
#include <linux/kernel.h>

+#include <linux/kfence.h>
#include <linux/kmemleak.h>

#include <linux/linkage.h>
#include <linux/memblock.h>
@@ -396,6 +397,9 @@ static bool __kasan_slab_free(struct kmem_cache *cache, void *object,
tagged_object = object;
object = reset_tag(object);

+ if (is_kfence_address(object))
+ return false;
+
if (unlikely(nearest_obj(cache, virt_to_head_page(object), object) !=
object)) {
kasan_report_invalid_free(tagged_object, ip);
@@ -444,6 +448,9 @@ static void *__kasan_kmalloc(struct kmem_cache *cache, const void *object,
if (unlikely(object == NULL))
return NULL;

+ if (is_kfence_address(object))
+ return (void *)object;
+
redzone_start = round_up((unsigned long)(object + size),
KASAN_SHADOW_SCALE_SIZE);
redzone_end = round_up((unsigned long)object + cache->object_size,
--

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:26:55 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

From: Alexander Potapenko <gli...@google.com>

Add compatibility with KMEMLEAK, by making KMEMLEAK aware of the KFENCE
memory pool. This allows building debug kernels with both enabled, which
also helped in debugging KFENCE.

Reviewed-by: Dmitry Vyukov <dvy...@google.com>
Co-developed-by: Marco Elver <el...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
---

v2:
* Rework using delete_object_part() [suggested by Catalin Marinas].
---
mm/kmemleak.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5e252d91eb14..feff16068e8e 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -97,6 +97,7 @@
#include <linux/atomic.h>

#include <linux/kasan.h>

+#include <linux/kfence.h>
#include <linux/kmemleak.h>

#include <linux/memory_hotplug.h>

@@ -1948,6 +1949,11 @@ void __init kmemleak_init(void)
KMEMLEAK_GREY, GFP_ATOMIC);
create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
KMEMLEAK_GREY, GFP_ATOMIC);
+#if defined(CONFIG_KFENCE) && defined(CONFIG_HAVE_ARCH_KFENCE_STATIC_POOL)
+ /* KFENCE objects are located in .bss, which may confuse kmemleak. Skip them. */
+ delete_object_part((unsigned long)__kfence_pool, KFENCE_POOL_SIZE);
+#endif
+
/* only register .data..ro_after_init if not within .data */
if (&__start_ro_after_init < &_sdata || &__end_ro_after_init > &_edata)
create_object((unsigned long)__start_ro_after_init,
--

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:27:00 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add KFENCE documentation in dev-tools/kfence.rst, and add to index.

Reviewed-by: Dmitry Vyukov <dvy...@google.com>

Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

v3:
* Re-introduce reference to Documentation/dev-tools/kfence.rst.

v2:
* Many clarifications based on comments from Andrey Konovalov.
* Document CONFIG_KFENCE_SAMPLE_INTERVAL=0 usage.
* Make use-cases between KASAN and KFENCE clearer.
* Be clearer about the fact the pool is fixed size.
* Update based on reporting changes.
* Explicitly mention max supported allocation size is PAGE_SIZE.
---

Documentation/dev-tools/index.rst | 1 +
Documentation/dev-tools/kfence.rst | 291 +++++++++++++++++++++++++++++
lib/Kconfig.kfence | 2 +
3 files changed, 294 insertions(+)

create mode 100644 Documentation/dev-tools/kfence.rst

diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
index f7809c7b1ba9..1b1cf4f5c9d9 100644
--- a/Documentation/dev-tools/index.rst
+++ b/Documentation/dev-tools/index.rst
@@ -22,6 +22,7 @@ whole; patches welcome!
ubsan
kmemleak
kcsan
+ kfence
gdb-kernel-debugging
kgdb
kselftest
diff --git a/Documentation/dev-tools/kfence.rst b/Documentation/dev-tools/kfence.rst

new file mode 100644

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
index 6825c1c07a10..872bcbdd8cc4 100644
--- a/lib/Kconfig.kfence
+++ b/lib/Kconfig.kfence
@@ -19,6 +19,8 @@ menuconfig KFENCE

to have negligible cost to permit enabling it in production

environments.

+ See <file:Documentation/dev-tools/kfence.rst> for more details.

+
Note that, KFENCE is not a substitute for explicit testing with tools

such as KASAN. KFENCE can detect a subset of bugs that KASAN can

detect, albeit at very different performance profiles. If you can

--
2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:27:00 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Lockdep checks that dynamic key registration is only performed on keys
that are not static objects. With KFENCE, it is possible that such a
dynamically allocated key is a KFENCE object which may, however, be
allocated from a static memory pool (if HAVE_ARCH_KFENCE_STATIC_POOL).

Therefore, ignore KFENCE-allocated objects in static_obj().

Reviewed-by: Dmitry Vyukov <dvy...@google.com>
Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

kernel/locking/lockdep.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 54b74fabf40c..0cf5d5ecbd31 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -38,6 +38,7 @@
#include <linux/seq_file.h>
#include <linux/spinlock.h>

#include <linux/kallsyms.h>
+#include <linux/kfence.h>

#include <linux/interrupt.h>
#include <linux/stacktrace.h>
#include <linux/debug_locks.h>
@@ -755,6 +756,13 @@ static int static_obj(const void *obj)
if (arch_is_kernel_initmem_freed(addr))
return 0;

+ /*

+ * KFENCE objects may be allocated from a static memory pool, but are
+ * not actually static objects.
+ */
+ if (is_kfence_address(obj))
+ return 0;
+
/*
* static variable?
*/
--

2.28.0.681.g6f77f65b4e-goog

Marco Elver

unread,

Sep 21, 2020, 9:27:03 AM9/21/20

to el...@google.com, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

Add KFENCE test suite, testing various error detection scenarios. Makes
use of KUnit for test organization. Since KFENCE's interface to obtain
error reports is via the console, the test verifies that KFENCE outputs
expected reports to the console.

Reviewed-by: Dmitry Vyukov <dvy...@google.com>
Co-developed-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Alexander Potapenko <gli...@google.com>
Signed-off-by: Marco Elver <el...@google.com>
---

v3:
* Lower line buffer size to avoid warnings of using more than 1024 bytes
stack usage [reported by kernel test robot <l...@intel.com>].

v2:
* Update for shortened memory corruption report.
---
lib/Kconfig.kfence | 13 +
mm/kfence/Makefile | 3 +
mm/kfence/kfence_test.c | 777 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 793 insertions(+)
create mode 100644 mm/kfence/kfence_test.c

diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
index 872bcbdd8cc4..46d9b6693abb 100644
--- a/lib/Kconfig.kfence
+++ b/lib/Kconfig.kfence

@@ -62,4 +62,17 @@ config KFENCE_STRESS_TEST_FAULTS

The option is only to test KFENCE; set to 0 if you are unsure.

+config KFENCE_KUNIT_TEST
+ tristate "KFENCE integration test suite" if !KUNIT_ALL_TESTS
+ default KUNIT_ALL_TESTS
+ depends on TRACEPOINTS && KUNIT
+ help
+ Test suite for KFENCE, testing various error detection scenarios with
+ various allocation types, and checking that reports are correctly
+ output to console.
+
+ Say Y here if you want the test to be built into the kernel and run
+ during boot; say M if you want the test to build as a module; say N

+ if you are unsure.
+

endif # KFENCE
diff --git a/mm/kfence/Makefile b/mm/kfence/Makefile
index d991e9a349f0..6872cd5e5390 100644
--- a/mm/kfence/Makefile
+++ b/mm/kfence/Makefile
@@ -1,3 +1,6 @@
# SPDX-License-Identifier: GPL-2.0

obj-$(CONFIG_KFENCE) := core.o report.o
+
+CFLAGS_kfence_test.o := -g -fno-omit-frame-pointer -fno-optimize-sibling-calls
+obj-$(CONFIG_KFENCE_KUNIT_TEST) += kfence_test.o
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c

new file mode 100644
index 000000000000..2ecd87668a74

+
+#include "kfence.h"
+

+/* Report as observed from console. */
+static struct {
+ spinlock_t lock;
+ int nlines;

+ char lines[2][256];

+} observed = {
+ .lock = __SPIN_LOCK_UNLOCKED(observed.lock),
+};
+
+/* Probe for console output: obtains observed lines of interest. */
+static void probe_console(void *ignore, const char *buf, size_t len)

+{
+ unsigned long flags;

+ int nlines;
+
+ spin_lock_irqsave(&observed.lock, flags);
+ nlines = observed.nlines;
+
+ if (strnstr(buf, "BUG: KFENCE: ", len) && strnstr(buf, "test_", len)) {

+ /*

+ unsigned long flags;

+ typeof(observed.lines) expect;
+ const char *end;
+ char *cur;
+
+ /* Doubled-checked locking. */
+ if (!report_available())

+ return false;
+

+ /* Generate expected report contents. */
+
+ /* Title */
+ cur = expect[0];
+ end = &expect[0][sizeof(expect[0]) - 1];
+ switch (r->type) {
+ case KFENCE_ERROR_OOB:
+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: out-of-bounds");

+ break;
+ case KFENCE_ERROR_UAF:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: use-after-free");

+ break;
+ case KFENCE_ERROR_CORRUPTION:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: memory corruption");

+ break;
+ case KFENCE_ERROR_INVALID:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid access");

+ break;
+ case KFENCE_ERROR_INVALID_FREE:

+ cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid free");

+ break;
+ }
+

+ scnprintf(cur, end - cur, " in %pS", r->fn);
+ /* The exact offset won't match, remove it; also strip module name. */
+ cur = strchr(expect[0], '+');
+ if (cur)
+ *cur = '\0';
+
+ /* Access information */
+ cur = expect[1];
+ end = &expect[1][sizeof(expect[1]) - 1];
+
+ switch (r->type) {
+ case KFENCE_ERROR_OOB:
+ cur += scnprintf(cur, end - cur, "Out-of-bounds access at");

+ break;
+ case KFENCE_ERROR_UAF:

+ cur += scnprintf(cur, end - cur, "Use-after-free access at");

+ break;
+ case KFENCE_ERROR_CORRUPTION:

+ cur += scnprintf(cur, end - cur, "Corrupted memory at");

+ break;
+ case KFENCE_ERROR_INVALID:

+ cur += scnprintf(cur, end - cur, "Invalid access at");

+ break;
+ case KFENCE_ERROR_INVALID_FREE:

+ cur += scnprintf(cur, end - cur, "Invalid free of");

+ break;
+ }
+

+ cur += scnprintf(cur, end - cur, " 0x" PTR_FMT, (void *)r->addr);
+
+ spin_lock_irqsave(&observed.lock, flags);
+ if (!report_available())
+ goto out; /* A new report is being captured. */
+
+ /* Finally match expected output to what we actually observed. */
+ ret = strstr(observed.lines[0], expect[0]) && strstr(observed.lines[1], expect[1]);
+out:
+ spin_unlock_irqrestore(&observed.lock, flags);
+ return ret;
+}
+
+/* ===== Test cases ===== */
+
+#define TEST_PRIV_WANT_MEMCACHE ((void *)1)
+
+/* Cache used by tests; if NULL, allocate from kmalloc instead. */
+static struct kmem_cache *test_cache;
+
+static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t flags,
+ void (*ctor)(void *))
+{
+ if (test->priv != TEST_PRIV_WANT_MEMCACHE)

+ return size;
+

+ kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor);

+
+ /*

+ * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any
+ * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to
+ * allocate via memcg, if enabled.
+ */
+ flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT;
+ test_cache = kmem_cache_create("test", size, 1, flags, ctor);
+ KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache");
+

+ return size;
+}
+

+static void test_cache_destroy(void)
+{
+ if (!test_cache)
+ return;
+
+ kmem_cache_destroy(test_cache);
+ test_cache = NULL;
+}
+
+static inline size_t kmalloc_cache_alignment(size_t size)
+{
+ return kmalloc_caches[kmalloc_type(GFP_KERNEL)][kmalloc_index(size)]->align;
+}
+
+/* Must always inline to match stack trace against caller. */
+static __always_inline void test_free(void *ptr)
+{
+ if (test_cache)
+ kmem_cache_free(test_cache, ptr);
+ else
+ kfree(ptr);

+}
+
+/*

+ * If this should be a KFENCE allocation, and on which side the allocation and
+ * the closest guard page should be.
+ */
+enum allocation_policy {
+ ALLOCATE_ANY, /* KFENCE, any side. */
+ ALLOCATE_LEFT, /* KFENCE, left side of page. */
+ ALLOCATE_RIGHT, /* KFENCE, right side of page. */
+ ALLOCATE_NONE, /* No KFENCE allocation. */

+};
+
+/*

+ * Try to get a guarded allocation from KFENCE. Uses either kmalloc() or the
+ * current test_cache if set up.
+ */
+static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocation_policy policy)
+{
+ void *alloc;
+ unsigned long timeout, resched_after;
+ const char *policy_name;
+
+ switch (policy) {
+ case ALLOCATE_ANY:
+ policy_name = "any";
+ break;
+ case ALLOCATE_LEFT:
+ policy_name = "left";
+ break;
+ case ALLOCATE_RIGHT:
+ policy_name = "right";
+ break;
+ case ALLOCATE_NONE:
+ policy_name = "none";

+ break;
+ }
+

+ kunit_info(test, "%s: size=%zu, gfp=%x, policy=%s, cache=%i\n", __func__, size, gfp,
+ policy_name, !!test_cache);

+
+ /*

+}
+
+/*

+ * KFENCE is unable to detect an OOB if the allocation's alignment requirements
+ * leave a gap between the object and the guard page. Specifically, an
+ * allocation of e.g. 73 bytes is aligned on 8 and 128 bytes for SLUB or SLAB
+ * respectively. Therefore it is impossible for the allocated object to adhere
+ * to either of the page boundaries.
+ *
+ * However, we test that an access to memory beyond the gap result in KFENCE
+ * detecting an OOB access.
+ */
+static void test_kmalloc_aligned_oob_read(struct kunit *test)
+{
+ const size_t size = 73;
+ const size_t align = kmalloc_cache_alignment(size);
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_OOB,
+ .fn = test_kmalloc_aligned_oob_read,
+ };
+ char *buf;
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);

+
+ /*

+ * The object is offset to the right, so there won't be an OOB to the
+ * left of it.
+ */
+ READ_ONCE(*(buf - 1));
+ KUNIT_EXPECT_FALSE(test, report_available());

+
+ /*

+ * @buf must be aligned on @align, therefore buf + size belongs to the
+ * same page -> no OOB.
+ */
+ READ_ONCE(*(buf + size));
+ KUNIT_EXPECT_FALSE(test, report_available());
+
+ /* Overflowing by @align bytes will result in an OOB. */
+ expect.addr = buf + size + align;
+ READ_ONCE(*expect.addr);
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+
+ test_free(buf);
+}
+
+static void test_kmalloc_aligned_oob_write(struct kunit *test)
+{
+ const size_t size = 73;
+ struct expect_report expect = {
+ .type = KFENCE_ERROR_CORRUPTION,
+ .fn = test_kmalloc_aligned_oob_write,
+ };
+ char *buf;
+
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);

+ /*

+ };
+ int i;
+

+ if (!IS_ENABLED(CONFIG_INIT_ON_FREE_DEFAULT_ON))
+ return;
+ /* Assume it hasn't been disabled on command line. */
+
+ setup_test_cache(test, size, 0, NULL);
+ expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ for (i = 0; i < size; i++)
+ expect.addr[i] = i + 1;
+ test_free(expect.addr);
+

+ for (i = 0; i < size; i++) {
+ /*

+ * This may fail if the page was recycled by KFENCE and then
+ * written to again -- this however, is near impossible with a
+ * default config.
+ */
+ KUNIT_EXPECT_EQ(test, expect.addr[i], (char)0);
+
+ if (!i) /* Only check first access to not fail test if page is ever re-protected. */
+ KUNIT_EXPECT_TRUE(test, report_matches(&expect));
+ }
+}
+
+/* Ensure that constructors work properly. */
+static void test_memcache_ctor(struct kunit *test)
+{
+ const size_t size = 32;
+ char *buf;

+ int i;
+

+ setup_test_cache(test, size, 0, ctor_set_x);
+ buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+
+ for (i = 0; i < 8; i++)
+ KUNIT_EXPECT_EQ(test, buf[i], (char)'x');
+
+ test_free(buf);
+
+ KUNIT_EXPECT_FALSE(test, report_available());
+}
+
+/* Test that memory is zeroed if requested. */
+static void test_gfpzero(struct kunit *test)
+{
+ const size_t size = PAGE_SIZE; /* PAGE_SIZE so we can use ALLOCATE_ANY. */
+ char *buf1, *buf2;

+ int i;
+

+ if (CONFIG_KFENCE_SAMPLE_INTERVAL > 100) {
+ kunit_warn(test, "skipping ... would take too long\n");

+ return;
+ }
+

+ setup_test_cache(test, size, 0, NULL);
+ buf1 = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+ for (i = 0; i < size; i++)
+ buf1[i] = i + 1;
+ test_free(buf1);

+

+ /* Try to get same address again -- this can take a while. */
+ for (i = 0;; i++) {
+ buf2 = test_alloc(test, size, GFP_KERNEL | __GFP_ZERO, ALLOCATE_ANY);
+ if (buf1 == buf2)
+ break;
+ test_free(buf2);
+
+ if (i == CONFIG_KFENCE_NUM_OBJECTS) {
+ kunit_warn(test, "giving up ... cannot get same object back\n");

+ return;
+ }
+ }
+

+ int i;
+

+ pass = true;
+ break;
+ }

+ }
+ kmem_cache_free_bulk(test_cache, num, objects);
+ /*
+ * kmem_cache_alloc_bulk() disables interrupts, and calling it
+ * in a tight loop may not give KFENCE a chance to switch the
+ * static branch. Call cond_resched() to let KFENCE chime in.
+ */
+ cond_resched();
+ } while (!pass && time_before(jiffies, timeout));
+
+ KUNIT_EXPECT_TRUE(test, pass);
+ KUNIT_EXPECT_FALSE(test, report_available());

+}
+
+/*

+{
+ unsigned long flags;

+ int i;
+
+ spin_lock_irqsave(&observed.lock, flags);
+ for (i = 0; i < ARRAY_SIZE(observed.lines); i++)
+ observed.lines[i][0] = '\0';
+ observed.nlines = 0;
+ spin_unlock_irqrestore(&observed.lock, flags);
+
+ /* Any test with 'memcache' in its name will want a memcache. */
+ if (strstr(test->name, "memcache"))
+ test->priv = TEST_PRIV_WANT_MEMCACHE;
+ else
+ test->priv = NULL;

+
+ return 0;
+}
+

+static void test_exit(struct kunit *test)
+{
+ test_cache_destroy();
+}
+
+static struct kunit_suite kfence_test_suite = {
+ .name = "kfence",
+ .test_cases = kfence_test_cases,
+ .init = test_init,
+ .exit = test_exit,
+};
+static struct kunit_suite *kfence_test_suites[] = { &kfence_test_suite, NULL };
+
+static void register_tracepoints(struct tracepoint *tp, void *ignore)
+{
+ check_trace_callback_type_console(probe_console);
+ if (!strcmp(tp->name, "console"))
+ WARN_ON(tracepoint_probe_register(tp, probe_console, NULL));
+}
+
+static void unregister_tracepoints(struct tracepoint *tp, void *ignore)
+{
+ if (!strcmp(tp->name, "console"))
+ tracepoint_probe_unregister(tp, probe_console, NULL);

+}
+
+/*

+ * We only want to do tracepoints setup and teardown once, therefore we have to
+ * customize the init and exit functions and cannot rely on kunit_test_suite().
+ */
+static int __init kfence_test_init(void)

+{
+ /*

+ * Because we want to be able to build the test as a module, we need to
+ * iterate through all known tracepoints, since the static registration
+ * won't work here.
+ */
+ for_each_kernel_tracepoint(register_tracepoints, NULL);
+ return __kunit_test_suites_init(kfence_test_suites);
+}
+
+static void kfence_test_exit(void)
+{
+ __kunit_test_suites_exit(kfence_test_suites);
+ for_each_kernel_tracepoint(unregister_tracepoints, NULL);
+ tracepoint_synchronize_unregister();
+}
+
+late_initcall(kfence_test_init);
+module_exit(kfence_test_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Alexander Potapenko <gli...@google.com>, Marco Elver <el...@google.com>");
--

2.28.0.681.g6f77f65b4e-goog

Dmitry Vyukov

unread,

Sep 21, 2020, 9:38:49 AM9/21/20

to Andrew Morton, Alexander Potapenko, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, sjp...@amazon.com, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux-MM, Marco Elver

On Mon, Sep 21, 2020 at 3:26 PM Marco Elver <el...@google.com> wrote:
>
> This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
> low-overhead sampling-based memory safety error detector of heap
> use-after-free, invalid-free, and out-of-bounds access errors. This
> series enables KFENCE for the x86 and arm64 architectures, and adds
> KFENCE hooks to the SLAB and SLUB allocators.

Hi Andrew,

I wanted to ask what we can expect with respect to the timeline of
merging this into mm/upstream? The series got few reviews/positive
feedback.

Thank you

Will Deacon

unread,

Sep 21, 2020, 10:31:11 AM9/21/20

to Marco Elver, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, pau...@kernel.org, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On Mon, Sep 21, 2020 at 03:26:04PM +0200, Marco Elver wrote:
> Add architecture specific implementation details for KFENCE and enable
> KFENCE for the arm64 architecture. In particular, this implements the
> required interface in <asm/kfence.h>. Currently, the arm64 version does
> not yet use a statically allocated memory pool, at the cost of a pointer
> load for each is_kfence_address().
>
> Reviewed-by: Dmitry Vyukov <dvy...@google.com>
> Co-developed-by: Alexander Potapenko <gli...@google.com>
> Signed-off-by: Alexander Potapenko <gli...@google.com>
> Signed-off-by: Marco Elver <el...@google.com>
> ---
> For ARM64, we would like to solicit feedback on what the best option is
> to obtain a constant address for __kfence_pool. One option is to declare
> a memory range in the memory layout to be dedicated to KFENCE (like is
> done for KASAN), however, it is unclear if this is the best available
> option. We would like to avoid touching the memory layout.

Sorry for the delay on this.

Given that the pool is relatively small (i.e. when compared with our virtual
address space), dedicating an area of virtual space sounds like it makes
the most sense here. How early do you need it to be available?

An alternative approach would be to patch in the address at runtime, with
something like a static key to swizzle off the direct __kfence_pool load
once we're up and running.

Will

Alexander Potapenko

unread,

Sep 21, 2020, 10:58:29 AM9/21/20

to Will Deacon, Marco Elver, Andrew Morton, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, sjp...@amazon.com, Thomas Gleixner, Vlastimil Babka, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

On Mon, Sep 21, 2020 at 4:31 PM Will Deacon <wi...@kernel.org> wrote:
>
> On Mon, Sep 21, 2020 at 03:26:04PM +0200, Marco Elver wrote:
> > Add architecture specific implementation details for KFENCE and enable
> > KFENCE for the arm64 architecture. In particular, this implements the
> > required interface in <asm/kfence.h>. Currently, the arm64 version does
> > not yet use a statically allocated memory pool, at the cost of a pointer
> > load for each is_kfence_address().
> >
> > Reviewed-by: Dmitry Vyukov <dvy...@google.com>
> > Co-developed-by: Alexander Potapenko <gli...@google.com>
> > Signed-off-by: Alexander Potapenko <gli...@google.com>
> > Signed-off-by: Marco Elver <el...@google.com>
> > ---
> > For ARM64, we would like to solicit feedback on what the best option is
> > to obtain a constant address for __kfence_pool. One option is to declare
> > a memory range in the memory layout to be dedicated to KFENCE (like is
> > done for KASAN), however, it is unclear if this is the best available
> > option. We would like to avoid touching the memory layout.
>
> Sorry for the delay on this.

NP, thanks for looking!

> Given that the pool is relatively small (i.e. when compared with our virtual
> address space), dedicating an area of virtual space sounds like it makes
> the most sense here. How early do you need it to be available?

Yes, having a dedicated address sounds good.
We're inserting kfence_init() into start_kernel() after timekeeping_init().
So way after mm_init(), if that matters.

> An alternative approach would be to patch in the address at runtime, with
> something like a static key to swizzle off the direct __kfence_pool load
> once we're up and running.

IIUC there's no such thing as address patching in the kernel at the
moment, at least static keys work differently?
I am not sure how much we need to randomize this address range (we
don't on x86 anyway).

> Will
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/20200921143059.GO2139%40willie-the-truck.

Alexander Potapenko

unread,

Sep 21, 2020, 11:37:24 AM9/21/20

to Will Deacon, Marco Elver, Andrew Morton, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, sjp...@amazon.com, Thomas Gleixner, Vlastimil Babka, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

The question is though, how big should that dedicated area be?
Right now KFENCE_NUM_OBJECTS can be up to 16383 (which makes the pool
size 64MB), but this number actually comes from the limitation on
static objects, so we might want to increase that number on arm64.

Paul E. McKenney

unread,

Sep 21, 2020, 1:13:27 PM9/21/20

to Marco Elver, ak...@linux-foundation.org, gli...@google.com, h...@zytor.com, andre...@google.com, arya...@virtuozzo.com, lu...@kernel.org, b...@alien8.de, catalin...@arm.com, c...@linux.com, dave....@linux.intel.com, rien...@google.com, dvy...@google.com, edum...@google.com, gre...@linuxfoundation.org, hda...@sina.com, mi...@redhat.com, ja...@google.com, Jonathan...@huawei.com, cor...@lwn.net, iamjoon...@lge.com, kees...@chromium.org, mark.r...@arm.com, pen...@kernel.org, pet...@infradead.org, sjp...@amazon.com, tg...@linutronix.de, vba...@suse.cz, wi...@kernel.org, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, linux-ar...@lists.infradead.org, linu...@kvack.org

On Mon, Sep 21, 2020 at 03:26:11PM +0200, Marco Elver wrote:
> Add KFENCE test suite, testing various error detection scenarios. Makes
> use of KUnit for test organization. Since KFENCE's interface to obtain
> error reports is via the console, the test verifies that KFENCE outputs
> expected reports to the console.
>
> Reviewed-by: Dmitry Vyukov <dvy...@google.com>
> Co-developed-by: Alexander Potapenko <gli...@google.com>
> Signed-off-by: Alexander Potapenko <gli...@google.com>
> Signed-off-by: Marco Elver <el...@google.com>

[ . . . ]

> +/* Test SLAB_TYPESAFE_BY_RCU works. */
> +static void test_memcache_typesafe_by_rcu(struct kunit *test)
> +{
> + const size_t size = 32;
> + struct expect_report expect = {
> + .type = KFENCE_ERROR_UAF,
> + .fn = test_memcache_typesafe_by_rcu,
> + };
> +
> + setup_test_cache(test, size, SLAB_TYPESAFE_BY_RCU, NULL);
> + KUNIT_EXPECT_TRUE(test, test_cache); /* Want memcache. */
> +
> + expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
> + *expect.addr = 42;
> +
> + rcu_read_lock();
> + test_free(expect.addr);
> + KUNIT_EXPECT_EQ(test, *expect.addr, (char)42);
> + rcu_read_unlock();

It won't happen very often, but memory really could be freed at this point,
especially in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels ...

> + /* No reports yet, memory should not have been freed on access. */
> + KUNIT_EXPECT_FALSE(test, report_available());

... so the above statement needs to go before the rcu_read_unlock().

> + rcu_barrier(); /* Wait for free to happen. */

But you are quite right that the memory is not -guaranteed- to be freed
until we get here.

Thanx, Paul

Marco Elver

unread,

Sep 21, 2020, 1:37:26 PM9/21/20

to Paul E. McKenney, Andrew Morton, Alexander Potapenko, H. Peter Anvin, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan Cameron, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, SeongJae Park, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

Ah, thanks for pointing it out.

> > + /* No reports yet, memory should not have been freed on access. */
> > + KUNIT_EXPECT_FALSE(test, report_available());
>
> ... so the above statement needs to go before the rcu_read_unlock().

You mean the comment (and not the KUNIT_EXPECT_FALSE that no reports
were generated), correct?

Admittedly, the whole comment is a bit imprecise, so I'll reword.

> > + rcu_barrier(); /* Wait for free to happen. */
>
> But you are quite right that the memory is not -guaranteed- to be freed
> until we get here.

Right, I'll update the comment.

Thanks,
-- Marco

Will Deacon

unread,

Sep 21, 2020, 1:44:08 PM9/21/20

to Alexander Potapenko, Marco Elver, Andrew Morton, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, sjp...@amazon.com, Thomas Gleixner, Vlastimil Babka, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

What happens on x86 and why would we do something different?

Will

Paul E. McKenney

unread,

Sep 21, 2020, 1:48:29 PM9/21/20

to Marco Elver, Andrew Morton, Alexander Potapenko, H. Peter Anvin, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan Cameron, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, SeongJae Park, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

I freely confess that I did not research exactly what might generate
a report. But if this KUNIT_EXPECT_FALSE() was just verifying that the
previous KUNIT_EXPECT_TRUE() did not trigger, then yes, the code is just
fine as it is.

Thanx, Paul

Marco Elver

unread,

Sep 22, 2020, 5:56:40 AM9/22/20

to Will Deacon, Alexander Potapenko, Andrew Morton, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan Cameron, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, SeongJae Park, Thomas Gleixner, Vlastimil Babka, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

On Mon, 21 Sep 2020 at 19:44, Will Deacon <wi...@kernel.org> wrote:
[...]

> > > > > For ARM64, we would like to solicit feedback on what the best option is
> > > > > to obtain a constant address for __kfence_pool. One option is to declare
> > > > > a memory range in the memory layout to be dedicated to KFENCE (like is
> > > > > done for KASAN), however, it is unclear if this is the best available
> > > > > option. We would like to avoid touching the memory layout.
> > > >
> > > > Sorry for the delay on this.
> > >
> > > NP, thanks for looking!
> > >
> > > > Given that the pool is relatively small (i.e. when compared with our virtual
> > > > address space), dedicating an area of virtual space sounds like it makes
> > > > the most sense here. How early do you need it to be available?
> > >
> > > Yes, having a dedicated address sounds good.
> > > We're inserting kfence_init() into start_kernel() after timekeeping_init().
> > > So way after mm_init(), if that matters.
> >
> > The question is though, how big should that dedicated area be?
> > Right now KFENCE_NUM_OBJECTS can be up to 16383 (which makes the pool
> > size 64MB), but this number actually comes from the limitation on
> > static objects, so we might want to increase that number on arm64.
>
> What happens on x86 and why would we do something different?

On x86 we just do `char __kfence_pool[KFENCE_POOL_SIZE] ...;` to
statically allocate the pool. On arm64 this doesn't seem to work
because static memory doesn't have struct pages?

Thanks,
-- Marco

SeongJae Park

unread,

Sep 25, 2020, 7:24:06 AM9/25/20

to Marco Elver, ak...@linux-foundation.org, gli...@google.com, mark.r...@arm.com, hda...@sina.com, linu...@vger.kernel.org, pet...@infradead.org, catalin...@arm.com, dave....@linux.intel.com, linu...@kvack.org, edum...@google.com, h...@zytor.com, c...@linux.com, wi...@kernel.org, sjp...@amazon.com, cor...@lwn.net, x...@kernel.org, kasa...@googlegroups.com, mi...@redhat.com, vba...@suse.cz, rien...@google.com, arya...@virtuozzo.com, kees...@chromium.org, pau...@kernel.org, ja...@google.com, andre...@google.com, b...@alien8.de, lu...@kernel.org, Jonathan...@huawei.com, tg...@linutronix.de, dvy...@google.com, linux-ar...@lists.infradead.org, gre...@linuxfoundation.org, linux-...@vger.kernel.org, pen...@kernel.org, iamjoon...@lge.com

This patch doesn't introduce this file yet, right? How about using a separate
final patch for MAINTAINERS update?

Other than that,

Reviewed-by: SeongJae Park <sjp...@amazon.de>

Thanks,
SeongJae Park

Marco Elver

unread,

Sep 25, 2020, 7:32:01 AM9/25/20

to SeongJae Park, Andrew Morton, Alexander Potapenko, Mark Rutland, Hillf Danton, open list:DOCUMENTATION, Peter Zijlstra, Catalin Marinas, Dave Hansen, Linux Memory Management List, Eric Dumazet, H. Peter Anvin, Christoph Lameter, Will Deacon, Jonathan Corbet, the arch/x86 maintainers, kasan-dev, Ingo Molnar, Vlastimil Babka, David Rientjes, Andrey Ryabinin, Kees Cook, Paul E. McKenney, Jann Horn, Andrey Konovalov, Borislav Petkov, Andy Lutomirski, Jonathan Cameron, Thomas Gleixner, Dmitry Vyukov, Linux ARM, Greg Kroah-Hartman, LKML, Pekka Enberg, Joonsoo Kim

Sure.

> Other than that,
>
> Reviewed-by: SeongJae Park <sjp...@amazon.de>

Thanks!

Alexander Potapenko

unread,

Sep 25, 2020, 11:25:24 AM9/25/20

to Will Deacon, Marco Elver, Andrew Morton, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitriy Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, sjp...@amazon.com, Thomas Gleixner, Vlastimil Babka, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

Will,

> Given that the pool is relatively small (i.e. when compared with our virtual
> address space), dedicating an area of virtual space sounds like it makes
> the most sense here. How early do you need it to be available?

How do we assign struct pages to a fixed virtual space area (I'm
currently experimenting with 0xffff7f0000000000-0xffff7f0000200000)?
Looks like filling page table entries (similarly to what's being done
in arch/arm64/mm/kasan_init.c) is not enough.
I thought maybe vmemmap_populate() would do the job, but it didn't
(virt_to_pfn() still returns invalid PFNs).

Marco Elver

unread,

Sep 28, 2020, 7:54:07 AM9/28/20

to Will Deacon, Alexander Potapenko, Andrew Morton, H. Peter Anvin, Paul E. McKenney, Andrey Konovalov, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan Cameron, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, SeongJae Park, Thomas Gleixner, Vlastimil Babka, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

On Mon, 21 Sep 2020 at 16:31, Will Deacon <wi...@kernel.org> wrote:
> On Mon, Sep 21, 2020 at 03:26:04PM +0200, Marco Elver wrote:
> > Add architecture specific implementation details for KFENCE and enable
> > KFENCE for the arm64 architecture. In particular, this implements the
> > required interface in <asm/kfence.h>. Currently, the arm64 version does
> > not yet use a statically allocated memory pool, at the cost of a pointer
> > load for each is_kfence_address().

[...]

> > For ARM64, we would like to solicit feedback on what the best option is
> > to obtain a constant address for __kfence_pool. One option is to declare
> > a memory range in the memory layout to be dedicated to KFENCE (like is
> > done for KASAN), however, it is unclear if this is the best available
> > option. We would like to avoid touching the memory layout.

> Given that the pool is relatively small (i.e. when compared with our virtual
> address space), dedicating an area of virtual space sounds like it makes
> the most sense here. How early do you need it to be available?

Note: we're going to send a v4 this or next week with a few other
minor fixes in it. But I think we just don't want to block the entire
series on figuring out what the static-pool arm64 version should do,
especially if we'll have a few iterations with only this patch here
changing.

So the plan will be:

1. Send v4, which could from our point-of-view be picked up for
merging. Unless of course there are more comments.

2. Work out the details for the static-pool arm64 version, since it
doesn't seem trivial to do the same thing as we do for x86. In
preparation for that, v4 will allow the __kfence_pool's attributes to
be defined entirely by <asm/kfence.h>, so that we can fiddle with
sections etc.

3. Send patch switching out the simpler arm64 version here for one
that places __kfence_pool at a static location.

Hopefully that plan is reasonable.

Thanks,
-- Marco

Andrey Konovalov

unread,

Sep 29, 2020, 8:21:08 AM9/29/20

to Marco Elver, Andrew Morton, Alexander Potapenko, H. Peter Anvin, Paul E . McKenney, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, sjp...@amazon.com, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

On Mon, Sep 21, 2020 at 3:26 PM Marco Elver <el...@google.com> wrote:
>

With KFENCE + KASAN both enabled we need to bail out in all KASAN
hooks that get called from the allocator, right? Do I understand
correctly that these two are the only ones that are called for
KFENCE-allocated objects due to the way KFENCE is integrated into the
allocator?

Andrey Konovalov

unread,

Sep 29, 2020, 8:43:07 AM9/29/20

to Marco Elver, Andrew Morton, Alexander Potapenko, H. Peter Anvin, Paul E . McKenney, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, sjp...@amazon.com, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

On Mon, Sep 21, 2020 at 3:26 PM Marco Elver <el...@google.com> wrote:
>

Why do we subtract 1 here? We do have the metadata entry reserved for something?

It was a concious decision to not use stackdepot, right? Perhaps it
makes sense to document the reason somewhere.

Marco Elver

unread,

Sep 29, 2020, 9:11:44 AM9/29/20

to Andrey Konovalov, Andrew Morton, Alexander Potapenko, H. Peter Anvin, Paul E . McKenney, Andrey Ryabinin, Andy Lutomirski, Borislav Petkov, Catalin Marinas, Christoph Lameter, Dave Hansen, David Rientjes, Dmitry Vyukov, Eric Dumazet, Greg Kroah-Hartman, Hillf Danton, Ingo Molnar, Jann Horn, Jonathan...@huawei.com, Jonathan Corbet, Joonsoo Kim, Kees Cook, Mark Rutland, Pekka Enberg, Peter Zijlstra, sjp...@amazon.com, Thomas Gleixner, Vlastimil Babka, Will Deacon, the arch/x86 maintainers, open list:DOCUMENTATION, LKML, kasan-dev, Linux ARM, Linux Memory Management List

On Tue, Sep 29, 2020 at 02:42PM +0200, Andrey Konovalov wrote:
[...]

> > + */
> > + index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1;
>
> Why do we subtract 1 here? We do have the metadata entry reserved for something?

Above the declaration of __kfence_pool it says:

* We allocate an even number of pages, as it simplifies calculations to map

* address to metadata indices; effectively, the very first page serves as an

* extended guard page, but otherwise has no special purpose.

Hopefully that clarifies the `- 1` here.

[...]

> > + /* Allocation and free stack information. */
> > + int num_alloc_stack;
> > + int num_free_stack;
> > + unsigned long alloc_stack[KFENCE_STACK_DEPTH];
> > + unsigned long free_stack[KFENCE_STACK_DEPTH];
>
> It was a concious decision to not use stackdepot, right? Perhaps it
> makes sense to document the reason somewhere.

Yes; we want to avoid the dynamic allocations that stackdepot does.

[...]

Thanks,
-- Marco