Re: kasan: false use-after-scope warnings with KCOV

38 views
Skip to first unread message

Dmitry Vyukov

unread,
Nov 28, 2017, 7:58:12 AM11/28/17
to Mark Rutland, LKML, linux-ar...@lists.infradead.org, Andrey Ryabinin, Alexander Potapenko, kasan-dev
On Tue, Nov 28, 2017 at 1:35 PM, Mark Rutland <mark.r...@arm.com> wrote:
> Hi,
>
> As a heads-up, I'm seeing a number of what appear to be false-positive
> use-after-scope warnings when I enable both KCOV and KASAN (inline or outline),
> when using the Linaro 17.08 GCC7.1.1 for arm64. So far I haven't spotted these
> without KCOV selected, and I'm only seeing these for sanitize-use-after-scope.
>
> The reports vary depending on configuration even with the same trigger. I'm not
> sure if it's the reporting that's misleading, or whether the detection is going
> wrong.
>
> For example, with v4.15-rc1, defconfig + KCOV + KASAN_OUTLINE, I can trigger a
> splat:
>
> $ perf record true
>
> [ 37.577497] ==================================================================
> [ 37.584702] BUG: KASAN: use-after-scope in __alloc_pages_nodemask+0x104/0x1608
> [ 37.591883] Write of size 24 at addr ffff80092d65f160 by task perf/2430
> [ 37.598452]
> [ 37.599944] CPU: 1 PID: 2430 Comm: perf Not tainted 4.15.0-rc1-00001-gaf82bf81ebae #1
> [ 37.607725] Hardware name: ARM Juno development board (r1) (DT)
> [ 37.613605] Call trace:
> [ 37.616051] dump_backtrace+0x0/0x320
> [ 37.619700] show_stack+0x20/0x30
> [ 37.623005] dump_stack+0x108/0x174
> [ 37.626481] print_address_description+0x60/0x270
> [ 37.631162] kasan_report+0x210/0x2f0
> [ 37.634811] check_memory_region+0x148/0x198
> [ 37.639063] __asan_storeN+0x14/0x20
> [ 37.642624] __alloc_pages_nodemask+0x104/0x1608
> [ 37.647221] alloc_pages_vma+0xa0/0x2d8
> [ 37.651042] wp_page_copy+0x15c/0xee0
> [ 37.654689] do_wp_page+0x404/0xa70
> [ 37.658165] __handle_mm_fault+0xb28/0x13e0
> [ 37.662331] handle_mm_fault+0x290/0x390
> [ 37.666237] do_page_fault+0x32c/0x5c0
> [ 37.669969] do_mem_abort+0xa8/0x1e0
> [ 37.673528] el0_da+0x20/0x24
> [ 37.676477]
> [ 37.677961] The buggy address belongs to the page:
> [ 37.682730] page:ffff7e0024b597c0 count:0 mapcount:0 mapping: (null) index:0x0
> [ 37.690692] flags: 0x1fffc00000000000()
> [ 37.694518] raw: 1fffc00000000000 0000000000000000 0000000000000000 00000000ffffffff
> [ 37.702225] raw: 0000000000000000 ffff7e0024b597e0 0000000000000000 0000000000000000
> [ 37.709922] page dumped because: kasan: bad access detected
> [ 37.715457]
> [ 37.716941] Memory state around the buggy address:
> [ 37.721709] ffff80092d65f000: f2 f2 04 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2
> [ 37.728893] ffff80092d65f080: f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2
> [ 37.736078] >ffff80092d65f100: f2 f2 00 f2 f2 f2 f2 f2 f2 f2 f8 f8 f8 f8 00 f2
> [ 37.743257] ^
> [ 37.749576] ffff80092d65f180: f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2 f3 f3
> [ 37.756761] ffff80092d65f200: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 37.763939] ==================================================================
> [ 37.771117] Disabling lock debugging due to kernel taint
>
> $ ./scripts/faddr2line vmlinux __alloc_pages_nodemask+0x104/0x1608
> __alloc_pages_nodemask+0x104/0x1608:
> __alloc_pages_nodemask at mm/page_alloc.c:4215
>
> ... which is the declaration+initialisation of a local variable in
> __alloc_pages_nodemask:
>
> 4208 struct page *
> 4209 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> 4210 nodemask_t *nodemask)
> 4211 {
> 4212 struct page *page;
> 4213 unsigned int alloc_flags = ALLOC_WMARK_LOW;
> 4214 gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> 4215 struct alloc_context ac = { };
>
> ... which is clearly not a use-after-scope bug.
>
> If I separate the declaration and assignment, I get a splat corresponding to the
> assignment to ac.
>
> I wondered if we were missing some shadow initialisation, so I hacked a call to
> kasan_unpoison_task_stack() into dup_task_struct(), but this had no effect. I
> also wondered if this was the result of an overflow caused by instrumentation
> bloating the stack, but doubling my stack size (from 32K to 64K) also had no
> effect.

Hi Mark,

Has anything changed in your environment? Kernel? Compiler? Configs?

The last one that I debugged related to stack false positives was due
to incorrect DTLB flush after KASAN shadow initialization. But that
was on x86 and due to a missed backport to 4.4.

Please post disasm of the function. Instrumentation should have been
cleared shadow for ac in prologue.

Mark Rutland

unread,
Nov 28, 2017, 9:14:03 AM11/28/17
to Dmitry Vyukov, LKML, linux-ar...@lists.infradead.org, Andrey Ryabinin, Alexander Potapenko, kasan-dev
On Tue, Nov 28, 2017 at 01:57:49PM +0100, Dmitry Vyukov wrote:
> On Tue, Nov 28, 2017 at 1:35 PM, Mark Rutland <mark.r...@arm.com> wrote:
> > As a heads-up, I'm seeing a number of what appear to be false-positive
> > use-after-scope warnings when I enable both KCOV and KASAN (inline or outline),
> > when using the Linaro 17.08 GCC7.1.1 for arm64. So far I haven't spotted these
> > without KCOV selected, and I'm only seeing these for sanitize-use-after-scope.
> >
> > The reports vary depending on configuration even with the same trigger. I'm not
> > sure if it's the reporting that's misleading, or whether the detection is going
> > wrong.

> > For example, with v4.15-rc1, defconfig + KCOV + KASAN_OUTLINE, I can trigger a
> > splat:
> >
> > $ perf record true

> > [ 37.584702] BUG: KASAN: use-after-scope in __alloc_pages_nodemask+0x104/0x1608

> > $ ./scripts/faddr2line vmlinux __alloc_pages_nodemask+0x104/0x1608
> > __alloc_pages_nodemask+0x104/0x1608:
> > __alloc_pages_nodemask at mm/page_alloc.c:4215
> >
> > ... which is the declaration+initialisation of a local variable in
> > __alloc_pages_nodemask:

> > 4215 struct alloc_context ac = { };
> >
> > ... which is clearly not a use-after-scope bug.

> Has anything changed in your environment? Kernel? Compiler? Configs?

This is the first time I've used the Linaro 17.08 GCC 7.1.1
toolchain.

This is also the first time I've tested v4.15-rc1. I had a go with v4.14
(same toolchain, same config), and I saw the same problem.

Previously I was using the Linaro 17.05 GCC 6.3.1 toolchain, which did
not support -fsanitize-use-after-scope.

> The last one that I debugged related to stack false positives was due
> to incorrect DTLB flush after KASAN shadow initialization. But that
> was on x86 and due to a missed backport to 4.4.

The arm64 shadow initialization was recently reworked in v4.15-rc1, but
given I can trigger the same issue on v4.14, it doesn't seem likely
that's the problem.

> Please post disasm of the function. Instrumentation should have been
> cleared shadow for ac in prologue.

The function is 1400+ instructions, so I've just included the prologue
below at the end of the mail.

IIUC the relevant call to __asan_storeN is on line ffff200008293230.

AFAICT, the prologue doesn't zero the shadow at all -- it only
initialises the non-zero bytes. IIRC, functions are meant to clean up
when they return, as we had to fix up for idle in commit:

0d97e6d8024c71cc ("arm64: kasan: clear stale stack poison")

I tested with idle disabled, which made no difference.

I hacked a kasan_clear_task_stack(current) immediately before the call
to __alloc_pages_nodemask(), and I get a splat later in
__save_stack_trace() instead. So it looks like the shadow placed by
__alloc_pages_nodemask() isn't overlapping its stack variables.

... it looks suspiciously like something is setting up non-zero shadow
bytes, but not zeroing them upon return.

Thanks,
Mark.

---->8----
ffff200008293130 <__alloc_pages_nodemask>:
ffff200008293130: d116c3ff sub sp, sp, #0x5b0
ffff200008293134: d2915665 mov x5, #0x8ab3 // #35507
ffff200008293138: f2a836a5 movk x5, #0x41b5, lsl #16
ffff20000829313c: d2c40007 mov x7, #0x200000000000 // #35184372088832
ffff200008293140: f2fbffe7 movk x7, #0xdfff, lsl #48
ffff200008293144: f000b666 adrp x6, ffff200009962000 <kallsyms_token_index+0x13d00>
ffff200008293148: a9007bfd stp x29, x30, [sp]
ffff20000829314c: 910003fd mov x29, sp
ffff200008293150: 910543a4 add x4, x29, #0x150
ffff200008293154: 910560c6 add x6, x6, #0x158
ffff200008293158: d343fc88 lsr x8, x4, #3
ffff20000829315c: 8b070104 add x4, x8, x7
ffff200008293160: a9151ba5 stp x5, x6, [x29, #336]
ffff200008293164: 90000005 adrp x5, ffff200008293000 <gfp_pfmemalloc_allowed+0x80>
ffff200008293168: 9104c0a5 add x5, x5, #0x130
ffff20000829316c: f90087a8 str x8, [x29, #264]
ffff200008293170: 529e4086 mov w6, #0xf204 // #61956
ffff200008293174: a90153f3 stp x19, x20, [sp, #16]
ffff200008293178: 72be5e46 movk w6, #0xf2f2, lsl #16
ffff20000829317c: a9025bf5 stp x21, x22, [sp, #32]
ffff200008293180: aa0303f6 mov x22, x3
ffff200008293184: a90363f7 stp x23, x24, [sp, #48]
ffff200008293188: 93407c55 sxtw x21, w2
ffff20000829318c: f90023f9 str x25, [sp, #64]
ffff200008293190: 52802234 mov w20, #0x111 // #273
ffff200008293194: a90573fb stp x27, x28, [sp, #80]
ffff200008293198: 2a0003fb mov w27, w0
ffff20000829319c: f900b3a5 str x5, [x29, #352]
ffff2000082931a0: 3204d3e5 mov w5, #0xf1f1f1f1 // #-235802127
ffff2000082931a4: b8276905 str w5, [x8, x7]
ffff2000082931a8: 529e5e45 mov w5, #0xf2f2 // #62194
ffff2000082931ac: 72be5e45 movk w5, #0xf2f2, lsl #16
ffff2000082931b0: 529e4007 mov w7, #0xf200 // #61952
ffff2000082931b4: 29009486 stp w6, w5, [x4, #4]
ffff2000082931b8: 72be5e47 movk w7, #0xf2f2, lsl #16
ffff2000082931bc: 29019486 stp w6, w5, [x4, #12]
ffff2000082931c0: 9105c3a0 add x0, x29, #0x170
ffff2000082931c4: 29029486 stp w6, w5, [x4, #20]
ffff2000082931c8: 9113c3bc add x28, x29, #0x4f0
ffff2000082931cc: 29039486 stp w6, w5, [x4, #28]
ffff2000082931d0: 91004393 add x19, x28, #0x10
ffff2000082931d4: 29049486 stp w6, w5, [x4, #36]
ffff2000082931d8: 72a02434 movk w20, #0x121, lsl #16
ffff2000082931dc: 29059486 stp w6, w5, [x4, #44]
ffff2000082931e0: 29069486 stp w6, w5, [x4, #52]
ffff2000082931e4: 29079486 stp w6, w5, [x4, #60]
ffff2000082931e8: 29089486 stp w6, w5, [x4, #68]
ffff2000082931ec: 29099486 stp w6, w5, [x4, #76]
ffff2000082931f0: 52be4006 mov w6, #0xf2000000 // #-234881024
ffff2000082931f4: 290a9487 stp w7, w5, [x4, #84]
ffff2000082931f8: 290b9487 stp w7, w5, [x4, #92]
ffff2000082931fc: 290c9487 stp w7, w5, [x4, #100]
ffff200008293200: 290d9487 stp w7, w5, [x4, #108]
ffff200008293204: 290f1487 stp w7, w5, [x4, #120]
ffff200008293208: 3204d7e5 mov w5, #0xf3f3f3f3 // #-202116109
ffff20000829320c: 29109486 stp w6, w5, [x4, #132]
ffff200008293210: d000cac4 adrp x4, ffff200009bed000 <page_wait_table+0x1280>
ffff200008293214: b9014fa1 str w1, [x29, #332]
ffff200008293218: 913ea099 add x25, x4, #0xfa8
ffff20000829321c: 9402f069 bl ffff20000834f3c0 <__asan_store4>
ffff200008293220: 52800022 mov w2, #0x1 // #1
ffff200008293224: aa1303e0 mov x0, x19
ffff200008293228: d2800301 mov x1, #0x18 // #24
ffff20000829322c: b90173a2 str w2, [x29, #368]
ffff200008293230: 9402f1ac bl ffff20000834f8e0 <__asan_storeN>
ffff200008293234: 9113c3a2 add x2, x29, #0x4f0

Mark Rutland

unread,
Nov 28, 2017, 10:24:11 AM11/28/17
to Dmitry Vyukov, Andrey Ryabinin, kasan-dev, Alexander Potapenko, LKML, linux-ar...@lists.infradead.org, Dennis Zhou, Fengguang Wu
On Tue, Nov 28, 2017 at 02:13:55PM +0000, Mark Rutland wrote:
> On Tue, Nov 28, 2017 at 01:57:49PM +0100, Dmitry Vyukov wrote:
> > On Tue, Nov 28, 2017 at 1:35 PM, Mark Rutland <mark.r...@arm.com> wrote:
> > > As a heads-up, I'm seeing a number of what appear to be false-positive
> > > use-after-scope warnings when I enable both KCOV and KASAN (inline or outline),
> > > when using the Linaro 17.08 GCC7.1.1 for arm64. So far I haven't spotted these
> > > without KCOV selected, and I'm only seeing these for sanitize-use-after-scope.
> > >
> > > The reports vary depending on configuration even with the same trigger. I'm not
> > > sure if it's the reporting that's misleading, or whether the detection is going
> > > wrong.

> ... it looks suspiciously like something is setting up non-zero shadow
> bytes, but not zeroing them upon return.

It looks like this is the case.

The hack below detects leftover poison on an exception return *before*
the false-positive warning (example splat at the end of the email). With
scripts/Makefile.kasan hacked to not pass
-fsanitize-address-use-after-scope, I see no leftover poison.

Unfortunately, there's not enough information left to say where exactly
that happened.

Given the report that Andrey linked to [1], it looks like the compiler
is doing something wrong, and failing to clear some poison in some
cases. Dennis noted [2] that this appears to be the case where inline
functions are called in a loop.

It sounds like this is a general GCC 7.x problem, on both x86_64 and
arm64. As we don't have a smoking gun, it's still possible that
something else is corrupting the shadow, but it seems unlikely.

[1] https://lkml.kernel.org/r/20171128124534....@wfg-t540p.sh.intel.com
[2] https://lkml.kernel.org/r/20171127210...@localhost.corp.microsoft.com

Thanks,
Mark.

Hack
--------
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6d14b8f29b5f..8191e122d6f4 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -220,6 +220,8 @@ alternative_else_nop_endif
.endm

.macro kernel_exit, el
+ mov x0, sp
+ bl kasan_assert_task_stack_is_clean_below
.if \el != 0
disable_daif

diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 405bba487df5..dab8a51ee52f 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -37,6 +37,8 @@
#include <linux/vmalloc.h>
#include <linux/bug.h>

+#include <asm/stacktrace.h>
+
#include "kasan.h"
#include "../slab.h"

@@ -241,6 +243,33 @@ static __always_inline bool memory_is_poisoned(unsigned long addr, size_t size)
return memory_is_poisoned_n(addr, size);
}

+/*
+ * In some contexts (e.g. when returning from an exception), all shadow beyond
+ * a certain point on the stack should be clear. This helper can be called by
+ * assembly code to verify this is the case.
+ */
+asmlinkage void kasan_assert_task_stack_is_clean_below(unsigned long watermark)
+{
+ unsigned long base;
+
+ /*
+ * This is an arm64-specific hack. This should be fixed properly to
+ * discover and check the bounds of the current stack in an
+ * arch-agnostic manner.
+ */
+ if (!on_task_stack(current, watermark))
+ return;
+
+ /*
+ * Calculate the task stack base address. Avoid using 'current'
+ * because this function is called by early resume code which hasn't
+ * yet set up the percpu register (%gs).
+ */
+ base = watermark & ~(THREAD_SIZE - 1);
+
+ WARN_ON_ONCE(memory_is_poisoned(base, watermark - base));
+}
+
static __always_inline void check_memory_region_inline(unsigned long addr,
size_t size, bool write,
unsigned long ret_ip)
--------

Splat
--------
[ 186.951300] WARNING: CPU: 1 PID: 2429 at mm/kasan/kasan.c:270 kasan_assert_task_stack_is_clean_below+0x144/0x150
[ 186.961418] Modules linked in:
[ 186.964468] CPU: 1 PID: 2429 Comm: perf Not tainted 4.15.0-rc1-00001-g7780802c256e #6
[ 186.972249] Hardware name: ARM Juno development board (r1) (DT)
[ 186.978133] task: ffff800933fe6900 task.stack: ffff80092c990000
[ 186.984019] pstate: 200003c5 (nzCv DAIF -PAN -UAO)
[ 186.988789] pc : kasan_assert_task_stack_is_clean_below+0x144/0x150
[ 186.995022] lr : ret_fast_syscall+0x34/0x98
[ 186.999177] sp : ffff80092c997ec0
[ 187.002472] x29: ffff80092c997ff0 x28: ffff800933fe6900
[ 187.007760] x27: ffff200009264000 x26: 00000000000000f1
[ 187.013047] x25: 0000000000000124 x24: 0000000000000015
[ 187.018334] x23: 0000000060000000 x22: 0000ffffae4b7554
[ 187.023621] x21: 00000000ffffffff x20: 000060092de30000
[ 187.028908] x19: 0000000000000000 x18: 0000ffffd2ec5330
[ 187.034195] x17: 0000ffffae4b7530 x16: ffff200008270508
[ 187.039482] x15: 0000ffffae538588 x14: 0000000000000000
[ 187.044769] x13: ffffffffffffffff x12: ffffffffffffffff
[ 187.050060] x11: 1ffff00125932f33 x10: ffff100125932f33
[ 187.055349] x9 : dfff200000000000 x8 : dfff200000000008
[ 187.060638] x7 : 1ffff00125932fd7 x6 : ffff100125932fd7
[ 187.065927] x5 : ffff80092c997ebf x4 : ffff100125932fd8
[ 187.071217] x3 : dfff200000000000 x2 : ffff100125932e30
[ 187.076506] x1 : ffff100125932e28 x0 : 00000000000000f8
[ 187.081793] Call trace:
[ 187.084238] kasan_assert_task_stack_is_clean_below+0x144/0x150
[ 187.090122] ---[ end trace 9c3a99d1de859687 ]---
[ 187.212571] ==================================================================
[ 187.219786] BUG: KASAN: use-after-scope in __save_stack_trace+0x1c8/0x2f0
[ 187.226537] Read of size 4 at addr ffff800930e4f048 by task true/2432
[ 187.232935]
[ 187.234430] CPU: 2 PID: 2432 Comm: true Tainted: G W 4.15.0-rc1-00001-g7780802c256e #6
[ 187.243507] Hardware name: ARM Juno development board (r1) (DT)
[ 187.249389] Call trace:
[ 187.251830] dump_backtrace+0x0/0x320
[ 187.255477] show_stack+0x20/0x30
[ 187.258782] dump_stack+0x108/0x174
[ 187.262256] print_address_description+0x60/0x270
[ 187.266936] kasan_report+0x210/0x2f0
[ 187.270584] __asan_load4+0x84/0xa8
[ 187.274059] __save_stack_trace+0x1c8/0x2f0
[ 187.278224] save_stack_trace+0x24/0x30
[ 187.282044] kasan_kmalloc+0xd0/0x180
[ 187.285688] kasan_slab_alloc+0x14/0x20
[ 187.289508] kmem_cache_alloc+0x128/0x1e8
[ 187.293499] perf_event_mmap+0x2dc/0x968
[ 187.297405] mmap_region+0x24c/0xa60
[ 187.300963] do_mmap+0x404/0x640
[ 187.304178] vm_mmap_pgoff+0x15c/0x190
[ 187.307909] vm_mmap+0x70/0xb0
[ 187.310951] elf_map+0x114/0x150
[ 187.314165] load_elf_binary+0x728/0x1b84
[ 187.318158] search_binary_handler+0xe4/0x3b8
[ 187.322495] do_execveat_common.isra.12+0xaa4/0xc60
[ 187.327349] SyS_execve+0x48/0x60
[ 187.330650] el0_svc_naked+0x20/0x24
[ 187.334202]
[ 187.335685] The buggy address belongs to the page:
[ 187.340453] page:ffff7e0024c393c0 count:0 mapcount:0 mapping: (null) index:0x0
[ 187.348414] flags: 0x1fffc00000000000()
[ 187.352240] raw: 1fffc00000000000 0000000000000000 0000000000000000 00000000ffffffff
[ 187.359947] raw: 0000000000000000 ffff7e0024c393e0 0000000000000000 0000000000000000
[ 187.367643] page dumped because: kasan: bad access detected
[ 187.373178]
[ 187.374661] Memory state around the buggy address:
[ 187.379428] ffff800930e4ef00: f1 f1 f8 f2 f2 f2 f2 f2 f2 f2 00 00 f2 f2 f2 f2
[ 187.386612] ffff800930e4ef80: f2 f2 00 00 f2 f2 f3 f3 f3 f3 f8 f8 f8 f8 f8 f8
[ 187.393795] >ffff800930e4f000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 00 00 00 00 00 00
[ 187.400973] ^
[ 187.406516] ffff800930e4f080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 187.413699] ffff800930e4f100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 187.420877] ==================================================================
--------

Dmitry Vyukov

unread,
Nov 28, 2017, 12:52:53 PM11/28/17
to Mark Rutland, Andrey Ryabinin, kasan-dev, Alexander Potapenko, LKML, linux-ar...@lists.infradead.org, Dennis Zhou, Fengguang Wu
On Tue, Nov 28, 2017 at 4:24 PM, Mark Rutland <mark.r...@arm.com> wrote:
>> > > As a heads-up, I'm seeing a number of what appear to be false-positive
>> > > use-after-scope warnings when I enable both KCOV and KASAN (inline or outline),
>> > > when using the Linaro 17.08 GCC7.1.1 for arm64. So far I haven't spotted these
>> > > without KCOV selected, and I'm only seeing these for sanitize-use-after-scope.
>> > >
>> > > The reports vary depending on configuration even with the same trigger. I'm not
>> > > sure if it's the reporting that's misleading, or whether the detection is going
>> > > wrong.
>
>> ... it looks suspiciously like something is setting up non-zero shadow
>> bytes, but not zeroing them upon return.
>
> It looks like this is the case.
>
> The hack below detects leftover poison on an exception return *before*
> the false-positive warning (example splat at the end of the email). With
> scripts/Makefile.kasan hacked to not pass
> -fsanitize-address-use-after-scope, I see no leftover poison.
>
> Unfortunately, there's not enough information left to say where exactly
> that happened.
>
> Given the report that Andrey linked to [1], it looks like the compiler
> is doing something wrong, and failing to clear some poison in some
> cases. Dennis noted [2] that this appears to be the case where inline
> functions are called in a loop.
>
> It sounds like this is a general GCC 7.x problem, on both x86_64 and
> arm64. As we don't have a smoking gun, it's still possible that
> something else is corrupting the shadow, but it seems unlikely.



We use gcc 7.1 extensively on x86_64 and have not seen any problems.

ASAN stack instrumentation actually contains information about frames.
I just never got around to using it in KASAN. But user-space ASAN
prints the following on stack bugs:

Address 0x7ffdb1c75140 is located in stack of thread T0 at offset 64 in frame
#0 0x527fff in main test.c:5

This frame has 2 object(s):
[32, 40) 'p'
[64, 68) 'x' <== Memory access at offset 64 is inside this variable

Function prologue contains code similar to this:

528062: 48 ba f0 7f 52 00 00 movabs $0x527ff0,%rdx
52806c: 48 be 9c e5 53 00 00 movabs $0x53e59c,%rsi
528076: 48 89 c7 mov %rax,%rdi
528079: 48 83 c7 20 add $0x20,%rdi
52807d: 49 89 c0 mov %rax,%r8
528080: 49 83 c0 40 add $0x40,%r8
528084: 48 c7 00 b3 8a b5 41 movq $0x41b58ab3,(%rax)
52808b: 48 89 70 08 mov %rsi,0x8(%rax)
52808f: 48 89 50 10 mov %rdx,0x10(%rax)

Here 0x41b58ab3 is marker of frame start, and after it 0x527ff0 and
0x53e59c should be pointers to globals that contain function name and
other aux information. Note that's on stack itself, not in shadow.
If you can find any of 0x41b58ab3 in the corrupted part of stack, you
can figure out what function has left garbage.

Ideally, we check that stack does not contain garbage in the beginning
of each function _before_ new asan frame is created. That would
increase chances of finding 0x41b58ab3 marked and pin pointing the
offending function. Unfortunately, I can't think of any existing
hook... wait, __fentry__ seems to be in the perfect place.
One of these global pointers after the mark is probably points to
struct kasan_global. I don't remember what's the other one.

Mark Rutland

unread,
Nov 29, 2017, 6:26:34 AM11/29/17
to Dmitry Vyukov, Andrey Ryabinin, kasan-dev, Alexander Potapenko, LKML, linux-ar...@lists.infradead.org, Dennis Zhou, Fengguang Wu
On Tue, Nov 28, 2017 at 06:52:32PM +0100, Dmitry Vyukov wrote:
> On Tue, Nov 28, 2017 at 4:24 PM, Mark Rutland <mark.r...@arm.com> wrote:

> >> ... it looks suspiciously like something is setting up non-zero shadow
> >> bytes, but not zeroing them upon return.
> >
> > It looks like this is the case.
> >
> > The hack below detects leftover poison on an exception return *before*
> > the false-positive warning (example splat at the end of the email). With
> > scripts/Makefile.kasan hacked to not pass
> > -fsanitize-address-use-after-scope, I see no leftover poison.
> >
> > Unfortunately, there's not enough information left to say where exactly
> > that happened.

> ASAN stack instrumentation actually contains information about frames.
> I just never got around to using it in KASAN. But user-space ASAN
> prints the following on stack bugs:
>
> Address 0x7ffdb1c75140 is located in stack of thread T0 at offset 64 in frame
> #0 0x527fff in main test.c:5
>
> This frame has 2 object(s):
> [32, 40) 'p'
> [64, 68) 'x' <== Memory access at offset 64 is inside this variable
>
> Function prologue contains code similar to this:
>
> 528062: 48 ba f0 7f 52 00 00 movabs $0x527ff0,%rdx
> 52806c: 48 be 9c e5 53 00 00 movabs $0x53e59c,%rsi
> 528076: 48 89 c7 mov %rax,%rdi
> 528079: 48 83 c7 20 add $0x20,%rdi
> 52807d: 49 89 c0 mov %rax,%r8
> 528080: 49 83 c0 40 add $0x40,%r8
> 528084: 48 c7 00 b3 8a b5 41 movq $0x41b58ab3,(%rax)
> 52808b: 48 89 70 08 mov %rsi,0x8(%rax)
> 52808f: 48 89 50 10 mov %rdx,0x10(%rax)
>
> Here 0x41b58ab3 is marker of frame start, and after it 0x527ff0 and
> 0x53e59c should be pointers to globals that contain function name and
> other aux information. Note that's on stack itself, not in shadow.
> If you can find any of 0x41b58ab3 in the corrupted part of stack, you
> can figure out what function has left garbage.

Thanks for the info! I'll try to give this a go, but I'm probably not
going to have the chance to investigate much this week.

I'm afraid I'm not that good at reading x86 assembly. IIUC there are
records on the stack something like:

struct record {
u64 magic; /* 0x41b58ab3 */
char *func_name;
struct aux *data;
};

... is that correct?

Is there any documentation on this that I can refer to?

Thanks,
Mark.

Dmitry Vyukov

unread,
Nov 29, 2017, 6:41:27 AM11/29/17
to Mark Rutland, Andrey Ryabinin, kasan-dev, Alexander Potapenko, LKML, linux-ar...@lists.infradead.org, Dennis Zhou, Fengguang Wu
I am not aware of any documentation other than code. I think the
simplest is AsanThread::GetStackFrameAccessByAddr() here:
http://llvm.org/viewvc/llvm-project/compiler-rt/trunk/lib/asan/asan_thread.cc?revision=310432&view=markup
It finds the struct with description from the bad access address.
Looking at the code, yes, there is magic, then char* frame
descriptions, and then PC where the frame was allocated (usually
function prologue, but I think can also point to an alloca).

Andrey Ryabinin

unread,
Nov 29, 2017, 11:50:47 AM11/29/17
to Dmitry Vyukov, Mark Rutland, kasan-dev, Alexander Potapenko, LKML, linux-ar...@lists.infradead.org, Dennis Zhou, Fengguang Wu
Yeah, it's probably two different problems.

Today kbuild reported another use-after-scope - http://lkml.kernel.org/r/<20171129052106....@wfg-t540p.sh.intel.com>
No struct leak plugin and kcov instrumentation is also off. It's hard to tell whether it's false-positive or not, the code is a mess.
So until proven otherwise, I tend to think that this time it's a real bug.
.config attached, if someone want to look. It's easy reproducible, just boot qemu and wait.



.config

Dmitry Vyukov

unread,
Nov 29, 2017, 1:57:38 PM11/29/17
to Andrey Ryabinin, Mark Rutland, kasan-dev, Alexander Potapenko, LKML, linux-ar...@lists.infradead.org, Dennis Zhou, Fengguang Wu
I hacked a quick prototype for printing frame info:

--- a/mm/kasan/report.c
+++ b/mm/kasan/report.c
@@ -289,6 +289,7 @@ static void print_shadow_for_address(const void *addr)
int i;
const void *shadow = kasan_mem_to_shadow(addr);
const void *shadow_row;
+ unsigned long *ptr;

shadow_row = (void *)round_down((unsigned long)shadow,
SHADOW_BYTES_PER_ROW)
@@ -320,6 +321,18 @@ static void print_shadow_for_address(const void *addr)

shadow_row += SHADOW_BYTES_PER_ROW;
}
+
+
+ ptr = (unsigned long *)((unsigned long)addr & ~7);
+ for (i = 0; i < 1000; i++, ptr--) {
+ if (*ptr == 0x41b58ab3) {
+ pr_err("\n");
+ pr_err("frame offset: %lu\n", (unsigned
long)addr - (unsigned long)ptr);
+ pr_err("desc: '%s'\n", (const char*)*(ptr+1));
+ pr_err("func: %pS\n", (void*)*(ptr+2));
+ break;
+ }
+ }
}



And this gave me:


[ 26.763495] ==================================================================
[ 26.764454] BUG: KASAN: use-after-scope in __drm_mm_interval_first+0xc0/0x1e2
[ 26.765297] Read of size 8 at addr ffff88006cb3fbe0 by task swapper/0/1
[ 26.766081]
[ 26.766278] CPU: 1 PID: 1 Comm: swapper/0 Not tainted
4.14.0-04319-gd17a1d97dc20-dirty #12
[ 26.767760] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 26.769419] Call Trace:
[ 26.769895] dump_stack+0xdb/0x17a
[ 26.770152] ? _atomic_dec_and_lock+0x12f/0x12f
[ 26.770152] ? show_regs_print_info+0x5b/0x5b
[ 26.770152] ? kasan_report+0x4d/0x247
[ 26.770152] ? __drm_mm_interval_first+0xc0/0x1e2
[ 26.770152] print_address_description+0x9a/0x232
[ 26.770152] ? __drm_mm_interval_first+0xc0/0x1e2
[ 26.770152] kasan_report+0x21e/0x247
[ 26.770152] __asan_report_load8_noabort+0x14/0x16
[ 26.770152] __drm_mm_interval_first+0xc0/0x1e2
[ 26.770152] assert_continuous+0x13e/0x22f
[ 26.770152] __igt_insert+0x665/0xc87
[ 26.770152] ? igt_bottomup+0xaa0/0xaa0
[ 26.770152] ? sched_clock_local+0x3c/0xfb
[ 26.770152] ? find_held_lock+0x33/0x103
[ 26.770152] ? next_prime_number+0x318/0x362
[ 26.770152] ? rcu_irq_enter_disabled+0xd/0xd
[ 26.770152] ? next_prime_number+0x337/0x362
[ 26.770152] igt_replace+0x4b/0xb3
[ 26.770152] test_drm_mm_init+0x118/0x172
[ 26.770152] ? drm_kms_helper_init+0xb/0xb
[ 26.770152] do_one_initcall+0x10f/0x21f
[ 26.770152] ? initcall_blacklisted+0x185/0x185
[ 26.770152] ? down_write_nested+0xa1/0x164
[ 26.770152] ? kasan_poison_shadow+0x2f/0x31
[ 26.770152] ? kasan_unpoison_shadow+0x14/0x35
[ 26.770152] kernel_init_freeable+0x2ae/0x339
[ 26.770152] ? rest_init+0x250/0x250
[ 26.770152] kernel_init+0xc/0x105
[ 26.770152] ? rest_init+0x250/0x250
[ 26.770152] ret_from_fork+0x24/0x30
[ 26.770152]
[ 26.770152] The buggy address belongs to the page:
[ 26.770152] page:ffff88007f39c5c8 count:0 mapcount:0 mapping:
(null) index:0x0
[ 26.770152] flags: 0x1a01fff800000()
[ 26.770152] raw: 0001a01fff800000 0000000000000000 0000000000000000
00000000ffffffff
[ 26.770152] raw: ffff88007f39c5e8 ffff88007f39c5e8 0000000000000000
[ 26.770152] page dumped because: kasan: bad access detected
[ 26.790299]
[ 26.790299] Memory state around the buggy address:
[ 26.790299] ffff88006cb3fa80: 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 f1
[ 26.790299] ffff88006cb3fb00: f1 f1 f1 00 f2 f2 f2 f2 f2 f2 f2 00
00 f2 f2 f2
[ 26.790299] >ffff88006cb3fb80: f2 f2 f2 f8 f8 f2 f2 f2 f2 f2 f2 f8
f8 f8 f8 f8
[ 26.790299] ^
[ 26.790299] ffff88006cb3fc00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
f8 f8 f8 f2
[ 26.790299] ffff88006cb3fc80: f2 f2 f2 00 00 00 00 00 00 00 00 00
00 00 00 00
[ 26.790299]
[ 26.790299] frame offset: 232
[ 26.790299] desc: '5 32 8 3 __u 96 16 4 prng 160 16 7 state__ 224
160 3 tmp 416 224 2 mm '
[ 26.790299] func: __igt_insert+0x0/0xc87
[ 26.790299] ==================================================================


That desc string is: number of local objects, then for each object:
offset, size, name length, name.

So that's variable tmp in __igt_insert. Not too surprising looking at the code:


for (mode = insert_modes; mode->name; mode++) {
for (n = 0; n < count; n++) {
struct drm_mm_node tmp;

node = replace ? &tmp : &nodes[n];
memset(node, 0, sizeof(*node));
if (!expect_insert(&mm, node, size, 0, n, mode)) {
pr_err("%s insert failed, size %llu step %d\n",
mode->name, size, n);
goto out;
}

if (replace) {
drm_mm_replace_node(&tmp, &nodes[n]);
if (drm_mm_node_allocated(&tmp)) {
pr_err("replaced old-node still allocated! step %d\n",
n);
goto out;
}

if (!assert_node(&nodes[n], &mm, size, 0, n)) {
pr_err("replaced node did not inherit parameters,
size %llu step %d\n",
size, n);
goto out;
}

if (tmp.start != nodes[n].start) {
pr_err("replaced node mismatch location expected
[%llx + %llx], found [%llx + %llx]\n",
tmp.start, size,
nodes[n].start, nodes[n].size);
goto out;
}
}
}



I guess we need to finally do this for real. Also print global names.

Arnd Bergmann

unread,
Nov 29, 2017, 3:17:45 PM11/29/17
to Mark Rutland, Dmitry Vyukov, Andrey Ryabinin, kasan-dev, Alexander Potapenko, LKML, Linux ARM, Dennis Zhou, Fengguang Wu
On Tue, Nov 28, 2017 at 4:24 PM, Mark Rutland <mark.r...@arm.com> wrote:
> On Tue, Nov 28, 2017 at 02:13:55PM +0000, Mark Rutland wrote:
>> On Tue, Nov 28, 2017 at 01:57:49PM +0100, Dmitry Vyukov wrote:
>> > On Tue, Nov 28, 2017 at 1:35 PM, Mark Rutland <mark.r...@arm.com> wrote:
>> > > As a heads-up, I'm seeing a number of what appear to be false-positive
>> > > use-after-scope warnings when I enable both KCOV and KASAN (inline or outline),
>> > > when using the Linaro 17.08 GCC7.1.1 for arm64. So far I haven't spotted these
>> > > without KCOV selected, and I'm only seeing these for sanitize-use-after-scope.
>> > >
>> > > The reports vary depending on configuration even with the same trigger. I'm not
>> > > sure if it's the reporting that's misleading, or whether the detection is going
>> > > wrong.
>
>> ... it looks suspiciously like something is setting up non-zero shadow
>> bytes, but not zeroing them upon return.
>
> It looks like this is the case.
>
> The hack below detects leftover poison on an exception return *before*
> the false-positive warning (example splat at the end of the email). With
> scripts/Makefile.kasan hacked to not pass
> -fsanitize-address-use-after-scope, I see no leftover poison.

That reminds me that we are still missing my patch to turn off
-fsanitize-address-use-after-scope by default and instead re-enable
CONFIG_FRAME_WARN when KASAN is turned on.

I spent about a year hunting down all the instances that produce more
than 2KB stack frames with KASAN (including asan-stack), they should
be disabled now, but we still have some seriously large stack frames with
-fsanitize-address-use-after-scope.

Maybe it's better to just completely disable -fsanitize-address-use-after-scope
when it has multiple independent problems.

Arnd

Dmitry Vyukov

unread,
Nov 29, 2017, 3:56:39 PM11/29/17
to Arnd Bergmann, Mark Rutland, Andrey Ryabinin, kasan-dev, Alexander Potapenko, LKML, Linux ARM, Dennis Zhou, Fengguang Wu
On Wed, Nov 29, 2017 at 9:17 PM, Arnd Bergmann <ar...@arndb.de> wrote:
>>> > > As a heads-up, I'm seeing a number of what appear to be false-positive
>>> > > use-after-scope warnings when I enable both KCOV and KASAN (inline or outline),
>>> > > when using the Linaro 17.08 GCC7.1.1 for arm64. So far I haven't spotted these
>>> > > without KCOV selected, and I'm only seeing these for sanitize-use-after-scope.
>>> > >
>>> > > The reports vary depending on configuration even with the same trigger. I'm not
>>> > > sure if it's the reporting that's misleading, or whether the detection is going
>>> > > wrong.
>>
>>> ... it looks suspiciously like something is setting up non-zero shadow
>>> bytes, but not zeroing them upon return.
>>
>> It looks like this is the case.
>>
>> The hack below detects leftover poison on an exception return *before*
>> the false-positive warning (example splat at the end of the email). With
>> scripts/Makefile.kasan hacked to not pass
>> -fsanitize-address-use-after-scope, I see no leftover poison.
>
> That reminds me that we are still missing my patch to turn off
> -fsanitize-address-use-after-scope by default and instead re-enable
> CONFIG_FRAME_WARN when KASAN is turned on.
>
> I spent about a year hunting down all the instances that produce more
> than 2KB stack frames with KASAN (including asan-stack), they should
> be disabled now, but we still have some seriously large stack frames with
> -fsanitize-address-use-after-scope.
>
> Maybe it's better to just completely disable -fsanitize-address-use-after-scope
> when it has multiple independent problems.


This one is not a problem with KASAN. KASAN has detected a very real
and subtle bug in the code.

Mark Rutland

unread,
Nov 30, 2017, 4:30:21 AM11/30/17
to Dmitry Vyukov, Andrey Ryabinin, kasan-dev, Alexander Potapenko, LKML, linux-ar...@lists.infradead.org, Dennis Zhou, Fengguang Wu
On Tue, Nov 28, 2017 at 06:52:32PM +0100, Dmitry Vyukov wrote:
FWIW, it looks like ASAN does go wrong on x86 under some conditions:

https://lkml.kernel.org/r/20171129175...@big-sky.attlocal.net

I note that in all cases reported so far, there's a GCC plugin involved,
so perhaps there's some bad interaction between the compiler passes.

Thanks,
Mark.
Reply all
Reply to author
Forward
0 new messages