mm: GPF in find_get_pages_tag

46 views
Skip to first unread message

Dmitry Vyukov

unread,
Jul 5, 2016, 7:39:44 AM7/5/16
to Andrew Morton, Jan Kara, ross.z...@linux.intel.com, Kirill A. Shutemov, linu...@kvack.org, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal, Andrey Ryabinin, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
Hello,

The following program triggers GPF in find_get_pages_tag if run in
parallel loop for minutes:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 2 PID: 301 Comm: a.out Tainted: G W 4.7.0-rc5+ #28
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880063d12440 ti: ffff880067350000 task.ti: ffff880067350000
RIP: 0010:[<ffffffff816951a4>]
[< inline >] radix_tree_next_slot include/linux/radix-tree.h:473
[<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
RSP: 0018:ffff880067357840 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff880063d12c80
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000008
RBP: ffff880067357910 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff89f06360 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000000 R15: ffffed0007058ee5
FS: 00007f56e017c700(0000) GS:ffff88006d400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f56df97ae78 CR3: 0000000063d9e000 CR4: 00000000000006e0
Stack:
ffffffff81694efc ffff8800673578a8 0000010267357860 ffff880067357a50
ffff880065986aa0 1ffff1000ce6af11 0000000e00000000 ffff880067357a00
0000000000000003 0000000041b58ab3 ffffffff87e2a722 ffffffff81694e70
Call Trace:
[<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
[<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90
fs/ext4/inode.c:2516
[<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
[<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
[<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
[<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
[<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
[<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
[< inline >] vfs_fsync fs/sync.c:209
[<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
[< inline >] SYSC_fdatasync fs/sync.c:232
[<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
[<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
Code: 85 70 ff ff ff 49 d1 ec 4d 85 e4 4c 89 65 a8 74 65 e8 51 06 f0
ff 49 8d 7e 08 48 be 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80>
3c 30 00 0f 85 9c 05 00 00 4d 8b 6e 08 4c 89 eb 83 e3 03 48
RIP [< inline >] radix_tree_next_slot include/linux/radix-tree.h:473
RIP [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
RSP <ffff880067357840>
---[ end trace 33a0cc4dd9a49a67 ]---



// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include <pthread.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>

int fd;
char buf[8192];
char filename[256];

void* thr(void* arg)
{
switch ((long)arg) {
case 0:
write(fd, buf, 0x1001ul);
break;
case 1:
fdatasync(fd);
break;
case 2:
ftruncate(fd, 2);
break;
case 3:
write(fd, buf, 0x20ul);
break;
case 5:
fd = open(filename, 0x50042ul, 0x41ul);
break;
}
return 0;
}

int main()
{
long i;
pthread_t th[10];

srand(getpid());
sprintf(filename, "./file%d", getpid());
fd = open(filename, 0x50042ul, 0x41ul);
for (i = 0; i < 10; i++) {
pthread_create(&th[i], 0, thr, (void*)(i % 5));
usleep(rand() % 10);
}
for (i = 0; i < 10; i++)
pthread_join(th[i], 0);
unlink(filename);
return 0;
}

The faulting instruction is:
ffffffff816951a4: 80 3c 30 00 cmpb $0x0,(%rax,%rsi,1)
So this is KASAN shadow check for NULL address.


The previous taint is not relevant, it is:

[ 74.786477] ------------[ cut here ]------------
[ 74.786885] WARNING: CPU: 2 PID: 717 at lib/stackdepot.c:119
depot_save_stack+0x34f/0x5b0
[ 74.787196] Stack depot reached limit capacity


On commit 1a0a02d1efa066001fd315c1b4df583d939fa2c4 (Jun 30).

Andrey Ryabinin

unread,
Jul 14, 2016, 7:22:54 AM7/14/16
to Andrew Morton, Jan Kara, ross.z...@linux.intel.com, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Sasha Levin, linux-...@vger.kernel.org, Andrey Ryabinin, Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, sta...@vger.kernel.org
radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
leading to crash:

RIP: [< inline >] radix_tree_next_slot include/linux/radix-tree.h:473
[<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
....
Call Trace:
[<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
[<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
[<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
[<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
[<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
[<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
[<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
[<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
[< inline >] vfs_fsync fs/sync.c:209
[<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
[< inline >] SYSC_fdatasync fs/sync.c:232
[<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
[<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207

We must reset iterator's tags to bail out from radix_tree_next_slot() and
go to the slow-path in radix_tree_next_chunk().

Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
Signed-off-by: Andrey Ryabinin <arya...@virtuozzo.com>
Reported-by: Dmitry Vyukov <dvy...@google.com>
Cc: Konstantin Khlebnikov <koc...@gmail.com>
Cc: Matthew Wilcox <wi...@linux.intel.com>
Cc: Hugh Dickins <hu...@google.com>
Cc: <sta...@vger.kernel.org>
---
include/linux/radix-tree.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index cb4b7e8..eca6f62 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -407,6 +407,7 @@ static inline __must_check
void **radix_tree_iter_retry(struct radix_tree_iter *iter)
{
iter->next_index = iter->index;
+ iter->tags = 0;
return NULL;
}

--
2.7.3

Konstantin Khlebnikov

unread,
Jul 14, 2016, 8:21:40 AM7/14/16
to Andrey Ryabinin, Andrew Morton, Jan Kara, ross.z...@linux.intel.com, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable
ACK

Originally retry could happen only at index 0 when first indirect node
installed:
in this case tags holds only 1 bit. Seems like now this happends at any index.

Ross Zwisler

unread,
Jul 14, 2016, 6:25:28 PM7/14/16
to Andrey Ryabinin, Andrew Morton, Jan Kara, ross.z...@linux.intel.com, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Sasha Levin, linux-...@vger.kernel.org, Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, sta...@vger.kernel.org
On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
> leading to crash:
>
> RIP: [< inline >] radix_tree_next_slot include/linux/radix-tree.h:473
> [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> ....
> Call Trace:
> [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
> [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
> [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
> [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
> [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
> [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
> [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
> [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
> [< inline >] vfs_fsync fs/sync.c:209
> [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
> [< inline >] SYSC_fdatasync fs/sync.c:232
> [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
> [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>
> We must reset iterator's tags to bail out from radix_tree_next_slot() and
> go to the slow-path in radix_tree_next_chunk().

This analysis doesn't make sense to me. In find_get_pages_tag(), when we call
radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
does a 'continue'. This will hop to the next iteration of the
radix_tree_for_each_tagged() loop, which will very check the exit condition of
the for() loop:

#define radix_tree_for_each_tagged(slot, root, iter, start, tag) \
for (slot = radix_tree_iter_init(iter, start) ; \
slot || (slot = radix_tree_next_chunk(root, iter, \
RADIX_TREE_ITER_TAGGED | tag)) ; \
slot = radix_tree_next_slot(slot, iter, \
RADIX_TREE_ITER_TAGGED))

So, we'll run the
slot || (slot = radix_tree_next_chunk(root, iter, \
RADIX_TREE_ITER_TAGGED | tag)) ; \

bit first. 'slot' is NULL, so we'll set it via radix_tree_next_chunk(). At
this point radix_tree_next_slot() hasn't been called.

radix_tree_next_chunk() will set up the iter->index, iter->next_index and
iter->tags before it returns. The next iteration of the loop in
find_get_pages_tag() will use the non-NULL slot provided by
radix_tree_next_chunk(), and only after that iteration will we call
radix_tree_next_slot() again. By then iter->tags should be up to date.

Do you have a test setup that reliably fails without this code but passes when
you zero out iter->tags?

I've been looking at this as well, but haven't been able to get a reliable
reproducer in my test setup.

Andrey Ryabinin

unread,
Jul 15, 2016, 4:52:03 AM7/15/16
to Ross Zwisler, Andrew Morton, Jan Kara, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Sasha Levin, linux-...@vger.kernel.org, Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, sta...@vger.kernel.org
This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
and only after that goes the condition statement.


> 'slot' is NULL, so we'll set it via radix_tree_next_chunk(). At
> this point radix_tree_next_slot() hasn't been called.
>
> radix_tree_next_chunk() will set up the iter->index, iter->next_index and
> iter->tags before it returns. The next iteration of the loop in
> find_get_pages_tag() will use the non-NULL slot provided by
> radix_tree_next_chunk(), and only after that iteration will we call
> radix_tree_next_slot() again. By then iter->tags should be up to date.
>
> Do you have a test setup that reliably fails without this code but passes when
> you zero out iter->tags?
>


Yup, I run Dmitry's reproducer in a parallel loop:
$ while true; do ./a.out & done

Usually it takes just couple minutes maximum.

Ross Zwisler

unread,
Jul 15, 2016, 3:00:42 PM7/15/16
to Andrey Ryabinin, Ross Zwisler, Andrew Morton, Jan Kara, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Sasha Levin, linux-...@vger.kernel.org, Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, sta...@vger.kernel.org
Right...*sigh*... Thanks for the sanity check. :)

> > 'slot' is NULL, so we'll set it via radix_tree_next_chunk(). At
> > this point radix_tree_next_slot() hasn't been called.
> >
> > radix_tree_next_chunk() will set up the iter->index, iter->next_index and
> > iter->tags before it returns. The next iteration of the loop in
> > find_get_pages_tag() will use the non-NULL slot provided by
> > radix_tree_next_chunk(), and only after that iteration will we call
> > radix_tree_next_slot() again. By then iter->tags should be up to date.
> >
> > Do you have a test setup that reliably fails without this code but passes when
> > you zero out iter->tags?
> >
>
>
> Yup, I run Dmitry's reproducer in a parallel loop:
> $ while true; do ./a.out & done
>
> Usually it takes just couple minutes maximum.

Cool - I was able to get this to work on my system as well by upping the
thread count.

In looking at this more, I agree that your patch fixes this particular bug,
but I think that ultimately we might want something more general.

IIUC, the real issue is that we shouldn't be running through
radix_tree_next_slot() with a NULL 'slot' parameter. In the end I think it's
fine to zero out iter->tags in radix_tree_iter_retry(), but really we want to
guarantee that we just bail out of radix_tree_next_slot() if we have a NULL
'slot'.

I've run this patch in my test setup, and it fixes the reproducer provided by
Dmitry. I've also run xfstests against it with out any failures.

--- 8< ---
From 533beefac12f61f467aeb72e2d2c46685247b9bc Mon Sep 17 00:00:00 2001
From: Ross Zwisler <ross.z...@linux.intel.com>
Date: Fri, 15 Jul 2016 12:46:38 -0600
Subject: [PATCH] radix-tree: 'slot' can be NULL in radix_tree_next_slot()

There are four cases I can see where we could end up with a NULL 'slot' in
radix_tree_next_slot() (there might be more):

1) radix_tree_iter_retry() via a non-tagged iteration like
radix_tree_for_each_slot(). In this case we currently aren't seeing a bug
because radix_tree_iter_retry() sets

iter->next_index = iter->index;

which means that in in the else case in radix_tree_next_slot(), 'count' is
zero, so we skip over the while() loop and effectively just return NULL
without ever dereferencing 'slot'.

2) radix_tree_iter_retry() via tagged iteration like
radix_tree_for_each_tagged(). With the current code this case is
unhandled and we have seen it result in a kernel crash when we dereference
'slot'.

3) radix_tree_iter_next() via via a non-tagged iteration like
radix_tree_for_each_slot(). This currently happens in shmem_tag_pins()
and shmem_partial_swap_usage().

I think that this case is currently unhandled. Unlike with
radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
up executing code in the while() loop in radix_tree_next_slot() that assumes
'slot' is valid.

I haven't actually seen this crash on a test setup, but I don't think the
current code is safe.

4) radix_tree_iter_next() via tagged iteration like
radix_tree_for_each_tagged(). This happens in shmem_wait_for_pins().

radix_tree_iter_next() zeros out iter->tags, so we end up exiting
radix_tree_next_slot() here:

if (flags & RADIX_TREE_ITER_TAGGED) {
void *canon = slot;

iter->tags >>= 1;
if (unlikely(!iter->tags))
return NULL;

Really we want to guarantee that we just bail out of
radix_tree_next_slot() if we have a NULL 'slot'. This is a more explicit
way of handling all the 4 above cases.

Signed-off-by: Ross Zwisler <ross.z...@linux.intel.com>
---
include/linux/radix-tree.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index cb4b7e8..840308d 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
static __always_inline void **
radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
{
+ if (unlikely(!slot))
+ return NULL;
+
if (flags & RADIX_TREE_ITER_TAGGED) {
void *canon = slot;

--
2.9.0

Ross Zwisler

unread,
Jul 15, 2016, 3:03:18 PM7/15/16
to Dmitry Vyukov, Andrew Morton, Jan Kara, ross.z...@linux.intel.com, Kirill A. Shutemov, linu...@kvack.org, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal, Andrey Ryabinin, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
This open() code is unreachable because the thread argument will only be 0-4,
right? Should this be "case 4"?

Dmitry Vyukov

unread,
Jul 15, 2016, 3:08:09 PM7/15/16
to syzkaller, Andrew Morton, Jan Kara, ross.z...@linux.intel.com, Kirill A. Shutemov, linu...@kvack.org, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal, Andrey Ryabinin, Kostya Serebryany, Alexander Potapenko, Sasha Levin
I am not sure. I think it I just copy-pasted the program that
triggered the crash for me. Andrey should have a valid reproducer, in
the other thread he said that he can reproduce it. Andrey, did you
change 5 to 4?



>> }
>> return 0;
>> }
>>
>> int main()
>> {
>> long i;
>> pthread_t th[10];
>>
>> srand(getpid());
>> sprintf(filename, "./file%d", getpid());
>> fd = open(filename, 0x50042ul, 0x41ul);
>> for (i = 0; i < 10; i++) {
>> pthread_create(&th[i], 0, thr, (void*)(i % 5));
>> usleep(rand() % 10);
>> }
>> for (i = 0; i < 10; i++)
>> pthread_join(th[i], 0);
>> unlink(filename);
>> return 0;
>> }
>>
>> The faulting instruction is:
>> ffffffff816951a4: 80 3c 30 00 cmpb $0x0,(%rax,%rsi,1)
>> So this is KASAN shadow check for NULL address.
>>
>>
>> The previous taint is not relevant, it is:
>>
>> [ 74.786477] ------------[ cut here ]------------
>> [ 74.786885] WARNING: CPU: 2 PID: 717 at lib/stackdepot.c:119
>> depot_save_stack+0x34f/0x5b0
>> [ 74.787196] Stack depot reached limit capacity
>>
>>
>> On commit 1a0a02d1efa066001fd315c1b4df583d939fa2c4 (Jun 30).
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ross Zwisler

unread,
Jul 15, 2016, 4:21:45 PM7/15/16
to Dmitry Vyukov, syzkaller, Andrew Morton, Jan Kara, ross.z...@linux.intel.com, Kirill A. Shutemov, linu...@kvack.org, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal, Andrey Ryabinin, Kostya Serebryany, Alexander Potapenko, Sasha Levin
Ah, sorry if I wasn't clear. I don't think you need the open() call to have a
valid reproducer. In mine, in fact, I only use the first three - the error
happens with a combination of write(), fdatasync() and ftruncate().

I just wanted to note that the test program (which was autogenerated?) had an
unreachable case in the switch() statement. :)

Thanks for this testing, by the way!

Dmitry Vyukov

unread,
Jul 15, 2016, 4:26:19 PM7/15/16
to syzkaller, Andrew Morton, Jan Kara, ross.z...@linux.intel.com, Kirill A. Shutemov, linu...@kvack.org, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal, Andrey Ryabinin, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Fri, Jul 15, 2016 at 10:21 PM, Ross Zwisler
Ah, OK. I modified the program by hand to make it trigger the bug more
frequently. So I think the bug was introduced by me. Generator should
not generate dead cases.

Andrew Morton

unread,
Jul 15, 2016, 4:57:35 PM7/15/16
to Ross Zwisler, Andrey Ryabinin, Jan Kara, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Sasha Levin, linux-...@vger.kernel.org, Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, sta...@vger.kernel.org
On Fri, 15 Jul 2016 13:00:40 -0600 Ross Zwisler <ross.z...@linux.intel.com> wrote:

>
> ...
>
> In looking at this more, I agree that your patch fixes this particular bug,
> but I think that ultimately we might want something more general.
>
> ...
>
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
> static __always_inline void **
> radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
> {
> + if (unlikely(!slot))
> + return NULL;
> +
> if (flags & RADIX_TREE_ITER_TAGGED) {
> void *canon = slot;

I'll hang onto Andrey's
radix-tree-fix-radix_tree_iter_retry-for-tagged-iterators.patch for
now, plan to send it in to Linus Wednesdayish. If we can get the above
settled down prior to that then I shall swap over.

Konstantin Khlebnikov

unread,
Jul 16, 2016, 9:45:32 AM7/16/16
to Ross Zwisler, Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable
This is becase distance between ->index and ->next_index now could be
more that one?

We could fix that by adding "iter->index = iter->next_index - 1;" into
radix_tree_iter_next()
right after updating next_index and tweak multi-order itreration logic
if it depends on that.

I'd like to keep radix_tree_next_slot() as small as possible because
this is supposed to be a fast-path.

Konstantin Khlebnikov

unread,
Jul 17, 2016, 1:57:27 PM7/17/16
to Ross Zwisler, Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable
Support of multi-order entries in iterator is ridiculously over-engineered.
If radix_tree_next_chunk() finds multi-order entry it must return chunk
with size 1, radix_tree_next_slot() should know nothing about that.
I'll try to fix that.

Ross Zwisler

unread,
Jul 18, 2016, 7:09:42 PM7/18/16
to Andrew Morton, Ross Zwisler, Andrey Ryabinin, Jan Kara, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Sasha Levin, linux-...@vger.kernel.org, Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, sta...@vger.kernel.org
Sure, that works for me.

Ross Zwisler

unread,
Jul 19, 2016, 12:24:09 AM7/19/16
to Konstantin Khlebnikov, Ross Zwisler, Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable
On Sat, Jul 16, 2016 at 04:45:31PM +0300, Konstantin Khlebnikov wrote:
> On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
> <ross.z...@linux.intel.com> wrote:
<>
> > 3) radix_tree_iter_next() via via a non-tagged iteration like
> > radix_tree_for_each_slot(). This currently happens in shmem_tag_pins()
> > and shmem_partial_swap_usage().
> >
> > I think that this case is currently unhandled. Unlike with
> > radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
> > case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
> > up executing code in the while() loop in radix_tree_next_slot() that assumes
> > 'slot' is valid.
> >
> > I haven't actually seen this crash on a test setup, but I don't think the
> > current code is safe.
>
> This is becase distance between ->index and ->next_index now could be
> more that one?
>
> We could fix that by adding "iter->index = iter->next_index - 1;" into
> radix_tree_iter_next()
> right after updating next_index and tweak multi-order itreration logic
> if it depends on that.
>
> I'd like to keep radix_tree_next_slot() as small as possible because
> this is supposed to be a fast-path.

I think it'll be exactly one?

iter->next_index = __radix_tree_iter_add(iter, 1);

So iter->index will be X, iter->next_index will be X+1, accounting for the
iterator's shift. So, basically, whatever your height is, you'll be set up to
process one more entry, I think.

This means that radix_tree_chunk_size() will return 1. I guess with the
current logic this is safe:

static __always_inline void **
radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
{
...
} else {
long count = radix_tree_chunk_size(iter);
void *canon = slot;

while (--count > 0) {
/* code that assumes 'slot' is non-NULL */

So 'count' will be 1, the prefix decrement will make it 0, so we won't execute
the code in the while() loop. So maybe all the cases are covered after all.

It seems like we need some unit tests in tools/testing/radix-tree around this
- I'll try and find time to add them this week.

I just feel like this isn't very organized. We have a lot of code in
radix_tree_next_slot() that assumes that 'slot' is non-NULL, but we don't
check it anywhere. We know it *can* be NULL, but we just happen to have
things set up so that none of the code that uses 'slot' is executed.

I personally feel like a quick check for slot==NULL at the beginning of the
function is the simplest way to keep ourselves safe, and it doesn't seem like
we'd be adding that much overhead.

Konstantin Khlebnikov

unread,
Jul 19, 2016, 5:07:38 AM7/19/16
to Ross Zwisler, Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov, linu...@kvack.org, Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable
Either fix is fine now. I working on better design for multiorder iterator which
moves all that logic from radix_tree_next_slot() into radix_tree_next_chunk().

Most likely I'll change tree structure a little. For example I think sibling
entries chould hold offset to head entry and order rather than a pointer to it.
Or maybe size: support of non-power-of-2 entries is interesting feature too.
Reply all
Reply to author
Forward
0 new messages