Re: ucount: use-after-free read in inc_ucount & dec_ucount

65 views
Skip to first unread message

Dmitry Vyukov

unread,
Mar 4, 2017, 6:45:09 AM3/4/17
to Nikolay Borisov, Eric W. Biederman, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
On Sat, Mar 4, 2017 at 11:58 AM, Nikolay Borisov
<n.boris...@gmail.com> wrote:
> [Addressing Dmitry Vyukov to ask for syzkaller clarification]
>
> On 3.03.2017 18:30, Eric W. Biederman wrote:
>> Nikolay Borisov <n.boris...@gmail.com> writes:
>>
>>> [Added containers ml, Eric Biederman and Jan Kara]. Please,
>>> next time don't add random people but take the time to see who touched
>>> the code.
>>
>> Comments below.
>>
>>> On 3.03.2017 14:16, JongHwan Kim wrote:
>>>> I've got the following report with syzkaller fuzzer
>>>>
>>>>
>>>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit .
>>>>
>>>> ==================================================================
>>>> BUG: KASAN: use-after-free in __read_once_size
>>>> include/linux/compiler.h:254 [inline] at addr ffff88006d399bc4
>>>> BUG: KASAN: use-after-free in atomic_read
>>>> arch/x86/include/asm/atomic.h:26 [inline] at addr ffff88006d399bc4
>>>> BUG: KASAN: use-after-free in atomic_dec_if_positive
>>>> include/linux/atomic.h:616 [inline] at addr ffff88006d399bc4
>>>> BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>> at addr ffff88006d399bc4
>>>> Read of size 4 by task syz-executor3/19713
>>>> CPU: 1 PID: 19713 Comm: syz-executor3 Not tainted 4.10.0+ #4
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>> Call Trace:
>>>> __dump_stack lib/dump_stack.c:15 [inline]
>>>> dump_stack+0x115/0x1cf lib/dump_stack.c:51
>>>> kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>>>> print_address_description mm/kasan/report.c:200 [inline]
>>>> kasan_report_error mm/kasan/report.c:289 [inline]
>>>> kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>>>> kasan_report mm/kasan/report.c:331 [inline]
>>>> __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
>>>> __read_once_size include/linux/compiler.h:254 [inline]
>>>> atomic_read arch/x86/include/asm/atomic.h:26 [inline]
>>>> atomic_dec_if_positive include/linux/atomic.h:616 [inline]
>>>> dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>> dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>> inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>> fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>> fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>> fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>> inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>> __fput+0x327/0x7e0 fs/file_table.c:208
>>>> ____fput+0x15/0x20 fs/file_table.c:244
>>>> task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>> exit_task_work include/linux/task_work.h:21 [inline]
>>>> do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>> do_group_exit+0x149/0x400 kernel/exit.c:977
>>>> get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>> do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>> exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>> prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>> syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>> entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>
>>> So PID 19713 is exitting and as part of it it's freeing its file
>>> descriptors, one of which is apparently an inotify fd. And this has
>>> already been freed.
>>>
>>>
>>>> RIP: 0033:0x44fb79
>>>> RSP: 002b:00007ffd0f00f6d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000ca
>>>> RAX: fffffffffffffdfc RBX: 0000000000708024 RCX: 000000000044fb79
>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708024
>>>> RBP: 00000000000ae8e6 R08: 0000000000708000 R09: 000000160000000d
>>>> R10: 00007ffd0f00f710 R11: 0000000000000206 R12: 0000000000708000
>>>> R13: 0000000000708024 R14: 00000000000ae8a1 R15: 0000000000000016
>>>> Object at ffff88006d399b88, in cache kmalloc-96 size: 96
>>>> Allocated:
>>>> PID = 19691
>>>> save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>> save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>> set_track mm/kasan/kasan.c:514 [inline]
>>>> kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
>>>> kmem_cache_alloc_trace+0xfb/0x280 mm/slub.c:2745
>>>> kmalloc include/linux/slab.h:490 [inline]
>>>> kzalloc include/linux/slab.h:663 [inline]
>>>> get_ucounts kernel/ucount.c:140 [inline]
>>>> inc_ucount+0x538/0xa70 kernel/ucount.c:195
>>>> inotify_new_group+0x309/0x410 fs/notify/inotify/inotify_user.c:655
>>>> SYSC_inotify_init1 fs/notify/inotify/inotify_user.c:682 [inline]
>>>> SyS_inotify_init1 fs/notify/inotify/inotify_user.c:669 [inline]
>>>> sys_inotify_init+0x17/0x80 fs/notify/inotify/inotify_user.c:696
>>>> entry_SYSCALL_64_fastpath+0x1f/0xc2
>>>
>>> However, it has been actually allocated by a different process with pid
>>> 19691.
>>>
>>>
>>>> Freed:
>>>> PID = 19708
>>>> save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>> save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>> set_track mm/kasan/kasan.c:514 [inline]
>>>> kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
>>>> slab_free_hook mm/slub.c:1357 [inline]
>>>> slab_free_freelist_hook mm/slub.c:1379 [inline]
>>>> slab_free mm/slub.c:2961 [inline]
>>>> kfree+0xe8/0x2c0 mm/slub.c:3882
>>>> put_ucounts+0x1dd/0x270 kernel/ucount.c:172
>>>> dec_ucount+0x172/0x210 kernel/ucount.c:220
>>>> dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>> inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>> fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>> fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>> fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>> inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>> __fput+0x327/0x7e0 fs/file_table.c:208
>>>> ____fput+0x15/0x20 fs/file_table.c:244
>>>> task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>> exit_task_work include/linux/task_work.h:21 [inline]
>>>> do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>> do_group_exit+0x149/0x400 kernel/exit.c:977
>>>> get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>> do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>> exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>> prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>> syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>> entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>
>>> And yet we have a third process which freed it, PID 19708. So there is
>>> some dance happening with this fd, being allocated by one process,
>>> handed over to 2 more, which are freeing it. Is this a valid usage
>>> scenario of inotify descriptors?
>>
>> They are file descriptors so passing them around is valid. That is
>> something unix domain sockets have allowed since the dawn of linux.
>>
>> The dance would need to be the fd being passed to the addtional
>> processes and then closed in the original before being closed
>> in the processes the fd was passed to.
>>
>> If those additional processes last longer than the original process this
>> is easy to achieve.
>>
>> My guess is that someone just taught syskallzer to pass file descriptors
>> around. So this may be an old bug. Either that or syskallzer hasn't
>> been looking at linux-next with KASAN enabled in the kernel.
>
> Dmitry, can you tell if syzkaller tests sending file descriptors across
> sockets? Since the calltraces here show multiple processes being
> involved in different operations on the exact same file descriptor.
>
> Also JongHwan, can you provide the full, compilable reproducer to try
> and track this issue down?


syzkaller can pass descriptors across sockets, but currently only
within a single multi-threaded process.

Are you sure it's the same descriptor? It seems to me that it's struct
ucounts, which is shared via the global ucounts_hashtable, so no
sharing is required in user processes.

Unless I am missing something, we want:

@@ -154,7 +155,7 @@ static struct ucounts *get_ucounts(struct
user_namespace *ns, kuid_t uid)
ucounts = new;
}
}
- if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
+ if (!atomic_add_unless(&ucounts->count, 1, 0))
ucounts = NULL;
spin_unlock_irq(&ucounts_lock);
return ucounts;

no?

put_ucounts drops the last reference, then get_ucounts finds the
ucounts and successfully increments refcount as it's not INT_MAX (it's
0) and starts using it, meanwhile put_ucounts proceeds to
unconditionally deleting the ucounts.

Dmitry Vyukov

unread,
Mar 4, 2017, 6:51:09 AM3/4/17
to Nikolay Borisov, Eric W. Biederman, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
It also seems that a concurrent put_ucounts can make get_ucounts fail
_spuriously_, which does not look good.
Don't we want something along the following lines?

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 8a11fc0cb459..233c8e46acd5 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -143,19 +143,18 @@ static struct ucounts *get_ucounts(struct
user_namespace *ns, kuid_t uid)

new->ns = ns;
new->uid = uid;
- atomic_set(&new->count, 0);
+ atomic_set(&new->count, 1);

spin_lock_irq(&ucounts_lock);
ucounts = find_ucounts(ns, uid, hashent);
if (ucounts) {
+ atomic_inc(&ucounts->count);
kfree(new);
} else {
hlist_add_head(&new->node, hashent);
ucounts = new;
}
}
- if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
- ucounts = NULL;
spin_unlock_irq(&ucounts_lock);
return ucounts;
}
@@ -166,7 +165,10 @@ static void put_ucounts(struct ucounts *ucounts)

if (atomic_dec_and_test(&ucounts->count)) {
spin_lock_irqsave(&ucounts_lock, flags);
- hlist_del_init(&ucounts->node);
+ if (atomic_read(&ucounts->count) == 0)
+ hlist_del_init(&ucounts->node);
+ else
+ ucounts = NULL;
spin_unlock_irqrestore(&ucounts_lock, flags);

kfree(ucounts);

Dmitry Vyukov

unread,
Mar 4, 2017, 6:57:31 AM3/4/17
to Nikolay Borisov, Eric W. Biederman, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
I was able to reproduce it by adding udelay(100) into put_ucounts
after the decrement. The workload was just lots of parallel processes
creating private inotify's.


==================================================================
BUG: KASAN: use-after-free in atomic_dec_if_positive
include/linux/compiler.h:254 [inline] at addr ffff88006c86de3c
BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210
kernel/ucount.c:219 at addr ffff88006c86de3c
Read of size 4 by task udevadm/1644
CPU: 2 PID: 1644 Comm: udevadm Not tainted 4.10.0+ #278
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
kasan_object_err+0x1c/0x70 mm/kasan/report.c:166
print_address_description mm/kasan/report.c:204 [inline]
kasan_report_error mm/kasan/report.c:288 [inline]
kasan_report.part.2+0x198/0x440 mm/kasan/report.c:310
kasan_report mm/kasan/report.c:330 [inline]
__asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:330
atomic_dec_if_positive include/linux/compiler.h:254 [inline]
dec_ucount+0x1e5/0x210 kernel/ucount.c:219
dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
__fput+0x332/0x7f0 fs/file_table.c:208
____fput+0x15/0x20 fs/file_table.c:244
task_work_run+0x18a/0x260 kernel/task_work.c:116
tracehook_notify_resume include/linux/tracehook.h:191 [inline]
exit_to_usermode_loop+0x23b/0x2a0 arch/x86/entry/common.c:160
prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:259
entry_SYSCALL_64_fastpath+0xc0/0xc2
RIP: 0033:0x7f14a12fa2b0
RSP: 002b:00007ffdeea0a6c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f14a12fa2b0
RDX: 0000000000e03000 RSI: 00007f14a15b0e40 RDI: 0000000000000003
RBP: 0000000000000000 R08: 00007f14a1bf37a0 R09: 0000000000000000
R10: 1999999999999999 R11: 0000000000000246 R12: 0000000000000078
R13: 0000000000000000 R14: 431bde82d7b634db R15: 0000000000de2130
Object at ffff88006c86de00, in cache kmalloc-96 size: 96
Allocated:
PID = 0
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
save_stack+0x43/0xd0 mm/kasan/kasan.c:502
set_track mm/kasan/kasan.c:514 [inline]
kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:605
kmem_cache_alloc_trace+0x10b/0x6e0 mm/slab.c:3637
kmalloc include/linux/slab.h:490 [inline]
kzalloc include/linux/slab.h:663 [inline]
get_ucounts kernel/ucount.c:140 [inline]
inc_ucount+0x482/0x950 kernel/ucount.c:197
inc_mnt_namespaces fs/namespace.c:2843 [inline]
alloc_mnt_ns+0xfe/0x560 fs/namespace.c:2874
create_mnt_ns+0x6e/0x310 fs/namespace.c:2984
init_mount_tree fs/namespace.c:3224 [inline]
mnt_init+0x2b8/0x471 fs/namespace.c:3276
vfs_caches_init+0xaa/0x156 fs/dcache.c:3626
start_kernel+0x72e/0x7d2 init/main.c:648
x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
x86_64_start_kernel+0x13c/0x149 arch/x86/kernel/head64.c:176
verify_cpu+0x0/0xfc
Freed:
PID = 1644
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
save_stack+0x43/0xd0 mm/kasan/kasan.c:502
set_track mm/kasan/kasan.c:514 [inline]
kasan_slab_free+0x6f/0xb0 mm/kasan/kasan.c:578
__cache_free mm/slab.c:3513 [inline]
kfree+0xd3/0x250 mm/slab.c:3830
put_ucounts+0x341/0x3e0 kernel/ucount.c:174
dec_ucount+0x172/0x210 kernel/ucount.c:222
dec_inotify_watches fs/notify/inotify/inotify.h:47 [inline]
inotify_ignored_and_remove_idr+0x80/0x90 fs/notify/inotify/inotify_user.c:501
inotify_freeing_mark+0x1d/0x30 fs/notify/inotify/inotify_fsnotify.c:125
__fsnotify_free_mark+0x13c/0x2b0 fs/notify/mark.c:203
fsnotify_detach_group_marks+0xd2/0x180 fs/notify/mark.c:508
fsnotify_destroy_group+0x62/0x120 fs/notify/group.c:70
inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
__fput+0x332/0x7f0 fs/file_table.c:208
____fput+0x15/0x20 fs/file_table.c:244
task_work_run+0x18a/0x260 kernel/task_work.c:116
tracehook_notify_resume include/linux/tracehook.h:191 [inline]
exit_to_usermode_loop+0x23b/0x2a0 arch/x86/entry/common.c:160
prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:259
entry_SYSCALL_64_fastpath+0xc0/0xc2
Memory state around the buggy address:
ffff88006c86dd00: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
ffff88006c86dd80: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
>ffff88006c86de00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff88006c86de80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
ffff88006c86df00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
==================================================================

Dmitry Vyukov

unread,
Mar 4, 2017, 7:01:30 AM3/4/17
to Nikolay Borisov, Eric W. Biederman, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

This is broken per se. Need something more elaborate.

Nikolay Borisov

unread,
Mar 4, 2017, 7:10:38 AM3/4/17
to Dmitry Vyukov, Eric W. Biederman, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
How about this :

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 8a11fc0cb459..b817ac0e587c 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -166,11 +166,15 @@ static void put_ucounts(struct ucounts *ucounts)

if (atomic_dec_and_test(&ucounts->count)) {
spin_lock_irqsave(&ucounts_lock, flags);
- hlist_del_init(&ucounts->node);
- spin_unlock_irqrestore(&ucounts_lock, flags);
-
- kfree(ucounts);
+ if (!atomic_read(&ucounts->count)) {
+ hlist_del_init(&ucounts->node);
+ spin_unlock_irqrestore(&ucounts_lock, flags);
+ kfree(ucounts);
+ return;
+ }
}
+
+ spin_unlock_irqrestore(&ucounts_lock, flags);
}



This makes the atomic_dec_and_test and hashtable removal atomic in essence.

Dmitry Vyukov

unread,
Mar 4, 2017, 7:15:45 AM3/4/17
to Nikolay Borisov, Eric W. Biederman, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
This won't work.
Consider the following scenario:
Thread 0 decrements count to 0 here:
if (atomic_dec_and_test(&ucounts->count)) {
Then thread 1 calls get_ucounts, increments count to 1, then calls
put_ucounts, decrements count to 0 and unhashes and frees ucounts.
Not thread 0 does:
if (!atomic_read(&ucounts->count)) {
but ucounts is already freed!

쪼르

unread,
Mar 4, 2017, 7:35:26 AM3/4/17
to Dmitry Vyukov, Nikolay Borisov, Eric W. Biederman, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
Hi, This is my new one report about dec_ucount:
ps.Sorry for my uncomfortable report. This is my first usage of lkml. 
Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit .

==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:254 [inline] at addr ffff8800372f21fc
BUG: KASAN: use-after-free in atomic_read arch/x86/include/asm/atomic.h:26 [inline] at addr ffff8800372f21fc
BUG: KASAN: use-after-free in atomic_dec_if_positive include/linux/atomic.h:616 [inline] at addr ffff8800372f21fc
BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210 kernel/ucount.c:217 at addr ffff8800372f21fc
Read of size 4 by task syz-executor0/19190
CPU: 1 PID: 19190 Comm: syz-executor0 Not tainted 4.10.0+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x115/0x1cf lib/dump_stack.c:51
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
 print_address_description mm/kasan/report.c:200 [inline]
 kasan_report_error mm/kasan/report.c:289 [inline]
 kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
 kasan_report mm/kasan/report.c:331 [inline]
 __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
 __read_once_size include/linux/compiler.h:254 [inline]
 atomic_read arch/x86/include/asm/atomic.h:26 [inline]
 atomic_dec_if_positive include/linux/atomic.h:616 [inline]
 dec_ucount+0x1e5/0x210 kernel/ucount.c:217
 dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
 inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
 fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
 fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
 fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
 inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
 __fput+0x327/0x7e0 fs/file_table.c:208
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x18a/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0xa45/0x1b20 kernel/exit.c:873
 do_group_exit+0x149/0x400 kernel/exit.c:977
 get_signal+0x7d5/0x1810 kernel/signal.c:2313
 do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
 exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
 prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
 syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
 entry_SYSCALL_64_fastpath+0xc0/0xc2
RIP: 0033:0x44fb79
RSP: 002b:00007f048a1cacf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 0000000000708218 RCX: 000000000044fb79
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708218
RBP: 00000000007081f8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fff84d701af R14: 00007f048a1cb9c0 R15: 000000000000000e
Object at ffff8800372f21c0, in cache kmalloc-96 size: 96
Allocated:
PID = 19163
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
 kmem_cache_alloc_trace+0xfb/0x280 mm/slub.c:2745
 kmalloc include/linux/slab.h:490 [inline]
 kzalloc include/linux/slab.h:663 [inline]
 get_ucounts kernel/ucount.c:140 [inline]
 inc_ucount+0x538/0xa70 kernel/ucount.c:195
 inotify_new_group+0x309/0x410 fs/notify/inotify/inotify_user.c:655
 SYSC_inotify_init1 fs/notify/inotify/inotify_user.c:682 [inline]
 SyS_inotify_init1 fs/notify/inotify/inotify_user.c:669 [inline]
 sys_inotify_init+0x17/0x80 fs/notify/inotify/inotify_user.c:696
 entry_SYSCALL_64_fastpath+0x1f/0xc2
Freed:
PID = 19163
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kfree+0xe8/0x2c0 mm/slub.c:3882
 put_ucounts+0x1dd/0x270 kernel/ucount.c:172
 dec_ucount+0x172/0x210 kernel/ucount.c:220
 dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
 inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
 fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
 fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
 fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
 inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
 __fput+0x327/0x7e0 fs/file_table.c:208
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x18a/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0xa45/0x1b20 kernel/exit.c:873
 do_group_exit+0x149/0x400 kernel/exit.c:977
 get_signal+0x7d5/0x1810 kernel/signal.c:2313
 do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
 exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
 prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
 syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
 entry_SYSCALL_64_fastpath+0xc0/0xc2
Memory state around the buggy address:
 ffff8800372f2080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8800372f2100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8800372f2180: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                                ^
 ffff8800372f2200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8800372f2280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 19190 Comm: syz-executor0 Tainted: G    B           4.10.0+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x115/0x1cf lib/dump_stack.c:51
 panic+0x1b4/0x392 kernel/panic.c:179
 kasan_end_report+0x5b/0x60 mm/kasan/report.c:141
 kasan_report_error mm/kasan/report.c:293 [inline]
 kasan_report.part.1+0x40a/0x4e0 mm/kasan/report.c:311
 kasan_report mm/kasan/report.c:331 [inline]
 __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
 __read_once_size include/linux/compiler.h:254 [inline]
 atomic_read arch/x86/include/asm/atomic.h:26 [inline]
 atomic_dec_if_positive include/linux/atomic.h:616 [inline]
 dec_ucount+0x1e5/0x210 kernel/ucount.c:217
 dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
 inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
 fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
 fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
 fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
 inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
 __fput+0x327/0x7e0 fs/file_table.c:208
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x18a/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0xa45/0x1b20 kernel/exit.c:873
 do_group_exit+0x149/0x400 kernel/exit.c:977
 get_signal+0x7d5/0x1810 kernel/signal.c:2313
 do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
 exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
 prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
 syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
 entry_SYSCALL_64_fastpath+0xc0/0xc2
RIP: 0033:0x44fb79
RSP: 002b:00007f048a1cacf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 0000000000708218 RCX: 000000000044fb79
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708218
RBP: 00000000007081f8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fff84d701af R14: 00007f048a1cb9c0 R15: 000000000000000e
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


Syzkaller reproducer:
# {Threaded:false Collide:false Repeat:true Procs:1 Sandbox:setuid Repro:false}
semget$private(0x0, 0x400001003, 0x181)


C reproducer:
// autogenerated by syzkaller (http://github.com/google/syzkaller)

#ifndef __NR_semget
#define __NR_semget 64
#endif

#define __STDC_VERSION__ 201112L

#define _GNU_SOURCE

#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/mount.h>
#include <sys/prctl.h>
#include <sys/resource.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <linux/capability.h>
#include <linux/if.h>
#include <linux/if_tun.h>
#include <linux/kvm.h>
#include <linux/sched.h>
#include <net/if_arp.h>

#include <assert.h>
#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <grp.h>
#include <pthread.h>
#include <setjmp.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

const int kFailStatus = 67;
const int kErrorStatus = 68;
const int kRetryStatus = 69;

__attribute__((noreturn)) void doexit(int status)
{
  volatile unsigned i;
  syscall(__NR_exit_group, status);
  for (i = 0;; i++) {
  }
}

__attribute__((noreturn)) void fail(const char* msg, ...)
{
  int e = errno;
  fflush(stdout);
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  doexit(e == ENOMEM ? kRetryStatus : kFailStatus);
}

__attribute__((noreturn)) void exitf(const char* msg, ...)
{
  int e = errno;
  fflush(stdout);
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  doexit(kRetryStatus);
}

static int flag_debug;

void debug(const char* msg, ...)
{
  if (!flag_debug)
    return;
  va_list args;
  va_start(args, msg);
  vfprintf(stdout, msg, args);
  va_end(args);
  fflush(stdout);
}

__thread int skip_segv;
__thread jmp_buf segv_env;

static void segv_handler(int sig, siginfo_t* info, void* uctx)
{
  uintptr_t addr = (uintptr_t)info->si_addr;
  const uintptr_t prog_start = 1 << 20;
  const uintptr_t prog_end = 100 << 20;
  if (__atomic_load_n(&skip_segv, __ATOMIC_RELAXED) &&
      (addr < prog_start || addr > prog_end)) {
    debug("SIGSEGV on %p, skipping\n", addr);
    _longjmp(segv_env, 1);
  }
  debug("SIGSEGV on %p, exiting\n", addr);
  doexit(sig);
  for (;;) {
  }
}

static void install_segv_handler()
{
  struct sigaction sa;
  memset(&sa, 0, sizeof(sa));
  sa.sa_sigaction = segv_handler;
  sa.sa_flags = SA_NODEFER | SA_SIGINFO;
  sigaction(SIGSEGV, &sa, NULL);
  sigaction(SIGBUS, &sa, NULL);
}

#define NONFAILING(...)                                                \
  {                                                                    \
    __atomic_fetch_add(&skip_segv, 1, __ATOMIC_SEQ_CST);               \
    if (_setjmp(segv_env) == 0) {                                      \
      __VA_ARGS__;                                                     \
    }                                                                  \
    __atomic_fetch_sub(&skip_segv, 1, __ATOMIC_SEQ_CST);               \
  }

#define BITMASK_LEN(type, bf_len) (type)((1ull << (bf_len)) - 1)

#define BITMASK_LEN_OFF(type, bf_off, bf_len)                          \
  (type)(BITMASK_LEN(type, (bf_len)) << (bf_off))

#define STORE_BY_BITMASK(type, addr, val, bf_off, bf_len)              \
  if ((bf_off) == 0 && (bf_len) == 0) {                                \
    *(type*)(addr) = (type)(val);                                      \
  } else {                                                             \
    type new_val = *(type*)(addr);                                     \
    new_val &= ~BITMASK_LEN_OFF(type, (bf_off), (bf_len));             \
    new_val |= ((type)(val)&BITMASK_LEN(type, (bf_len))) << (bf_off);  \
    *(type*)(addr) = new_val;                                          \
  }

static uintptr_t execute_syscall(int nr, uintptr_t a0, uintptr_t a1,
                                 uintptr_t a2, uintptr_t a3,
                                 uintptr_t a4, uintptr_t a5,
                                 uintptr_t a6, uintptr_t a7,
                                 uintptr_t a8)
{
  switch (nr) {
  default:
    return syscall(nr, a0, a1, a2, a3, a4, a5);
  }
}

static void setup_main_process()
{
  struct sigaction sa;
  memset(&sa, 0, sizeof(sa));
  sa.sa_handler = SIG_IGN;
  syscall(SYS_rt_sigaction, 0x20, &sa, NULL, 8);
  syscall(SYS_rt_sigaction, 0x21, &sa, NULL, 8);
  install_segv_handler();

  char tmpdir_template[] = "./syzkaller.XXXXXX";
  char* tmpdir = mkdtemp(tmpdir_template);
  if (!tmpdir)
    fail("failed to mkdtemp");
  if (chmod(tmpdir, 0777))
    fail("failed to chmod");
  if (chdir(tmpdir))
    fail("failed to chdir");
}

static void loop();

static void sandbox_common()
{
  prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
  setpgrp();
  setsid();

  struct rlimit rlim;
  rlim.rlim_cur = rlim.rlim_max = 128 << 20;
  setrlimit(RLIMIT_AS, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 1 << 20;
  setrlimit(RLIMIT_FSIZE, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 1 << 20;
  setrlimit(RLIMIT_STACK, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 0;
  setrlimit(RLIMIT_CORE, &rlim);

  unshare(CLONE_NEWNS);
  unshare(CLONE_NEWIPC);
  unshare(CLONE_IO);
}

static int do_sandbox_setuid(int executor_pid, bool enable_tun)
{
  int pid = fork();
  if (pid)
    return pid;

  sandbox_common();

  const int nobody = 65534;
  if (setgroups(0, NULL))
    fail("failed to setgroups");
  if (syscall(SYS_setresgid, nobody, nobody, nobody))
    fail("failed to setresgid");
  if (syscall(SYS_setresuid, nobody, nobody, nobody))
    fail("failed to setresuid");

  loop();
  doexit(1);
}

static void remove_dir(const char* dir)
{
  DIR* dp;
  struct dirent* ep;
  int iter = 0;
retry:
  dp = opendir(dir);
  if (dp == NULL) {
    if (errno == EMFILE) {
      exitf("opendir(%s) failed due to NOFILE, exiting");
    }
    exitf("opendir(%s) failed", dir);
  }
  while ((ep = readdir(dp))) {
    if (strcmp(ep->d_name, ".") == 0 || strcmp(ep->d_name, "..") == 0)
      continue;
    char filename[FILENAME_MAX];
    snprintf(filename, sizeof(filename), "%s/%s", dir, ep->d_name);
    struct stat st;
    if (lstat(filename, &st))
      exitf("lstat(%s) failed", filename);
    if (S_ISDIR(st.st_mode)) {
      remove_dir(filename);
      continue;
    }
    int i;
    for (i = 0;; i++) {
      debug("unlink(%s)\n", filename);
      if (unlink(filename) == 0)
        break;
      if (errno == EROFS) {
        debug("ignoring EROFS\n");
        break;
      }
      if (errno != EBUSY || i > 100)
        exitf("unlink(%s) failed", filename);
      debug("umount(%s)\n", filename);
      if (umount2(filename, MNT_DETACH))
        exitf("umount(%s) failed", filename);
    }
  }
  closedir(dp);
  int i;
  for (i = 0;; i++) {
    debug("rmdir(%s)\n", dir);
    if (rmdir(dir) == 0)
      break;
    if (i < 100) {
      if (errno == EROFS) {
        debug("ignoring EROFS\n");
        break;
      }
      if (errno == EBUSY) {
        debug("umount(%s)\n", dir);
        if (umount2(dir, MNT_DETACH))
          exitf("umount(%s) failed", dir);
        continue;
      }
      if (errno == ENOTEMPTY) {
        if (iter < 100) {
          iter++;
          goto retry;
        }
      }
    }
    exitf("rmdir(%s) failed", dir);
  }
}

static uint64_t current_time_ms()
{
  struct timespec ts;

  if (clock_gettime(CLOCK_MONOTONIC, &ts))
    fail("clock_gettime failed");
  return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

static void test();

void loop()
{
  int iter;
  for (iter = 0;; iter++) {
    char cwdbuf[256];
    sprintf(cwdbuf, "./%d", iter);
    if (mkdir(cwdbuf, 0777))
      fail("failed to mkdir");
    int pid = fork();
    if (pid < 0)
      fail("clone failed");
    if (pid == 0) {
      prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
      setpgrp();
      if (chdir(cwdbuf))
        fail("failed to chdir");
      test();
      doexit(0);
    }
    int status = 0;
    uint64_t start = current_time_ms();
    for (;;) {
      int res = waitpid(-1, &status, __WALL | WNOHANG);
      if (res == pid)
        break;
      usleep(1000);
      if (current_time_ms() - start > 5 * 1000) {
        kill(-pid, SIGKILL);
        kill(pid, SIGKILL);
        while (waitpid(-1, &status, __WALL) != pid) {
        }
        break;
      }
    }
    remove_dir(cwdbuf);
  }
}

long r[1];
void test()
{
  memset(r, -1, sizeof(r));
  r[0] = execute_syscall(__NR_semget, 0x0ul, 0x400001003ul, 0x181ul, 0,
                         0, 0, 0, 0, 0);
}
int main()
{
  setup_main_process();
  int pid = do_sandbox_setuid(0, false);
  int status = 0;
  while (waitpid(pid, &status, __WALL) != pid) {
  }
  return 0;
}

Dmitry Vyukov

unread,
Mar 4, 2017, 7:38:49 AM3/4/17
to Nikolay Borisov, Eric W. Biederman, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
What may work is if put_ucounts re-lookups the ucounts. If it can
still find it and count==0, then it is the right time to delete it. If
it can't find the ucounts, then somebody else has beaten it.

Eric W. Biederman

unread,
Mar 4, 2017, 7:02:51 PM3/4/17
to Dmitry Vyukov, Nikolay Borisov, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
I believe what we want is atomic_dec_and_lock_irqsave.

As that does not exist we can just do:

@@ -164,13 +164,16 @@ static void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;

+ /* Unless the count is 1 decrement the quick way */
+ if (atomic_add_unless(&ucounts->count, -1, 1))
+ return;
+
+ spin_lock_irqsave(&ucounts_lock, flags);
if (atomic_dec_and_test(&ucounts->count)) {
- spin_lock_irqsave(&ucounts_lock, flags);
hlist_del_init(&ucounts->node);
- spin_unlock_irqrestore(&ucounts_lock, flags);
-
kfree(ucounts);
}
+ spin_unlock_irqrestore(&ucounts_lock, flags);
}

static inline bool atomic_inc_below(atomic_t *v, int u)


AKA take the spin_lock around the dec_and_test.

Arguably always decrementing under the ucounts_lock that means we can
stop reduce ucounts->count to a simple integer and just always take the
lock. Thus reducing our number of atomic operations and speeding up the
code a little.

But that might be a bit much for a simple bug fix.

If you folks can verify the fix above closes the race and stops the
problems I would appreciate it.

Eric


Dmitry Vyukov

unread,
Mar 5, 2017, 5:53:36 AM3/5/17
to Eric W. Biederman, Nikolay Borisov, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
Nice. I think it should work.

I would also do:

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 8a11fc0cb459..233c8e46acd5 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -143,19 +143,18 @@ static struct ucounts *get_ucounts(struct
user_namespace *ns, kuid_t uid)

new->ns = ns;
new->uid = uid;
- atomic_set(&new->count, 0);
+ atomic_set(&new->count, 1);

spin_lock_irq(&ucounts_lock);
ucounts = find_ucounts(ns, uid, hashent);
if (ucounts) {
+ atomic_inc(&ucounts->count);
kfree(new);
} else {
hlist_add_head(&new->node, hashent);
ucounts = new;
}
}
- if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
- ucounts = NULL;
spin_unlock_irq(&ucounts_lock);
return ucounts;
}



Eric W. Biederman

unread,
Mar 5, 2017, 1:46:03 PM3/5/17
to Dmitry Vyukov, Nikolay Borisov, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
No. As that allows ucounts->count to be incremented to the point
it goes negative. Counter wrap-around is just as bad as imperfect
locking if you can trigger it, and has been a cause of use-after-free
errors etc.

So it is a feature that if the count is maxed out for a given kuid that
get_ucounts will fail.

Eric


Eric W. Biederman

unread,
Mar 5, 2017, 4:05:13 PM3/5/17
to 쪼르, Dmitry Vyukov, Nikolay Borisov, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
쪼르 <zzor...@gmail.com> writes:

> Hi, This is my new one report about dec_ucount:
> ps.Sorry for my uncomfortable report. This is my first usage of lkml.
> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit
> .

You are doing well. Thank you very much for the report.

Thank you for the reproducer. Unfortunately I am not able to reproduce
the bug with what the code you have posted here.

From the initial mailing the code said:

> Syzkaller reproducer:
> # {Threaded:false Collide:false Repeat:true Procs:4 Sandbox:setuid
> Repro:false}
> inotify_init()

The code you posted says:

> Syzkaller reproducer:
> # {Threaded:false Collide:false Repeat:true Procs:1 Sandbox:setuid Repro:false}
> semget$private(0x0, 0x400001003, 0x181)

So I expect syzkaller did not create the same code when you ran it
again. Something easy to miss if you haven't run used a tool like that
much.

If someone knows how to get the code that syzkaller would generate that
matches the original reproducer I would very much appreciate it so that
we can confirm the bug we have spotted in the code is the bug syzkaller
found.

Until that point I am going to fix the obvious bug in the code and hope
that fixes the problem.

Eric

Eric W. Biederman

unread,
Mar 5, 2017, 4:45:59 PM3/5/17
to Dmitry Vyukov, Nikolay Borisov, JongHwan Kim, conta...@lists.linux-foundation.org, Jan Kara, syzkaller

Always increment/decrement ucount->count under the ucounts_lock. The
increments are there already and moving the decrements there means the
locking logic of the code is simpler. This simplification in the
locking logic fixes a race between put_ucounts and get_ucounts that
could result in a use-after-free because the count could go zero then
be found by get_ucounts and then be freed by put_ucounts.

A bug presumably this one was found by a combination of syzkaller and
KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
spotted the race in the code.

Reported-by: JongHwan Kim <zzor...@gmail.com>
Reported-by: Dmitry Vyukov <dvy...@google.com>
Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
include/linux/user_namespace.h | 2 +-
kernel/ucount.c | 18 +++++++++++-------
2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 363e0e8082a9..61071b6d2d12 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -69,7 +69,7 @@ struct ucounts {
struct hlist_node node;
struct user_namespace *ns;
kuid_t uid;
- atomic_t count;
+ int count;
atomic_t ucount[UCOUNT_COUNTS];
};

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 68716403b261..73696faa80dd 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -143,7 +143,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)

new->ns = ns;
new->uid = uid;
- atomic_set(&new->count, 0);
+ new->count = 0;

spin_lock_irq(&ucounts_lock);
ucounts = find_ucounts(ns, uid, hashent);
@@ -154,8 +154,10 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
ucounts = new;
}
}
- if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
+ if (ucounts->count == INT_MAX)
ucounts = NULL;
+ else
+ ucounts->count += 1;
spin_unlock_irq(&ucounts_lock);
return ucounts;
}
@@ -164,13 +166,15 @@ static void put_ucounts(struct ucounts *ucounts)
{
unsigned long flags;

- if (atomic_dec_and_test(&ucounts->count)) {
- spin_lock_irqsave(&ucounts_lock, flags);
+ spin_lock_irqsave(&ucounts_lock, flags);
+ ucounts->count -= 1;
+ if (!ucounts->count)
hlist_del_init(&ucounts->node);
- spin_unlock_irqrestore(&ucounts_lock, flags);
+ else
+ ucounts = NULL;
+ spin_unlock_irqrestore(&ucounts_lock, flags);

- kfree(ucounts);
- }
+ kfree(ucounts);
}

static inline bool atomic_inc_below(atomic_t *v, int u)
--
2.10.1

Dmitry Vyukov

unread,
Mar 6, 2017, 4:14:07 AM3/6/17
to Eric W. Biederman, 쪼르, Nikolay Borisov, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
Reliably reproducing such bugs is not possible (how would you expect
it to look like?). Your best bet is to write a stress test that
provokes the bug, add some sleeps into kernel code and run it for a
while with KASAN. Should be reproducible within minutes.

Eric W. Biederman

unread,
Mar 6, 2017, 11:38:08 AM3/6/17
to Dmitry Vyukov, 쪼르, Nikolay Borisov, conta...@lists.linux-foundation.org, Jan Kara, syzkaller
I was not asking for a reliable reproducer. I was asking what code was
run that triggered the error.

I don't have a clue what the randomly generated code that prompted the
original kernel error is and it doesn't appear anyone else does either.

The only hint I have is:
>>> Syzkaller reproducer:
>>> # {Threaded:false Collide:false Repeat:true Procs:4 Sandbox:setuid
>>> Repro:false}
>>> inotify_init()

The code that was posted did not call inotify_init and so I believe that
was a completely different random piece of code, that has nothing to do
with this issue.

I don't know syzkaller and it looks non-trivial to install on my system
and play around with. So I am going to leave futzing with syzkaller to
people who have been able to figure it out.

Until I have a reasonable understanding of what the code was doing that
triggered the error I can't say with any certainty that the reported bug
was fixed.

I would love to be able to say that it looks like the bug that caused
the error report was fixed.

Eric

Dmitry Vyukov

unread,
Mar 6, 2017, 1:43:31 PM3/6/17
to Eric W. Biederman, 쪼르, Nikolay Borisov, conta...@lists.linux-foundation.org, Jan Kara, syzkaller

Andrei Vagin

unread,
Mar 6, 2017, 3:39:24 PM3/6/17
to Eric W. Biederman, Dmitry Vyukov, syzkaller, conta...@lists.linux-foundation.org, Jan Kara, JongHwan Kim
On Sun, Mar 05, 2017 at 03:41:06PM -0600, Eric W. Biederman wrote:
>
> Always increment/decrement ucount->count under the ucounts_lock. The
> increments are there already and moving the decrements there means the
> locking logic of the code is simpler. This simplification in the
> locking logic fixes a race between put_ucounts and get_ucounts that
> could result in a use-after-free because the count could go zero then
> be found by get_ucounts and then be freed by put_ucounts.
>
> A bug presumably this one was found by a combination of syzkaller and
> KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
> spotted the race in the code.
>

Reviewed-by: Andrei Vagin <ava...@gmail.com>

I think we can rework this in a future so that ucount will be rcu
protected.
> _______________________________________________
> Containers mailing list
> Conta...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Eric W. Biederman

unread,
Mar 6, 2017, 4:30:56 PM3/6/17
to Andrei Vagin, Dmitry Vyukov, syzkaller, conta...@lists.linux-foundation.org, Jan Kara, JongHwan Kim
Andrei Vagin <ava...@gmail.com> writes:

> On Sun, Mar 05, 2017 at 03:41:06PM -0600, Eric W. Biederman wrote:
>>
>> Always increment/decrement ucount->count under the ucounts_lock. The
>> increments are there already and moving the decrements there means the
>> locking logic of the code is simpler. This simplification in the
>> locking logic fixes a race between put_ucounts and get_ucounts that
>> could result in a use-after-free because the count could go zero then
>> be found by get_ucounts and then be freed by put_ucounts.
>>
>> A bug presumably this one was found by a combination of syzkaller and
>> KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
>> spotted the race in the code.
>>
>
> Reviewed-by: Andrei Vagin <ava...@gmail.com>
>
> I think we can rework this in a future so that ucount will be rcu
> protected.

Agreed. Although I would like to see a benchmark that motivated that.
So far my impression is that all of these counts are in the noise.
Which is why I have aimed more at simplicity than the fastest possible
data structures.


Eric
Reply all
Reply to author
Forward
0 new messages