KASAN: use-after-free Read in task_is

syzbot

unread,

Oct 21, 2018, 3:10:03 AM10/21/18

to jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, se...@hallyn.com, syzkall...@googlegroups.com

Hello,

syzbot found the following crash on:

HEAD commit: 270b77a0f30e Merge tag 'drm-fixes-2018-10-20-1' of git://a..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=116f4ad9400000
kernel config: https://syzkaller.appspot.com/x/.config?x=b3f55cb3dfcc6c33
dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a9ac39...@syzkaller.appspotmail.com

binder: 12719:12721 got transaction to context manager from process owning
it
binder: BINDER_SET_CONTEXT_MGR already set
binder: 12719:12724 ioctl 40046207 0 returned -16
binder: 12719:12721 BC_ACQUIRE_DONE u0000000000000000 no match
==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:188
[inline]
BUG: KASAN: use-after-free in task_is_descendant.part.2+0x610/0x670
security/yama/yama_lsm.c:295
Read of size 8 at addr ffff8801c4666b20 by task syz-executor3/12722

CPU: 1 PID: 12722 Comm: syz-executor3 Not tainted 4.19.0-rc8+ #70
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
__read_once_size include/linux/compiler.h:188 [inline]
task_is_descendant.part.2+0x610/0x670 security/yama/yama_lsm.c:295
task_is_descendant security/yama/yama_lsm.c:282 [inline]
yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371
security_ptrace_access_check+0x54/0xb0 security/security.c:260
__ptrace_may_access+0x564/0x950 kernel/ptrace.c:331
ptrace_attach+0x1fa/0x640 kernel/ptrace.c:379
__do_sys_ptrace kernel/ptrace.c:1130 [inline]
__se_sys_ptrace kernel/ptrace.c:1110 [inline]
__x64_sys_ptrace+0x229/0x260 kernel/ptrace.c:1110
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f7df12c4c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000065
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 0000000000457569
RDX: 0000000000000000 RSI: 0000000000000265 RDI: 0000000000000010
RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7df12c56d4
R13: 00000000004c30c5 R14: 00000000004d4ab8 R15: 00000000ffffffff

Allocated by task 8920:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
alloc_task_struct_node kernel/fork.c:157 [inline]
dup_task_struct kernel/fork.c:802 [inline]
copy_process+0x1ff4/0x8780 kernel/fork.c:1707
_do_fork+0x1cb/0x11d0 kernel/fork.c:2166
__do_sys_clone kernel/fork.c:2273 [inline]
__se_sys_clone kernel/fork.c:2267 [inline]
__x64_sys_clone+0xbf/0x150 kernel/fork.c:2267
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 18:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kmem_cache_free+0x83/0x290 mm/slab.c:3756
free_task_struct kernel/fork.c:162 [inline]
free_task+0x16e/0x1f0 kernel/fork.c:416
__put_task_struct+0x2e6/0x620 kernel/fork.c:689
put_task_struct include/linux/sched/task.h:96 [inline]
delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181
__rcu_reclaim kernel/rcu/rcu.h:236 [inline]
rcu_do_batch kernel/rcu/tree.c:2576 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline]
__rcu_process_callbacks kernel/rcu/tree.c:2847 [inline]
rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864
__do_softirq+0x30b/0xad8 kernel/softirq.c:292

The buggy address belongs to the object at ffff8801c4666640
which belongs to the cache task_struct(65:syz3) of size 6080
The buggy address is located 1248 bytes inside of
6080-byte region [ffff8801c4666640, ffff8801c4667e00)
The buggy address belongs to the page:
page:ffffea0007119980 count:1 mapcount:0 mapping:ffff8801c2336800 index:0x0
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffffea00070a0008 ffffea0006080688 ffff8801c2336800
raw: 0000000000000000 ffff8801c4666640 0000000100000001 ffff8801bdb40c00
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff8801bdb40c00

Memory state around the buggy address:
ffff8801c4666a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c4666a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801c4666b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801c4666b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c4666c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.

Tetsuo Handa

unread,

Oct 21, 2018, 3:12:53 AM10/21/18

to se...@hallyn.com, Oleg Nesterov, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 2018/10/21 16:10, syzbot wrote:
> BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:188 [inline]
> BUG: KASAN: use-after-free in task_is_descendant.part.2+0x610/0x670 security/yama/yama_lsm.c:295
> Read of size 8 at addr ffff8801c4666b20 by task syz-executor3/12722
>
> CPU: 1 PID: 12722 Comm: syz-executor3 Not tainted 4.19.0-rc8+ #70
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
> print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
> kasan_report_error mm/kasan/report.c:354 [inline]
> kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
> __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
> __read_once_size include/linux/compiler.h:188 [inline]
> task_is_descendant.part.2+0x610/0x670 security/yama/yama_lsm.c:295

Do we need to hold

write_lock_irq(&tasklist_lock);

rather than

rcu_read_lock();

when accessing

"struct task_struct"->real_parent

?

Oleg Nesterov

unread,

Oct 22, 2018, 5:54:44 AM10/22/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

Well, if "task" is stable (can't exit), then I think

rcu_dereference(task->real_parent)

is fine, we know that ->real_parent did not pass exit_notif() yet.

However, task_is_descendant() looks unnecessarily complicated, it could be

static int task_is_descendant(struct task_struct *parent,
struct task_struct *child)
{
int rc = 0;
struct task_struct *walker;

if (!parent || !child)
return 0;

rcu_read_lock();
for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
if (same_thread_group(parent, walker)) {
rc = 1;
break;
}
rcu_read_unlock();

return rc;
}

And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?

Oleg.

Tetsuo Handa

unread,

Oct 22, 2018, 6:07:00 AM10/22/18

to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

OK.

>
> However, task_is_descendant() looks unnecessarily complicated, it could be
>
> static int task_is_descendant(struct task_struct *parent,
> struct task_struct *child)
> {
> int rc = 0;
> struct task_struct *walker;
>
> if (!parent || !child)
> return 0;
>
> rcu_read_lock();
> for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
> if (same_thread_group(parent, walker)) {
> rc = 1;
> break;
> }
> rcu_read_unlock();
>
> return rc;
> }
>
> And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
> task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?

Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
released. Do we still need pid_alive(child) ?

Oleg Nesterov

unread,

Oct 22, 2018, 9:46:38 AM10/22/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/22, Tetsuo Handa wrote:
>
> > However, task_is_descendant() looks unnecessarily complicated, it could be
> >
> > static int task_is_descendant(struct task_struct *parent,
> > struct task_struct *child)
> > {
> > int rc = 0;
> > struct task_struct *walker;
> >
> > if (!parent || !child)
> > return 0;
> >
> > rcu_read_lock();
> > for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
> > if (same_thread_group(parent, walker)) {
> > rc = 1;
> > break;
> > }
> > rcu_read_unlock();
> >
> > return rc;
> > }
> >
> > And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
> > task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?
>
> Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
> released. Do we still need pid_alive(child) ?

get_task_struct(child) can only ensure that this task_struct can't be freed.

Suppose that this child exits after get_task_struct(), then its real_parent exits
too and calls call_rcu(delayed_put_task_struct).

Now, when task_is_descendant() is called, rcu_read_lock() can happen after rcu gp,
iow child->parent can be already freed/reused/unmapped.

We need to ensure that child is still protected by RCU.

Oleg.

Tetsuo Handa

unread,

Oct 24, 2018, 10:15:53 PM10/24/18

to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

Oleg Nesterov wrote:
> On 10/22, Tetsuo Handa wrote:
> > > And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
> > > task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?
> >
> > Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
> > released. Do we still need pid_alive(child) ?
>
> get_task_struct(child) can only ensure that this task_struct can't be freed.

The report says that it is a use-after-free read at

walker = rcu_dereference(walker->real_parent);

which means that walker was already released.

>
> Suppose that this child exits after get_task_struct(), then its real_parent exits
> too and calls call_rcu(delayed_put_task_struct).
>
> Now, when task_is_descendant() is called, rcu_read_lock() can happen after rcu gp,
> iow child->parent can be already freed/reused/unmapped.
>
> We need to ensure that child is still protected by RCU.

I wonder whether pid_alive() test helps.

We can get

[ 40.620318] parent or walker is dead.
[ 40.624146] tracee is dead.

messages using below patch and reproducer.

----------
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99cfddd..0d9d786 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -385,6 +385,7 @@ static int ptrace_attach(struct task_struct *task, long request,
if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
goto out;

+ schedule_timeout_killable(HZ);
task_lock(task);
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
task_unlock(task);
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index ffda91a..a231ec6 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -283,6 +283,11 @@ static int task_is_descendant(struct task_struct *parent,
return 0;

rcu_read_lock();
+ if (!pid_alive(parent) || !pid_alive(walker)) {
+ rcu_read_unlock();
+ printk("parent or walker is dead.\n");
+ return 0;
+ }
if (!thread_group_leader(parent))
parent = rcu_dereference(parent->group_leader);
while (walker->pid > 0) {
@@ -315,6 +320,10 @@ static int ptracer_exception_found(struct task_struct *tracer,
bool found = false;

rcu_read_lock();
+ if (!pid_alive(tracee)) {
+ printk("tracee is dead.\n");
+ goto unlock;
+ }

/*
* If there's already an active tracing relationship, then make an
----------

----------
#include <unistd.h>
#include <sys/ptrace.h>
#include <poll.h>

int main(int argc, char *argv[])
{
if (fork() == 0) {
ptrace(PTRACE_ATTACH, getppid(), NULL, NULL);
_exit(0);
}
poll(NULL, 0, 100);
return 0;
}
----------

But since "child" has at least one reference, reading "child->real_parent" should
be safe. Therefore, I think that bailing out due to pid_is_alive(child) == false
(like above patch does) cannot avoid this problem...

Kees Cook

unread,

Oct 25, 2018, 4:19:49 AM10/25/18

to Oleg Nesterov, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

task_is_descendant() is called under rcu_read_lock() in both
ptracer_exception_found() and yama_ptrace_access_check() so I don't
understand how any of the tasks could get freed? This is walking
group_leader and real_parent -- are these not stable under rcu_lock()?

--
Kees Cook

Oleg Nesterov

unread,

Oct 25, 2018, 7:14:02 AM10/25/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/25, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > On 10/22, Tetsuo Handa wrote:
> > > > And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
> > > > task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?
> > >
> > > Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
> > > released. Do we still need pid_alive(child) ?
> >
> > get_task_struct(child) can only ensure that this task_struct can't be freed.
>
> The report says that it is a use-after-free read at
>
> walker = rcu_dereference(walker->real_parent);
>
> which means that walker was already released.

quite possibly I missed something, but I am not sure I understand your concerns...

So again, suppose that "child" is already dead. Its task_struct can't be freed,
but child->real_parent can point to the already freed memory.

This means that the 1st walker = rcu_dereference(walker->real_parent) is fine,
this simply reads the child->real_parent pointer, but on the second iteration

walker = rcu_dereference(walker->real_parent);

reads the alredy freed memory.

> I wonder whether pid_alive() test helps.
>
> We can get
>
> [ 40.620318] parent or walker is dead.
> [ 40.624146] tracee is dead.
>
> messages using below patch and reproducer.

again, I do not understand, this all looks correct...

> ----------
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 99cfddd..0d9d786 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -385,6 +385,7 @@ static int ptrace_attach(struct task_struct *task, long request,
> if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
> goto out;
>
> + schedule_timeout_killable(HZ);
> task_lock(task);
> retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> task_unlock(task);
> diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
> index ffda91a..a231ec6 100644
> --- a/security/yama/yama_lsm.c
> +++ b/security/yama/yama_lsm.c
> @@ -283,6 +283,11 @@ static int task_is_descendant(struct task_struct *parent,
> return 0;
>
> rcu_read_lock();
> + if (!pid_alive(parent) || !pid_alive(walker)) {
> + rcu_read_unlock();
> + printk("parent or walker is dead.\n");

This is what we need to do, except I think we should change yama_ptrace_access_check().
And iiuc parent == current, pid_alive(parent) looks unnecessary. Although we need to
check ptracer_exception_found(), may be it needs some changes too.

And yes, task_is_descendant() can hit the dead child, if nothing else it can
be killed. This can explain the kasan report.

> @@ -315,6 +320,10 @@ static int ptracer_exception_found(struct task_struct *tracer,
> bool found = false;
>
> rcu_read_lock();
> + if (!pid_alive(tracee)) {
> + printk("tracee is dead.\n");
> + goto unlock;

Sure, this is possible too.

> But since "child" has at least one reference, reading "child->real_parent" should
> be safe. Therefore, I think that bailing out due to pid_is_alive(child) == false
> (like above patch does) cannot avoid this problem...

Why?

OK. Lets ignore ptracer_exception_found() for the moment. Why do you think the
patch below can't help?

Oleg.

--- x/security/yama/yama_lsm.c
+++ x/security/yama/yama_lsm.c
@@ -368,7 +368,8 @@ static int yama_ptrace_access_check(stru
break;
case YAMA_SCOPE_RELATIONAL:
rcu_read_lock();
- if (!task_is_descendant(current, child) &&
+ if (!pid_alive(child) ||
+ !task_is_descendant(current, child) &&
!ptracer_exception_found(current, child) &&
!ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
rc = -EPERM;

Kees Cook

unread,

Oct 25, 2018, 7:36:15 AM10/25/18

to Oleg Nesterov, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

On Thu, Oct 25, 2018 at 12:13 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> So again, suppose that "child" is already dead. Its task_struct can't be freed,
> but child->real_parent can point to the already freed memory.

I can't find a path for "child" to be released. I see task_lock()
always called on it before it ends up in Yama.

(Well, I can't find the lock for switch_mm(), but I assume that's safe
because it's switching to the task.)

> This means that the 1st walker = rcu_dereference(walker->real_parent) is fine,
> this simply reads the child->real_parent pointer, but on the second iteration
>
> walker = rcu_dereference(walker->real_parent);
>
> reads the alredy freed memory.

What does rcu_read_lock() protect actually protect here? I thought
none of the task structs would be freed until after all
rcu_read_unlock() finished.

> OK. Lets ignore ptracer_exception_found() for the moment. Why do you think the
> patch below can't help?
>
> Oleg.
>
> --- x/security/yama/yama_lsm.c
> +++ x/security/yama/yama_lsm.c
> @@ -368,7 +368,8 @@ static int yama_ptrace_access_check(stru
> break;
> case YAMA_SCOPE_RELATIONAL:
> rcu_read_lock();
> - if (!task_is_descendant(current, child) &&
> + if (!pid_alive(child) ||
> + !task_is_descendant(current, child) &&
> !ptracer_exception_found(current, child) &&
> !ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
> rc = -EPERM;
>

Hm, documentation there says:
* If pid_alive fails, then pointers within the task structure
* can be stale and must not be dereferenced.

What is the safe pattern for checking vs rcu?

-Kees

--
Kees Cook

Tetsuo Handa

unread,

Oct 25, 2018, 7:47:52 AM10/25/18

to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 2018/10/25 20:13, Oleg Nesterov wrote:
> On 10/25, Tetsuo Handa wrote:
>>
>> Oleg Nesterov wrote:
>>> On 10/22, Tetsuo Handa wrote:
>>>>> And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
>>>>> task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?
>>>>
>>>> Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
>>>> released. Do we still need pid_alive(child) ?
>>>
>>> get_task_struct(child) can only ensure that this task_struct can't be freed.
>>
>> The report says that it is a use-after-free read at
>>
>> walker = rcu_dereference(walker->real_parent);
>>
>> which means that walker was already released.
>
> quite possibly I missed something, but I am not sure I understand your concerns...
>
> So again, suppose that "child" is already dead. Its task_struct can't be freed,
> but child->real_parent can point to the already freed memory.

Yes.

But if child->real_parent is pointing to the already freed memory,
why does pid_alive(child) == true help?

>
> This means that the 1st walker = rcu_dereference(walker->real_parent) is fine,
> this simply reads the child->real_parent pointer,

Yes.

> but on the second iteration
>
> walker = rcu_dereference(walker->real_parent);
>
> reads the alredy freed memory.

Yes.

There are two task_is_descendant() callers, and one of them is not passing current.

>
> And yes, task_is_descendant() can hit the dead child, if nothing else it can
> be killed. This can explain the kasan report.

The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
isn't it?

How can we check that that pointer is pointing to already freed memory? As soon as

walker = rcu_dereference(walker->real_parent);

is executed, task_alive(walker) will try to read from already freed memory...

Oleg Nesterov

unread,

Oct 25, 2018, 7:52:51 AM10/25/18

to Kees Cook, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

On 10/25, Kees Cook wrote:
>
> task_is_descendant() is called under rcu_read_lock() in both
> ptracer_exception_found() and yama_ptrace_access_check() so I don't
> understand how any of the tasks could get freed? This is walking
> group_leader and real_parent -- are these not stable under rcu_lock()?

group_leader/real_parent/etc are no longer rcu-protected after the exiting
child calls release_task() which in particular removes the child from
children/thread_group lists.

OK. Suppose you have an rcu-protected list, and each element also has a
reference counter so you can do something

struct elt {
atomic_t ctr;
struct list_head list;
int pid;
};

rcu_read_lock();
list_for_each_entry(elt, &LIST, list) {
if (elt->pid == 100) {
atomic_inc(&elt->ctr); // get_task_struct()
break;
}
}
rcu_read_unlock();

do_something(elt);

This code is fine. This elt can't be freed, you have a reference. But once
you drop rcu lock you can't trust elt->list.next! So, for example, you can
not do

rcu_read_lock();
list_for_each_entry_continue_rcu(elt, &LIST, list) {
...
}
rcu_read_unlock();

too late, elt.list.next can be already freed, or it can be freed while you
iterate the list.

Another simple example. Suppose you have a global PTR protected by rcu. So
ignoring the necessary rcu_dereference this code is fine:

rcu_read_lock();
if (ptr = PTR)
do_something(ptr);
rcu_read_unlcok();

But this is not:

ptr = PTR;
rcu_read_lock();
if (ptr)
do_something(ptr);
rcu_read_unlock();

basically the same thing...

Oleg.

Oleg Nesterov

unread,

Oct 25, 2018, 8:05:26 AM10/25/18

to Kees Cook, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

On 10/25, Kees Cook wrote:
>

> On Thu, Oct 25, 2018 at 12:13 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> > So again, suppose that "child" is already dead. Its task_struct can't be freed,
> > but child->real_parent can point to the already freed memory.
>
> I can't find a path for "child" to be released. I see task_lock()
> always called on it before it ends up in Yama.

Are you saying that yama_ptrace_access_check() is always called under
task_lock(child) ? Yes, it seems so

So what? Say, ptrace_attach() can hit a dead task. It should notice this
and fail later, but security_ptrace_access_check() is called before that.

> > This means that the 1st walker = rcu_dereference(walker->real_parent) is fine,
> > this simply reads the child->real_parent pointer, but on the second iteration
> >
> > walker = rcu_dereference(walker->real_parent);
> >
> > reads the alredy freed memory.
>
> What does rcu_read_lock() protect actually protect here? I thought
> none of the task structs would be freed until after all
> rcu_read_unlock() finished.

See another email I sent you a minute ago.

Oleg.

Oleg Nesterov

unread,

Oct 25, 2018, 8:17:15 AM10/25/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/25, Tetsuo Handa wrote:
>
> On 2018/10/25 20:13, Oleg Nesterov wrote:
> >
> > So again, suppose that "child" is already dead. Its task_struct can't be freed,
> > but child->real_parent can point to the already freed memory.
>
> Yes.
>
> But if child->real_parent is pointing to the already freed memory,
> why does pid_alive(child) == true help?

Hmm. Because pid_alive(child) == true && child->real_parent is freed must not
be possible? As long as we check pid_alive() under rcu_read_lock().

> >> @@ -283,6 +283,11 @@ static int task_is_descendant(struct task_struct *parent,
> >> return 0;
> >>
> >> rcu_read_lock();
> >> + if (!pid_alive(parent) || !pid_alive(walker)) {
> >> + rcu_read_unlock();
> >> + printk("parent or walker is dead.\n");
> >
> > This is what we need to do, except I think we should change yama_ptrace_access_check().
> > And iiuc parent == current, pid_alive(parent) looks unnecessary. Although we need to
> > check ptracer_exception_found(), may be it needs some changes too.
>
> There are two task_is_descendant() callers, and one of them is not passing current.

As I said below, please ignore ptracer_exception_found(), another caller for now,
perhaps it needs some changes too. I even have a vague feeling that I have already
blamed this function some time ago...

> > And yes, task_is_descendant() can hit the dead child, if nothing else it can
> > be killed. This can explain the kasan report.
>
> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
> isn't it?

Yes. and you know, I am all confused. I no longer can understand you :/

> How can we check that that pointer is pointing to already freed memory? As soon as
>
> walker = rcu_dereference(walker->real_parent);
>
> is executed, task_alive(walker) will try to read from already freed memory...

Of course we should not do it this way. The patch I sent doesn't...

Oleg.

Oleg Nesterov

unread,

Oct 25, 2018, 9:01:36 AM10/25/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/25, Oleg Nesterov wrote:
>
> As I said below, please ignore ptracer_exception_found(), another caller for now,
> perhaps it needs some changes too. I even have a vague feeling that I have already
> blamed this function some time ago...

Heh, yes, 3 years ago ;)

https://lore.kernel.org/lkml/20150106184...@redhat.com/

I can't understand my email today, but note that I tried to point out that
task_is_descendant() can dereference the freed mem.

And yes, task_is_descendant() is overcompicated for no reason, afaics.

Oleg.

Tetsuo Handa

unread,

Oct 25, 2018, 9:15:09 AM10/25/18

to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 2018/10/25 21:17, Oleg Nesterov wrote:
>>> And yes, task_is_descendant() can hit the dead child, if nothing else it can
>>> be killed. This can explain the kasan report.
>>
>> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
>> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
>> isn't it?
>
> Yes. and you know, I am all confused. I no longer can understand you :/

Why don't we need to check every time like shown below?
Why checking only once is sufficient?

--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -285,7 +285,7 @@ static int task_is_descendant(struct task_struct *parent,
rcu_read_lock();

if (!thread_group_leader(parent))
parent = rcu_dereference(parent->group_leader);

- while (walker->pid > 0) {
+ while (pid_alive(walker) && walker->pid > 0) {
if (!thread_group_leader(walker))
walker = rcu_dereference(walker->group_leader);
if (walker == parent) {

Oleg Nesterov

unread,

Oct 25, 2018, 11:55:10 AM10/25/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/25, Tetsuo Handa wrote:
>

> On 2018/10/25 21:17, Oleg Nesterov wrote:
> >>> And yes, task_is_descendant() can hit the dead child, if nothing else it can
> >>> be killed. This can explain the kasan report.
> >>
> >> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
> >> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
> >> isn't it?
> >
> > Yes. and you know, I am all confused. I no longer can understand you :/
>
> Why don't we need to check every time like shown below?
> Why checking only once is sufficient?

Why do you think it is not sufficient?

Again, I can be easily wrong, rcu is not simple, but so far I think we need
a single check at the start.

> --- a/security/yama/yama_lsm.c
> +++ b/security/yama/yama_lsm.c
> @@ -285,7 +285,7 @@ static int task_is_descendant(struct task_struct *parent,
> rcu_read_lock();
> if (!thread_group_leader(parent))
> parent = rcu_dereference(parent->group_leader);
> - while (walker->pid > 0) {
> + while (pid_alive(walker) && walker->pid > 0) {

OK. To simplify, ets suppose that task_is_descendant() is called with tasklist
lock held. And lets suppose that all tasks are single-threaded.

Then we obviously need a single check at the start, we need to ensure that the
child was not removed from its ->real_parent->children list. The latter means
that if ->real_parent exits, the child will be re-parented and its ->real_parent
will be updated.

So we could do

read_lock(tasklist);

if (list_empty(child->sibling))
// it is dead, removed from ->children list, we can't trust
// child->real_parent
return -EWHATEVER;

task_is_descendant(current, child);

But note that we can safely use pid_alive(child) instead, detach_pid() and
list_del_init(&p->sibling) happen "at the same time" since we hold tasklist.

(And btw, I suggested several times to rename it, or add another helper with
a better name. Note also that we could check, say, ->sighand != NULL with
the same effect.)

Now. Why do you think rcu_read_lock() differs in that we need to check
pid_alive() at every step?

Suppose that one of the grand parents exits, and it is going to be freed. Again,
to (over)simplify the things, lets suppose that release_task() does

synchronize_rcu();
free_task(p);

at the end. Now, can

rcu_read_lock();
if (pid_alive(child)) {
while (child->pid)
child = child->real_parent;
}
rcu_read_unlock();

hit the already freed ->real_parent ? Say, the freed child->real_parent->real_parent.

Lets denote P1 = child->real_parent, P2 = P1->real_parent. Can P2 be already freed?

This is only possible if synchronize_rcu() above was called before rcu_read_lock(),
see the last sentence below.

If P1->real_parent is still P2, then P1 has already exited too. And we still observe
that child->real_parent == P1, this too is only possible if child has exited, so we
must see pid_alive() == F.

Why must we see pid_alive() == F without tasklist? It must be true, release_task()
is serialized by tasklist_lock, but why we can't get the stale value under
rcu_read_lock() ?

Because our rcu read-lock critical section extends beyond the return from
synchronize_rcu(), and thus we must have a full memory barrier _between_
that synchronize_rcu() and our rcu_read_lock(). We must see all memory updates,
including thread_pid = NULL which makes pid_alive() == F.

Do you see any hole?

Oleg.

Oleg Nesterov

unread,

Oct 25, 2018, 12:26:00 PM10/25/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/25, Oleg Nesterov wrote:
>

> Because our rcu read-lock critical section extends beyond the return from
> synchronize_rcu(), and thus we must have a full memory barrier _between_
> that synchronize_rcu() and our rcu_read_lock(). We must see all memory updates,
> including thread_pid = NULL which makes pid_alive() == F.

In case I was not clear....

Suppose we have int X = 0. If some CPU does

X = 1;
synchronize_rcu();

and another CPU does

rcu_read_lock();
x = X;
rcu_read_unlock();

then x should be == 1 in case when rcu_read_unlock() happens _after_ return
from synchronize_rcu().

Oleg.

Tetsuo Handa

unread,

Oct 26, 2018, 8:24:03 AM10/26/18

to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 2018/10/26 0:55, Oleg Nesterov wrote:
> On 10/25, Tetsuo Handa wrote:
>>
>> On 2018/10/25 21:17, Oleg Nesterov wrote:
>>>>> And yes, task_is_descendant() can hit the dead child, if nothing else it can
>>>>> be killed. This can explain the kasan report.
>>>>
>>>> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
>>>> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
>>>> isn't it?
>>>
>>> Yes. and you know, I am all confused. I no longer can understand you :/
>>
>> Why don't we need to check every time like shown below?
>> Why checking only once is sufficient?
>
> Why do you think it is not sufficient?
>
> Again, I can be easily wrong, rcu is not simple, but so far I think we need
> a single check at the start.
>

Hmm, this report is difficult to guess what happened.

Since the "child" passed to task_is_descendant() has at least one reference
count taken by find_get_task_by_vpid(), rcu_dereference(walker->real_parent)
in the first iteration

while (child->pid > 0) {
if (!thread_group_leader(child))
walker = rcu_dereference(child->group_leader);
if (walker == parent) {
rc = 1;
break;
}
walker = rcu_dereference(walker->real_parent);
}

must not trigger use-after-free bug. Thus, when this use-after-free was
detected at rcu_dereference(walker->real_parent), the memory pointed by
"walker" must have been released between

while (walker->pid > 0) {

if (!thread_group_leader(walker))
walker = rcu_dereference(walker->group_leader);

and

walker = rcu_dereference(walker->real_parent);
}

because otherwise use-after-free would have been reported at walker->pid
or thread_group_leader(walker) or rcu_dereference(walker->group_leader).

Is my understanding correct?

Then, what pid_alive(child) is testing? It is not memory pointed by "child" but
memory pointed by "walker" (i.e. parent of "child" or parent of parent of "child"
or ... ) which is triggering use-after-free.

Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
when p2 tried to attach on p1, p2->real_parent was pointing to already
(or about to be) freed p1.

Even if pid_alive(p2) test can guarantee that p1 won't be released,
how can pid_alive(p3) test guarantee that p1 won't be released?
p1 can be released any moment because it has already waited for RCU
grace period, can't it?

ptrace(PTRACE_ATTACH, vpid_of_p2) {
p2 = find_get_task_by_vpid(vpid_of_p2);
ptrace_attach(p2, PTRACE_ATTACH, addr, data) {
mutex_lock_interruptible(&p2->signal->cred_guard_mutex);
// p1 starts exit()ing here.
task_lock(p2);
__ptrace_may_access(p2) {
// p2->real_parent starts pointing to already freed p1.
security_ptrace_access_check(p2, PTRACE_MODE_ATTACH) {
yama_ptrace_access_check() {
task_is_descendant(current, p2) {
walker = p2;
rcu_read_lock();
if (pid_alive(p2)) { // If true
if (p2->pid > 0) { // will be true
p1 = rcu_dereference(p2->real_parent); // might be OK due to pid_alive(p2) == true?
}
}
rcu_read_unlock();
}
}
}
}
task_unlock(p2);
mutex_unlock(&p2->signal->cred_guard_mutex);
}
put_task_struct(p2);
}

ptrace(PTRACE_ATTACH, vpid_of_p3) {
p3 = find_get_task_by_vpid(vpid_of_p3);
ptrace_attach(p3, PTRACE_ATTACH, addr, data) {
mutex_lock_interruptible(&p3->signal->cred_guard_mutex);
// p1 starts exit()ing here.
task_lock(p3);
__ptrace_may_access(p3) {
// p2->real_parent starts pointing to already freed p1.
security_ptrace_access_check(p3, PTRACE_MODE_ATTACH) {
yama_ptrace_access_check() {
task_is_descendant(current, p3) {
walker = p3;
rcu_read_lock();
if (pid_alive(p3)) { // If true
if (p3->pid > 0) { // will be true
p2 = rcu_dereference(p3->real_parent); // will be OK if above assumption is OK.
if (p2->pid > 0) { // will be true
p1 = rcu_dereference(p2->real_parent); // will read already (or about to be) freed p1 address
if (p1->pid > 0) { // Oops here or
if (!thread_group_leader(p1)) // oops here or
p1 = rcu_dereference(p1->group_leader); // oops here or
p0 = rcu_dereference(p1->real_parent); // oops here, or not oops because releasing after this
}
}
}
}
rcu_read_unlock();
}
}
}
}
task_unlock(p3);
mutex_unlock(&p3->signal->cred_guard_mutex);
}
put_task_struct(p3);
}

Oleg Nesterov

unread,

Oct 26, 2018, 9:04:47 AM10/26/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/26, Tetsuo Handa wrote:
>
> Since the "child" passed to task_is_descendant() has at least one reference
> count taken by find_get_task_by_vpid(), rcu_dereference(walker->real_parent)
> in the first iteration
>
> while (child->pid > 0) {
> if (!thread_group_leader(child))
> walker = rcu_dereference(child->group_leader);
> if (walker == parent) {
> rc = 1;
> break;
> }
> walker = rcu_dereference(walker->real_parent);
> }
>
> must not trigger use-after-free bug.

Yes,

> Thus, when this use-after-free was
> detected at rcu_dereference(walker->real_parent), the memory pointed by
> "walker" must have been released between
>
> while (walker->pid > 0) {
> if (!thread_group_leader(walker))
> walker = rcu_dereference(walker->group_leader);
>
> and
>
> walker = rcu_dereference(walker->real_parent);
> }

this is possible too, rcu callback which frees the task_struct can run
int between

> because otherwise use-after-free would have been reported at walker->pid
> or thread_group_leader(walker) or rcu_dereference(walker->group_leader).

Well, I do not know how much kasan is "percise", but if it is "perfect"
then you are right,

> Then, what pid_alive(child) is testing?

I tried to explain this in my previous email. I even mentioned that we could
do another check, say, ->sighand != NULL, or list_empty(child->sibling) with
the same effect.

> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
> when p2 tried to attach on p1, p2->real_parent was pointing to already
> (or about to be) freed p1.

No, p2->real_parent will be updated. If p1 exits it will re-parent its
children including p2.

Again, did you read my previous email?

Let me repeat, of course I can be wrong. But so far I am answering the
same questions again and again...

Oleg.

Tetsuo Handa

unread,

Oct 26, 2018, 9:51:59 AM10/26/18

to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 2018/10/26 22:04, Oleg Nesterov wrote:
>> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
>> when p2 tried to attach on p1, p2->real_parent was pointing to already
>> (or about to be) freed p1.
>
> No, p2->real_parent will be updated. If p1 exits it will re-parent its
> children including p2.

My error.

Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited

when someone tried to attach on p2, p2->real_parent was pointing to already

(or about to be) freed p1.

So, the puzzle part is why p2->real_parent was still pointing p1 even after
p1 was freed...

>
> Again, did you read my previous email?

Yes. But I still can't be convinced that pid_alive() test helps.

Oleg Nesterov

unread,

Oct 26, 2018, 10:39:43 AM10/26/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/26, Tetsuo Handa wrote:
>

> On 2018/10/26 22:04, Oleg Nesterov wrote:
> >> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
> >> when p2 tried to attach on p1, p2->real_parent was pointing to already
> >> (or about to be) freed p1.
> >
> > No, p2->real_parent will be updated. If p1 exits it will re-parent its
> > children including p2.
>
> My error.
>
> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
> when someone tried to attach on p2, p2->real_parent was pointing to already
> (or about to be) freed p1.

I don't see a difference.

If p1 exits it will re-parent p2, p2->real_parent will be updated.

> So, the puzzle part is why p2->real_parent was still pointing p1 even after
> p1 was freed...

I don't understand the question.

Once again. TASK->real_parent can point to the freed mem only if a) TASK exits,
and b) _after_ that its parent TASK->real_parent exits too.

> > Again, did you read my previous email?
>
> Yes. But I still can't be convinced that pid_alive() test helps.

Well, I don't understand which part of my explanations is not clear to you.

OR. Perhaps I am wrong and do not understand your concerns.

Oleg.

Tetsuo Handa

unread,

Oct 26, 2018, 11:04:44 AM10/26/18

to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 2018/10/26 23:39, Oleg Nesterov wrote:
> On 10/26, Tetsuo Handa wrote:
>> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
>> when someone tried to attach on p2, p2->real_parent was pointing to already
>> (or about to be) freed p1.
>
> I don't see a difference.
>
> If p1 exits it will re-parent p2, p2->real_parent will be updated.
>
>> So, the puzzle part is why p2->real_parent was still pointing p1 even after
>> p1 was freed...
>
> I don't understand the question.
>
> Once again. TASK->real_parent can point to the freed mem only if a) TASK exits,
> and b) _after_ that its parent TASK->real_parent exits too.

Oh, p2 exited and then p1 also exited when someone tried to attach on p2.
Then, p2->real_parent can point to already (or about to be) freed p1.

>
>>> Again, did you read my previous email?
>>
>> Yes. But I still can't be convinced that pid_alive() test helps.
>
> Well, I don't understand which part of my explanations is not clear to you.

OK. Checking pid_alive() should help.

(By the way, if p->real_parent were updated to point to init_task when p exits,
we could omit pid_alive() check?)

Oleg Nesterov

unread,

Oct 26, 2018, 11:22:14 AM10/26/18

to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com

On 10/27, Tetsuo Handa wrote:
>
> On 2018/10/26 23:39, Oleg Nesterov wrote:
> > On 10/26, Tetsuo Handa wrote:
> >> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
> >> when someone tried to attach on p2, p2->real_parent was pointing to already
> >> (or about to be) freed p1.
> >
> > I don't see a difference.
> >
> > If p1 exits it will re-parent p2, p2->real_parent will be updated.
> >
> >> So, the puzzle part is why p2->real_parent was still pointing p1 even after
> >> p1 was freed...
> >
> > I don't understand the question.
> >
> > Once again. TASK->real_parent can point to the freed mem only if a) TASK exits,
> > and b) _after_ that its parent TASK->real_parent exits too.
>
> Oh, p2 exited and then p1 also exited when someone tried to attach on p2.
> Then, p2->real_parent can point to already (or about to be) freed p1.

Then we must see pid_alive(p2) == F as I already explained you in my
yesterday's email.

Because if p1 exits _after_ p2, then pid_alive(p2) == F must be already
completed, p1 can't exit (I mean, release_task(p1) can't be called) until
release_task(p2) does detach_pid() and drops tasklist_lock.

> (By the way, if p->real_parent were updated to point to init_task when p exits,
> we could omit pid_alive() check?)

Sorry, I don't understand the question...

Ignoring the PR_SET_CHILD_SUBREAPER parents, ignoring the exiting sub-threads
which reparent to group leader, ->real_parent is alwasy updated to point to
init_task. But I don't see why we could omit pid_alive() check.

Oleg.

Kees Cook

unread,

Oct 26, 2018, 12:09:22 PM10/26/18

to Oleg Nesterov, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

On Thu, Oct 25, 2018 at 2:01 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> On 10/25, Oleg Nesterov wrote:
>> perhaps it needs some changes too. I even have a vague feeling that I have already
>> blamed this function some time ago...
>
> Heh, yes, 3 years ago ;)
>
> https://lore.kernel.org/lkml/20150106184...@redhat.com/
>
> I can't understand my email today, but note that I tried to point out that
> task_is_descendant() can dereference the freed mem.

Instead of:

while (walker->pid > 0) {

should it simply be "while (pid_liave(walker)) {"? And add a
pid_alive(parent) after rcu_read_lock()?

> And yes, task_is_descendant() is overcompicated for no reason, afaics.

Yeah, agreed. I'll fix this up. Just to make sure I'm not crazy: the
real_parent of all tasks in a thread group are the same, yes? The
trouble I was trying to deal with for the complication was where a
non-leader thread would add an exception to the checking, and the
tasks wouldn't match. (As far as I can see, though, using
same_thread_group() should fix it.)

-Kees

--
Kees Cook

Oleg Nesterov

unread,

Oct 29, 2018, 8:24:02 AM10/29/18

to Kees Cook, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

On 10/26, Kees Cook wrote:
>
> On Thu, Oct 25, 2018 at 2:01 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> > On 10/25, Oleg Nesterov wrote:
> >> perhaps it needs some changes too. I even have a vague feeling that I have already
> >> blamed this function some time ago...
> >
> > Heh, yes, 3 years ago ;)
> >
> > https://lore.kernel.org/lkml/20150106184...@redhat.com/
> >
> > I can't understand my email today, but note that I tried to point out that
> > task_is_descendant() can dereference the freed mem.
>
> Instead of:
>
> while (walker->pid > 0) {
>
> should it simply be "while (pid_liave(walker)) {"?

No, this would be wrong. Probably walker->pid > 0 is not the best check,
but we do not need to change it for correctness.

> And add a
> pid_alive(parent) after rcu_read_lock()?

So you too do not read my emails ;)

I still think we need a single pid_alive() check and I even sent the patch.
Attached again at the end.

To clarify, let me repeat that ptracer_exception_found() may need some fixes
too, right now I am only talking about task_is_descendant().

> > And yes, task_is_descendant() is overcompicated for no reason, afaics.
>
> Yeah, agreed. I'll fix this up.

I have already posted this code, this is what I think it should do.

static int task_is_descendant(struct task_struct *parent,

struct task_struct *child)
{
struct task_struct *walker;

for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent)) {
if (same_thread_group(parent, walker))

return 1;
}

return 0;
}

This version differs in that I removed the parent/child != NULL at the start
and rcu_read_lock(), it should be held by the caller anyway.

> Just to make sure I'm not crazy: the
> real_parent of all tasks in a thread group are the same, yes?

Well, yes and no. So if

same_thread_group(t1, t2) == T

then
same_thread_group(t1->real_parent, t2->real_parent) == T

which means that real_parent of all tasks in a thread group is the same
_process_.

But t1->real_parent and t2->real_parent are not necessarily the same task.

Oleg Nesterov

unread,

Oct 29, 2018, 11:05:53 AM10/29/18

to Kees Cook, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

let me change the subject to avoid the confusion with the already confusing
disccussion about task_is_descendant().

On 10/29, Oleg Nesterov wrote:
>
> I still think we need a single pid_alive() check and I even sent the patch.
> Attached again at the end.
>
> To clarify, let me repeat that ptracer_exception_found() may need some fixes
> too, right now I am only talking about task_is_descendant().

so yes, the ptracer_relations code looks very broken to me, but perhaps I
misread this code, please correct me.

RCU can only protect the ptracer_relations list itself, you can do nothing
with (say) relation->tracer. relation->tracer can be already freed when
ptracer_exception_found() checks relation->tracee == tracee.

Not only pid_alive(parent) can not help in this case, pid_alive(parent) is
equally unsafe because, again, this memory can be freed.

security_task_free(tsk) is called right before free_task(tsk), there is no
a gp pass in between, and of course we can't rely on the ->invalid check.

_At first glance_ we can fix this if we simply turn both ->tracer/tracee
pointers into "signal_struct *", then we can turn all same_thread_group()'s
into walker->signal == parent which doesn't need to dereference the possibly-
freed parent. This also allows to remove all thread_group_leader() checks.
We need to ensure that false-positive is not possible (if, say, ->tracer
was already re-allocated and points to another task->signal), but this
doesn't look difficult.

Oleg.

syzbot

unread,

Nov 9, 2018, 10:25:04 PM11/9/18

to jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, ol...@redhat.com, penguin...@i-love.sakura.ne.jp, penguin...@i-love.sakura.ne.jp, se...@hallyn.com, syzkall...@googlegroups.com

syzbot has found a reproducer for the following crash on:

HEAD commit: 3541833fd1f2 Merge tag 's390-4.20-2' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16d3cad5400000
kernel config: https://syzkaller.appspot.com/x/.config?x=8f559fee2fc3375a
dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1494bb0b400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a9ac39...@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:182
[inline]
BUG: KASAN: use-after-free in task_is_descendant.part.3+0x610/0x670
security/yama/yama_lsm.c:295
Read of size 8 at addr ffff8801c46f24e0 by task syz-executor2/9973

CPU: 0 PID: 9973 Comm: syz-executor2 Not tainted 4.20.0-rc1+ #107

Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]

dump_stack+0x244/0x39d lib/dump_stack.c:113
print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
__read_once_size include/linux/compiler.h:182 [inline]
task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295
task_is_descendant security/yama/yama_lsm.c:282 [inline]
yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371
security_ptrace_access_check+0x54/0xb0 security/security.c:271
__ptrace_may_access+0x5c8/0x980 kernel/ptrace.c:336
ptrace_attach+0x1fa/0x640 kernel/ptrace.c:389
__do_sys_ptrace kernel/ptrace.c:1139 [inline]
__se_sys_ptrace kernel/ptrace.c:1119 [inline]
__x64_sys_ptrace+0x229/0x260 kernel/ptrace.c:1119
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f2ed4dbfc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000065
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000000457569
RDX: 0000000000000000 RSI: 000000000000021d RDI: 0000000000004206
RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2ed4dc06d4
R13: 00000000004c33bd R14: 00000000004d50e0 R15: 00000000ffffffff

Allocated by task 5873:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
alloc_task_struct_node kernel/fork.c:158 [inline]
dup_task_struct kernel/fork.c:843 [inline]
copy_process+0x2026/0x87a0 kernel/fork.c:1751
_do_fork+0x1cb/0x11d0 kernel/fork.c:2216
__do_sys_clone kernel/fork.c:2323 [inline]
__se_sys_clone kernel/fork.c:2317 [inline]
__x64_sys_clone+0xbf/0x150 kernel/fork.c:2317
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kmem_cache_free+0x83/0x290 mm/slab.c:3760
free_task_struct kernel/fork.c:163 [inline]
free_task+0x16e/0x1f0 kernel/fork.c:457
__put_task_struct+0x2e6/0x620 kernel/fork.c:730
put_task_struct include/linux/sched/task.h:96 [inline]
delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2437 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
__do_softirq+0x308/0xb7e kernel/softirq.c:292

The buggy address belongs to the object at ffff8801c46f2000
which belongs to the cache task_struct(65:syz2) of size 6080
The buggy address is located 1248 bytes inside of
6080-byte region [ffff8801c46f2000, ffff8801c46f37c0)
The buggy address belongs to the page:
page:ffffea000711bc80 count:1 mapcount:0 mapping:ffff8801c204fc80 index:0x0
compound_mapcount: 0
flags: 0x2fffc0000010200(slab|head)
raw: 02fffc0000010200 ffffea00073cf808 ffffea000697c908 ffff8801c204fc80
raw: 0000000000000000 ffff8801c46f2000 0000000100000001 ffff8801d8404a80
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff8801d8404a80

Memory state around the buggy address:
ffff8801c46f2380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c46f2400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801c46f2480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801c46f2500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c46f2580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

syzbot

unread,

Nov 10, 2018, 6:46:05 AM11/10/18

to jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, ol...@redhat.com, penguin...@i-love.sakura.ne.jp, penguin...@i-love.sakura.ne.jp, se...@hallyn.com, syzkall...@googlegroups.com

syzbot has found a reproducer for the following crash on:

HEAD commit: aa4330e15c26 Merge tag 'devicetree-fixes-for-4.20-2' of gi..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17abaa47400000

kernel config: https://syzkaller.appspot.com/x/.config?x=8f559fee2fc3375a
dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

userspace arch: i386
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=108ae6d5400000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17bd86a3400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a9ac39...@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:182
[inline]
BUG: KASAN: use-after-free in task_is_descendant.part.3+0x610/0x670
security/yama/yama_lsm.c:295

Read of size 8 at addr ffff8801ae718560 by task syz-executor198/4607

CPU: 1 PID: 4607 Comm: syz-executor198 Not tainted 4.20.0-rc1+ #232

Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x39d lib/dump_stack.c:113
print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
__read_once_size include/linux/compiler.h:182 [inline]
task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295
task_is_descendant security/yama/yama_lsm.c:282 [inline]
yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371
security_ptrace_access_check+0x54/0xb0 security/security.c:271
__ptrace_may_access+0x5c8/0x980 kernel/ptrace.c:336
ptrace_attach+0x1fa/0x640 kernel/ptrace.c:389

__do_compat_sys_ptrace kernel/ptrace.c:1284 [inline]
__se_compat_sys_ptrace kernel/ptrace.c:1266 [inline]
__ia32_compat_sys_ptrace+0x1d2/0x260 kernel/ptrace.c:1266
do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
do_fast_syscall_32+0x34d/0xfb2 arch/x86/entry/common.c:397
entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7feba29
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000f7fe71ec EFLAGS: 00000296 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 0000000000004206 RCX: 00000000000014b5
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 5668:

save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
alloc_task_struct_node kernel/fork.c:158 [inline]
dup_task_struct kernel/fork.c:843 [inline]
copy_process+0x2026/0x87a0 kernel/fork.c:1751
_do_fork+0x1cb/0x11d0 kernel/fork.c:2216

__do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
__se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
__ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
do_fast_syscall_32+0x34d/0xfb2 arch/x86/entry/common.c:397
entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

Freed by task 16:

save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kmem_cache_free+0x83/0x290 mm/slab.c:3760
free_task_struct kernel/fork.c:163 [inline]
free_task+0x16e/0x1f0 kernel/fork.c:457
__put_task_struct+0x2e6/0x620 kernel/fork.c:730
put_task_struct include/linux/sched/task.h:96 [inline]
delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2437 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
__do_softirq+0x308/0xb7e kernel/softirq.c:292

The buggy address belongs to the object at ffff8801ae718080
which belongs to the cache task_struct of size 6080

The buggy address is located 1248 bytes inside of

6080-byte region [ffff8801ae718080, ffff8801ae719840)

The buggy address belongs to the page:

page:ffffea0006b9c600 count:1 mapcount:0 mapping:ffff8801da970180 index:0x0

compound_mapcount: 0
flags: 0x2fffc0000010200(slab|head)

raw: 02fffc0000010200 ffffea0006b99488 ffffea0006b9c688 ffff8801da970180
raw: 0000000000000000 ffff8801ae718080 0000000100000001 0000000000000000

page dumped because: kasan: bad access detected

Memory state around the buggy address:

ffff8801ae718400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801ae718480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801ae718500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801ae718580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801ae718600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Tetsuo Handa

unread,

Jan 10, 2019, 6:05:55 AM1/10/19

to Oleg Nesterov, Kees Cook, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

Hello, Kees.

syzbot is hitting this problem as of linux-next-20190110.
When a patch will be proposed?

Kees Cook

unread,

Jan 10, 2019, 1:48:09 PM1/10/19

to Tetsuo Handa, Oleg Nesterov, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkaller-bugs

On Thu, Jan 10, 2019 at 3:06 AM Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
> syzbot is hitting this problem as of linux-next-20190110.
> When a patch will be proposed?

Hi! Sorry, this got delayed over the holidays. Let me finish the patch
I was working on and get it published.

Thanks!

--
Kees Cook

Oleg Nesterov

unread,

Jan 16, 2019, 12:40:09 PM1/16/19

to Tetsuo Handa, Kees Cook, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com

On 01/10, Tetsuo Handa wrote:
>
> syzbot is hitting this problem as of linux-next-20190110.
> When a patch will be proposed?

Well. I have already suggested the patch below several times. It won't fix all
problems in this code (I forgot the details but iirc ptracer_exception_found()
is broken too, task_is_descendant() should be cleanuped, may be something else),
but probably this trivial fix makes sense for the start?

Reply all

Reply to author

Forward

KASAN: use-after-free Read in task_is_descendant

syzbot

Tetsuo Handa

Oleg Nesterov

Tetsuo Handa

Oleg Nesterov

Tetsuo Handa

Kees Cook

Oleg Nesterov

Kees Cook

Tetsuo Handa

Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Tetsuo Handa

Oleg Nesterov

Oleg Nesterov

Tetsuo Handa

Oleg Nesterov

Tetsuo Handa

Oleg Nesterov

Tetsuo Handa

Oleg Nesterov

Kees Cook

Oleg Nesterov

Oleg Nesterov

syzbot

syzbot

Tetsuo Handa

Kees Cook

Oleg Nesterov