KASAN: use-after-free Read in task_is_descendant

39 views
Skip to first unread message

syzbot

unread,
Oct 21, 2018, 3:10:03 AM10/21/18
to jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, se...@hallyn.com, syzkall...@googlegroups.com
Hello,

syzbot found the following crash on:

HEAD commit: 270b77a0f30e Merge tag 'drm-fixes-2018-10-20-1' of git://a..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=116f4ad9400000
kernel config: https://syzkaller.appspot.com/x/.config?x=b3f55cb3dfcc6c33
dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a9ac39...@syzkaller.appspotmail.com

binder: 12719:12721 got transaction to context manager from process owning
it
binder: BINDER_SET_CONTEXT_MGR already set
binder: 12719:12724 ioctl 40046207 0 returned -16
binder: 12719:12721 BC_ACQUIRE_DONE u0000000000000000 no match
==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:188
[inline]
BUG: KASAN: use-after-free in task_is_descendant.part.2+0x610/0x670
security/yama/yama_lsm.c:295
Read of size 8 at addr ffff8801c4666b20 by task syz-executor3/12722

CPU: 1 PID: 12722 Comm: syz-executor3 Not tainted 4.19.0-rc8+ #70
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
__read_once_size include/linux/compiler.h:188 [inline]
task_is_descendant.part.2+0x610/0x670 security/yama/yama_lsm.c:295
task_is_descendant security/yama/yama_lsm.c:282 [inline]
yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371
security_ptrace_access_check+0x54/0xb0 security/security.c:260
__ptrace_may_access+0x564/0x950 kernel/ptrace.c:331
ptrace_attach+0x1fa/0x640 kernel/ptrace.c:379
__do_sys_ptrace kernel/ptrace.c:1130 [inline]
__se_sys_ptrace kernel/ptrace.c:1110 [inline]
__x64_sys_ptrace+0x229/0x260 kernel/ptrace.c:1110
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f7df12c4c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000065
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 0000000000457569
RDX: 0000000000000000 RSI: 0000000000000265 RDI: 0000000000000010
RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7df12c56d4
R13: 00000000004c30c5 R14: 00000000004d4ab8 R15: 00000000ffffffff

Allocated by task 8920:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
alloc_task_struct_node kernel/fork.c:157 [inline]
dup_task_struct kernel/fork.c:802 [inline]
copy_process+0x1ff4/0x8780 kernel/fork.c:1707
_do_fork+0x1cb/0x11d0 kernel/fork.c:2166
__do_sys_clone kernel/fork.c:2273 [inline]
__se_sys_clone kernel/fork.c:2267 [inline]
__x64_sys_clone+0xbf/0x150 kernel/fork.c:2267
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 18:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kmem_cache_free+0x83/0x290 mm/slab.c:3756
free_task_struct kernel/fork.c:162 [inline]
free_task+0x16e/0x1f0 kernel/fork.c:416
__put_task_struct+0x2e6/0x620 kernel/fork.c:689
put_task_struct include/linux/sched/task.h:96 [inline]
delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181
__rcu_reclaim kernel/rcu/rcu.h:236 [inline]
rcu_do_batch kernel/rcu/tree.c:2576 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline]
__rcu_process_callbacks kernel/rcu/tree.c:2847 [inline]
rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864
__do_softirq+0x30b/0xad8 kernel/softirq.c:292

The buggy address belongs to the object at ffff8801c4666640
which belongs to the cache task_struct(65:syz3) of size 6080
The buggy address is located 1248 bytes inside of
6080-byte region [ffff8801c4666640, ffff8801c4667e00)
The buggy address belongs to the page:
page:ffffea0007119980 count:1 mapcount:0 mapping:ffff8801c2336800 index:0x0
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffffea00070a0008 ffffea0006080688 ffff8801c2336800
raw: 0000000000000000 ffff8801c4666640 0000000100000001 ffff8801bdb40c00
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff8801bdb40c00

Memory state around the buggy address:
ffff8801c4666a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c4666a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801c4666b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801c4666b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c4666c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.

Tetsuo Handa

unread,
Oct 21, 2018, 3:12:53 AM10/21/18
to se...@hallyn.com, Oleg Nesterov, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 2018/10/21 16:10, syzbot wrote:
> BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:188 [inline]
> BUG: KASAN: use-after-free in task_is_descendant.part.2+0x610/0x670 security/yama/yama_lsm.c:295
> Read of size 8 at addr ffff8801c4666b20 by task syz-executor3/12722
>
> CPU: 1 PID: 12722 Comm: syz-executor3 Not tainted 4.19.0-rc8+ #70
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
>  print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>  __read_once_size include/linux/compiler.h:188 [inline]
>  task_is_descendant.part.2+0x610/0x670 security/yama/yama_lsm.c:295

Do we need to hold

write_lock_irq(&tasklist_lock);

rather than

rcu_read_lock();

when accessing

"struct task_struct"->real_parent

?

Oleg Nesterov

unread,
Oct 22, 2018, 5:54:44 AM10/22/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
Well, if "task" is stable (can't exit), then I think

rcu_dereference(task->real_parent)

is fine, we know that ->real_parent did not pass exit_notif() yet.

However, task_is_descendant() looks unnecessarily complicated, it could be

static int task_is_descendant(struct task_struct *parent,
struct task_struct *child)
{
int rc = 0;
struct task_struct *walker;

if (!parent || !child)
return 0;

rcu_read_lock();
for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
if (same_thread_group(parent, walker)) {
rc = 1;
break;
}
rcu_read_unlock();

return rc;
}

And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?

Oleg.


Tetsuo Handa

unread,
Oct 22, 2018, 6:07:00 AM10/22/18
to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
OK.

>
> However, task_is_descendant() looks unnecessarily complicated, it could be
>
> static int task_is_descendant(struct task_struct *parent,
> struct task_struct *child)
> {
> int rc = 0;
> struct task_struct *walker;
>
> if (!parent || !child)
> return 0;
>
> rcu_read_lock();
> for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
> if (same_thread_group(parent, walker)) {
> rc = 1;
> break;
> }
> rcu_read_unlock();
>
> return rc;
> }
>
> And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
> task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?

Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
released. Do we still need pid_alive(child) ?

Oleg Nesterov

unread,
Oct 22, 2018, 9:46:38 AM10/22/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/22, Tetsuo Handa wrote:
>
> > However, task_is_descendant() looks unnecessarily complicated, it could be
> >
> > static int task_is_descendant(struct task_struct *parent,
> > struct task_struct *child)
> > {
> > int rc = 0;
> > struct task_struct *walker;
> >
> > if (!parent || !child)
> > return 0;
> >
> > rcu_read_lock();
> > for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent))
> > if (same_thread_group(parent, walker)) {
> > rc = 1;
> > break;
> > }
> > rcu_read_unlock();
> >
> > return rc;
> > }
> >
> > And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
> > task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?
>
> Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
> released. Do we still need pid_alive(child) ?

get_task_struct(child) can only ensure that this task_struct can't be freed.

Suppose that this child exits after get_task_struct(), then its real_parent exits
too and calls call_rcu(delayed_put_task_struct).

Now, when task_is_descendant() is called, rcu_read_lock() can happen after rcu gp,
iow child->parent can be already freed/reused/unmapped.

We need to ensure that child is still protected by RCU.

Oleg.

Tetsuo Handa

unread,
Oct 24, 2018, 10:15:53 PM10/24/18
to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
Oleg Nesterov wrote:
> On 10/22, Tetsuo Handa wrote:
> > > And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
> > > task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?
> >
> > Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
> > released. Do we still need pid_alive(child) ?
>
> get_task_struct(child) can only ensure that this task_struct can't be freed.

The report says that it is a use-after-free read at

walker = rcu_dereference(walker->real_parent);

which means that walker was already released.

>
> Suppose that this child exits after get_task_struct(), then its real_parent exits
> too and calls call_rcu(delayed_put_task_struct).
>
> Now, when task_is_descendant() is called, rcu_read_lock() can happen after rcu gp,
> iow child->parent can be already freed/reused/unmapped.
>
> We need to ensure that child is still protected by RCU.

I wonder whether pid_alive() test helps.

We can get

[ 40.620318] parent or walker is dead.
[ 40.624146] tracee is dead.

messages using below patch and reproducer.

----------
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99cfddd..0d9d786 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -385,6 +385,7 @@ static int ptrace_attach(struct task_struct *task, long request,
if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
goto out;

+ schedule_timeout_killable(HZ);
task_lock(task);
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
task_unlock(task);
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index ffda91a..a231ec6 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -283,6 +283,11 @@ static int task_is_descendant(struct task_struct *parent,
return 0;

rcu_read_lock();
+ if (!pid_alive(parent) || !pid_alive(walker)) {
+ rcu_read_unlock();
+ printk("parent or walker is dead.\n");
+ return 0;
+ }
if (!thread_group_leader(parent))
parent = rcu_dereference(parent->group_leader);
while (walker->pid > 0) {
@@ -315,6 +320,10 @@ static int ptracer_exception_found(struct task_struct *tracer,
bool found = false;

rcu_read_lock();
+ if (!pid_alive(tracee)) {
+ printk("tracee is dead.\n");
+ goto unlock;
+ }

/*
* If there's already an active tracing relationship, then make an
----------

----------
#include <unistd.h>
#include <sys/ptrace.h>
#include <poll.h>

int main(int argc, char *argv[])
{
if (fork() == 0) {
ptrace(PTRACE_ATTACH, getppid(), NULL, NULL);
_exit(0);
}
poll(NULL, 0, 100);
return 0;
}
----------

But since "child" has at least one reference, reading "child->real_parent" should
be safe. Therefore, I think that bailing out due to pid_is_alive(child) == false
(like above patch does) cannot avoid this problem...

Kees Cook

unread,
Oct 25, 2018, 4:19:49 AM10/25/18
to Oleg Nesterov, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
task_is_descendant() is called under rcu_read_lock() in both
ptracer_exception_found() and yama_ptrace_access_check() so I don't
understand how any of the tasks could get freed? This is walking
group_leader and real_parent -- are these not stable under rcu_lock()?

--
Kees Cook

Oleg Nesterov

unread,
Oct 25, 2018, 7:14:02 AM10/25/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/25, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > On 10/22, Tetsuo Handa wrote:
> > > > And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
> > > > task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?
> > >
> > > Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
> > > released. Do we still need pid_alive(child) ?
> >
> > get_task_struct(child) can only ensure that this task_struct can't be freed.
>
> The report says that it is a use-after-free read at
>
> walker = rcu_dereference(walker->real_parent);
>
> which means that walker was already released.

quite possibly I missed something, but I am not sure I understand your concerns...

So again, suppose that "child" is already dead. Its task_struct can't be freed,
but child->real_parent can point to the already freed memory.

This means that the 1st walker = rcu_dereference(walker->real_parent) is fine,
this simply reads the child->real_parent pointer, but on the second iteration

walker = rcu_dereference(walker->real_parent);

reads the alredy freed memory.

> I wonder whether pid_alive() test helps.
>
> We can get
>
> [ 40.620318] parent or walker is dead.
> [ 40.624146] tracee is dead.
>
> messages using below patch and reproducer.

again, I do not understand, this all looks correct...

> ----------
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 99cfddd..0d9d786 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -385,6 +385,7 @@ static int ptrace_attach(struct task_struct *task, long request,
> if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
> goto out;
>
> + schedule_timeout_killable(HZ);
> task_lock(task);
> retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> task_unlock(task);
> diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
> index ffda91a..a231ec6 100644
> --- a/security/yama/yama_lsm.c
> +++ b/security/yama/yama_lsm.c
> @@ -283,6 +283,11 @@ static int task_is_descendant(struct task_struct *parent,
> return 0;
>
> rcu_read_lock();
> + if (!pid_alive(parent) || !pid_alive(walker)) {
> + rcu_read_unlock();
> + printk("parent or walker is dead.\n");

This is what we need to do, except I think we should change yama_ptrace_access_check().
And iiuc parent == current, pid_alive(parent) looks unnecessary. Although we need to
check ptracer_exception_found(), may be it needs some changes too.

And yes, task_is_descendant() can hit the dead child, if nothing else it can
be killed. This can explain the kasan report.

> @@ -315,6 +320,10 @@ static int ptracer_exception_found(struct task_struct *tracer,
> bool found = false;
>
> rcu_read_lock();
> + if (!pid_alive(tracee)) {
> + printk("tracee is dead.\n");
> + goto unlock;

Sure, this is possible too.

> But since "child" has at least one reference, reading "child->real_parent" should
> be safe. Therefore, I think that bailing out due to pid_is_alive(child) == false
> (like above patch does) cannot avoid this problem...

Why?

OK. Lets ignore ptracer_exception_found() for the moment. Why do you think the
patch below can't help?

Oleg.

--- x/security/yama/yama_lsm.c
+++ x/security/yama/yama_lsm.c
@@ -368,7 +368,8 @@ static int yama_ptrace_access_check(stru
break;
case YAMA_SCOPE_RELATIONAL:
rcu_read_lock();
- if (!task_is_descendant(current, child) &&
+ if (!pid_alive(child) ||
+ !task_is_descendant(current, child) &&
!ptracer_exception_found(current, child) &&
!ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
rc = -EPERM;

Kees Cook

unread,
Oct 25, 2018, 7:36:15 AM10/25/18
to Oleg Nesterov, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
On Thu, Oct 25, 2018 at 12:13 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> So again, suppose that "child" is already dead. Its task_struct can't be freed,
> but child->real_parent can point to the already freed memory.

I can't find a path for "child" to be released. I see task_lock()
always called on it before it ends up in Yama.

(Well, I can't find the lock for switch_mm(), but I assume that's safe
because it's switching to the task.)

> This means that the 1st walker = rcu_dereference(walker->real_parent) is fine,
> this simply reads the child->real_parent pointer, but on the second iteration
>
> walker = rcu_dereference(walker->real_parent);
>
> reads the alredy freed memory.

What does rcu_read_lock() protect actually protect here? I thought
none of the task structs would be freed until after all
rcu_read_unlock() finished.

> OK. Lets ignore ptracer_exception_found() for the moment. Why do you think the
> patch below can't help?
>
> Oleg.
>
> --- x/security/yama/yama_lsm.c
> +++ x/security/yama/yama_lsm.c
> @@ -368,7 +368,8 @@ static int yama_ptrace_access_check(stru
> break;
> case YAMA_SCOPE_RELATIONAL:
> rcu_read_lock();
> - if (!task_is_descendant(current, child) &&
> + if (!pid_alive(child) ||
> + !task_is_descendant(current, child) &&
> !ptracer_exception_found(current, child) &&
> !ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
> rc = -EPERM;
>

Hm, documentation there says:
* If pid_alive fails, then pointers within the task structure
* can be stale and must not be dereferenced.

What is the safe pattern for checking vs rcu?

-Kees

--
Kees Cook

Tetsuo Handa

unread,
Oct 25, 2018, 7:47:52 AM10/25/18
to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 2018/10/25 20:13, Oleg Nesterov wrote:
> On 10/25, Tetsuo Handa wrote:
>>
>> Oleg Nesterov wrote:
>>> On 10/22, Tetsuo Handa wrote:
>>>>> And again, I do not know how/if yama ensures that child is rcu-protected, perhaps
>>>>> task_is_descendant() needs to check pid_alive(child) right after rcu_read_lock() ?
>>>>
>>>> Since the caller (ptrace() path) called get_task_struct(child), child itself can't be
>>>> released. Do we still need pid_alive(child) ?
>>>
>>> get_task_struct(child) can only ensure that this task_struct can't be freed.
>>
>> The report says that it is a use-after-free read at
>>
>> walker = rcu_dereference(walker->real_parent);
>>
>> which means that walker was already released.
>
> quite possibly I missed something, but I am not sure I understand your concerns...
>
> So again, suppose that "child" is already dead. Its task_struct can't be freed,
> but child->real_parent can point to the already freed memory.

Yes.

But if child->real_parent is pointing to the already freed memory,
why does pid_alive(child) == true help?

>
> This means that the 1st walker = rcu_dereference(walker->real_parent) is fine,
> this simply reads the child->real_parent pointer,

Yes.

> but on the second iteration
>
> walker = rcu_dereference(walker->real_parent);
>
> reads the alredy freed memory.

Yes.
There are two task_is_descendant() callers, and one of them is not passing current.

>
> And yes, task_is_descendant() can hit the dead child, if nothing else it can
> be killed. This can explain the kasan report.

The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
isn't it?

How can we check that that pointer is pointing to already freed memory? As soon as

walker = rcu_dereference(walker->real_parent);

is executed, task_alive(walker) will try to read from already freed memory...

Oleg Nesterov

unread,
Oct 25, 2018, 7:52:51 AM10/25/18
to Kees Cook, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
On 10/25, Kees Cook wrote:
>
> task_is_descendant() is called under rcu_read_lock() in both
> ptracer_exception_found() and yama_ptrace_access_check() so I don't
> understand how any of the tasks could get freed? This is walking
> group_leader and real_parent -- are these not stable under rcu_lock()?

group_leader/real_parent/etc are no longer rcu-protected after the exiting
child calls release_task() which in particular removes the child from
children/thread_group lists.

OK. Suppose you have an rcu-protected list, and each element also has a
reference counter so you can do something

struct elt {
atomic_t ctr;
struct list_head list;
int pid;
};

rcu_read_lock();
list_for_each_entry(elt, &LIST, list) {
if (elt->pid == 100) {
atomic_inc(&elt->ctr); // get_task_struct()
break;
}
}
rcu_read_unlock();

do_something(elt);

This code is fine. This elt can't be freed, you have a reference. But once
you drop rcu lock you can't trust elt->list.next! So, for example, you can
not do

rcu_read_lock();
list_for_each_entry_continue_rcu(elt, &LIST, list) {
...
}
rcu_read_unlock();

too late, elt.list.next can be already freed, or it can be freed while you
iterate the list.

Another simple example. Suppose you have a global PTR protected by rcu. So
ignoring the necessary rcu_dereference this code is fine:

rcu_read_lock();
if (ptr = PTR)
do_something(ptr);
rcu_read_unlcok();

But this is not:

ptr = PTR;
rcu_read_lock();
if (ptr)
do_something(ptr);
rcu_read_unlock();

basically the same thing...

Oleg.

Oleg Nesterov

unread,
Oct 25, 2018, 8:05:26 AM10/25/18
to Kees Cook, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
On 10/25, Kees Cook wrote:
>
> On Thu, Oct 25, 2018 at 12:13 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> > So again, suppose that "child" is already dead. Its task_struct can't be freed,
> > but child->real_parent can point to the already freed memory.
>
> I can't find a path for "child" to be released. I see task_lock()
> always called on it before it ends up in Yama.

Are you saying that yama_ptrace_access_check() is always called under
task_lock(child) ? Yes, it seems so

So what? Say, ptrace_attach() can hit a dead task. It should notice this
and fail later, but security_ptrace_access_check() is called before that.


> > This means that the 1st walker = rcu_dereference(walker->real_parent) is fine,
> > this simply reads the child->real_parent pointer, but on the second iteration
> >
> > walker = rcu_dereference(walker->real_parent);
> >
> > reads the alredy freed memory.
>
> What does rcu_read_lock() protect actually protect here? I thought
> none of the task structs would be freed until after all
> rcu_read_unlock() finished.

See another email I sent you a minute ago.

Oleg.

Oleg Nesterov

unread,
Oct 25, 2018, 8:17:15 AM10/25/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/25, Tetsuo Handa wrote:
>
> On 2018/10/25 20:13, Oleg Nesterov wrote:
> >
> > So again, suppose that "child" is already dead. Its task_struct can't be freed,
> > but child->real_parent can point to the already freed memory.
>
> Yes.
>
> But if child->real_parent is pointing to the already freed memory,
> why does pid_alive(child) == true help?

Hmm. Because pid_alive(child) == true && child->real_parent is freed must not
be possible? As long as we check pid_alive() under rcu_read_lock().

> >> @@ -283,6 +283,11 @@ static int task_is_descendant(struct task_struct *parent,
> >> return 0;
> >>
> >> rcu_read_lock();
> >> + if (!pid_alive(parent) || !pid_alive(walker)) {
> >> + rcu_read_unlock();
> >> + printk("parent or walker is dead.\n");
> >
> > This is what we need to do, except I think we should change yama_ptrace_access_check().
> > And iiuc parent == current, pid_alive(parent) looks unnecessary. Although we need to
> > check ptracer_exception_found(), may be it needs some changes too.
>
> There are two task_is_descendant() callers, and one of them is not passing current.

As I said below, please ignore ptracer_exception_found(), another caller for now,
perhaps it needs some changes too. I even have a vague feeling that I have already
blamed this function some time ago...

> > And yes, task_is_descendant() can hit the dead child, if nothing else it can
> > be killed. This can explain the kasan report.
>
> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
> isn't it?

Yes. and you know, I am all confused. I no longer can understand you :/

> How can we check that that pointer is pointing to already freed memory? As soon as
>
> walker = rcu_dereference(walker->real_parent);
>
> is executed, task_alive(walker) will try to read from already freed memory...

Of course we should not do it this way. The patch I sent doesn't...

Oleg.

Oleg Nesterov

unread,
Oct 25, 2018, 9:01:36 AM10/25/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/25, Oleg Nesterov wrote:
>
> As I said below, please ignore ptracer_exception_found(), another caller for now,
> perhaps it needs some changes too. I even have a vague feeling that I have already
> blamed this function some time ago...

Heh, yes, 3 years ago ;)

https://lore.kernel.org/lkml/20150106184...@redhat.com/

I can't understand my email today, but note that I tried to point out that
task_is_descendant() can dereference the freed mem.

And yes, task_is_descendant() is overcompicated for no reason, afaics.

Oleg.

Tetsuo Handa

unread,
Oct 25, 2018, 9:15:09 AM10/25/18
to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 2018/10/25 21:17, Oleg Nesterov wrote:
>>> And yes, task_is_descendant() can hit the dead child, if nothing else it can
>>> be killed. This can explain the kasan report.
>>
>> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
>> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
>> isn't it?
>
> Yes. and you know, I am all confused. I no longer can understand you :/

Why don't we need to check every time like shown below?
Why checking only once is sufficient?

--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -285,7 +285,7 @@ static int task_is_descendant(struct task_struct *parent,
rcu_read_lock();
if (!thread_group_leader(parent))
parent = rcu_dereference(parent->group_leader);
- while (walker->pid > 0) {
+ while (pid_alive(walker) && walker->pid > 0) {
if (!thread_group_leader(walker))
walker = rcu_dereference(walker->group_leader);
if (walker == parent) {

Oleg Nesterov

unread,
Oct 25, 2018, 11:55:10 AM10/25/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/25, Tetsuo Handa wrote:
>
> On 2018/10/25 21:17, Oleg Nesterov wrote:
> >>> And yes, task_is_descendant() can hit the dead child, if nothing else it can
> >>> be killed. This can explain the kasan report.
> >>
> >> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
> >> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
> >> isn't it?
> >
> > Yes. and you know, I am all confused. I no longer can understand you :/
>
> Why don't we need to check every time like shown below?
> Why checking only once is sufficient?

Why do you think it is not sufficient?

Again, I can be easily wrong, rcu is not simple, but so far I think we need
a single check at the start.

> --- a/security/yama/yama_lsm.c
> +++ b/security/yama/yama_lsm.c
> @@ -285,7 +285,7 @@ static int task_is_descendant(struct task_struct *parent,
> rcu_read_lock();
> if (!thread_group_leader(parent))
> parent = rcu_dereference(parent->group_leader);
> - while (walker->pid > 0) {
> + while (pid_alive(walker) && walker->pid > 0) {

OK. To simplify, ets suppose that task_is_descendant() is called with tasklist
lock held. And lets suppose that all tasks are single-threaded.

Then we obviously need a single check at the start, we need to ensure that the
child was not removed from its ->real_parent->children list. The latter means
that if ->real_parent exits, the child will be re-parented and its ->real_parent
will be updated.

So we could do

read_lock(tasklist);

if (list_empty(child->sibling))
// it is dead, removed from ->children list, we can't trust
// child->real_parent
return -EWHATEVER;

task_is_descendant(current, child);

But note that we can safely use pid_alive(child) instead, detach_pid() and
list_del_init(&p->sibling) happen "at the same time" since we hold tasklist.

(And btw, I suggested several times to rename it, or add another helper with
a better name. Note also that we could check, say, ->sighand != NULL with
the same effect.)


Now. Why do you think rcu_read_lock() differs in that we need to check
pid_alive() at every step?

Suppose that one of the grand parents exits, and it is going to be freed. Again,
to (over)simplify the things, lets suppose that release_task() does

synchronize_rcu();
free_task(p);

at the end. Now, can

rcu_read_lock();
if (pid_alive(child)) {
while (child->pid)
child = child->real_parent;
}
rcu_read_unlock();

hit the already freed ->real_parent ? Say, the freed child->real_parent->real_parent.

Lets denote P1 = child->real_parent, P2 = P1->real_parent. Can P2 be already freed?

This is only possible if synchronize_rcu() above was called before rcu_read_lock(),
see the last sentence below.

If P1->real_parent is still P2, then P1 has already exited too. And we still observe
that child->real_parent == P1, this too is only possible if child has exited, so we
must see pid_alive() == F.

Why must we see pid_alive() == F without tasklist? It must be true, release_task()
is serialized by tasklist_lock, but why we can't get the stale value under
rcu_read_lock() ?

Because our rcu read-lock critical section extends beyond the return from
synchronize_rcu(), and thus we must have a full memory barrier _between_
that synchronize_rcu() and our rcu_read_lock(). We must see all memory updates,
including thread_pid = NULL which makes pid_alive() == F.

Do you see any hole?

Oleg.

Oleg Nesterov

unread,
Oct 25, 2018, 12:26:00 PM10/25/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/25, Oleg Nesterov wrote:
>
> Because our rcu read-lock critical section extends beyond the return from
> synchronize_rcu(), and thus we must have a full memory barrier _between_
> that synchronize_rcu() and our rcu_read_lock(). We must see all memory updates,
> including thread_pid = NULL which makes pid_alive() == F.

In case I was not clear....

Suppose we have int X = 0. If some CPU does

X = 1;
synchronize_rcu();

and another CPU does

rcu_read_lock();
x = X;
rcu_read_unlock();

then x should be == 1 in case when rcu_read_unlock() happens _after_ return
from synchronize_rcu().

Oleg.

Tetsuo Handa

unread,
Oct 26, 2018, 8:24:03 AM10/26/18
to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 2018/10/26 0:55, Oleg Nesterov wrote:
> On 10/25, Tetsuo Handa wrote:
>>
>> On 2018/10/25 21:17, Oleg Nesterov wrote:
>>>>> And yes, task_is_descendant() can hit the dead child, if nothing else it can
>>>>> be killed. This can explain the kasan report.
>>>>
>>>> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
>>>> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
>>>> isn't it?
>>>
>>> Yes. and you know, I am all confused. I no longer can understand you :/
>>
>> Why don't we need to check every time like shown below?
>> Why checking only once is sufficient?
>
> Why do you think it is not sufficient?
>
> Again, I can be easily wrong, rcu is not simple, but so far I think we need
> a single check at the start.
>

Hmm, this report is difficult to guess what happened.

Since the "child" passed to task_is_descendant() has at least one reference
count taken by find_get_task_by_vpid(), rcu_dereference(walker->real_parent)
in the first iteration

while (child->pid > 0) {
if (!thread_group_leader(child))
walker = rcu_dereference(child->group_leader);
if (walker == parent) {
rc = 1;
break;
}
walker = rcu_dereference(walker->real_parent);
}

must not trigger use-after-free bug. Thus, when this use-after-free was
detected at rcu_dereference(walker->real_parent), the memory pointed by
"walker" must have been released between

while (walker->pid > 0) {
if (!thread_group_leader(walker))
walker = rcu_dereference(walker->group_leader);

and

walker = rcu_dereference(walker->real_parent);
}

because otherwise use-after-free would have been reported at walker->pid
or thread_group_leader(walker) or rcu_dereference(walker->group_leader).

Is my understanding correct?



Then, what pid_alive(child) is testing? It is not memory pointed by "child" but
memory pointed by "walker" (i.e. parent of "child" or parent of parent of "child"
or ... ) which is triggering use-after-free.

Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
when p2 tried to attach on p1, p2->real_parent was pointing to already
(or about to be) freed p1.

Even if pid_alive(p2) test can guarantee that p1 won't be released,
how can pid_alive(p3) test guarantee that p1 won't be released?
p1 can be released any moment because it has already waited for RCU
grace period, can't it?


ptrace(PTRACE_ATTACH, vpid_of_p2) {
p2 = find_get_task_by_vpid(vpid_of_p2);
ptrace_attach(p2, PTRACE_ATTACH, addr, data) {
mutex_lock_interruptible(&p2->signal->cred_guard_mutex);
// p1 starts exit()ing here.
task_lock(p2);
__ptrace_may_access(p2) {
// p2->real_parent starts pointing to already freed p1.
security_ptrace_access_check(p2, PTRACE_MODE_ATTACH) {
yama_ptrace_access_check() {
task_is_descendant(current, p2) {
walker = p2;
rcu_read_lock();
if (pid_alive(p2)) { // If true
if (p2->pid > 0) { // will be true
p1 = rcu_dereference(p2->real_parent); // might be OK due to pid_alive(p2) == true?
}
}
rcu_read_unlock();
}
}
}
}
task_unlock(p2);
mutex_unlock(&p2->signal->cred_guard_mutex);
}
put_task_struct(p2);
}

ptrace(PTRACE_ATTACH, vpid_of_p3) {
p3 = find_get_task_by_vpid(vpid_of_p3);
ptrace_attach(p3, PTRACE_ATTACH, addr, data) {
mutex_lock_interruptible(&p3->signal->cred_guard_mutex);
// p1 starts exit()ing here.
task_lock(p3);
__ptrace_may_access(p3) {
// p2->real_parent starts pointing to already freed p1.
security_ptrace_access_check(p3, PTRACE_MODE_ATTACH) {
yama_ptrace_access_check() {
task_is_descendant(current, p3) {
walker = p3;
rcu_read_lock();
if (pid_alive(p3)) { // If true
if (p3->pid > 0) { // will be true
p2 = rcu_dereference(p3->real_parent); // will be OK if above assumption is OK.
if (p2->pid > 0) { // will be true
p1 = rcu_dereference(p2->real_parent); // will read already (or about to be) freed p1 address
if (p1->pid > 0) { // Oops here or
if (!thread_group_leader(p1)) // oops here or
p1 = rcu_dereference(p1->group_leader); // oops here or
p0 = rcu_dereference(p1->real_parent); // oops here, or not oops because releasing after this
}
}
}
}
rcu_read_unlock();
}
}
}
}
task_unlock(p3);
mutex_unlock(&p3->signal->cred_guard_mutex);
}
put_task_struct(p3);
}

Oleg Nesterov

unread,
Oct 26, 2018, 9:04:47 AM10/26/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/26, Tetsuo Handa wrote:
>
> Since the "child" passed to task_is_descendant() has at least one reference
> count taken by find_get_task_by_vpid(), rcu_dereference(walker->real_parent)
> in the first iteration
>
> while (child->pid > 0) {
> if (!thread_group_leader(child))
> walker = rcu_dereference(child->group_leader);
> if (walker == parent) {
> rc = 1;
> break;
> }
> walker = rcu_dereference(walker->real_parent);
> }
>
> must not trigger use-after-free bug.

Yes,

> Thus, when this use-after-free was
> detected at rcu_dereference(walker->real_parent), the memory pointed by
> "walker" must have been released between
>
> while (walker->pid > 0) {
> if (!thread_group_leader(walker))
> walker = rcu_dereference(walker->group_leader);
>
> and
>
> walker = rcu_dereference(walker->real_parent);
> }

this is possible too, rcu callback which frees the task_struct can run
int between

> because otherwise use-after-free would have been reported at walker->pid
> or thread_group_leader(walker) or rcu_dereference(walker->group_leader).

Well, I do not know how much kasan is "percise", but if it is "perfect"
then you are right,

> Then, what pid_alive(child) is testing?

I tried to explain this in my previous email. I even mentioned that we could
do another check, say, ->sighand != NULL, or list_empty(child->sibling) with
the same effect.

> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
> when p2 tried to attach on p1, p2->real_parent was pointing to already
> (or about to be) freed p1.

No, p2->real_parent will be updated. If p1 exits it will re-parent its
children including p2.

Again, did you read my previous email?

Let me repeat, of course I can be wrong. But so far I am answering the
same questions again and again...

Oleg.

Tetsuo Handa

unread,
Oct 26, 2018, 9:51:59 AM10/26/18
to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 2018/10/26 22:04, Oleg Nesterov wrote:
>> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
>> when p2 tried to attach on p1, p2->real_parent was pointing to already
>> (or about to be) freed p1.
>
> No, p2->real_parent will be updated. If p1 exits it will re-parent its
> children including p2.

My error.

Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
when someone tried to attach on p2, p2->real_parent was pointing to already
(or about to be) freed p1.

So, the puzzle part is why p2->real_parent was still pointing p1 even after
p1 was freed...

>
> Again, did you read my previous email?

Yes. But I still can't be convinced that pid_alive() test helps.

Oleg Nesterov

unread,
Oct 26, 2018, 10:39:43 AM10/26/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/26, Tetsuo Handa wrote:
>
> On 2018/10/26 22:04, Oleg Nesterov wrote:
> >> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
> >> when p2 tried to attach on p1, p2->real_parent was pointing to already
> >> (or about to be) freed p1.
> >
> > No, p2->real_parent will be updated. If p1 exits it will re-parent its
> > children including p2.
>
> My error.
>
> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
> when someone tried to attach on p2, p2->real_parent was pointing to already
> (or about to be) freed p1.

I don't see a difference.

If p1 exits it will re-parent p2, p2->real_parent will be updated.

> So, the puzzle part is why p2->real_parent was still pointing p1 even after
> p1 was freed...

I don't understand the question.

Once again. TASK->real_parent can point to the freed mem only if a) TASK exits,
and b) _after_ that its parent TASK->real_parent exits too.

> > Again, did you read my previous email?
>
> Yes. But I still can't be convinced that pid_alive() test helps.

Well, I don't understand which part of my explanations is not clear to you.

OR. Perhaps I am wrong and do not understand your concerns.

Oleg.

Tetsuo Handa

unread,
Oct 26, 2018, 11:04:44 AM10/26/18
to Oleg Nesterov, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 2018/10/26 23:39, Oleg Nesterov wrote:
> On 10/26, Tetsuo Handa wrote:
>> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
>> when someone tried to attach on p2, p2->real_parent was pointing to already
>> (or about to be) freed p1.
>
> I don't see a difference.
>
> If p1 exits it will re-parent p2, p2->real_parent will be updated.
>
>> So, the puzzle part is why p2->real_parent was still pointing p1 even after
>> p1 was freed...
>
> I don't understand the question.
>
> Once again. TASK->real_parent can point to the freed mem only if a) TASK exits,
> and b) _after_ that its parent TASK->real_parent exits too.

Oh, p2 exited and then p1 also exited when someone tried to attach on p2.
Then, p2->real_parent can point to already (or about to be) freed p1.

>
>>> Again, did you read my previous email?
>>
>> Yes. But I still can't be convinced that pid_alive() test helps.
>
> Well, I don't understand which part of my explanations is not clear to you.

OK. Checking pid_alive() should help.

(By the way, if p->real_parent were updated to point to init_task when p exits,
we could omit pid_alive() check?)

Oleg Nesterov

unread,
Oct 26, 2018, 11:22:14 AM10/26/18
to Tetsuo Handa, se...@hallyn.com, syzbot, jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, syzkall...@googlegroups.com
On 10/27, Tetsuo Handa wrote:
>
> On 2018/10/26 23:39, Oleg Nesterov wrote:
> > On 10/26, Tetsuo Handa wrote:
> >> Suppose p1 == p2->real_parent and p2 == p3->real_parent, and p1 exited
> >> when someone tried to attach on p2, p2->real_parent was pointing to already
> >> (or about to be) freed p1.
> >
> > I don't see a difference.
> >
> > If p1 exits it will re-parent p2, p2->real_parent will be updated.
> >
> >> So, the puzzle part is why p2->real_parent was still pointing p1 even after
> >> p1 was freed...
> >
> > I don't understand the question.
> >
> > Once again. TASK->real_parent can point to the freed mem only if a) TASK exits,
> > and b) _after_ that its parent TASK->real_parent exits too.
>
> Oh, p2 exited and then p1 also exited when someone tried to attach on p2.
> Then, p2->real_parent can point to already (or about to be) freed p1.

Then we must see pid_alive(p2) == F as I already explained you in my
yesterday's email.

Because if p1 exits _after_ p2, then pid_alive(p2) == F must be already
completed, p1 can't exit (I mean, release_task(p1) can't be called) until
release_task(p2) does detach_pid() and drops tasklist_lock.

> (By the way, if p->real_parent were updated to point to init_task when p exits,
> we could omit pid_alive() check?)

Sorry, I don't understand the question...

Ignoring the PR_SET_CHILD_SUBREAPER parents, ignoring the exiting sub-threads
which reparent to group leader, ->real_parent is alwasy updated to point to
init_task. But I don't see why we could omit pid_alive() check.

Oleg.

Kees Cook

unread,
Oct 26, 2018, 12:09:22 PM10/26/18
to Oleg Nesterov, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
On Thu, Oct 25, 2018 at 2:01 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> On 10/25, Oleg Nesterov wrote:
>> perhaps it needs some changes too. I even have a vague feeling that I have already
>> blamed this function some time ago...
>
> Heh, yes, 3 years ago ;)
>
> https://lore.kernel.org/lkml/20150106184...@redhat.com/
>
> I can't understand my email today, but note that I tried to point out that
> task_is_descendant() can dereference the freed mem.

Instead of:

while (walker->pid > 0) {

should it simply be "while (pid_liave(walker)) {"? And add a
pid_alive(parent) after rcu_read_lock()?

> And yes, task_is_descendant() is overcompicated for no reason, afaics.

Yeah, agreed. I'll fix this up. Just to make sure I'm not crazy: the
real_parent of all tasks in a thread group are the same, yes? The
trouble I was trying to deal with for the complication was where a
non-leader thread would add an exception to the checking, and the
tasks wouldn't match. (As far as I can see, though, using
same_thread_group() should fix it.)

-Kees

--
Kees Cook

Oleg Nesterov

unread,
Oct 29, 2018, 8:24:02 AM10/29/18
to Kees Cook, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
On 10/26, Kees Cook wrote:
>
> On Thu, Oct 25, 2018 at 2:01 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> > On 10/25, Oleg Nesterov wrote:
> >> perhaps it needs some changes too. I even have a vague feeling that I have already
> >> blamed this function some time ago...
> >
> > Heh, yes, 3 years ago ;)
> >
> > https://lore.kernel.org/lkml/20150106184...@redhat.com/
> >
> > I can't understand my email today, but note that I tried to point out that
> > task_is_descendant() can dereference the freed mem.
>
> Instead of:
>
> while (walker->pid > 0) {
>
> should it simply be "while (pid_liave(walker)) {"?

No, this would be wrong. Probably walker->pid > 0 is not the best check,
but we do not need to change it for correctness.

> And add a
> pid_alive(parent) after rcu_read_lock()?

So you too do not read my emails ;)

I still think we need a single pid_alive() check and I even sent the patch.
Attached again at the end.

To clarify, let me repeat that ptracer_exception_found() may need some fixes
too, right now I am only talking about task_is_descendant().

> > And yes, task_is_descendant() is overcompicated for no reason, afaics.
>
> Yeah, agreed. I'll fix this up.

I have already posted this code, this is what I think it should do.


static int task_is_descendant(struct task_struct *parent,
struct task_struct *child)
{
struct task_struct *walker;

for (walker = child; walker->pid; walker = rcu_dereference(walker->real_parent)) {
if (same_thread_group(parent, walker))
return 1;
}

return 0;
}

This version differs in that I removed the parent/child != NULL at the start
and rcu_read_lock(), it should be held by the caller anyway.

> Just to make sure I'm not crazy: the
> real_parent of all tasks in a thread group are the same, yes?

Well, yes and no. So if

same_thread_group(t1, t2) == T

then
same_thread_group(t1->real_parent, t2->real_parent) == T

which means that real_parent of all tasks in a thread group is the same
_process_.

But t1->real_parent and t2->real_parent are not necessarily the same task.

Oleg Nesterov

unread,
Oct 29, 2018, 11:05:53 AM10/29/18
to Kees Cook, Tetsuo Handa, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
let me change the subject to avoid the confusion with the already confusing
disccussion about task_is_descendant().

On 10/29, Oleg Nesterov wrote:
>
> I still think we need a single pid_alive() check and I even sent the patch.
> Attached again at the end.
>
> To clarify, let me repeat that ptracer_exception_found() may need some fixes
> too, right now I am only talking about task_is_descendant().

so yes, the ptracer_relations code looks very broken to me, but perhaps I
misread this code, please correct me.

RCU can only protect the ptracer_relations list itself, you can do nothing
with (say) relation->tracer. relation->tracer can be already freed when
ptracer_exception_found() checks relation->tracee == tracee.

Not only pid_alive(parent) can not help in this case, pid_alive(parent) is
equally unsafe because, again, this memory can be freed.

security_task_free(tsk) is called right before free_task(tsk), there is no
a gp pass in between, and of course we can't rely on the ->invalid check.

_At first glance_ we can fix this if we simply turn both ->tracer/tracee
pointers into "signal_struct *", then we can turn all same_thread_group()'s
into walker->signal == parent which doesn't need to dereference the possibly-
freed parent. This also allows to remove all thread_group_leader() checks.
We need to ensure that false-positive is not possible (if, say, ->tracer
was already re-allocated and points to another task->signal), but this
doesn't look difficult.

Oleg.

syzbot

unread,
Nov 9, 2018, 10:25:04 PM11/9/18
to jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, ol...@redhat.com, penguin...@i-love.sakura.ne.jp, penguin...@i-love.sakura.ne.jp, se...@hallyn.com, syzkall...@googlegroups.com
syzbot has found a reproducer for the following crash on:

HEAD commit: 3541833fd1f2 Merge tag 's390-4.20-2' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16d3cad5400000
kernel config: https://syzkaller.appspot.com/x/.config?x=8f559fee2fc3375a
dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1494bb0b400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a9ac39...@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:182
[inline]
BUG: KASAN: use-after-free in task_is_descendant.part.3+0x610/0x670
security/yama/yama_lsm.c:295
Read of size 8 at addr ffff8801c46f24e0 by task syz-executor2/9973

CPU: 0 PID: 9973 Comm: syz-executor2 Not tainted 4.20.0-rc1+ #107
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x39d lib/dump_stack.c:113
print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
__read_once_size include/linux/compiler.h:182 [inline]
task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295
task_is_descendant security/yama/yama_lsm.c:282 [inline]
yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371
security_ptrace_access_check+0x54/0xb0 security/security.c:271
__ptrace_may_access+0x5c8/0x980 kernel/ptrace.c:336
ptrace_attach+0x1fa/0x640 kernel/ptrace.c:389
__do_sys_ptrace kernel/ptrace.c:1139 [inline]
__se_sys_ptrace kernel/ptrace.c:1119 [inline]
__x64_sys_ptrace+0x229/0x260 kernel/ptrace.c:1119
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f2ed4dbfc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000065
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000000457569
RDX: 0000000000000000 RSI: 000000000000021d RDI: 0000000000004206
RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2ed4dc06d4
R13: 00000000004c33bd R14: 00000000004d50e0 R15: 00000000ffffffff

Allocated by task 5873:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
alloc_task_struct_node kernel/fork.c:158 [inline]
dup_task_struct kernel/fork.c:843 [inline]
copy_process+0x2026/0x87a0 kernel/fork.c:1751
_do_fork+0x1cb/0x11d0 kernel/fork.c:2216
__do_sys_clone kernel/fork.c:2323 [inline]
__se_sys_clone kernel/fork.c:2317 [inline]
__x64_sys_clone+0xbf/0x150 kernel/fork.c:2317
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kmem_cache_free+0x83/0x290 mm/slab.c:3760
free_task_struct kernel/fork.c:163 [inline]
free_task+0x16e/0x1f0 kernel/fork.c:457
__put_task_struct+0x2e6/0x620 kernel/fork.c:730
put_task_struct include/linux/sched/task.h:96 [inline]
delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2437 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
__do_softirq+0x308/0xb7e kernel/softirq.c:292

The buggy address belongs to the object at ffff8801c46f2000
which belongs to the cache task_struct(65:syz2) of size 6080
The buggy address is located 1248 bytes inside of
6080-byte region [ffff8801c46f2000, ffff8801c46f37c0)
The buggy address belongs to the page:
page:ffffea000711bc80 count:1 mapcount:0 mapping:ffff8801c204fc80 index:0x0
compound_mapcount: 0
flags: 0x2fffc0000010200(slab|head)
raw: 02fffc0000010200 ffffea00073cf808 ffffea000697c908 ffff8801c204fc80
raw: 0000000000000000 ffff8801c46f2000 0000000100000001 ffff8801d8404a80
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff8801d8404a80

Memory state around the buggy address:
ffff8801c46f2380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c46f2400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801c46f2480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801c46f2500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c46f2580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

syzbot

unread,
Nov 10, 2018, 6:46:05 AM11/10/18
to jmo...@namei.org, kees...@chromium.org, linux-...@vger.kernel.org, linux-secu...@vger.kernel.org, ol...@redhat.com, penguin...@i-love.sakura.ne.jp, penguin...@i-love.sakura.ne.jp, se...@hallyn.com, syzkall...@googlegroups.com
syzbot has found a reproducer for the following crash on:

HEAD commit: aa4330e15c26 Merge tag 'devicetree-fixes-for-4.20-2' of gi..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17abaa47400000
kernel config: https://syzkaller.appspot.com/x/.config?x=8f559fee2fc3375a
dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
userspace arch: i386
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=108ae6d5400000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17bd86a3400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a9ac39...@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:182
[inline]
BUG: KASAN: use-after-free in task_is_descendant.part.3+0x610/0x670
security/yama/yama_lsm.c:295
Read of size 8 at addr ffff8801ae718560 by task syz-executor198/4607

CPU: 1 PID: 4607 Comm: syz-executor198 Not tainted 4.20.0-rc1+ #232
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x39d lib/dump_stack.c:113
print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
__read_once_size include/linux/compiler.h:182 [inline]
task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295
task_is_descendant security/yama/yama_lsm.c:282 [inline]
yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371
security_ptrace_access_check+0x54/0xb0 security/security.c:271
__ptrace_may_access+0x5c8/0x980 kernel/ptrace.c:336
ptrace_attach+0x1fa/0x640 kernel/ptrace.c:389
__do_compat_sys_ptrace kernel/ptrace.c:1284 [inline]
__se_compat_sys_ptrace kernel/ptrace.c:1266 [inline]
__ia32_compat_sys_ptrace+0x1d2/0x260 kernel/ptrace.c:1266
do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
do_fast_syscall_32+0x34d/0xfb2 arch/x86/entry/common.c:397
entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7feba29
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000f7fe71ec EFLAGS: 00000296 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 0000000000004206 RCX: 00000000000014b5
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 5668:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
alloc_task_struct_node kernel/fork.c:158 [inline]
dup_task_struct kernel/fork.c:843 [inline]
copy_process+0x2026/0x87a0 kernel/fork.c:1751
_do_fork+0x1cb/0x11d0 kernel/fork.c:2216
__do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
__se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
__ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
do_fast_syscall_32+0x34d/0xfb2 arch/x86/entry/common.c:397
entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

Freed by task 16:
save_stack+0x43/0xd0 mm/kasan/kasan.c:448
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
__cache_free mm/slab.c:3498 [inline]
kmem_cache_free+0x83/0x290 mm/slab.c:3760
free_task_struct kernel/fork.c:163 [inline]
free_task+0x16e/0x1f0 kernel/fork.c:457
__put_task_struct+0x2e6/0x620 kernel/fork.c:730
put_task_struct include/linux/sched/task.h:96 [inline]
delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2437 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
__do_softirq+0x308/0xb7e kernel/softirq.c:292

The buggy address belongs to the object at ffff8801ae718080
which belongs to the cache task_struct of size 6080
The buggy address is located 1248 bytes inside of
6080-byte region [ffff8801ae718080, ffff8801ae719840)
The buggy address belongs to the page:
page:ffffea0006b9c600 count:1 mapcount:0 mapping:ffff8801da970180 index:0x0
compound_mapcount: 0
flags: 0x2fffc0000010200(slab|head)
raw: 02fffc0000010200 ffffea0006b99488 ffffea0006b9c688 ffff8801da970180
raw: 0000000000000000 ffff8801ae718080 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff8801ae718400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801ae718480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801ae718500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801ae718580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801ae718600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Tetsuo Handa

unread,
Jan 10, 2019, 6:05:55 AM1/10/19
to Oleg Nesterov, Kees Cook, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
Hello, Kees.

syzbot is hitting this problem as of linux-next-20190110.
When a patch will be proposed?

Kees Cook

unread,
Jan 10, 2019, 1:48:09 PM1/10/19
to Tetsuo Handa, Oleg Nesterov, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkaller-bugs
On Thu, Jan 10, 2019 at 3:06 AM Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
> syzbot is hitting this problem as of linux-next-20190110.
> When a patch will be proposed?

Hi! Sorry, this got delayed over the holidays. Let me finish the patch
I was working on and get it published.

Thanks!

--
Kees Cook

Oleg Nesterov

unread,
Jan 16, 2019, 12:40:09 PM1/16/19
to Tetsuo Handa, Kees Cook, Serge E. Hallyn, syzbot, James Morris, LKML, linux-security-module, syzkall...@googlegroups.com
On 01/10, Tetsuo Handa wrote:
>
> syzbot is hitting this problem as of linux-next-20190110.
> When a patch will be proposed?

Well. I have already suggested the patch below several times. It won't fix all
problems in this code (I forgot the details but iirc ptracer_exception_found()
is broken too, task_is_descendant() should be cleanuped, may be something else),
but probably this trivial fix makes sense for the start?
Reply all
Reply to author
Forward
0 new messages