[PATCH 0/2] wait/ptrace: assume __WALL if the child is traced

18 views
Skip to first unread message

Oleg Nesterov

unread,
Mar 15, 2016, 7:34:34 PM3/15/16
to Andrew Morton, Denys Vlasenko, Dmitry Vyukov, Jan Kratochvil, Michael Kerrisk (man-pages), Pedro Alves, Roland McGrath, syzk...@googlegroups.com, linux-...@vger.kernel.org
Resend. And sorry for the huge delay. The patches are the same, I only
updated the changelog a little bit.

The previous discussion was a bit confusing, but iirc/iiuc nobody really
argued with this change. In particular strace/gdb maintainers do not think
it can break something.

To remind, 1 and 2 do not depend on each other. But if we decide to not
fix the kernel, then 2/2 makes much more sense. Most init's use waitid()
which doesn't allow __WALL, so the user-space fix will be more complicated
without this patch.

And just in case let me repeat that I agree, PTRACE_TRACEME is ugly. And
probably it should not succeed after re-parenting (in fact I personally
think PTRACE_TRACEME should not even exist). But imho it is too late to
try change this ancient interface, at least I strongly dislike the idea
to add something like is_global_init() check into ptrace_traceme(). And
any "sane" restriction here can break something too, plus this will
complicate the rules.

Oleg.

Oleg Nesterov

unread,
Mar 15, 2016, 7:34:53 PM3/15/16
to Andrew Morton, Denys Vlasenko, Dmitry Vyukov, Jan Kratochvil, Michael Kerrisk (man-pages), Pedro Alves, Roland McGrath, syzk...@googlegroups.com, linux-...@vger.kernel.org
The following program (simplified version of generated by syzkaller)

#include <pthread.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <stdio.h>
#include <signal.h>

void *thread_func(void *arg)
{
ptrace(PTRACE_TRACEME, 0,0,0);
return 0;
}

int main(void)
{
pthread_t thread;

if (fork())
return 0;

while (getppid() != 1)
;

pthread_create(&thread, NULL, thread_func, NULL);
pthread_join(thread, NULL);
return 0;
}

creates an unreapable zombie if /sbin/init doesn't use __WALL.

This is not a kernel bug, at least in a sense that everything works as
expected: debugger should reap a traced sub-thread before it can reap
the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.

Unfortunately, it seems that /sbin/init in most (all?) distributions
doesn't use it and we have to change the kernel to avoid the problem.
Note also that most init's use sys_waitid() which doesn't allow __WALL,
so the necessary user-space fix is not that trivial.

This patch just adds the "ptrace" check into eligible_child(). To some
degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal
is mostly ignored when the tracee reports to debugger. Or WSTOPPED, the
tracer doesn't need to set this flag to wait for the stopped tracee.

This obviously means the user-visible change: __WCLONE and __WALL no
longer have any meaning for debugger. And I can only hope that this
won't break something, but at least strace/gdb won't suffer.

We could make a more conservative change. Say, we can take __WCLONE
into account, or !thread_group_leader(). But it would be nice to not
complicate these historical/confusing checks.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
Reported-by: Dmitry Vyukov <dvy...@google.com>
Cc: <sta...@vger.kernel.org>
---
kernel/exit.c | 29 ++++++++++++++++++++---------
1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 10e0882..c112abb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -916,17 +916,28 @@ static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
task_pid_type(p, wo->wo_type) == wo->wo_pid;
}

-static int eligible_child(struct wait_opts *wo, struct task_struct *p)
+static int
+eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
{
if (!eligible_pid(wo, p))
return 0;
- /* Wait for all children (clone and not) if __WALL is set;
- * otherwise, wait for clone children *only* if __WCLONE is
- * set; otherwise, wait for non-clone children *only*. (Note:
- * A "clone" child here is one that reports to its parent
- * using a signal other than SIGCHLD.) */
- if (((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE))
- && !(wo->wo_flags & __WALL))
+
+ /*
+ * Wait for all children (clone and not) if __WALL is set or
+ * if it is traced by us.
+ */
+ if (ptrace || (wo->wo_flags & __WALL))
+ return 1;
+
+ /*
+ * Otherwise, wait for clone children *only* if __WCLONE is set;
+ * otherwise, wait for non-clone children *only*.
+ *
+ * Note: a "clone" child here is one that reports to its parent
+ * using a signal other than SIGCHLD, or a non-leader thread which
+ * we can only see if it is traced by us.
+ */
+ if ((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE))
return 0;

return 1;
@@ -1298,7 +1309,7 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
if (unlikely(exit_state == EXIT_DEAD))
return 0;

- ret = eligible_child(wo, p);
+ ret = eligible_child(wo, ptrace, p);
if (!ret)
return ret;

--
2.5.0

Oleg Nesterov

unread,
Mar 15, 2016, 7:34:56 PM3/15/16
to Andrew Morton, Denys Vlasenko, Dmitry Vyukov, Jan Kratochvil, Michael Kerrisk (man-pages), Pedro Alves, Roland McGrath, syzk...@googlegroups.com, linux-...@vger.kernel.org
I see no reason why waitid() can't support other linux-specific flags
allowed in sys_wait4().

In particular this change can help if we reconsider the previous change
which adds the "automagical" __WALL for debugger.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---
kernel/exit.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index c112abb..9db1f4a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1533,7 +1533,8 @@ SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
enum pid_type type;
long ret;

- if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED))
+ if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED|
+ __WNOTHREAD|__WCLONE|__WALL))
return -EINVAL;
if (!(options & (WEXITED|WSTOPPED|WCONTINUED)))
return -EINVAL;
--
2.5.0

Reply all
Reply to author
Forward
0 new messages