Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: "run seccomp after ptrace" changes expose "missing PTRACE_EVENT_EXIT" bug

21 views
Skip to first unread message

Robert O'Callahan

unread,
Aug 3, 2016, 8:00:05 PM8/3/16
to
I work on rr (http://rr-project.org/), a record-and-replay
reverse-execution debugger which is a heavy user of ptrace and
seccomp. The recent change to perform syscall-entry PTRACE_SYSCALL
stops before PTRACE_EVENT_SECCOMP stops broke rr, which is fine
because I'm fixing rr and this change actually makes rr faster
(thanks!). However, it exposed an existing kernel bug which creates a
problem for us, and which I'm not sure how to fix.

The problem is that if a tracee task is in a PTRACE_EVENT_SECCOMP
trap, or has been resumed after such a trap but not yet been
scheduled, and another task in the thread-group calls exit_group(),
then the tracee task exits without the ptracer receiving a
PTRACE_EVENT_EXIT notification. Small-ish testcase here:
https://gist.github.com/rocallahan/1344f7d01183c233d08a2c6b93413068.

The bug happens because when __seccomp_filter() detects
fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
that task is descheduled, __schedule() notices that there is a fatal
signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
That prevents the ptracer's waitpid() from returning the ptrace event.
A more detailed analysis is here:
https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.

This bug has been in the kernel for a while. rr never hit it before
because we trace all threads and mostly run only one tracee thread at
a time. Immediately after each PTRACE_EVENT_SECCOMP notification we'd
issue a PTRACE_SYSCALL to get that task to the syscall-entry
PTRACE_SYSCALL stop, so there was never an opportunity for one tracee
thread to call exit_group while another tracee was in the problematic
part of __seccomp_filter(). Unfortunately now there is no way for us
to avoid that possibility.

My guess is that __seccomp_filter() should dequeue the fatal signal it
detects before calling do_exit(), to behave more like get_signal(). Is
that correct, and if so, what would be the right way to do that?

Thanks,
Robert O'Callahan
--
lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD
selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr
.a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn

Kees Cook

unread,
Aug 4, 2016, 1:30:05 AM8/4/16
to
Thanks for the detailed analysis! I'll take a look at what can be done
here. Off the top of my head, I don't see a problem with what you're
suggesting. Let me see what I can come up with.

-Kees

>
> Thanks,
> Robert O'Callahan
> --
> lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf
> toD
> selthor stor edna siewaoeodm or v sstvr esBa kbvted,t
> rdsme,aoreseoouoto
> o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr
> .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr
> esn



--
Kees Cook
Brillo & Chrome OS Security

Kyle Huey

unread,
Aug 10, 2016, 4:00:09 PM8/10/16
to
This fixes rr. It doesn't quite fix the provided testcase, because the testcase fails to wait on the tracee after awakening from the nanosleep. Instead the testcase immediately does a PTHREAD_CONT, discarding the PTHREAD_EVENT_EXIT. The slightly modified testcase at https://gist.github.com/khuey/3c43ac247c72cef8c956c does pass.

I don't see any obvious way to dequeue only the fatal signal, so instead I dequeue them all. Since none of these signals will ever be delivered it shouldn't affect the executing task.

Suggested-by: Robert O'Callahan <rob...@ocallahan.org>
Signed-off-by: Kyle Huey <kh...@kylehuey.com>
---
kernel/seccomp.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ef6c6c3..728074d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -609,8 +609,20 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
* Terminating the task now avoids executing a system
* call that may not be intended.
*/
- if (fatal_signal_pending(current))
+ if (fatal_signal_pending(current)) {
+ /*
+ * Swallow the signals we will never deliver.
+ * If we do not do this, the PTRACE_EVENT_EXIT will
+ * be suppressed by those signals.
+ */
+ siginfo_t info;
+
+ spin_lock_irq(&current->sighand->siglock);
+ while (dequeue_signal(current, &current->blocked, &info));
+ spin_unlock_irq(&current->sighand->siglock);
+
do_exit(SIGSYS);
+ }
/* Check if the tracer forced the syscall to be skipped. */
this_syscall = syscall_get_nr(current, task_pt_regs(current));
if (this_syscall < 0)
--
2.7.4

Kees Cook

unread,
Aug 10, 2016, 5:40:07 PM8/10/16
to
On Wed, Aug 10, 2016 at 12:50 PM, Kyle Huey <m...@kylehuey.com> wrote:
> This fixes rr. It doesn't quite fix the provided testcase, because the testcase fails to wait on the tracee after awakening from the nanosleep. Instead the testcase immediately does a PTHREAD_CONT, discarding the PTHREAD_EVENT_EXIT. The slightly modified testcase at https://gist.github.com/khuey/3c43ac247c72cef8c956c does pass.
>
> I don't see any obvious way to dequeue only the fatal signal, so instead I dequeue them all. Since none of these signals will ever be delivered it shouldn't affect the executing task.
>
> Suggested-by: Robert O'Callahan <rob...@ocallahan.org>
> Signed-off-by: Kyle Huey <kh...@kylehuey.com>

Oh excellent! Thank you for the patch; I've been chasing other things
this week and hadn't had time yet to dig into this. I agree with your
rationale: the signals aren't being delivered anyway, so drop them to
keep ptrace operating as expected.

Oleg, does this pass a quick sanity-check from you? For context, the
original problem description was:


The problem is that if a tracee task is in a PTRACE_EVENT_SECCOMP
trap, or has been resumed after such a trap but not yet been
scheduled, and another task in the thread-group calls exit_group(),
then the tracee task exits without the ptracer receiving a
PTRACE_EVENT_EXIT notification. Small-ish testcase here:
https://gist.github.com/rocallahan/1344f7d01183c233d08a2c6b93413068.

The bug happens because when __seccomp_filter() detects
fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
that task is descheduled, __schedule() notices that there is a fatal
signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
That prevents the ptracer's waitpid() from returning the ptrace event.
A more detailed analysis is here:
https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.


Thanks!

-Kees
--
Kees Cook
Nexus Security

Kees Cook

unread,
Aug 10, 2016, 7:30:05 PM8/10/16
to
I took a closer look at this now (while chatting off-list with Kyle).
I think the dequeue solution isn't right (it's papering over the real
bug). The primary issue is this:

/*
* The delivery of a fatal signal during event
* notification may silently skip tracer notification.
* Terminating the task now avoids executing a system
* call that may not be intended.
*/
if (fatal_signal_pending(current))
do_exit(SIGSYS);

The comment here actually describes very nearly exactly what we're
trying to fix. When a fatal signal is delivered during event
notification, i.e. RET_TRACE happened, seccomp started to try to
notify the tracer, process caught a fatal signal so tracer never
changed syscall state, and then seccomp continues and we want to be
sure that the syscall is not executed because we have reason to
believe that the tracer got bypassed. However, calling do_exit() here
puts us in the unexpected scheduler state. Instead of the more
paranoid do_exit(), we can just do "goto skip;" here to make sure the
syscall is skipped and then let the signals get delivered correctly
(which will kill the process, matching the original goal here).

Testing the "goto skip;" here with your test case shows it passing
again. (And still passes the seccomp regression tests.) I'll send a
new patch...

-Kees
0 new messages