SIGSEGV when using "perf record -g" with 3.13-rc* kernel

Waiman Long

unread,

Jan 10, 2014, 10:30:02 AM1/10/14

to

Peter,

I recently encountered a strange problem using 3.13-rc* kernel that
did not happen in a 3.12 kernel. When I ran the high_systime benchmark
of the AIM7 test suite, I saw the following errors:

Child terminated by signal #11

core dumped

Child terminated by signal #11

Child process called exit(), status = 139

core dumped

Child terminated by signal #11

This only happened when I monitored the running of the benchmark using
"perf record -g". There was no problem if callchain was not enabled.
Adding debug code to the kernel showed the following stack trace:

Call Trace:
<NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
[<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8105f1f1>] force_sig_info+0x131/0x140
[<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
[<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
[<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
[<ffffffff81042cc9>] no_context+0x159/0x1f0
[<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
[<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
[<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
[<ffffffff81578fc2>] __do_page_fault+0x362/0x480
[<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
[<ffffffff815791be>] do_page_fault+0xe/0x10
[<ffffffff81575962>] page_fault+0x22/0x30
[<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
[<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
[<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
[<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
[<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
[<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
[<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
[<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
[<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
[<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
[<ffffffff8157664a>] nmi_handle+0x8a/0x170
[<ffffffff81576848>] default_do_nmi+0x68/0x210
[<ffffffff81576a80>] do_nmi+0x90/0xe0
[<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
[<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
[<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
[<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
<<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
[<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
[<ffffffff81578fc2>] __do_page_fault+0x362/0x480
[<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
[<ffffffff815791be>] do_page_fault+0xe/0x10
[<ffffffff81575962>] page_fault+0x22/0x30
---[ end trace 037bf09d279751ec ]---

So this is a double page faults. Looking at relevant changes in
3.13 kernel, I spotted the following one patch that modified the
perf_callchain_user() function shown up in the stack trace above:

perf: Fix arch_perf_out_copy_user default

@@ -2041,7 +2041,7 @@ perf_callchain_user(struct perf_callchain_entry
*entry, struct pt_regs *regs)
frame.return_address = 0;

bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
- if (bytes != sizeof(frame))
+ if (bytes != 0)
break;

if (!valid_user_frame(fp, sizeof(frame)))

I wondered if it was the cause of the SIGSEGV. Please let me know your
thought on that.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Jan 10, 2014, 12:00:01 PM1/10/14

to

On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote:
> Peter,

Hurm, that's an expected double fault, not something we should take the
process down for.

I'll have to look at how all that works for a bit.

Peter Zijlstra

unread,

Jan 10, 2014, 12:10:02 PM1/10/14

to

How easily can you reproduce this? Could you test something like the
below, which would allow us to take double faults from NMI context.

---
arch/x86/mm/fault.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9ff85bb8dd69..18c498d4274d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,

/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs)) {
- if (current_thread_info()->sig_on_uaccess_error && signal) {
+ if (!in_nmi() && current_thread_info()->sig_on_uaccess_error && signal) {
tsk->thread.trap_nr = X86_TRAP_PF;
tsk->thread.error_code = error_code | PF_USER;
tsk->thread.cr2 = address;

Peter Zijlstra

unread,

Jan 10, 2014, 12:50:01 PM1/10/14

to

Andy, introduced all this in 4fc3490114bb ("x86-64: Set siginfo and
context on vsyscall emulation faults").

It looks like your initial userspace fault hit the magic button and ends
up in emulate_vsyscall. Right at that point we trigger a PMI, which
tries to do a stack-trace. That stack-trace also stumbles into unmapped
memory (might be the same) and faults again.

Now at that point, we usually just give up on the callchain and proceed
like normal, however because of this double fault emulate-vsyscall
SIGSEGV magic you loose.

So the below might well be a valid fix.. Anybody? Andy?

Andy Lutomirski

unread,

Jan 10, 2014, 2:00:03 PM1/10/14

to

Yuck -- when I wrote that thing, I hadn't imagined that an interrupt
(there's nothing particularly special about NMIs here, I think) would
try to access user memory. The fix below looks okay, but IMO it needs
a big fat comment explaining what's going on.

Is there a way to ask whether the previous entry into the kernel came
from user space? The valid "sig_on_uaccess_error" case happens when
the current fault was triggered by a fault from userspace. The
invalid case (and any invalid case from, say, an int3 that a
tracepoint stuck in there) would be a page fault triggered by a fault
handler that in turn started in kernel space (in particular, in
emulate_vsyscall).

For that matter, why does current_thread_info() work from an NMI at
all? Does the NMI vector not have its own stack? The call that
installs it is set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK).

In any case, this at least needs a comment. I don't see why this same
bug couldn't be triggered by non-NMI based tracing mechanisms, though.

Sigh, corner cases of corner cases...

>
>> How easily can you reproduce this? Could you test something like the
>> below, which would allow us to take double faults from NMI context.
>>
>> ---
>> arch/x86/mm/fault.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 9ff85bb8dd69..18c498d4274d 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>>
>> /* Are we prepared to handle this kernel fault? */
>> if (fixup_exception(regs)) {
>> - if (current_thread_info()->sig_on_uaccess_error && signal) {
>> + if (!in_nmi() && current_thread_info()->sig_on_uaccess_error && signal) {
>> tsk->thread.trap_nr = X86_TRAP_PF;
>> tsk->thread.error_code = error_code | PF_USER;
>> tsk->thread.cr2 = address;

--Andy

Waiman Long

unread,

Jan 10, 2014, 2:40:02 PM1/10/14

to

The error can be readily reproducible in my current setup.

> ---
> arch/x86/mm/fault.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ff85bb8dd69..18c498d4274d 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>
> /* Are we prepared to handle this kernel fault? */
> if (fixup_exception(regs)) {
> - if (current_thread_info()->sig_on_uaccess_error&& signal) {

> + if (!in_nmi()&& current_thread_info()->sig_on_uaccess_error&& signal) {

> tsk->thread.trap_nr = X86_TRAP_PF;
> tsk->thread.error_code = error_code | PF_USER;
> tsk->thread.cr2 = address;

Yes, this change fixed the error that I got. I no longer see SIGSEGV
when I run the test.

I did tried to back out your "perf: Fix arch_perf_out_copy_user default"
patch, but it didn't fix the problem.

Thank for the quick turnaround!

-Longman

Waiman Long

unread,

Jan 10, 2014, 2:50:02 PM1/10/14

to

The processes that got the SIGSEGV were all running shell scripts. I am
not totally sure that they were running in user space when getting the
PMIs, but are likely the case.

-Longman

Andy Lutomirski

unread,

Jan 10, 2014, 3:00:03 PM1/10/14

to

The error you're seeing is:

- Userspace code calls into the vsyscall page (e.g. old code calling
gettimeofday).
- Calls to the vsyscall page trap, so you end up executing in kernel
space in emulate_vsyscall.
- perf is running, so you end up in an NMI handler that triggers
while sig_on_uaccess_fault is true.
- The perf callchain code tries to read from a bad userspace pointer
(not sure why -- the ip value in the vsyscall page *is* readable).
That traps (as expected), but the trap handler injects SIGSEGV due to
sig_on_uaccess_fault==1.

The cleanest fix I can think of is to change the condition to
sig_on_uaccess_fault==1 && (the get_user caller was a single kernel
entry away from userspace). In the failure you're seeing, there was
an NMI in there.

--Andy

Peter Zijlstra

unread,

Jan 10, 2014, 3:10:01 PM1/10/14

to

On Fri, Jan 10, 2014 at 10:54:52AM -0800, Andy Lutomirski wrote:
> Yuck -- when I wrote that thing, I hadn't imagined that an interrupt
> (there's nothing particularly special about NMIs here, I think) would
> try to access user memory. The fix below looks okay, but IMO it needs
> a big fat comment explaining what's going on.

Agreed on both points, we can equally trigger this using software
timers, so any interrupt must be exempt. And yes a comment!

> Is there a way to ask whether the previous entry into the kernel came
> from user space?

Not afaik, but in_interrupt() gets us any interrupt context, whatever
remains must be task context. Still not quite the same, but close enough
I think.

> The valid "sig_on_uaccess_error" case happens when
> the current fault was triggered by a fault from userspace. The
> invalid case (and any invalid case from, say, an int3 that a
> tracepoint stuck in there) would be a page fault triggered by a fault
> handler that in turn started in kernel space (in particular, in
> emulate_vsyscall).
>
> For that matter, why does current_thread_info() work from an NMI at
> all? Does the NMI vector not have its own stack? The call that
> installs it is set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK).

NMIs do have their own stack, however x86_64 grabs kernel_stack from a
per-cpu variable, not rsp.

> In any case, this at least needs a comment. I don't see why this same
> bug couldn't be triggered by non-NMI based tracing mechanisms, though.
>
> Sigh, corner cases of corner cases...

:-)

Something like this perhaps?

---
Subject: x86, mm: Allow double faults from interrupts

Waiman managed to trigger a PMI while in a emulate_vsyscall() fault, the
PMI in turn managed to trigger a fault while obtaining a stack trace.
This triggered the double fault logic and killed the process dead.

Fix this by explicitly excluding interrupts from the double fault logic.

Reported-by: Waiman Long <waima...@hp.com>
Signed-off-by: Peter Zijlstra <pet...@infradead.org>
---
arch/x86/mm/fault.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9ff85bb8dd69..4c8e32986aad 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -641,6 +641,20 @@ no_context(struct pt_regs *regs, unsigned long error_code,

/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs)) {

+ /*
+ * Any interrupt that takes a fault gets the fixup. This
+ * makes the below double fault logic only apply to a
+ * task double faulting from task context.
+ */
+ if (in_interrupt())
+ return;
+
+ /*
+ * Per the above we're !in_interrupt(), aka. task context.
+ *
+ * In this case we need to make sure we're not double faulting
+ * through the emulate_vsyscall() logic.
+ */

if (current_thread_info()->sig_on_uaccess_error && signal) {

tsk->thread.trap_nr = X86_TRAP_PF;
tsk->thread.error_code = error_code | PF_USER;

@@ -649,6 +663,10 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* XXX: hwpoison faults will set the wrong code. */
force_sig_info_fault(signal, si_code, address, tsk, 0);
}
+
+ /*
+ * Barring that, we can do the fixup and be happy.
+ */
return;

Peter Zijlstra

unread,

Jan 10, 2014, 3:20:02 PM1/10/14

to

On Fri, Jan 10, 2014 at 11:56:07AM -0800, Andy Lutomirski wrote:
> - The perf callchain code tries to read from a bad userspace pointer
> (not sure why -- the ip value in the vsyscall page *is* readable).
> That traps (as expected), but the trap handler injects SIGSEGV due to
> sig_on_uaccess_fault==1.

So the perf thing tries to walk the entire stack in order to obtain a
callchain. Its fairly common to hit crap when attempting this :-)

Peter Zijlstra

unread,

Jan 10, 2014, 3:20:02 PM1/10/14

to

On Fri, Jan 10, 2014 at 02:37:10PM -0500, Waiman Long wrote:
> >@@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> >
> > /* Are we prepared to handle this kernel fault? */
> > if (fixup_exception(regs)) {
> >- if (current_thread_info()->sig_on_uaccess_error&& signal) {
> >+ if (!in_nmi()&& current_thread_info()->sig_on_uaccess_error&& signal) {
> > tsk->thread.trap_nr = X86_TRAP_PF;
> > tsk->thread.error_code = error_code | PF_USER;
> > tsk->thread.cr2 = address;
>
> Yes, this change fixed the error that I got. I no longer see SIGSEGV when I
> run the test.

Awesome, just send a more elaborate version that should have the same
effect.

> I did tried to back out your "perf: Fix arch_perf_out_copy_user default"
> patch, but it didn't fix the problem.

The culprit was: e00b12e64be9 ("perf/x86: Further optimize
copy_from_user_nmi()")

Andy Lutomirski

unread,

Jan 10, 2014, 3:30:02 PM1/10/14

to

My only real nit is the same thing that confused me the first time I
read your email -- when I see "double fault", I think #DF. How about
something like "It's possible for an interrupt to occur while
sig_on_uaccess_error is set. In that case, uaccess faults that are
caused by the interrupt handler should not send signals."?

--Andy

Waiman Long

unread,

Jan 15, 2014, 10:40:03 AM1/15/14

to

> if (current_thread_info()->sig_on_uaccess_error&& signal) {

> tsk->thread.trap_nr = X86_TRAP_PF;
> tsk->thread.error_code = error_code | PF_USER;
> @@ -649,6 +663,10 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> /* XXX: hwpoison faults will set the wrong code. */
> force_sig_info_fault(signal, si_code, address, tsk, 0);
> }
> +
> + /*
> + * Barring that, we can do the fixup and be happy.
> + */
> return;
> }
>

Are you going to send out an official patch to fix this problem? I
really like to see it merged into 3.13 before it is released.

-Longman

tip-bot for Peter Zijlstra

unread,

Jan 16, 2014, 8:50:03 AM1/16/14

to

Commit-ID: c026b3591e4f2a4993df773183704bb31634e0bd
Gitweb: http://git.kernel.org/tip/c026b3591e4f2a4993df773183704bb31634e0bd
Author: Peter Zijlstra <pet...@infradead.org>
AuthorDate: Fri, 10 Jan 2014 21:06:03 +0100
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 16 Jan 2014 09:19:48 +0100

x86, mm, perf: Allow recursive faults from interrupts

Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
the PMI in turn managed to trigger a fault while obtaining a stack

trace. This triggered the sig_on_uaccess_error recursive fault logic

and killed the process dead.

Fix this by explicitly excluding interrupts from the recursive fault
logic.

Reported-and-Tested-by: Waiman Long <waima...@hp.com>
Fixes: e00b12e64be9 ("perf/x86: Further optimize copy_from_user_nmi()")
Cc: Aswin Chandramouleeswaran <as...@hp.com>
Cc: Scott J Norton <scott....@hp.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Andy Lutomirski <lu...@amacapital.net>
Cc: Arnaldo Carvalho de Melo <ac...@ghostprotocols.net>
Cc: Andrew Morton <ak...@linux-foundation.org>
Signed-off-by: Peter Zijlstra <pet...@infradead.org>
Link: http://lkml.kernel.org/r/2014011020...@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mi...@kernel.org>

---
arch/x86/mm/fault.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c

index 9ff85bb..9d591c8 100644

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -641,6 +641,20 @@ no_context(struct pt_regs *regs, unsigned long error_code,

/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs)) {
+ /*

+ * Any interrupt that takes a fault gets the fixup. This makes
+ * the below recursive fault logic only apply to a faults from
+ * task context.

+ */
+ if (in_interrupt())
+ return;
+
+ /*
+ * Per the above we're !in_interrupt(), aka. task context.
+ *

+ * In this case we need to make sure we're not recursively
+ * faulting through the emulate_vsyscall() logic.

Waiman Long

unread,

Jan 17, 2014, 1:20:03 PM1/17/14

to

Will that be picked up by Linus as it is a 3.13 regression?

-Longman

Andy Lutomirski

unread,

Jan 17, 2014, 2:20:01 PM1/17/14

to

Does anyone actually know why this regressed recently? The buggy code
has been there for quite a while.

--Andy

Waiman Long

unread,

Jan 17, 2014, 3:10:03 PM1/17/14

to

Yes, the bug was there for a while, but a recent change by Peter (see
the "Fixes:" line above) made it much easier to hit it.

-Longman

Andy Lutomirski

unread,

Jan 17, 2014, 4:10:03 PM1/17/14

to

Thanks!

So I feel slightly better now -- this particular bug didn't actually
exist when I wrote the offending code :) But that also means that
this should really be fixed in 3.13.

--Andy