frequent lockups in 3.18rc4

Dave Jones

unread,

Nov 14, 2014, 4:31:40 PM11/14/14

to Linux Kernel, Linus Torvalds

I'm not sure how long this goes back (3.17 was fine afair) but I'm
seeing these several times a day lately..

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570]
irq event stamp: 74224
hardirqs last enabled at (74223): [<ffffffff9c875664>] restore_args+0x0/0x30
hardirqs last disabled at (74224): [<ffffffff9c8759aa>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (74222): [<ffffffff9c07f43a>] __do_softirq+0x26a/0x6f0
softirqs last disabled at (74209): [<ffffffff9c07fb4d>] irq_exit+0x13d/0x170
CPU: 3 PID: 25570 Comm: trinity-c129 Not tainted 3.18.0-rc4+ #83 [loadavg: 198.04 186.66 181.58 24/442 26708]
task: ffff880213442f00 ti: ffff8801ea714000 task.ti: ffff8801ea714000
RIP: 0010:[<ffffffff9c11e98a>] [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0
RSP: 0018:ffff8801ea717a08 EFLAGS: 00000202
RAX: ffff880213442f00 RBX: ffffffff9c875664 RCX: 0000000000000006
RDX: 0000000000001370 RSI: ffff880213443790 RDI: ffff880213442f00
RBP: ffff8801ea717a68 R08: ffff880242b56690 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ea717978
R13: ffff880213442f00 R14: ffff8801ea714000 R15: ffff880213442f00
FS: 00007f240994e700(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000004 CR3: 000000019a017000 CR4: 00000000001407e0
DR0: 00007fb3367e0000 DR1: 00007f82542ab000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
ffffffff9ce4c620 0000000000000000 ffffffff9c048b20 ffff8801ea717b18
0000000000000003 0000000052e0da3d ffffffff9cc7ef3c 0000000000000002
ffffffff9c048b20 ffff8801ea717b18 0000000000000001 0000000000000003
Call Trace:
[<ffffffff9c048b20>] ? leave_mm+0x210/0x210
[<ffffffff9c048b20>] ? leave_mm+0x210/0x210
[<ffffffff9c11ead6>] smp_call_function_single+0x66/0x110
[<ffffffff9c048b20>] ? leave_mm+0x210/0x210
[<ffffffff9c11f021>] smp_call_function_many+0x2f1/0x390
[<ffffffff9c049300>] flush_tlb_mm_range+0xe0/0x370
[<ffffffff9c1d95a2>] tlb_flush_mmu_tlbonly+0x42/0x50
[<ffffffff9c1d9cb5>] tlb_finish_mmu+0x45/0x50
[<ffffffff9c1daf59>] zap_page_range_single+0x119/0x170
[<ffffffff9c1db140>] unmap_mapping_range+0x140/0x1b0
[<ffffffff9c1c7edd>] shmem_fallocate+0x43d/0x540
[<ffffffff9c0b111b>] ? preempt_count_sub+0xab/0x100
[<ffffffff9c0cdac7>] ? prepare_to_wait+0x27/0x80
[<ffffffff9c2287f3>] ? __sb_start_write+0x103/0x1d0
[<ffffffff9c223aba>] do_fallocate+0x12a/0x1c0
[<ffffffff9c1f0bd3>] SyS_madvise+0x3d3/0x890
[<ffffffff9c1a40d2>] ? context_tracking_user_exit+0x52/0x260
[<ffffffff9c013ebd>] ? syscall_trace_enter_phase2+0x10d/0x3d0
[<ffffffff9c874c89>] tracesys_phase2+0xd4/0xd9
Code: 63 c7 48 89 de 48 89 df 48 c7 c2 c0 50 1d 00 48 03 14 c5 40 b9 f2 9c e8 d5 ea 2b 00 84 c0 74 0b e9 bc 00 00 00 0f 1f 40 00 f3 90 <f6> 43 18 01 75 f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00
Kernel panic - not syncing: softlockup: hung tasks

I've got a local hack to dump loadavg on traces, and as you can see in that
example, the machine was really busy, but we were at least making progress
before the trace spewed, and the machine rebooted. (I have reboot-on-lockup sysctl
set, without it, the machine just wedges indefinitely shortly after the spew).

The trace doesn't really enlighten me as to what we should be doing
to prevent this though.

ideas?
I can try to bisect it, but it takes hours before it happens,
so it might take days to complete, and the next few weeks are
complicated timewise..

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Nov 14, 2014, 5:01:36 PM11/14/14

to Dave Jones, Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 1:31 PM, Dave Jones <da...@redhat.com> wrote:
> I'm not sure how long this goes back (3.17 was fine afair) but I'm
> seeing these several times a day lately..

Hmm. I don't see what would have changed in this area since v3.17.
There's a TLB range fix in mm/memory.c, but for the life of me I can't
see how that would possibly matter the way x86 does TLB flushing (if
the range fix does something bad and the range goes too large, x86
will just end up doing a full TLB invalidate instead).

Plus, judging by the fact that there's a stale "leave_mm+0x210/0x210"
(wouldn't that be the *next* function, namely do_flush_tlb_all())
pointer on the stack, I suspect that whole range-flushing doesn't even
trigger, and we are flushing everything.

But since you say "several times a day", just for fun, can you test
the follow-up patch to that one-liner fix that Will Deacon posted
today (Subject: "[PATCH] mmu_gather: move minimal range calculations
into generic code"). That does some further cleanup in this area.

I don't see any changes to the x86 IPI or TLB flush handling, but
maybe I'm missing something, so I'm adding the x86 maintainers to the
cc.

> I've got a local hack to dump loadavg on traces, and as you can see in that
> example, the machine was really busy, but we were at least making progress
> before the trace spewed, and the machine rebooted. (I have reboot-on-lockup sysctl
> set, without it, the machine just wedges indefinitely shortly after the spew).
>
> The trace doesn't really enlighten me as to what we should be doing
> to prevent this though.
>
> ideas?

I can't say I have any ideas except to point at the TLB range patch,
and quite frankly, I don't see how that would matter.

If Will's patch doesn't make a difference, what about reverting that
ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
don't see why any of this would be noticeable on x86 (it triggered
issues on ARM64, but that was because ARM64 cared much more about the
exact range).

> I can try to bisect it, but it takes hours before it happens,
> so it might take days to complete, and the next few weeks are
> complicated timewise..

Hmm. Even narrowing it down a bit might help, ie if you could get say
four bisections in over a day, and see if that at least says "ok, it's
likely one of these pulls".

But yeah, I can see it being painful, so maybe a quick check of the
TLB ones, even if I can't for the life see why they would possibly
matter.

Linus

---

> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570]
> irq event stamp: 74224
> hardirqs last enabled at (74223): [<ffffffff9c875664>] restore_args+0x0/0x30
> hardirqs last disabled at (74224): [<ffffffff9c8759aa>] apic_timer_interrupt+0x6a/0x80
> softirqs last enabled at (74222): [<ffffffff9c07f43a>] __do_softirq+0x26a/0x6f0
> softirqs last disabled at (74209): [<ffffffff9c07fb4d>] irq_exit+0x13d/0x170
> CPU: 3 PID: 25570 Comm: trinity-c129 Not tainted 3.18.0-rc4+ #83 [loadavg: 198.04 186.66 181.58 24/442 26708]

> RIP: 0010:[<ffffffff9c11e98a>] [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0

> Call Trace:
> [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> [<ffffffff9c11ead6>] smp_call_function_single+0x66/0x110
> [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> [<ffffffff9c11f021>] smp_call_function_many+0x2f1/0x390
> [<ffffffff9c049300>] flush_tlb_mm_range+0xe0/0x370
> [<ffffffff9c1d95a2>] tlb_flush_mmu_tlbonly+0x42/0x50
> [<ffffffff9c1d9cb5>] tlb_finish_mmu+0x45/0x50
> [<ffffffff9c1daf59>] zap_page_range_single+0x119/0x170
> [<ffffffff9c1db140>] unmap_mapping_range+0x140/0x1b0
> [<ffffffff9c1c7edd>] shmem_fallocate+0x43d/0x540

> [<ffffffff9c223aba>] do_fallocate+0x12a/0x1c0
> [<ffffffff9c1f0bd3>] SyS_madvise+0x3d3/0x890
> [<ffffffff9c874c89>] tracesys_phase2+0xd4/0xd9

> Kernel panic - not syncing: softlockup: hung tasks

Dave Jones

unread,

Nov 14, 2014, 5:31:17 PM11/14/14

to Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 02:01:27PM -0800, Linus Torvalds wrote:

> Plus, judging by the fact that there's a stale "leave_mm+0x210/0x210"
> (wouldn't that be the *next* function, namely do_flush_tlb_all())
> pointer on the stack, I suspect that whole range-flushing doesn't even
> trigger, and we are flushing everything.
>
> But since you say "several times a day", just for fun, can you test
> the follow-up patch to that one-liner fix that Will Deacon posted
> today (Subject: "[PATCH] mmu_gather: move minimal range calculations
> into generic code"). That does some further cleanup in this area.

I'll give it a shot. Should know by the morning if it changes anything.

> > The trace doesn't really enlighten me as to what we should be doing
> > to prevent this though.
> >
> > ideas?
>
> I can't say I have any ideas except to point at the TLB range patch,
> and quite frankly, I don't see how that would matter.
>
> If Will's patch doesn't make a difference, what about reverting that
> ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
> don't see why any of this would be noticeable on x86 (it triggered
> issues on ARM64, but that was because ARM64 cared much more about the
> exact range).

Digging through the serial console logs, there was one other trace variant,
which is even less informative..

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c104:19168]
irq event stamp: 223186
hardirqs last enabled at (223185): [<ffffffff941a4092>] context_tracking_user_exit+0x52/0x260
hardirqs last disabled at (223186): [<ffffffff948756aa>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (187030): [<ffffffff9407f43a>] __do_softirq+0x26a/0x6f0
softirqs last disabled at (187017): [<ffffffff9407fb4d>] irq_exit+0x13d/0x170
CPU: 3 PID: 19168 Comm: trinity-c104 Not tainted 3.18.0-rc4+ #82 [loadavg: 99.30 85.88 82.88 9/303 19302]
task: ffff88023f8b4680 ti: ffff880157418000 task.ti: ffff880157418000
RIP: 0010:[<ffffffff941a4094>] [<ffffffff941a4094>] context_tracking_user_exit+0x54/0x260
RSP: 0018:ffff88015741bee8 EFLAGS: 00000246
RAX: ffff88023f8b4680 RBX: ffffffff940b111b RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88023f8b4680
RBP: ffff88015741bef8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff88015741bf58 R14: ffff88023f8b4ae8 R15: ffff88023f8b4b18
FS: 00007f9a0789b740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000003dfa1b7c90 CR3: 0000000165f3c000 CR4: 00000000001407e0
DR0: 00000000ffffffbf DR1: 00007f2c0c3d9000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

0000000000080000 ffff88015741c000 ffff88015741bf78 ffffffff94013d35
ffff88015741bf28 ffffffff940d865d 0000000000004b02 0000000000000000
00007f9a071bb000 ffffffff943d816b 0000000000000000 0000000000000000
Call Trace:
[<ffffffff94013d35>] syscall_trace_enter_phase1+0x125/0x1a0
[<ffffffff940d865d>] ? trace_hardirqs_on_caller+0x16d/0x210
[<ffffffff943d816b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff9487487f>] tracesys+0x14/0x4a
Code: fa e8 51 0a f3 ff 48 c7 c7 26 52 cd 94 e8 f5 21 24 00 65 8b 04 25 f4 f8 1c 00 83 f8 01 74 28 f6 c7 02 74 13 e8 6e 46 f3 ff 53 9d <5b> 41 5c 5d c3 0f 1f 80
00 00 00 00 53 9d e8 19 0a f3 ff eb eb

It looks like I've been seeing these since 3.18-rc1 though,
but for those, the machine crashed before the trace even made it
over usb-serial, leaving just the "NMI watchdog" line.

> > I can try to bisect it, but it takes hours before it happens,
> > so it might take days to complete, and the next few weeks are
> > complicated timewise..
>
> Hmm. Even narrowing it down a bit might help, ie if you could get say
> four bisections in over a day, and see if that at least says "ok, it's
> likely one of these pulls".
>
> But yeah, I can see it being painful, so maybe a quick check of the
> TLB ones, even if I can't for the life see why they would possibly
> matter.

Assuming the NMI watchdog traces I saw in rc1 are the same problem,
I'll see if I can bisect between .17 and .18rc1 on Monday, and see
if that yields anything interesting.

Dave

Thomas Gleixner

unread,

Nov 14, 2014, 5:55:42 PM11/14/14

to Linus Torvalds, Dave Jones, Linux Kernel, the arch/x86 maintainers

On Fri, 14 Nov 2014, Linus Torvalds wrote:
> On Fri, Nov 14, 2014 at 1:31 PM, Dave Jones <da...@redhat.com> wrote:
> > I'm not sure how long this goes back (3.17 was fine afair) but I'm
> > seeing these several times a day lately..
>

> Plus, judging by the fact that there's a stale "leave_mm+0x210/0x210"
> (wouldn't that be the *next* function, namely do_flush_tlb_all())
> pointer on the stack, I suspect that whole range-flushing doesn't even
> trigger, and we are flushing everything.

This stale entry is not relevant here because the thing is stuck in
generic_exec_single().

> > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570]

> > RIP: 0010:[<ffffffff9c11e98a>] [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0

> > Call Trace:
> > [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> > [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> > [<ffffffff9c11ead6>] smp_call_function_single+0x66/0x110
> > [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> > [<ffffffff9c11f021>] smp_call_function_many+0x2f1/0x390
> > [<ffffffff9c049300>] flush_tlb_mm_range+0xe0/0x370

flush_tlb_mm_range()
.....
out:
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), mm, start, end);

which calls

smp_call_function_many() via native_flush_tlb_others()

which is either inlined or not on the stack the invocation of
smp_call_function_many() is a tail call.

So from smp_call_function_many() we end up via
smp_call_function_single() in generic_exec_single().

So the only ways to get stuck there are:

csd_lock(csd);
and
csd_lock_wait(csd);

The called function is flush_tlb_func() and I really can't see why
that would get stuck at all.

So this looks more like a smp function call fuckup.

I assume Dave is running that stuff on KVM. So it might be worth while
to look at the IPI magic there.

Thanks,

tglx

Dave Jones

unread,

Nov 14, 2014, 6:32:40 PM11/14/14

to Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 11:55:30PM +0100, Thomas Gleixner wrote:

> So this looks more like a smp function call fuckup.
>
> I assume Dave is running that stuff on KVM. So it might be worth while
> to look at the IPI magic there.

no, bare metal.

Dave

Thomas Gleixner

unread,

Nov 14, 2014, 7:36:54 PM11/14/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Fri, 14 Nov 2014, Dave Jones wrote:

> On Fri, Nov 14, 2014 at 11:55:30PM +0100, Thomas Gleixner wrote:
>
> > So this looks more like a smp function call fuckup.
> >
> > I assume Dave is running that stuff on KVM. So it might be worth while
> > to look at the IPI magic there.
>
> no, bare metal.

Ok, but that does not change the fact that we are stuck in
smp_function_call land.

Enabling softlockup_all_cpu_backtrace will probably not help much as
we will end up waiting for csd_lock again :(

Is the machine still accesible when this happens? If yes, we might
enable a few trace points and functions and read out the trace
buffer. If not, we could just panic the machine and dump the trace
buffer over serial.

Sigh

tglx

Linus Torvalds

unread,

Nov 14, 2014, 9:00:12 PM11/14/14

to Thomas Gleixner, Dave Jones, Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 2:55 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
>
> This stale entry is not relevant here because the thing is stuck in
> generic_exec_single().

That wasn't really my argument. The fact that "do_flush_tlb_all()" was
left over on the stack frame implies that we're not doing the
range-flush, and if it was some odd bug with a negative range or
something like that (due to the fix in commit ce9ec37bddb6), I'd
expect the lockup to be due to a hung do_kernel_range_flush() or
something. But the range flushing never even happens.

> So from smp_call_function_many() we end up via
> smp_call_function_single() in generic_exec_single().
>
> So the only ways to get stuck there are:
>
> csd_lock(csd);
> and
> csd_lock_wait(csd);

Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
end. The disassembly looks like

29: f3 90 pause
2b:* f6 43 18 01 testb $0x1,0x18(%rbx) <-- trapping instruction
2f: 75 f8 jne 0x29
31: 31 c0 xor %eax,%eax

and that "xor %eax,%eax" seems to be part of the "return 0"
immediately afterwards.

But that's not entirely conclusive, it's just a strong hint.

It does sound like there might be some IPI issue. I just don't see
*any* changes in this area since 3.17. Some unrelated APIC change? I
don't see that either. As you noted, there are KVM changes, but
apparently that isn't involved either.

Linus

Dave Jones

unread,

Nov 14, 2014, 9:41:03 PM11/14/14

to Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Sat, Nov 15, 2014 at 01:36:41AM +0100, Thomas Gleixner wrote:
> On Fri, 14 Nov 2014, Dave Jones wrote:
>
> > On Fri, Nov 14, 2014 at 11:55:30PM +0100, Thomas Gleixner wrote:
> >
> > > So this looks more like a smp function call fuckup.
> > >
> > > I assume Dave is running that stuff on KVM. So it might be worth while
> > > to look at the IPI magic there.
> >
> > no, bare metal.
>
> Ok, but that does not change the fact that we are stuck in
> smp_function_call land.
>
> Enabling softlockup_all_cpu_backtrace will probably not help much as
> we will end up waiting for csd_lock again :(
>
> Is the machine still accesible when this happens? If yes, we might
> enable a few trace points and functions and read out the trace
> buffer. If not, we could just panic the machine and dump the trace
> buffer over serial.

No, it wedges solid. Even though it says something like "CPU3 locked up",
aparently all cores also get stuck.
9 times out of 10 it doesn't stay alive long enough to even get the full
trace out over usb-serial.

Dave

Dave Jones

unread,

Nov 15, 2014, 4:34:24 PM11/15/14

to Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 02:01:27PM -0800, Linus Torvalds wrote:

> But since you say "several times a day", just for fun, can you test
> the follow-up patch to that one-liner fix that Will Deacon posted
> today (Subject: "[PATCH] mmu_gather: move minimal range calculations
> into generic code"). That does some further cleanup in this area.

A few hours ago it hit the NMI watchdog again with that patch applied.
Incomplete trace, but it looks different based on what did make it over.
Different RIP at least.

[65155.054155] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [trinity-c127:12559]
[65155.054573] irq event stamp: 296752
[65155.054589] hardirqs last enabled at (296751): [<ffffffff9d87403d>] _raw_spin_unlock_irqrestore+0x5d/0x80
[65155.054625] hardirqs last disabled at (296752): [<ffffffff9d875cea>] apic_timer_interrupt+0x6a/0x80
[65155.054657] softirqs last enabled at (296188): [<ffffffff9d259943>] bdi_queue_work+0x83/0x270
[65155.054688] softirqs last disabled at (296184): [<ffffffff9d259920>] bdi_queue_work+0x60/0x270
[65155.054721] CPU: 1 PID: 12559 Comm: trinity-c127 Not tainted 3.18.0-rc4+ #84 [loadavg: 209.68 187.90 185.33 34/431 17515]
[65155.054795] task: ffff88023f664680 ti: ffff8801649f0000 task.ti: ffff8801649f0000
[65155.054820] RIP: 0010:[<ffffffff9d87403f>] [<ffffffff9d87403f>] _raw_spin_unlock_irqrestore+0x5f/0x80
[65155.054852] RSP: 0018:ffff8801649f3be8 EFLAGS: 00000292
[65155.054872] RAX: ffff88023f664680 RBX: 0000000000000007 RCX: 0000000000000007
[65155.054895] RDX: 00000000000029e0 RSI: ffff88023f664ea0 RDI: ffff88023f664680
[65155.054919] RBP: ffff8801649f3bf8 R08: 0000000000000000 R09: 0000000000000000
[65155.055956] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[65155.056985] R13: ffff8801649f3b58 R14: ffffffff9d3e7d0e R15: 00000000000003e0
[65155.058037] FS: 00007f0dc957c700(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[65155.059083] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[65155.060121] CR2: 00007f0dc958e000 CR3: 000000022f31e000 CR4: 00000000001407e0
[65155.061152] DR0: 00007f54162bc000 DR1: 00007feb92c3d000 DR2: 0000000000000000
[65155.062180] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[65155.063202] Stack:

And that's all she wrote.

> If Will's patch doesn't make a difference, what about reverting that
> ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
> don't see why any of this would be noticeable on x86 (it triggered
> issues on ARM64, but that was because ARM64 cared much more about the
> exact range).

I'll try that next, and check in on it tomorrow.

Dave

Dave Jones

unread,

Nov 15, 2014, 8:40:57 PM11/15/14

to Linus Torvalds, Linux Kernel, the arch/x86 maintainers

No luck. Died even faster this time.

[ 772.459481] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [modprobe:31400]
[ 772.459858] irq event stamp: 3362
[ 772.459872] hardirqs last enabled at (3361): [<ffffffff941a437c>] context_tracking_user_enter+0x9c/0x2c0
[ 772.459907] hardirqs last disabled at (3362): [<ffffffff94875bea>] apic_timer_interrupt+0x6a/0x80
[ 772.459937] softirqs last enabled at (0): [<ffffffff940764d5>] copy_process.part.26+0x635/0x1d80
[ 772.459968] softirqs last disabled at (0): [< (null)>] (null)
[ 772.459996] CPU: 3 PID: 31400 Comm: modprobe Not tainted 3.18.0-rc4+ #85 [loadavg: 207.70 163.33 92.64 11/433 31547]
[ 772.460086] task: ffff88022f0b2f00 ti: ffff88019a944000 task.ti: ffff88019a944000
[ 772.460110] RIP: 0010:[<ffffffff941a437e>] [<ffffffff941a437e>] context_tracking_user_enter+0x9e/0x2c0
[ 772.460142] RSP: 0018:ffff88019a947f00 EFLAGS: 00000282
[ 772.460161] RAX: ffff88022f0b2f00 RBX: 0000000000000000 RCX: 0000000000000000
[ 772.460184] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88022f0b2f00
[ 772.460207] RBP: ffff88019a947f10 R08: 0000000000000000 R09: 0000000000000000
[ 772.460229] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88019a947e90
[ 772.460252] R13: ffffffff940f6d04 R14: ffff88019a947ec0 R15: ffff8802447cd640
[ 772.460294] FS: 00007f3b71ee4700(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[ 772.460362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 772.460391] CR2: 00007fffdad5af58 CR3: 000000011608e000 CR4: 00000000001407e0
[ 772.460424] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 772.460447] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 772.460470] Stack:
[ 772.460480] ffff88019a947f58 00000000006233a8 ffff88019a947f40 ffffffff9401429d
[ 772.460512] 00000000006233a8 000000000041d68a 00000000006233a8 0000000000000000
[ 772.460543] 00000000006233a0 ffffffff94874fa4 000000001008feff 000507d93d73a434
[ 772.460574] Call Trace:
[ 772.461576] [<ffffffff9401429d>] syscall_trace_leave+0xad/0x2e0
[ 772.462572] [<ffffffff94874fa4>] int_check_syscall_exit_work+0x34/0x3d
[ 772.463575] Code: f8 1c 00 84 c0 75 46 48 c7 c7 51 53 cd 94 e8 aa 23 24 00 65 c7 04 25 f4 f8 1c 00 01 00 00 00 f6 c7 02 74 19 e8 84 43 f3 ff 53 9d <5b> 41 5c 5d c3 0f 1f 44 00 00 c3 0f 1f 80 00 00 00 00 53 9d e8
[ 772.465797] Kernel panic - not syncing: softlockup: hung tasks
[ 772.466821] CPU: 3 PID: 31400 Comm: modprobe Tainted: G L 3.18.0-rc4+ #85 [loadavg: 207.70 163.33 92.64 11/433 31547]
[ 772.468915] ffff88022f0b2f00 00000000de65d5f5 ffff880244603dc8 ffffffff94869e01
[ 772.470031] 0000000000000000 ffffffff94c7599b ffff880244603e48 ffffffff94866b21
[ 772.471085] ffff880200000008 ffff880244603e58 ffff880244603df8 00000000de65d5f5
[ 772.472141] Call Trace:
[ 772.473183] <IRQ> [<ffffffff94869e01>] dump_stack+0x4f/0x7c
[ 772.474253] [<ffffffff94866b21>] panic+0xcf/0x202
[ 772.475346] [<ffffffff94154d1e>] watchdog_timer_fn+0x27e/0x290
[ 772.476414] [<ffffffff94106297>] __run_hrtimer+0xe7/0x740
[ 772.477475] [<ffffffff94106b64>] ? hrtimer_interrupt+0x94/0x270
[ 772.478555] [<ffffffff94154aa0>] ? watchdog+0x40/0x40
[ 772.479627] [<ffffffff94106be7>] hrtimer_interrupt+0x117/0x270
[ 772.480703] [<ffffffff940303db>] local_apic_timer_interrupt+0x3b/0x70
[ 772.481777] [<ffffffff948777f3>] smp_apic_timer_interrupt+0x43/0x60
[ 772.482856] [<ffffffff94875bef>] apic_timer_interrupt+0x6f/0x80
[ 772.483915] <EOI> [<ffffffff941a437e>] ? context_tracking_user_enter+0x9e/0x2c0
[ 772.484972] [<ffffffff9401429d>] syscall_trace_leave+0xad/0x2e0
[ 772.486042] [<ffffffff94874fa4>] int_check_syscall_exit_work+0x34/0x3d
[ 772.487187] Kernel Offset: 0x13000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Linus Torvalds

unread,

Nov 16, 2014, 1:35:28 AM11/16/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Sat, Nov 15, 2014 at 5:40 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > I'll try that next, and check in on it tomorrow.
>
> No luck. Died even faster this time.

Yeah, and your other lockups haven't even been TLB related. Not that
they look like anything else *either*.

I have no ideas left. I'd go for a bisection - rather than try random
things, at least bisection will get us a smaller set of suspects if
you can go through a few cycles of it. Even if you decide that you
want to run for most of a day before you are convinced it's all good,
a couple of days should get you a handful of bisection points (that's
assuming you hit a couple of bad ones too that turn bad in a shorter
while). And 4 or five bisections should get us from 11k commits down
to the ~600 commit range. That would be a huge improvement.

Linus

Markus Trippelsdorf

unread,

Nov 16, 2014, 5:06:47 AM11/16/14

to Linus Torvalds, Dave Jones, Linux Kernel, the arch/x86 maintainers

On 2014.11.15 at 22:33 -0800, Linus Torvalds wrote:
> On Sat, Nov 15, 2014 at 5:40 PM, Dave Jones <da...@redhat.com> wrote:
> > >
> > > I'll try that next, and check in on it tomorrow.
> >
> > No luck. Died even faster this time.
>
> Yeah, and your other lockups haven't even been TLB related. Not that
> they look like anything else *either*.
>
> I have no ideas left. I'd go for a bisection

Before starting a bisection you could try and disable transparent_hugepages.
There are strange bugs, that were introduced during this merge-window, in this
area. See: https://lkml.org/lkml/2014/11/4/144
https://lkml.org/lkml/2014/11/4/904
http://thread.gmane.org/gmane.linux.kernel.mm/124451

--
Markus

Thomas Gleixner

unread,

Nov 16, 2014, 7:17:35 AM11/16/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Fri, 14 Nov 2014, Dave Jones wrote:
> On Sat, Nov 15, 2014 at 01:36:41AM +0100, Thomas Gleixner wrote:
> > On Fri, 14 Nov 2014, Dave Jones wrote:
> >
> > > On Fri, Nov 14, 2014 at 11:55:30PM +0100, Thomas Gleixner wrote:
> > >
> > > > So this looks more like a smp function call fuckup.
> > > >
> > > > I assume Dave is running that stuff on KVM. So it might be worth while
> > > > to look at the IPI magic there.
> > >
> > > no, bare metal.
> >
> > Ok, but that does not change the fact that we are stuck in
> > smp_function_call land.
> >
> > Enabling softlockup_all_cpu_backtrace will probably not help much as
> > we will end up waiting for csd_lock again :(
> >
> > Is the machine still accesible when this happens? If yes, we might
> > enable a few trace points and functions and read out the trace
> > buffer. If not, we could just panic the machine and dump the trace
> > buffer over serial.
>
> No, it wedges solid. Even though it says something like "CPU3 locked up",
> aparently all cores also get stuck.

Does not surprise me. Once the smp function call machinery is wedged...

> 9 times out of 10 it doesn't stay alive long enough to even get the full
> trace out over usb-serial.

usb-serial is definitely not the best tool for stuff like this. I
wonder whether netconsole might give us some more info.

Last time I looked into something like that on my laptop I had to
resort to a crash kernel to get anything useful out of the box.

Thanks,

tglx

Linus Torvalds

unread,

Nov 16, 2014, 1:33:33 PM11/16/14

to Markus Trippelsdorf, Dave Jones, Linux Kernel, the arch/x86 maintainers

On Sun, Nov 16, 2014 at 2:06 AM, Markus Trippelsdorf
<mar...@trippelsdorf.de> wrote:
>
> Before starting a bisection you could try and disable transparent_hugepages.
> There are strange bugs, that were introduced during this merge-window, in this
> area. See: https://lkml.org/lkml/2014/11/4/144
> https://lkml.org/lkml/2014/11/4/904
> http://thread.gmane.org/gmane.linux.kernel.mm/124451

Those look different, and hopefully should be fixed by commit
1d5bfe1ffb5b ("mm, compaction: prevent infinite loop in
compact_zone"). Which admittedly isn't in -rc4 (it went in on
Thursday), but I think Dave tends to run git-of-the-day rather than
last rc, so he probably already had it.

I *think* that if it was the infinite compaction problem, you'd have
the soft-lockup reports showing that. Dave's are in random places.
Which is odd.

Linus

Don Zickus

unread,

Nov 17, 2014, 10:07:41 AM11/17/14

to Dave Jones, Linux Kernel, Linus Torvalds

Hi Dave,

When I usually see stuff like this, it is because another cpu is blocking
the IPI from smp_call_function_many from finishing, so this cpu waits
forever.

The problem usually becomes obvious with a dump of all cpus at the time
the lockup is detected.

Can you try adding 'softlockup_all_cpu_backtrace=1' to the kernel
commandline? That should dump all the cpus to see if anything stands out.

Though I don't normally see it traverse down to smp_call_function_single.

Anyway something to try.

Cheers,
Don

Dave Jones

unread,

Nov 17, 2014, 12:04:20 PM11/17/14

to Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Sat, Nov 15, 2014 at 10:33:19PM -0800, Linus Torvalds wrote:

> > > I'll try that next, and check in on it tomorrow.
> >
> > No luck. Died even faster this time.
>
> Yeah, and your other lockups haven't even been TLB related. Not that
> they look like anything else *either*.
>
> I have no ideas left. I'd go for a bisection - rather than try random
> things, at least bisection will get us a smaller set of suspects if
> you can go through a few cycles of it. Even if you decide that you
> want to run for most of a day before you are convinced it's all good,
> a couple of days should get you a handful of bisection points (that's
> assuming you hit a couple of bad ones too that turn bad in a shorter
> while). And 4 or five bisections should get us from 11k commits down
> to the ~600 commit range. That would be a huge improvement.

Great start to the week: I decided to confirm my recollection that .17
was ok, only to hit this within 10 minutes.

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
Call Trace:
<NMI> [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
[<ffffffff9583bcc0>] panic+0xd4/0x207
[<ffffffff95150908>] watchdog_overflow_callback+0x118/0x120
[<ffffffff95193dbe>] __perf_event_overflow+0xae/0x340
[<ffffffff95192230>] ? perf_event_task_disable+0xa0/0xa0
[<ffffffff9501a7bf>] ? x86_perf_event_set_period+0xbf/0x150
[<ffffffff95194be4>] perf_event_overflow+0x14/0x20
[<ffffffff95020676>] intel_pmu_handle_irq+0x206/0x410
[<ffffffff9501966b>] perf_event_nmi_handler+0x2b/0x50
[<ffffffff95007bb2>] nmi_handle+0xd2/0x390
[<ffffffff95007ae5>] ? nmi_handle+0x5/0x390
[<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
[<ffffffff950080a2>] default_do_nmi+0x72/0x1c0
[<ffffffff950082a8>] do_nmi+0xb8/0x100
[<ffffffff9584b9aa>] end_repeat_nmi+0x1e/0x2e
[<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
[<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
[<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
<<EOE>> <IRQ> [<ffffffff95101685>] lock_hrtimer_base.isra.18+0x25/0x50
[<ffffffff951019d3>] hrtimer_try_to_cancel+0x33/0x1f0
[<ffffffff95101baa>] hrtimer_cancel+0x1a/0x30
[<ffffffff95113557>] tick_nohz_restart+0x17/0x90
[<ffffffff95114533>] __tick_nohz_full_check+0xc3/0x100
[<ffffffff9511457e>] nohz_full_kick_work_func+0xe/0x10
[<ffffffff95188894>] irq_work_run_list+0x44/0x70
[<ffffffff951888ea>] irq_work_run+0x2a/0x50
[<ffffffff9510109b>] update_process_times+0x5b/0x70
[<ffffffff95113325>] tick_sched_handle.isra.20+0x25/0x60
[<ffffffff95113801>] tick_sched_timer+0x41/0x60
[<ffffffff95102281>] __run_hrtimer+0x81/0x480
[<ffffffff951137c0>] ? tick_sched_do_timer+0xb0/0xb0
[<ffffffff95102977>] hrtimer_interrupt+0x117/0x270
[<ffffffff950346d7>] local_apic_timer_interrupt+0x37/0x60
[<ffffffff9584c44f>] smp_apic_timer_interrupt+0x3f/0x50
[<ffffffff9584a86f>] apic_timer_interrupt+0x6f/0x80
<EOI> [<ffffffff950d3f3a>] ? lock_release_holdtime.part.28+0x9a/0x160
[<ffffffff950ef3b7>] ? rcu_is_watching+0x27/0x60
[<ffffffff9508cb75>] kill_pid_info+0xf5/0x130
[<ffffffff9508ca85>] ? kill_pid_info+0x5/0x130
[<ffffffff9508ccd3>] SYSC_kill+0x103/0x330
[<ffffffff9508cc7c>] ? SYSC_kill+0xac/0x330
[<ffffffff9519b592>] ? context_tracking_user_exit+0x52/0x1a0
[<ffffffff950d6f1d>] ? trace_hardirqs_on_caller+0x16d/0x210
[<ffffffff950d6fcd>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff950137ad>] ? syscall_trace_enter+0x14d/0x330
[<ffffffff9508f44e>] SyS_kill+0xe/0x10
[<ffffffff95849b24>] tracesys+0xdd/0xe2
Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

It could a completely different cause for lockup, but seeing this now
has me wondering if perhaps it's something unrelated to the kernel.
I have recollection of running late .17rc's for days without incident,
and I'm pretty sure .17 was ok too. But a few weeks ago I did upgrade
that test box to the Fedora 21 beta. Which means I have a new gcc.
I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
get 4.8 back on there and see if that's any better.

Dave

Linus Torvalds

unread,

Nov 17, 2014, 2:59:43 PM11/17/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 9:03 AM, Dave Jones <da...@redhat.com> wrote:
>
> It could a completely different cause for lockup, but seeing this now
> has me wondering if perhaps it's something unrelated to the kernel.
> I have recollection of running late .17rc's for days without incident,
> and I'm pretty sure .17 was ok too. But a few weeks ago I did upgrade
> that test box to the Fedora 21 beta. Which means I have a new gcc.
> I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
> get 4.8 back on there and see if that's any better.

I'm not sure if I should be relieved or horrified.

Horrified, I think.

It really would be a wonderful thing to have some kind of "compiler
bisection" with mixed object files to see exactly which file it
miscompiles (and by "miscompiles" it might just be a kernel bug where
we are missing a barrier or something, and older gcc's just happened
to not show it - so it could still easily be a kernel problem).

Linus

Linus Torvalds

unread,

Nov 17, 2014, 4:22:22 PM11/17/14

to Thomas Gleixner, Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 5:59 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
> end.

Btw, looking at this, I grew really suspicious of this code in csd_unlock():

WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));

because that makes no sense at all. It basically removes a sanity
check, yet that sanity check makes a hell of a lot of sense. Unlocking
a CSD that is not locked is *wrong*.

The crazy code code comes from commit c84a83e2aaab ("smp: don't warn
about csd->flags having CSD_FLAG_LOCK cleared for !wait") by Jens, but
the explanation and the code is pure crap.

There is no way in hell that it is ever correct to unlock an entry
that isn't locked, so that whole CSD_FLAG_WAIT thing is buggy as hell.

The explanation in commit c84a83e2aaab says that "blk-mq reuses the
request potentially immediately" and claims that that is somehow ok,
but that's utter BS. Even if you don't ever wait for it, the CSD lock
bit fundamentally also protects the "csd->llist" pointer. So what that
commit actually does is to just remove a safety check, and do so in a
very unsafe manner. And apparently block-mq re-uses something THAT IS
STILL ACTIVELY IN USE. That's just horrible.

Now, I think we might do this differently, by doing the "csd_unlock()"
after we have loaded everything from the csd, but *before* actually
calling the callback function. That would seem to be equivalent
(interrupts are disabled, so this will not result in the func()
possibly called twice), more efficient, _and_ not remove a useful
check.

Hmm? Completely untested patch attached. Jens, does this still work for you?

Am I missing something?

Linus

patch.diff

Thomas Gleixner

unread,

Nov 17, 2014, 5:31:17 PM11/17/14

to Linus Torvalds, Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

Yes. :)

> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -19,7 +19,6 @@
>
> enum {
> CSD_FLAG_LOCK = 0x01,
> - CSD_FLAG_WAIT = 0x02,
> };
>
> struct call_function_data {
> @@ -126,7 +125,7 @@ static void csd_lock(struct call_single_data *csd)
>
> static void csd_unlock(struct call_single_data *csd)
> {
> - WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
> + WARN_ON(!(csd->flags & CSD_FLAG_LOCK));
>
> /*
> * ensure we're all done before releasing data:
> @@ -173,9 +172,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
> csd->func = func;
> csd->info = info;
>
> - if (wait)
> - csd->flags |= CSD_FLAG_WAIT;
> -
> /*
> * The list addition should be visible before sending the IPI
> * handler locks the list to pull the entry off it because of
> @@ -250,8 +246,11 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
> }
>
> llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> - csd->func(csd->info);
> + smp_call_func_t func = csd->func;
> + void *info = csd->info;
> csd_unlock(csd);
> +
> + func(info);

No, that won't work for synchronous calls:

CPU 0 CPU 1

csd_lock(csd);
queue_csd();
ipi();
func = csd->func;
info = csd->info;
csd_unlock(csd);
csd_lock_wait();
func(info);

The csd_lock_wait() side will succeed and therefor assume that the
call has been completed while the function has not been called at
all. Interesting explosions to follow.

The proper solution is to revert that commit and properly analyze the
problem which Jens was trying to solve and work from there.

Thanks,

tglx

Thomas Gleixner

unread,

Nov 17, 2014, 5:43:28 PM11/17/14

to Linus Torvalds, Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

So a combo of both (Jens and yours) might do the trick. Patch below.

I think what Jens was trying to solve is:

CPU 0 CPU 1

csd_lock(csd);
queue_csd();
ipi();

csd->func(csd->info);
wait_for_completion(csd);
complete(csd);
reuse_csd(csd);
csd_unlock(csd);

Thanks,

tglx

Index: linux/kernel/smp.c
===================================================================
--- linux.orig/kernel/smp.c
+++ linux/kernel/smp.c
@@ -126,7 +126,7 @@ static void csd_lock(struct call_single_

static void csd_unlock(struct call_single_data *csd)
{
- WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+ WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

/*
* ensure we're all done before releasing data:

@@ -250,8 +250,23 @@ static void flush_smp_call_function_queu

}

llist_for_each_entry_safe(csd, csd_next, entry, llist) {
- csd->func(csd->info);

- csd_unlock(csd);
+
+ /*
+ * For synchronous calls we are not allowed to unlock
+ * before the callback returned. For the async case
+ * its the responsibility of the caller to keep
+ * csd->info consistent while the callback runs.
+ */
+ if (csd->flags & CSD_FLAG_WAIT) {
+ csd->func(csd->info);
+ csd_unlock(csd);
+ } else {

+ smp_call_func_t func = csd->func;
+ void *info = csd->info;

+
+ csd_unlock(csd);
+ func(info);
+ }
}

/*

Jens Axboe

unread,

Nov 17, 2014, 5:58:47 PM11/17/14

to Thomas Gleixner, Linus Torvalds, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

Maybe... The above looks ok to me from a functional point of view, but
now I can't convince myself that the blk-mq use case is correct.

I'll try and backout the original patch and reproduce the issue, that
should jog my memory and give me full understanding of what the issue I
faced back then was.

--
Jens Axboe

Jens Axboe

unread,

Nov 17, 2014, 6:05:07 PM11/17/14

to Linus Torvalds, Thomas Gleixner, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

On 11/17/2014 02:22 PM, Linus Torvalds wrote:
> On Fri, Nov 14, 2014 at 5:59 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
>>
>> Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
>> end.
>
> Btw, looking at this, I grew really suspicious of this code in csd_unlock():
>
> WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
>
> because that makes no sense at all. It basically removes a sanity
> check, yet that sanity check makes a hell of a lot of sense. Unlocking
> a CSD that is not locked is *wrong*.
>
> The crazy code code comes from commit c84a83e2aaab ("smp: don't warn
> about csd->flags having CSD_FLAG_LOCK cleared for !wait") by Jens, but
> the explanation and the code is pure crap.
>
> There is no way in hell that it is ever correct to unlock an entry
> that isn't locked, so that whole CSD_FLAG_WAIT thing is buggy as hell.
>
> The explanation in commit c84a83e2aaab says that "blk-mq reuses the
> request potentially immediately" and claims that that is somehow ok,
> but that's utter BS. Even if you don't ever wait for it, the CSD lock
> bit fundamentally also protects the "csd->llist" pointer. So what that
> commit actually does is to just remove a safety check, and do so in a
> very unsafe manner. And apparently block-mq re-uses something THAT IS
> STILL ACTIVELY IN USE. That's just horrible.

I agree that this description is probably utter crap. And now I do
actually remember the issue at hand. The resource here is the tag, that
decides what request we'll use, and subsequently what call_single_data
storage is used. When this was originally done, blk-mq cleared the
request from the function callback, instead of doing it at allocation
time. The assumption here was cache hotness. That in turn also cleared
->csd, which meant that the flags got zeroed and csd_unlock() was
naturally unhappy. THAT was the reuse case, not that the request would
get reused before we had finished the IPI fn callback since that would
obviously create other badness. Now I'm not sure what made me create
that patch, which in retrospect is a bad hammer for this problem.

blk-mq doesn't do the init-at-finish time anymore, so it should not be
hit by the issue. But if we do bring that back, then it would still work
fine with Thomas' patch, since we unlock prior to running the callback.

--
Jens Axboe

Thomas Gleixner

unread,

Nov 17, 2014, 6:17:31 PM11/17/14

to Jens Axboe, Linus Torvalds, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

So that's exactly what I described in my other reply.

csd_lock(csd);
queue_csd();
ipi();
csd->func(csd->info);
wait_for_completion(csd);
complete(csd);
reuse_csd(csd);
csd_unlock(csd);

When you call complete() nothing can rely on csd anymore, except for
the smp core code ....

> THAT was the reuse case, not that the request would get reused
> before we had finished the IPI fn callback since that would
> obviously create other badness. Now I'm not sure what made me create
> that patch, which in retrospect is a bad hammer for this problem.

Performance blindness?

> blk-mq doesn't do the init-at-finish time anymore, so it should not be
> hit by the issue. But if we do bring that back, then it would still work
> fine with Thomas' patch, since we unlock prior to running the callback.

So if blk-mq is not relying on that, then we really should back out
that stuff for 3.18 and tag it for stable.

Treating sync and async function calls differently makes sense,
because any async caller which cannot deal with the unlock before call
scheme is broken by definition already today. But that's material for
next.

Thanks,

tglx

Linus Torvalds

unread,

Nov 17, 2014, 6:59:16 PM11/17/14

to Thomas Gleixner, Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 2:43 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
>>
>> No, that won't work for synchronous calls:\

Right you are.

> So a combo of both (Jens and yours) might do the trick. Patch below.

Yeah, I guess that would work. The important part is that *if*
somebody really reuses the csd, we'd better have a release barrier
(which csd_unlock() does, although badly - but this probably isn't
that performance-critical) *before* we call the function, because
otherwise there's no real serialization for the reuse.

Of course, most of these things are presumably always per-cpu data
structures, so the whole worry about "csd" being accessed from
different CPU's probably doesn't even exist, and this all works fine
as-is anyway, even in the presense of odd memory ordering issues.

Judging from Jens' later email, it looks like we simply don't need
this code at all any more, though, and we could just revert the
commit.

NOTE! I don't think this actually has anything to do with the actual
problem that Dave saw. I just reacted to that WARN_ON() when I was
looking at the code, and it made me go "that looks extremely
suspicious".

Particularly on x86, with strong memory ordering, I don't think that
any random accesses to 'csd' after the call to 'csd->func()' could
actually matter. I just felt very nervous about the claim that
somebody can reuse the csd immediately, that smelled bad to me from a
*conceptual* standpoint, even if I suspect it works perfectly fine in
practice.

Anyway, I've found *another* race condition, which (again) doesn't
actually seem to be an issue on x86.

In particular, "csd_lock()" does things pretty well, in that it does a
smp_mb() after setting the lock bit, so certainly nothing afterwards
will leak out of that locked region.

But look at csd_lock_wait(). It just does

while (csd->flags & CSD_FLAG_LOCK)
cpu_relax();

and basically there's no memory barriers there. Now, on x86, this is a
non-issue, since all reads act as an acquire, but at least in *theory*
we have this completely unordered read going on. So any subsequent
memory oeprations (ie after the return from generic_exec_single()
could in theory see data from *before* the read.

So that whole kernel/smp.c locking looks rather dubious. The smp_mb()
in csd_lock() is overkill (a "smp_store_release()" should be
sufficient), and I think that the read of csd->flags in csd_unlock()
should be a smp_load_acquire().

Again, none of this has anything to do with Dave's problem. The memory
ordering issues really cannot be an issue on x86, I'm just saying that
there's code there that makes me a bit uncomfortable.

Linus

Thomas Gleixner

unread,

Nov 17, 2014, 7:15:30 PM11/17/14

to Linus Torvalds, Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

On Mon, 17 Nov 2014, Linus Torvalds wrote:

> On Mon, Nov 17, 2014 at 2:43 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> > So a combo of both (Jens and yours) might do the trick. Patch below.
>
> Yeah, I guess that would work. The important part is that *if*
> somebody really reuses the csd, we'd better have a release barrier
> (which csd_unlock() does, although badly - but this probably isn't
> that performance-critical) *before* we call the function, because
> otherwise there's no real serialization for the reuse.

Indeed.

> Of course, most of these things are presumably always per-cpu data
> structures, so the whole worry about "csd" being accessed from
> different CPU's probably doesn't even exist, and this all works fine
> as-is anyway, even in the presense of odd memory ordering issues.
>
> Judging from Jens' later email, it looks like we simply don't need
> this code at all any more, though, and we could just revert the
> commit.

Right. Reverting it is the proper solution for now. Though we should
really think about the async seperation later. It makes a lot of
sense.

> NOTE! I don't think this actually has anything to do with the actual
> problem that Dave saw. I just reacted to that WARN_ON() when I was
> looking at the code, and it made me go "that looks extremely
> suspicious".

One thing I was looking into today is the increased use of irq_work
which uses IPIs as well. Not sure whether that's related, but I it's
not from my radar yet.

But the possible compiler wreckage (or exposed kernel wreckage) is
frightening in several aspects ...

> Particularly on x86, with strong memory ordering, I don't think that
> any random accesses to 'csd' after the call to 'csd->func()' could
> actually matter. I just felt very nervous about the claim that
> somebody can reuse the csd immediately, that smelled bad to me from a
> *conceptual* standpoint, even if I suspect it works perfectly fine in
> practice.
>
> Anyway, I've found *another* race condition, which (again) doesn't
> actually seem to be an issue on x86.
>
> In particular, "csd_lock()" does things pretty well, in that it does a
> smp_mb() after setting the lock bit, so certainly nothing afterwards
> will leak out of that locked region.
>
> But look at csd_lock_wait(). It just does
>
> while (csd->flags & CSD_FLAG_LOCK)
> cpu_relax();
>
> and basically there's no memory barriers there. Now, on x86, this is a
> non-issue, since all reads act as an acquire, but at least in *theory*
> we have this completely unordered read going on. So any subsequent
> memory oeprations (ie after the return from generic_exec_single()
> could in theory see data from *before* the read.

True.

> So that whole kernel/smp.c locking looks rather dubious. The smp_mb()
> in csd_lock() is overkill (a "smp_store_release()" should be
> sufficient), and I think that the read of csd->flags in csd_unlock()
> should be a smp_load_acquire().
>
> Again, none of this has anything to do with Dave's problem. The memory
> ordering issues really cannot be an issue on x86, I'm just saying that
> there's code there that makes me a bit uncomfortable.

Right you are and we should fix it asap.

Thanks,

tglx

Dave Jones

unread,

Nov 17, 2014, 9:10:26 PM11/17/14

to Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 11:59:34AM -0800, Linus Torvalds wrote:
> On Mon, Nov 17, 2014 at 9:03 AM, Dave Jones <da...@redhat.com> wrote:
> >
> > It could a completely different cause for lockup, but seeing this now
> > has me wondering if perhaps it's something unrelated to the kernel.
> > I have recollection of running late .17rc's for days without incident,
> > and I'm pretty sure .17 was ok too. But a few weeks ago I did upgrade
> > that test box to the Fedora 21 beta. Which means I have a new gcc.
> > I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
> > get 4.8 back on there and see if that's any better.
>
> I'm not sure if I should be relieved or horrified.
>
> Horrified, I think.
>
> It really would be a wonderful thing to have some kind of "compiler
> bisection" with mixed object files to see exactly which file it
> miscompiles (and by "miscompiles" it might just be a kernel bug where
> we are missing a barrier or something, and older gcc's just happened
> to not show it - so it could still easily be a kernel problem).

After wasting countless hours rolling back to Fedora 20 and gcc 4.8.1,
I saw the exact same trace on 3.17, so now I don't know what to think.

So it's great that it's not a regression vs .17, but otoh, who knows
how far back this goes. This looks like a nightmarish bisect case, and
I've no idea why it's now happening so often.

I'll give Don's softlockup_all_cpu_backtrace=1 idea a try on 3.18rc5
and see if that shines any more light on this.

Deeply puzzling.

Dave

Linus Torvalds

unread,

Nov 17, 2014, 9:21:18 PM11/17/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 6:09 PM, Dave Jones <da...@redhat.com> wrote:
>
> After wasting countless hours rolling back to Fedora 20 and gcc 4.8.1,
> I saw the exact same trace on 3.17, so now I don't know what to think.

Uhhuh.

Has anything else changed? New trinity tests? If it has happened in as
little as ten minutes, and you don't recall having seen this until
about a week ago, it does sound like something changed.

But yeah, try the softlockup_all_cpu_backtrace, maybe there's a
pattern somewhere..

Linus

Jens Axboe

unread,

Nov 17, 2014, 9:23:49 PM11/17/14

to Thomas Gleixner, Linus Torvalds, Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

Right, and I didn't. It was the core use of csd->flags afterwards that
complained. blk-mq merely cleared ->flags in csd->func(), which
(granted) was a bit weird. So it was just storing to csd (before
unlock), but in an inappropriate way. It would obviously have broken a
sync invocation, but the block layer never does that.

>> THAT was the reuse case, not that the request would get reused
>> before we had finished the IPI fn callback since that would
>> obviously create other badness. Now I'm not sure what made me create
>> that patch, which in retrospect is a bad hammer for this problem.
>
> Performance blindness?

Possibly...

>> blk-mq doesn't do the init-at-finish time anymore, so it should not be
>> hit by the issue. But if we do bring that back, then it would still work
>> fine with Thomas' patch, since we unlock prior to running the callback.
>
> So if blk-mq is not relying on that, then we really should back out
> that stuff for 3.18 and tag it for stable.

Yeah, I'd be fine with doing that. I don't recall at the top of my head
when when we stopped doing the clear at free time, but it was relatively
early. OK, so checked, and 3.15 does init at free time, and 3.16 and
later does it at allocation time. So the revert can only safely be
applied to 3.16 and later...

> Treating sync and async function calls differently makes sense,
> because any async caller which cannot deal with the unlock before call
> scheme is broken by definition already today. But that's material for
> next.

Agree.

--
Jens Axboe

Dave Jones

unread,

Nov 17, 2014, 9:39:47 PM11/17/14

to Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 06:21:08PM -0800, Linus Torvalds wrote:
> On Mon, Nov 17, 2014 at 6:09 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > After wasting countless hours rolling back to Fedora 20 and gcc 4.8.1,
> > I saw the exact same trace on 3.17, so now I don't know what to think.
>
> Uhhuh.
>
> Has anything else changed? New trinity tests? If it has happened in as
> little as ten minutes, and you don't recall having seen this until
> about a week ago, it does sound like something changed.

Looking at the trinity commits over the last month or so, there's a few
new things, but nothing that sounds like it would trip up a bug like
this. "generate random ascii strings" and "mess with fcntl's after
opening fd's on startup" being the stand-outs. Everything else is pretty
much cleanups and code-motion. There was a lot of work on the code
that tracks mmaps about a month ago, but that shouldn't have had any
visible runtime differences.

<runs git diff>

hm, something I changed not that long ago, which I didn't commit yet,
was that it now runs more child processes than it used to (was 64, now 256)
I've been running like that for a while though. I want to say that was
before .17, but I'm not 100% sure.

So it could be that I'm just generating a lot more load now.
I could drop that back down and see if it 'goes away' or at least
happens less, but it strikes me that there's something here that needs
fixing regardless.

Dave

Linus Torvalds

unread,

Nov 17, 2014, 9:51:34 PM11/17/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 6:39 PM, Dave Jones <da...@redhat.com> wrote:
>
> So it could be that I'm just generating a lot more load now.
> I could drop that back down and see if it 'goes away' or at least
> happens less, but it strikes me that there's something here that needs
> fixing regardless.

Oh, absolutely. It's more a question of "maybe what changed can give us a clue".

But if it' something like "more load", that's not going to help
pinpoint, and you might be better off just doing the all-cpu-backtrace
thing and hope that gives some pattern to appreciate..

Linus

Dave Jones

unread,

Nov 18, 2014, 9:52:53 AM11/18/14

to Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 06:51:25PM -0800, Linus Torvalds wrote:
> On Mon, Nov 17, 2014 at 6:39 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > So it could be that I'm just generating a lot more load now.
> > I could drop that back down and see if it 'goes away' or at least
> > happens less, but it strikes me that there's something here that needs
> > fixing regardless.
>
> Oh, absolutely. It's more a question of "maybe what changed can give us a clue".
>
> But if it' something like "more load", that's not going to help
> pinpoint, and you might be better off just doing the all-cpu-backtrace
> thing and hope that gives some pattern to appreciate..

Here's the first hit. Curiously, one cpu is missing.

NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]
Modules linked in: dlci snd_seq_dummy fuse tun rfcomm bnep hidp scsi_transport_iscsi af_key llc2 can_raw nfnetlink can_bcm sctp libcrc32c nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_timer ptp shpchp snd pps_core soundcore nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc
irq event stamp: 2258092
hardirqs last enabled at (2258091): [<ffffffffa91a58b5>] get_page_from_freelist+0x555/0xaa0
hardirqs last disabled at (2258092): [<ffffffffa985396a>] apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (2244380): [<ffffffffa907b87f>] __do_softirq+0x24f/0x6f0
softirqs last disabled at (2244377): [<ffffffffa907c0dd>] irq_exit+0x13d/0x160
CPU: 1 PID: 17837 Comm: trinity-c180 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
task: ffff8801575e4680 ti: ffff880202434000 task.ti: ffff880202434000
RIP: 0010:[<ffffffffa91a0db0>] [<ffffffffa91a0db0>] bad_range+0x0/0x90
RSP: 0018:ffff8802024377a0 EFLAGS: 00000246
RAX: ffff8801575e4680 RBX: 0000000000000007 RCX: 0000000000000006
RDX: 0000000000002a20 RSI: ffffea0000887fc0 RDI: ffff88024d64c740
RBP: ffff880202437898 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000020 R14: 00000000001d8608 R15: 00000000001d8668
FS: 00007fd3b8960740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007fd3b5ea0777 CR3: 00000001027cd000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

ffffffffa91a58c4 00000000000009e4 ffff8801575e4680 0000000000000001
ffff88024d64dd08 0000010000000000 0000000000000000 ffff8802024377f8
0000000000000000 ffff88024d64dd00 ffffffffa90ac411 ffffffff00000003
Call Trace:
[<ffffffffa91a58c4>] ? get_page_from_freelist+0x564/0xaa0
[<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
[<ffffffffa91a6030>] __alloc_pages_nodemask+0x230/0xd20
[<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
[<ffffffffa90d1e45>] ? mark_held_locks+0x75/0xa0
[<ffffffffa91f400e>] alloc_pages_vma+0xee/0x1b0
[<ffffffffa91b643e>] ? shmem_alloc_page+0x6e/0xc0
[<ffffffffa91b643e>] shmem_alloc_page+0x6e/0xc0
[<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
[<ffffffffa90ac58b>] ? preempt_count_sub+0x7b/0x100
[<ffffffffa93dcc46>] ? __percpu_counter_add+0x86/0xb0
[<ffffffffa91d50d6>] ? __vm_enough_memory+0x66/0x1c0
[<ffffffffa919ad65>] ? find_get_entry+0x5/0x230
[<ffffffffa933b10c>] ? cap_vm_enough_memory+0x4c/0x60
[<ffffffffa91b8ff0>] shmem_getpage_gfp+0x630/0xa40
[<ffffffffa90cee01>] ? match_held_lock+0x111/0x160
[<ffffffffa91b9442>] shmem_write_begin+0x42/0x70
[<ffffffffa919a684>] generic_perform_write+0xd4/0x1f0
[<ffffffffa919d5d2>] __generic_file_write_iter+0x162/0x350
[<ffffffffa92154a0>] ? new_sync_read+0xd0/0xd0
[<ffffffffa919d7ff>] generic_file_write_iter+0x3f/0xb0
[<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
[<ffffffffa92155e8>] do_iter_readv_writev+0x78/0xc0
[<ffffffffa9216e18>] do_readv_writev+0xd8/0x2a0
[<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
[<ffffffffa90cf426>] ? lock_release_holdtime.part.28+0xe6/0x160
[<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
[<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
[<ffffffffa90ac58b>] ? preempt_count_sub+0x7b/0x100
[<ffffffffa90e782e>] ? rcu_read_lock_held+0x6e/0x80
[<ffffffffa921706c>] vfs_writev+0x3c/0x50
[<ffffffffa92171dc>] SyS_writev+0x5c/0x100
[<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9
Code: 09 48 83 f2 01 83 e2 01 eb a3 90 48 c7 c7 a0 8c e4 a9 e8 44 e1 f2 ff 85 c0 75 d2 eb c1 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 48 b8 00 00 00 00 00 16 00 00 55 4c 8b 47 68 48
sending NMI to other CPUs:
NMI backtrace for cpu 2
CPU: 2 PID: 15913 Comm: trinity-c141 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
task: ffff880223229780 ti: ffff8801afca0000 task.ti: ffff8801afca0000
RIP: 0010:[<ffffffffa9116dbe>] [<ffffffffa9116dbe>] generic_exec_single+0xee/0x1a0
RSP: 0018:ffff8801afca3928 EFLAGS: 00000202
RAX: ffff8802443d9d00 RBX: ffff8801afca3930 RCX: ffff8802443d9dc0
RDX: ffff8802443d4d80 RSI: ffff8801afca3930 RDI: ffff8801afca3930
RBP: ffff8801afca3988 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000001 R14: ffff8801afca3a48 R15: ffffffffa9045bb0
FS: 00007fd3b8960740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000008 CR3: 000000022f8bd000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

ffff8801afca3a08 ffff8802443d9dc0 ffffffffa9045bb0 ffff8801afca3a48
0000000000000003 000000007b19adc3 0000000000000001 00000000ffffffff
0000000000000001 ffffffffa9045bb0 ffff8801afca3a48 0000000000000001
Call Trace:
[<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
[<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
[<ffffffffa9116f3a>] smp_call_function_single+0x6a/0xe0
[<ffffffffa93b2e1f>] ? cpumask_next_and+0x4f/0xb0
[<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
[<ffffffffa9117679>] smp_call_function_many+0x2b9/0x320
[<ffffffffa9046370>] flush_tlb_mm_range+0xe0/0x370
[<ffffffffa91cc762>] tlb_flush_mmu_tlbonly+0x42/0x50
[<ffffffffa91cdd28>] unmap_single_vma+0x6b8/0x900
[<ffffffffa91ce06c>] zap_page_range_single+0xfc/0x160
[<ffffffffa91ce254>] unmap_mapping_range+0x134/0x190
[<ffffffffa91bb9dd>] shmem_fallocate+0x4fd/0x520
[<ffffffffa90c7c77>] ? prepare_to_wait+0x27/0x90
[<ffffffffa9213bc2>] do_fallocate+0x132/0x1d0
[<ffffffffa91e3228>] SyS_madvise+0x398/0x870
[<ffffffffa983f6c0>] ? rcu_read_lock_sched_held+0x4e/0x6a
[<ffffffffa9013877>] ? syscall_trace_enter_phase2+0xa7/0x2b0
[<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9
Code: 48 89 de 48 03 14 c5 60 74 f1 a9 48 89 df e8 0a fa 2a 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 0f 85 8e 00
NMI backtrace for cpu 0
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 37.091 msecs
CPU: 0 PID: 15851 Comm: trinity-c80 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 36/402 20526]
task: ffff8801874e8000 ti: ffff88022baec000 task.ti: ffff88022baec000
RIP: 0010:[<ffffffffa90ac450>] [<ffffffffa90ac450>] preempt_count_add+0x0/0xc0
RSP: 0000:ffff880244003c30 EFLAGS: 00000092
RAX: 0000000000000001 RBX: ffffffffa9edb560 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000001
RBP: ffff880244003c48 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: ffff8801874e88c8 [23543.271956] NMI backtrace for cpu 3
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 100.612 msecs
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 37/402 20526]
task: ffff880242b5c680 ti: ffff880242b78000 task.ti: ffff880242b78000
RIP: 0010:[<ffffffffa94251b5>] [<ffffffffa94251b5>] intel_idle+0xd5/0x180
RSP: 0018:ffff880242b7bdf8 EFLAGS: 00000046
RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff880242b7bfd8 RDI: 0000000000000003
RBP: ffff880242b7be28 R08: 000000008baf8f3d R09: 0000000000000000
R10: 0000000000000000 R11: ffff880242b5cea0 R12: 0000000000000005
R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b78000
FS: 0000000000000000(0000) GS:ffff880244600000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000b1c9ac CR3: 0000000029e11000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

0000000342b7be28 afc453cb003f4590 ffffe8ffff402200 0000000000000005
ffffffffa9eaa0c0 0000000000000003 ffff880242b7be78 ffffffffa96bbb45
0000156cc07cf6e3 ffffffffa9eaa290 0000000000000096 ffffffffa9f197b0
Call Trace:
[<ffffffffa96bbb45>] cpuidle_enter_state+0x55/0x300
[<ffffffffa96bbea7>] cpuidle_enter+0x17/0x20
[<ffffffffa90c88f5>] cpu_startup_entry+0x4e5/0x630
[<ffffffffa902d523>] start_secondary+0x1a3/0x220
Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 125.739 msecs

Linus Torvalds

unread,

Nov 18, 2014, 12:20:45 PM11/18/14

to Dave Jones, Linux Kernel, the arch/x86 maintainers, Don Zickus

On Tue, Nov 18, 2014 at 6:52 AM, Dave Jones <da...@redhat.com> wrote:
>
> Here's the first hit. Curiously, one cpu is missing.

That might be the CPU3 that isn't responding to IPIs due to some bug..

> NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]

> RIP: 0010:[<ffffffffa91a0db0>] [<ffffffffa91a0db0>] bad_range+0x0/0x90

Hmm. Something looping in the page allocator? Not waiting for a lock,
but livelocked? I'm not seeing anything here that should trigger the
NMI watchdog at all.

Can the NMI watchdog get confused somehow?

> Call Trace:
> [<ffffffffa91a6030>] __alloc_pages_nodemask+0x230/0xd20
> [<ffffffffa91f400e>] alloc_pages_vma+0xee/0x1b0
> [<ffffffffa91b643e>] shmem_alloc_page+0x6e/0xc0
> [<ffffffffa91b8ff0>] shmem_getpage_gfp+0x630/0xa40

> [<ffffffffa91b9442>] shmem_write_begin+0x42/0x70
> [<ffffffffa919a684>] generic_perform_write+0xd4/0x1f0
> [<ffffffffa919d5d2>] __generic_file_write_iter+0x162/0x350

> [<ffffffffa919d7ff>] generic_file_write_iter+0x3f/0xb0
> [<ffffffffa92155e8>] do_iter_readv_writev+0x78/0xc0
> [<ffffffffa9216e18>] do_readv_writev+0xd8/0x2a0
> [<ffffffffa90cf426>] ? lock_release_holdtime.part.28+0xe6/0x160
> [<ffffffffa921706c>] vfs_writev+0x3c/0x50

And CPU2 is in that TLB flusher again:

> NMI backtrace for cpu 2

> RIP: 0010:[<ffffffffa9116dbe>] [<ffffffffa9116dbe>] generic_exec_single+0xee/0x1a0

> Call Trace:
> [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60

> [<ffffffffa9116f3a>] smp_call_function_single+0x6a/0xe0

> [<ffffffffa9117679>] smp_call_function_many+0x2b9/0x320
> [<ffffffffa9046370>] flush_tlb_mm_range+0xe0/0x370
> [<ffffffffa91cc762>] tlb_flush_mmu_tlbonly+0x42/0x50
> [<ffffffffa91cdd28>] unmap_single_vma+0x6b8/0x900
> [<ffffffffa91ce06c>] zap_page_range_single+0xfc/0x160
> [<ffffffffa91ce254>] unmap_mapping_range+0x134/0x190

. and the code line implies that it's in that csd_lock_wait() loop,
again consistent with waiting for some other CPU. Presumably the
missing CPU3.

> NMI backtrace for cpu 0

> RIP: 0010:[<ffffffffa90ac450>] [<ffffffffa90ac450>] preempt_count_add+0x0/0xc0

> Call Trace:
> [<ffffffffa96bbb45>] cpuidle_enter_state+0x55/0x300
> [<ffffffffa96bbea7>] cpuidle_enter+0x17/0x20
> [<ffffffffa90c88f5>] cpu_startup_entry+0x4e5/0x630
> [<ffffffffa902d523>] start_secondary+0x1a3/0x220

And CPU0 is just in the idle loop (that RIP is literally the
instruction after the "mwait" according to the code line).

> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 125.739 msecs

. and that's us giving up on CPU3.

So it does look like CPU3 is the problem, but sadly, CPU3 is
apparently not listening, and doesn't even react to the NMI, much less
a TLB flush IPI.

Not reacting to NMI could be:
(a) some APIC state issue
(b) we're already stuck in a loop in the previous NMI handler
(c) what?

Anybody?

Linus

Thomas Gleixner

unread,

Nov 18, 2014, 1:54:38 PM11/18/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, 18 Nov 2014, Dave Jones wrote:
> Here's the first hit. Curiously, one cpu is missing.

I don't think so

> NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]

> irq event stamp: 2258092
> hardirqs last enabled at (2258091): [<ffffffffa91a58b5>] get_page_from_freelist+0x555/0xaa0
> hardirqs last disabled at (2258092): [<ffffffffa985396a>] apic_timer_interrupt+0x6a/0x80

So that means we are in the timer interrupt and handling
watchdog_timer_fn.

> CPU: 1 PID: 17837 Comm: trinity-c180 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
> task: ffff8801575e4680 ti: ffff880202434000 task.ti: ffff880202434000
> RIP: 0010:[<ffffffffa91a0db0>] [<ffffffffa91a0db0>] bad_range+0x0/0x90

So the softlockup tells us, that the high priority watchdog thread was
not able to touch the watchdog timestamp. That means this task was
hogging the CPU for 20+ seconds. I have no idea how that happens in
that call chain.

But this gets pages for a write into shmem and the other one below
does a madvise on a shmem map. Coincidence?

> sending NMI to other CPUs:

So here we kick the other cpus

> NMI backtrace for cpu 2
> CPU: 2 PID: 15913 Comm: trinity-c141 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
> task: ffff880223229780 ti: ffff8801afca0000 task.ti: ffff8801afca0000
> RIP: 0010:[<ffffffffa9116dbe>] [<ffffffffa9116dbe>] generic_exec_single+0xee/0x1a0

> [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
> [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
> [<ffffffffa9116f3a>] smp_call_function_single+0x6a/0xe0
> [<ffffffffa93b2e1f>] ? cpumask_next_and+0x4f/0xb0
> [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
> [<ffffffffa9117679>] smp_call_function_many+0x2b9/0x320
> [<ffffffffa9046370>] flush_tlb_mm_range+0xe0/0x370
> [<ffffffffa91cc762>] tlb_flush_mmu_tlbonly+0x42/0x50
> [<ffffffffa91cdd28>] unmap_single_vma+0x6b8/0x900
> [<ffffffffa91ce06c>] zap_page_range_single+0xfc/0x160
> [<ffffffffa91ce254>] unmap_mapping_range+0x134/0x190
> [<ffffffffa91bb9dd>] shmem_fallocate+0x4fd/0x520
> [<ffffffffa90c7c77>] ? prepare_to_wait+0x27/0x90
> [<ffffffffa9213bc2>] do_fallocate+0x132/0x1d0
> [<ffffffffa91e3228>] SyS_madvise+0x398/0x870
> [<ffffffffa983f6c0>] ? rcu_read_lock_sched_held+0x4e/0x6a
> [<ffffffffa9013877>] ? syscall_trace_enter_phase2+0xa7/0x2b0
> [<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9

We've seen that before

> NMI backtrace for cpu 0
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 37.091 msecs

So it complains that the backtrace handler took 37 msec, which is
indeed long for just dumping a stack trace.

> CPU: 0 PID: 15851 Comm: trinity-c80 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 36/402 20526]
> task: ffff8801874e8000 ti: ffff88022baec000 task.ti: ffff88022baec000
> RIP: 0010:[<ffffffffa90ac450>] [<ffffffffa90ac450>] preempt_count_add+0x0/0xc0
> RSP: 0000:ffff880244003c30 EFLAGS: 00000092
> RAX: 0000000000000001 RBX: ffffffffa9edb560 RCX: 0000000000000001
> RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000001
> RBP: ffff880244003c48 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000000 R11: ffff8801874e88c8 [23543.271956] NMI backtrace for cpu 3

So here we mangle CPU3 in and lose the backtrace for cpu0, which might
be the real interesting one ....

> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 100.612 msecs

This one takes 100ms.

> CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 37/402 20526]
> task: ffff880242b5c680 ti: ffff880242b78000 task.ti: ffff880242b78000
> RIP: 0010:[<ffffffffa94251b5>] [<ffffffffa94251b5>] intel_idle+0xd5/0x180

So that one is simply idle.

> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 125.739 msecs
>

And we get another backtrace handler taking too long. Of course we
cannot tell which of the 3 complaints comes from which cpu, because
the printk lacks a cpuid.

Thanks,

tglx

Thomas Gleixner

unread,

Nov 18, 2014, 2:28:15 PM11/18/14

to Linus Torvalds, Dave Jones, Linux Kernel, the arch/x86 maintainers, Don Zickus

On Tue, 18 Nov 2014, Linus Torvalds wrote:
> On Tue, Nov 18, 2014 at 6:52 AM, Dave Jones <da...@redhat.com> wrote:
> >
> > Here's the first hit. Curiously, one cpu is missing.
>
> That might be the CPU3 that isn't responding to IPIs due to some bug..
>
> > NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]
> > RIP: 0010:[<ffffffffa91a0db0>] [<ffffffffa91a0db0>] bad_range+0x0/0x90
>
> Hmm. Something looping in the page allocator? Not waiting for a lock,
> but livelocked? I'm not seeing anything here that should trigger the
> NMI watchdog at all.
>
> Can the NMI watchdog get confused somehow?

That's the soft lockup detector which runs from the timer interrupt
not from NMI.

> So it does look like CPU3 is the problem, but sadly, CPU3 is
> apparently not listening, and doesn't even react to the NMI, much less

As I said in the other mail. It gets the NMI and reacts on it. It's
just mangled into the CPU0 backtrace.

Thanks,

tglx

Don Zickus

unread,

Nov 18, 2014, 4:26:11 PM11/18/14

to Thomas Gleixner, Linus Torvalds, Dave Jones, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 08:28:01PM +0100, Thomas Gleixner wrote:
> On Tue, 18 Nov 2014, Linus Torvalds wrote:
> > On Tue, Nov 18, 2014 at 6:52 AM, Dave Jones <da...@redhat.com> wrote:
> > >
> > > Here's the first hit. Curiously, one cpu is missing.
> >
> > That might be the CPU3 that isn't responding to IPIs due to some bug..
> >
> > > NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]
> > > RIP: 0010:[<ffffffffa91a0db0>] [<ffffffffa91a0db0>] bad_range+0x0/0x90
> >
> > Hmm. Something looping in the page allocator? Not waiting for a lock,
> > but livelocked? I'm not seeing anything here that should trigger the
> > NMI watchdog at all.
> >
> > Can the NMI watchdog get confused somehow?
>
> That's the soft lockup detector which runs from the timer interrupt
> not from NMI.
>
> > So it does look like CPU3 is the problem, but sadly, CPU3 is
> > apparently not listening, and doesn't even react to the NMI, much less
>
> As I said in the other mail. It gets the NMI and reacts on it. It's
> just mangled into the CPU0 backtrace.

I was going to reply about both points too. :-) Though the mangling looks
odd because we have spin_locks serializing the output for each cpu.

Another thing I wanted to ask DaveJ, did you recently turn on
CONFIG_PREEMPT? That would explain why you are seeing the softlockups
now. If you disable CONFIG_PREEMPT does the softlockups disappear.

Cheers,
Don

Dave Jones

unread,

Nov 18, 2014, 4:31:39 PM11/18/14

to Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 04:25:53PM -0500, Don Zickus wrote:

> I was going to reply about both points too. :-) Though the mangling looks
> odd because we have spin_locks serializing the output for each cpu.
>
> Another thing I wanted to ask DaveJ, did you recently turn on
> CONFIG_PREEMPT? That would explain why you are seeing the softlockups
> now. If you disable CONFIG_PREEMPT does the softlockups disappear.

I've had it on on my test box forever. I'll add trying turning to off
to the list of things to try.

Dave

Don Zickus

unread,

Nov 18, 2014, 4:55:57 PM11/18/14

to Thomas Gleixner, Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

Dave,

Can you provide another dump? The hope is we get something not mangled?

The other option we have done in RHEL is panic the system and let kdump
capture the memory. Then we can analyze the vmcore for the stack trace
cpu0 stored in memory to get a rough idea where it might be if the cpu
isn't responding very well.

Cheers,
Don

Dave Jones

unread,

Nov 18, 2014, 5:03:17 PM11/18/14

to Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:

> > So here we mangle CPU3 in and lose the backtrace for cpu0, which might
> > be the real interesting one ....
>

> Can you provide another dump? The hope is we get something not mangled?

Working on it..

> The other option we have done in RHEL is panic the system and let kdump
> capture the memory. Then we can analyze the vmcore for the stack trace
> cpu0 stored in memory to get a rough idea where it might be if the cpu
> isn't responding very well.

I don't know if it's because of the debug options I typically run with,
or that I'm perpetually cursed, but I've never managed to get kdump to
do anything useful. (The last time I tried it was actively harmful in
that not only did it fail to dump anything, it wedged the machine so
it didn't reboot after panic).

Unless there's some magic step missing from the documentation at
http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
then I'm not optimistic it'll be useful.

Dave

Dave Jones

unread,

Nov 18, 2014, 9:19:27 PM11/18/14

to Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:

> Can you provide another dump? The hope is we get something not mangled?

Ok, here's another instance.
This time around, we got all 4 cpu traces.

NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
task: ffff88023fda4680 ti: ffff880101ee0000 task.ti: ffff880101ee0000
RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
RSP: 0018:ffff880101ee3f00 EFLAGS: 00000282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8802445d1258
RDX: 0000000000000001 RSI: ffffffff8aac2c64 RDI: ffffffff8aa94505
RBP: ffff880101ee3f10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880101ee3ec0
R13: ffffffff8a38c577 R14: ffff880101ee3e70 R15: ffffffff8aa9ee99
FS: 00007f706c089740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000000 CR3: 00000001bf67f000 CR4: 00000000001407e0
DR0: 00007f0b19510000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

ffff880101ee3f58 00007f706bcd0000 ffff880101ee3f40 ffffffff8a012fc5
0000000000000000 0000000000007c1b 00007f706bcd0000 00007f706bcd0068
0000000000000000 ffffffff8a7d8624 000000001008feff 0000000000000000
Call Trace:
[<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
[<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
Code: 75 4d 48 c7 c7 64 2c ac 8a e8 e9 2c 21 00 65 c7 04 25 54 f7 1c 00 01 00 00 00 41 f7 c4 00 02 00 00 74 1c e8 8f 44 fd ff 41 54 9d <5b> 41 5c 5d c3 0f 1f 80 00 00 00 00 f3 c3 66 0f 1f 44 00 00 41

sending NMI to other CPUs:

NMI backtrace for cpu 0

CPU: 0 PID: 27716 Comm: kworker/0:1 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
Workqueue: events nohz_kick_work_fn
task: ffff88017c358000 ti: ffff8801d7124000 task.ti: ffff8801d7124000
RIP: 0010:[<ffffffff8a0ffb52>] [<ffffffff8a0ffb52>] smp_call_function_many+0x1b2/0x320
RSP: 0018:ffff8801d7127cb8 EFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff8802441d4dc0 RCX: 0000000000000038
RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8802445d84e8
RBP: ffff8801d7127d08 R08: ffff880243c3aa80 R09: 0000000000000000
R10: ffff880243c3aa80 R11: 0000000000000000 R12: ffffffff8a0fa2c0
R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000002
FS: 0000000000000000(0000) GS:ffff880244000000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f70688ff000 CR3: 000000000ac11000 CR4: 00000000001407f0
DR0: 00007f0b19510000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

ffff8801d7127d28 0000000000000246 00000000d7127ce8 00000000001d4d80
ffff880206b99780 ffff88017f6ccc30 ffff8802441d3640 ffff8802441d9e00
ffffffff8ac4e340 0000000000000000 ffff8801d7127d18 ffffffff8a0fa3f5
Call Trace:
[<ffffffff8a0fa3f5>] tick_nohz_full_kick_all+0x35/0x70
[<ffffffff8a0ec8fe>] nohz_kick_work_fn+0xe/0x10
[<ffffffff8a08e61d>] process_one_work+0x1fd/0x590
[<ffffffff8a08e597>] ? process_one_work+0x177/0x590
[<ffffffff8a0c112e>] ? put_lock_stats.isra.23+0xe/0x30
[<ffffffff8a08eacb>] worker_thread+0x11b/0x490
[<ffffffff8a08e9b0>] ? process_one_work+0x590/0x590
[<ffffffff8a0942e9>] kthread+0xf9/0x110
[<ffffffff8a0c112e>] ? put_lock_stats.isra.23+0xe/0x30
[<ffffffff8a0941f0>] ? kthread_create_on_node+0x250/0x250
[<ffffffff8a7d82ac>] ret_from_fork+0x7c/0xb0
[<ffffffff8a0941f0>] ? kthread_create_on_node+0x250/0x250
Code: a5 c1 00 49 89 c7 41 89 c6 7d 7e 49 63 c7 48 8b 3b 48 03 3c c5 e0 6b d1 8a 0f b7 57 18 f6 c2 01 74 12 0f 1f 80 00 00 00 00 f3 90 <0f> b7 57 18 f6 c2 01 75 f5 83 ca 01 66 89 57 18 0f ae f0 48 8b
NMI backtrace for cpu 1
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 34.478 msecs
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 10/411 32140]
task: ffff880242b5de00 ti: ffff880242b64000 task.ti: ffff880242b64000
RIP: 0010:[<ffffffff8a3e14a5>] [<ffffffff8a3e14a5>] intel_idle+0xd5/0x180
RSP: 0018:ffff880242b67df8 EFLAGS: 00000046

RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001

RDX: 0000000000000000 RSI: ffff880242b67fd8 RDI: 0000000000000001
RBP: ffff880242b67e28 R08: 000000008baf93be R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b64000
FS: 0000000000000000(0000) GS:ffff880244200000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007fcd3a34d000 CR3: 000000000ac11000 CR4: 00000000001407e0
DR0: 00007f0b19510000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

0000000142b67e28 84d4d6ff01c7f4fd ffffe8ffff002200 0000000000000005
ffffffff8acaa080 0000000000000001 ffff880242b67e78 ffffffff8a666075
000011f1e9f6e882 ffffffff8acaa250 ffff880242b64000 ffffffff8ad18f30
Call Trace:
[<ffffffff8a666075>] cpuidle_enter_state+0x55/0x1c0
[<ffffffff8a666297>] cpuidle_enter+0x17/0x20
[<ffffffff8a0bb323>] cpu_startup_entry+0x433/0x4e0
[<ffffffff8a02b763>] start_secondary+0x1a3/0x220

Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48

NMI backtrace for cpu 3

INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 62.461 msecs
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 10/411 32140]

task: ffff880242b5c680 ti: ffff880242b78000 task.ti: ffff880242b78000

RIP: 0010:[<ffffffff8a3e14a5>] [<ffffffff8a3e14a5>] intel_idle+0xd5/0x180

RSP: 0018:ffff880242b7bdf8 EFLAGS: 00000046
RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001

RDX: 0000000000000000 RSI: ffff880242b7bfd8 RDI: ffffffff8ac11000
RBP: ffff880242b7be28 R08: 000000008baf93be R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005

R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b78000
FS: 0000000000000000(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000000 CR3: 000000000ac11000 CR4: 00000000001407e0
DR0: 00007f0b19510000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

0000000342b7be28 43db433bbd06c740 ffffe8ffff402200 0000000000000005
ffffffff8acaa080 0000000000000003 ffff880242b7be78 ffffffff8a666075
000011f1d69aaf37 ffffffff8acaa250 ffff880242b78000 ffffffff8ad18f30
Call Trace:
[<ffffffff8a666075>] cpuidle_enter_state+0x55/0x1c0
[<ffffffff8a666297>] cpuidle_enter+0x17/0x20
[<ffffffff8a0bb323>] cpu_startup_entry+0x433/0x4e0
[<ffffffff8a02b763>] start_secondary+0x1a3/0x220

Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48

INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 89.635 msecs

Linus Torvalds

unread,

Nov 18, 2014, 11:41:07 PM11/18/14

to Dave Jones, Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <da...@redhat.com> wrote:
>
> NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
> CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]

> RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190

> Call Trace:
> [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
> [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d

Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.

Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?

That makes me wonder: does the problem go away if you disable NOHZ?

> CPU: 0 PID: 27716 Comm: kworker/0:1 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> Workqueue: events nohz_kick_work_fn

> RIP: 0010:[<ffffffff8a0ffb52>] [<ffffffff8a0ffb52>] smp_call_function_many+0x1b2/0x320

> Call Trace:
> [<ffffffff8a0fa3f5>] tick_nohz_full_kick_all+0x35/0x70
> [<ffffffff8a0ec8fe>] nohz_kick_work_fn+0xe/0x10
> [<ffffffff8a08e61d>] process_one_work+0x1fd/0x590

> [<ffffffff8a08eacb>] worker_thread+0x11b/0x490
> [<ffffffff8a0942e9>] kthread+0xf9/0x110
> [<ffffffff8a7d82ac>] ret_from_fork+0x7c/0xb0

Yeah, there's certainly some NOHZ work going on on CPU0 too.

> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 10/411 32140]

> RIP: 0010:[<ffffffff8a3e14a5>] [<ffffffff8a3e14a5>] intel_idle+0xd5/0x180

> Call Trace:
> [<ffffffff8a666075>] cpuidle_enter_state+0x55/0x1c0
> [<ffffffff8a666297>] cpuidle_enter+0x17/0x20
> [<ffffffff8a0bb323>] cpu_startup_entry+0x433/0x4e0
> [<ffffffff8a02b763>] start_secondary+0x1a3/0x220

Nothing.

> CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 10/411 32140]

> RIP: 0010:[<ffffffff8a3e14a5>] [<ffffffff8a3e14a5>] intel_idle+0xd5/0x180

> [<ffffffff8a666075>] cpuidle_enter_state+0x55/0x1c0
> [<ffffffff8a666297>] cpuidle_enter+0x17/0x20
> [<ffffffff8a0bb323>] cpu_startup_entry+0x433/0x4e0
> [<ffffffff8a02b763>] start_secondary+0x1a3/0x220

Nothing.

Hmm. NOHZ?

Linus

Dave Jones

unread,

Nov 18, 2014, 11:59:38 PM11/18/14

to Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
> On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
> > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> > RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
> > Call Trace:
> > [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
> > [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
>
> Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
>
> Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?

I do.

> That makes me wonder: does the problem go away if you disable NOHZ?

I'll give it a try, and see what falls out overnight.

Dave

Dave Jones

unread,

Nov 19, 2014, 12:15:47 AM11/19/14

to Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:

> Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
>
> Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
>
> That makes me wonder: does the problem go away if you disable NOHZ?

Does nohz=off do enough ? I couldn't convince myself after looking at
dmesg, and still seeing dynticks stuff in there.

I'll do a rebuild with all the CONFIG_NO_HZ stuff off, though it also changes
some other config stuff wrt timers.

Dave

Don Zickus

unread,

Nov 19, 2014, 9:41:29 AM11/19/14

to Dave Jones, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers, vgo...@redhat.com

On Tue, Nov 18, 2014 at 05:02:54PM -0500, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:
>
> > > So here we mangle CPU3 in and lose the backtrace for cpu0, which might
> > > be the real interesting one ....
> >
> > Can you provide another dump? The hope is we get something not mangled?
>
> Working on it..
>
> > The other option we have done in RHEL is panic the system and let kdump
> > capture the memory. Then we can analyze the vmcore for the stack trace
> > cpu0 stored in memory to get a rough idea where it might be if the cpu
> > isn't responding very well.
>
> I don't know if it's because of the debug options I typically run with,
> or that I'm perpetually cursed, but I've never managed to get kdump to
> do anything useful. (The last time I tried it was actively harmful in
> that not only did it fail to dump anything, it wedged the machine so
> it didn't reboot after panic).
>
> Unless there's some magic step missing from the documentation at
> http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> then I'm not optimistic it'll be useful.

Well, I don't know when the last time you ran it, but I know the RH kexec
folks have started pursuing a Fedora-first package patch rule a couple of
years ago to ensure Fedora had a working kexec/kdump solution.

As for the wedging part, it was a common problem to have the kernel hang
while trying to boot the second kernel (and before console output
happened). So the problem makes sense and is unfortunate. I would
encourage you to try again. :-)

Though, it is transitioning to have the app built into the kernel to deal
with the whole secure boot thing, so that might be another can of worms.

I cc'd Vivek and he can let us know how well it works with F21.

Cheers,
Don

Dave Jones

unread,

Nov 19, 2014, 9:59:57 AM11/19/14

to Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:

> On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
> > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> > RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
> > Call Trace:
> > [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
> > [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
>
> Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
>
> Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
>
> That makes me wonder: does the problem go away if you disable NOHZ?

Aparently not.

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
RIP: 0010:[<ffffffff94175be7>] [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725
RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503
R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48
FS: 00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25
00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000
0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000
Call Trace:
[<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
[<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff947d41bf>] tracesys+0x14/0x4a
Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80

sending NMI to other CPUs:

NMI backtrace for cpu 1
CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000
RIP: 0010:[<ffffffff940fb71e>] [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
RSP: 0018:ffff8801a99a7d18 EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038
RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0
FS: 00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948
0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff
0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001
Call Trace:
[<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
[<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
[<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0
[<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100
[<ffffffff941671aa>] perf_event_read+0xca/0xd0
[<ffffffff94167240>] perf_event_read_value+0x90/0xe0
[<ffffffff941689c6>] perf_read+0x226/0x370
[<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
[<ffffffff941eafff>] vfs_read+0x9f/0x180
[<ffffffff941ebbd8>] SyS_read+0x58/0xd0
[<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9
Code: 48 89 de 48 03 14 c5 20 65 d1 94 48 89 df e8 8a 4b 28 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 0f 85 8e 00

NMI backtrace for cpu 0

INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 35.055 msecs
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 11/410 27945]
task: ffffffff94c164c0 ti: ffffffff94c00000 task.ti: ffffffff94c00000
RIP: 0010:[<ffffffff943dd415>] [<ffffffff943dd415>] intel_idle+0xd5/0x180
RSP: 0018:ffffffff94c03e28 EFLAGS: 00000046

RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001

RDX: 0000000000000000 RSI: ffffffff94c03fd8 RDI: 0000000000000000
RBP: ffffffff94c03e58 R08: 000000008baf8b86 R09: 0000000000000000

R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005

R13: 0000000000000032 R14: 0000000000000004 R15: ffffffff94c00000

FS: 0000000000000000(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f162e060000 CR3: 0000000014c11000 CR4: 00000000001407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

0000000094c03e58 5955c5b31ad5e8cf ffffe8fffee031a8 0000000000000005
ffffffff94ca9dc0 0000000000000000 ffffffff94c03ea8 ffffffff94661f05
00001cb7dcf6fd93 ffffffff94ca9f90 ffffffff94c00000 ffffffff94d18870
Call[31557.908912] NMI backtrace for cpu 2
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 68.178 msecs
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 10/410 27945]
task: ffff880242b596f0 ti: ffff880242b6c000 task.ti: ffff880242b6c000
RIP: 0010:[<ffffffff943dd415>] [<ffffffff943dd415>] intel_idle+0xd5/0x180
RSP: 0018:ffff880242b6fdf8 EFLAGS: 00000046

RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001

RDX: 0000000000000000 RSI: ffff880242b6ffd8 RDI: 0000000000000002
RBP: ffff880242b6fe28 R08: 000000008baf8b86 R09: 0000000000000000

R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005

R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b6c000
FS: 0000000000000000(0000) GS:ffff880244400000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000000 CR3: 0000000014c11000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

0000000242b6fe28 da97aa9b9f42090a ffffe8ffff2031a8 0000000000000005
ffffffff94ca9dc0 0000000000000002 ffff880242b6fe78 ffffffff94661f05
00001cb7dcdd1af6 ffffffff94ca9f90 ffff880242b6c000 ffffffff94d18870
Call Trace:
[<ffffffff94661f05>] cpuidle_enter_state+0x55/0x1c0
[<ffffffff94662127>] cpuidle_enter+0x17/0x20
[<ffffffff940b94a3>] cpu_startup_entry+0x423/0x4d0
[<ffffffff9402b763>] start_secondary+0x1a3/0x220

Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48

INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 95.994 msecs

<tangent>

Also, today I learned we can reach the perf_event_read code from
read(). Given I had /proc/sys/kernel/perf_event_paranoid set to 1,
I'm not sure how this is even possible. The only user of perf_fops
is perf_event_open syscall _after_ it's checked that sysctl.

Oh, there's an ioctl path to perf too. Though trinity
doesn't know anything about it, so I find it surprising if it
managed to pull the right combination of entropy to make that
do the right thing. Still, that ioctl path probably needs
to also be checking that sysctl shouldn't it ?

Dave

Vivek Goyal

unread,

Nov 19, 2014, 10:03:48 AM11/19/14

to Don Zickus, Dave Jones, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 09:41:05AM -0500, Don Zickus wrote:
> On Tue, Nov 18, 2014 at 05:02:54PM -0500, Dave Jones wrote:
> > On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:
> >
> > > > So here we mangle CPU3 in and lose the backtrace for cpu0, which might
> > > > be the real interesting one ....
> > >
> > > Can you provide another dump? The hope is we get something not mangled?
> >
> > Working on it..
> >
> > > The other option we have done in RHEL is panic the system and let kdump
> > > capture the memory. Then we can analyze the vmcore for the stack trace
> > > cpu0 stored in memory to get a rough idea where it might be if the cpu
> > > isn't responding very well.
> >
> > I don't know if it's because of the debug options I typically run with,
> > or that I'm perpetually cursed, but I've never managed to get kdump to
> > do anything useful. (The last time I tried it was actively harmful in
> > that not only did it fail to dump anything, it wedged the machine so
> > it didn't reboot after panic).

Hi Dave Jones,

Not being able to capture the dump I can understand but having wedged
the machine so that it does not reboot after dump failure sounds bad.
So you could not get machine to boot even after a power cycle? Would
you remember what was failing. I am curious to know what did kdump do
to make machine unbootable.

> >
> > Unless there's some magic step missing from the documentation at
> > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> > then I'm not optimistic it'll be useful.

I had a quick look at it and it basically looks fine. In fedora ideally
it is just two steps process.

- Reserve memory using crashkernel. Say crashkernel=160M
- systemctl start kdump
- Crash the system or wait for it to crash.

So despite your bad experience in the past, I would encourage you to
give it a try.

>
> Well, I don't know when the last time you ran it, but I know the RH kexec
> folks have started pursuing a Fedora-first package patch rule a couple of
> years ago to ensure Fedora had a working kexec/kdump solution.

Yep, now we are putting everything in fedora first so it should be much
better. Hard to say the same thing about driver authors. Sometimes they
might have a driver working in rhel and not necessarily upstream. I am
not sure if you ran into one of those issues.

Also recently I have seen issues with graphics drivers too.

>
> As for the wedging part, it was a common problem to have the kernel hang
> while trying to boot the second kernel (and before console output
> happened). So the problem makes sense and is unfortunate. I would
> encourage you to try again. :-)
>
> Though, it is transitioning to have the app built into the kernel to deal
> with the whole secure boot thing, so that might be another can of worms.

I doubt that secureboot bits will contribute to the failure.

Thanks
Vivek

Dave Jones

unread,

Nov 19, 2014, 10:39:16 AM11/19/14

to Vivek Goyal, Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 10:03:33AM -0500, Vivek Goyal wrote:

> Not being able to capture the dump I can understand but having wedged
> the machine so that it does not reboot after dump failure sounds bad.
> So you could not get machine to boot even after a power cycle? Would
> you remember what was failing. I am curious to know what did kdump do
> to make machine unbootable.

Power cycling was fine, because then it booted into the non-kdump kernel.
The issue was when I caused that kernel to panic, it would just sit there
wedged, with no indication it even tried to switch to the kdump kernel.

> > > Unless there's some magic step missing from the documentation at
> > > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> > > then I'm not optimistic it'll be useful.
>
> I had a quick look at it and it basically looks fine. In fedora ideally
> it is just two steps process.
>
> - Reserve memory using crashkernel. Say crashkernel=160M
> - systemctl start kdump
> - Crash the system or wait for it to crash.
>
> So despite your bad experience in the past, I would encourage you to
> give it a try.

'the past' here, is two weeks ago, on Fedora 21.

But, since then, I've reinstalled that box with Fedora 20 because I didn't
trust gcc 4.9, and on f20 things are actually even worse.

Right now it doesn't even create the image correctly:

dracut: *** Stripping files done ***
dracut: *** Store current command line parameters ***
dracut: *** Creating image file ***
dracut: *** Creating image file done ***
kdumpctl: cat: write error: Broken pipe
kdumpctl: kexec: failed to load kdump kernel
kdumpctl: Starting kdump: [FAILED]

It works if I run a Fedora kernel, but not with a self-built one.
And there's zero information as to what I'm doing wrong.

I saw something similar on F21, got past it somehow a few weeks ago,
but I can't remember what I had to do. Unfortunatly that was still
fruitless as it didn't actually dump anything, leading to my frustration
with the state of kdump.

I'll try again when I put F21 back on that machine, but I'm
not particularly optimistic tbh.

Dave

Vivek Goyal

unread,

Nov 19, 2014, 11:28:24 AM11/19/14

to Dave Jones, Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers, WANG Chao, Baoquan He, Dave Young

On Wed, Nov 19, 2014 at 10:38:52AM -0500, Dave Jones wrote:
> On Wed, Nov 19, 2014 at 10:03:33AM -0500, Vivek Goyal wrote:
>
> > Not being able to capture the dump I can understand but having wedged
> > the machine so that it does not reboot after dump failure sounds bad.
> > So you could not get machine to boot even after a power cycle? Would
> > you remember what was failing. I am curious to know what did kdump do
> > to make machine unbootable.
>
> Power cycling was fine, because then it booted into the non-kdump kernel.
> The issue was when I caused that kernel to panic, it would just sit there
> wedged, with no indication it even tried to switch to the kdump kernel.

I have seen the cases where we fail to boot in second kernel and often
failure can happen very early without any information on graphic console.
I have to always hook up a serial console to get an idea what went wrong
that early. It is not an idea situation but at the same time don't know
how to improve it.

I am wondering may be in some cases we panic in second kernel and sit
there. Probably we should append a kernel command line automatically
say "panic=1" so that it reboots itself if second kernel panics.

By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
disable that as currently kexec/kdump stuff does not work with it. And
it hangs very early in the boot process and I had to hook serial console
to get following message on console.

arch/x86/boot/compressed/misc.c
error("32-bit relocation outside of kernel!\n");

I noticed that error() halts in a while loop after error message. May be
there can be some way for it to try to reboot instead of halting in
while loop.

>
> > > > Unless there's some magic step missing from the documentation at
> > > > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> > > > then I'm not optimistic it'll be useful.
> >
> > I had a quick look at it and it basically looks fine. In fedora ideally
> > it is just two steps process.
> >
> > - Reserve memory using crashkernel. Say crashkernel=160M
> > - systemctl start kdump
> > - Crash the system or wait for it to crash.
> >
> > So despite your bad experience in the past, I would encourage you to
> > give it a try.
>
> 'the past' here, is two weeks ago, on Fedora 21.
>
> But, since then, I've reinstalled that box with Fedora 20 because I didn't
> trust gcc 4.9, and on f20 things are actually even worse.
>
> Right now it doesn't even create the image correctly:
>
> dracut: *** Stripping files done ***
> dracut: *** Store current command line parameters ***
> dracut: *** Creating image file ***
> dracut: *** Creating image file done ***
> kdumpctl: cat: write error: Broken pipe
> kdumpctl: kexec: failed to load kdump kernel
> kdumpctl: Starting kdump: [FAILED]

Hmmm..., can you please enable debugging in kdumpctl using "set -x" and
do "touch /etc/kdump.conf; kdumpctl restart" and give debug output to me.

>
> It works if I run a Fedora kernel, but not with a self-built one.
> And there's zero information as to what I'm doing wrong.

I just tested F20 kdump on my box and it worked fine for me.

So for you second kernel hangs and there is no info on console? Is there
any possibility to hook up serial console, enable early printk and see
if soemthing shows up there.

Apart from this, if you run into kdump issues in fedora, please cc
kexec fedora mailing list too so that we are aware of it.

https://lists.fedoraproject.org/mailman/listinfo/kexec

Thanks
Vivek

Linus Torvalds

unread,

Nov 19, 2014, 12:22:18 PM11/19/14

to Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Frédéric Weisbecker, Andy Lutomirski, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 6:59 AM, Dave Jones <da...@redhat.com> wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
> >
> > That makes me wonder: does the problem go away if you disable NOHZ?
>
> Aparently not.
>
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]

> RIP: 0010:[<ffffffff94175be7>] [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120

> Call Trace:
> [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
> [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff947d41bf>] tracesys+0x14/0x4a

Ok, that's just crazy. This is the system call *entry* portion.

Last time it was the system call exit side, which at least made some
sense, because the "return to user space" thing actually has loops in
it to handle all events before we return to user space.

But the whole 'tracesys' part is only entered once, at the very
beginning of a system call. There's no loop over the work. That whole
call trace implies that the lockup happened just after we entered the
system call path from _user_ space.

And in fact, exactly like last time, the code line implies that the
timer interrupt happened on the return from the instruction, and
indeed in both cases the code looked like this (the registers
differed, but the "restore flags, start popping saved regs" was the
exact same):

26: 53 push %rbx
27: 9d popfq
28:* 5b pop %rbx <-- trapping instruction
29: 41 5c pop %r12

in both cases, the timer interrupt happened right after the "popfq",
but in both cases the value in the register that was used to restore
eflags was invalid. Here %rbx was 0x0000000100000046 (which is a valid
eflags value, but not the one we've actually restored!), and in your
previous oops (where it was %r12) it was completely invalid.

So it hasn't actually done the "push %rbx; popfq" part - there must be
a label at the return part, and context_tracking_user_exit() never
actually did the local_irq_save/restore at all. Which means that it
took one of the early exits instead:

if (!context_tracking_is_enabled())
return;

if (in_interrupt())
return;

So not only does this happen at early system call entry time, the
function that is claimed to lock up doesn't actually *do* anything.

Ho humm..

Oh, and to make matters worse, the only way this call chain can happen
is this in syscall_trace_enter_phase1():

if (work & _TIF_NOHZ) {
user_exit();
work &= ~TIF_NOHZ;
}

so there's still some NOHZ confusion there. It looks like TIF_NOHZ
gets set regardless of whether NOHZ is enabled or not..

I'm adding Frederic explicitly to the cc too, because this is just
fishy. I am starting to blame context tracking, because it has now
shown up twice in different guises, and TIF_NOHZ seems to be
implicated.

> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]

> RIP: 0010:[<ffffffff940fb71e>] [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0

> Call Trace:
> [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0

> [<ffffffff941671aa>] perf_event_read+0xca/0xd0
> [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
> [<ffffffff941689c6>] perf_read+0x226/0x370

> [<ffffffff941eafff>] vfs_read+0x9f/0x180

Hmm.. We've certainly seen a lot of smp_call, for various different
reasons in your traces..

I'm wondering if the smp-call ended up corrupting something on CPU3.
Because even _with_ TIF_NOHZ confusion, I don't see how system call
*entry* could cause a watchdog event. There are no loops, there are no
locks I see, there is just *nothing* there I can see.

Let's add Andy L to the cc too, in case he hasn't seen this. He's
been changing the lowlevel asm code, including very much this whole
"syscall_trace_enter_phase1" thing. Maybe he sees something I don't.

Andy?

> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 11/410 27945]

> RIP: 0010:[<ffffffff943dd415>] [<ffffffff943dd415>] intel_idle+0xd5/0x180

> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 10/410 27945]

> RIP: 0010:[<ffffffff943dd415>] [<ffffffff943dd415>] intel_idle+0xd5/0x180

Nothing there.

> Also, today I learned we can reach the perf_event_read code from
> read(). Given I had /proc/sys/kernel/perf_event_paranoid set to 1,
> I'm not sure how this is even possible. The only user of perf_fops
> is perf_event_open syscall _after_ it's checked that sysctl.
>
> Oh, there's an ioctl path to perf too. Though trinity
> doesn't know anything about it, so I find it surprising if it
> managed to pull the right combination of entropy to make that
> do the right thing. Still, that ioctl path probably needs
> to also be checking that sysctl shouldn't it ?

Hmm. Perf people are already mostly on the list. Peter/Ingo/Arnaldo?

Linus

Linus Torvalds

unread,

Nov 19, 2014, 12:40:35 PM11/19/14

to Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Frédéric Weisbecker, Andy Lutomirski, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 9:22 AM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> So it hasn't actually done the "push %rbx; popfq" part - there must be
> a label at the return part, and context_tracking_user_exit() never
> actually did the local_irq_save/restore at all. Which means that it
> took one of the early exits instead:
>
> if (!context_tracking_is_enabled())
> return;
>
> if (in_interrupt())
> return;

Ho humm. Interesting. Neither of those should possibly have happened.

We "know" that "context_tracking_is_enabled()" must be true, because
the only way we get to context_tracking_user_exit() in the first place
is through "user_exit()", which does:

if (context_tracking_is_enabled())
context_tracking_user_exit();

and we know we shouldn't be in_interrupt(), because the backtrace is
the system call entry path, for chrissake!

So we definitely have some corruption going on. A few possibilities:

- either the register contents are corrupted (%rbx in your dump said
"0x0000000100000046", but the eflags we restored was 0x246)

- in_interrupt() is wrong, and we've had some irq_count() corruption.
I'd expect that to result in "scheduling while atomic" messages,
though, especially if it goes on long enough that you get a watchdog
event..

- there is something rotten in the land of
context_tracking_is_enabled(), which uses a static key.

- I have misread the whole trace, and am a moron. But your earlier
report really had some very similar things, just in
context_tracking_user_enter() instead of exit.

In your previous oops, the registers that was allegedly used to
restore %eflags was %r12:

28: 41 54 push %r12
2a: 9d popfq
2b:* 5b pop %rbx <-- trapping instruction
2c: 41 5c pop %r12
2e: 5d pop %rbp
2f: c3 retq

but:

R12: ffff880101ee3ec0
EFLAGS: 00000282

so again, it looks like we never actually did that "popfq"
instruction, and it would have exited through the (same) early exits.

But what an odd coincidence that it ended up in both of your reports
being *exactly* at that instruction after the "popf". If it had
actually *taken* the popf, I'd not be so surprised ("ok, popf enabled
interrupts, and there was an interrupt pending"), but since everything
seems to say that it came there through some control flow that did
*not* go through the popf, that's just a very odd coincidence.

And both context_tracking_user_enter() and exit() have that exact same
issue with the early returns. They shouldn't have happened in the
first place.

Frederic Weisbecker

unread,

Nov 19, 2014, 2:02:30 PM11/19/14

to Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

I got a report lately involving context tracking. Not sure if it's
the same here but the issue was that context tracking uses per cpu data
and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
lazy paging.

With that in mind, about eveything can happen. Some parts of the context tracking
code really aren't fault-safe (or more generally exception safe). That's because
context tracking itself tracks exceptions.

So for example if we enter a syscall, we go to context_tracking_user_exit() then
vtime_user_enter() which _takes a lock_ with write_seqlock().

If an exception occurs before we unlock the seqlock (it's possible for
example account_user_time() -> task_group_account_field()-> cpuacct_account_field()
accesses dynamically allocated per cpu area which can fault) then
the fault calls exception_enter() then user_exit() which does all the same again
and deadlocks.

I can certainly fix that with a few recursion protection.

Now we just need to determine if the current case has the same cause.

Andy Lutomirski

unread,

Nov 19, 2014, 2:04:19 PM11/19/14

to Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

Wait, what? If something like kernel_stack ends with an unmapped pmd,
we are well and truly screwed.

--Andy

Andy Lutomirski

unread,

Nov 19, 2014, 2:15:42 PM11/19/14

to Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Frédéric Weisbecker, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 9:22 AM, Linus Torvalds
<torv...@linux-foundation.org> wrote:

> On Wed, Nov 19, 2014 at 6:59 AM, Dave Jones <da...@redhat.com> wrote:
>> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
>> >
>> > That makes me wonder: does the problem go away if you disable NOHZ?
>>
>> Aparently not.
>>
>> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
>> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
>> RIP: 0010:[<ffffffff94175be7>] [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
>> Call Trace:
>> [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
>> [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> [<ffffffff947d41bf>] tracesys+0x14/0x4a
>
> Ok, that's just crazy. This is the system call *entry* portion.
>
> Last time it was the system call exit side, which at least made some
> sense, because the "return to user space" thing actually has loops in
> it to handle all events before we return to user space.
>
> But the whole 'tracesys' part is only entered once, at the very
> beginning of a system call. There's no loop over the work. That whole
> call trace implies that the lockup happened just after we entered the
> system call path from _user_ space.

I suspect that the regression was triggered by the seccomp pull, since
that reworked a lot of this code.

>
> And in fact, exactly like last time, the code line implies that the
> timer interrupt happened on the return from the instruction, and
> indeed in both cases the code looked like this (the registers
> differed, but the "restore flags, start popping saved regs" was the
> exact same):
>
> 26: 53 push %rbx
> 27: 9d popfq
> 28:* 5b pop %rbx <-- trapping instruction
> 29: 41 5c pop %r12

Just to make sure I understand: it says "NMI watchdog", but this trace
is from a timer interrupt, not NMI, right?

Is it possible that we've managed to return to userspace with
interrupts off somehow? A loop in userspace that somehow has
interrupts off can cause all kinds of fun lockups.

I don't understand the logic of what enables TIF_NOHZ. That being
said in the new 3.18 code, if TIF_NOHZ is set, we use part of the fast
path instead of the full syscall slow path, which means that we
meander differently through the asm than we used to (we do
syscall_trace_enter_phase1, then a fast path syscall, then we get to
sysret_careful, which does this:

/*
* We have a signal, or exit tracing or single-step.
* These all wind up with the iret return path anyway,
* so just join that path right now.
*/
FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
jmp int_check_syscall_exit_work

In 3.17, I don't think that code would run with context tracking on,
although I don't immediately see any bugs here.

>
>> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
>> RIP: 0010:[<ffffffff940fb71e>] [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
>> Call Trace:
>> [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0
>> [<ffffffff941671aa>] perf_event_read+0xca/0xd0
>> [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
>> [<ffffffff941689c6>] perf_read+0x226/0x370
>> [<ffffffff941eafff>] vfs_read+0x9f/0x180
>
> Hmm.. We've certainly seen a lot of smp_call, for various different
> reasons in your traces..
>
> I'm wondering if the smp-call ended up corrupting something on CPU3.
> Because even _with_ TIF_NOHZ confusion, I don't see how system call
> *entry* could cause a watchdog event. There are no loops, there are no
> locks I see, there is just *nothing* there I can see.
>

If we ever landed in userspace with interrupts off, this could happen
quite easily. It should be straightforward to add an assertion for
that, in trinity or in the kernel.

--Andy

> Let's add Andy L to the cc too, in case he hasn't seen this. He's
> been changing the lowlevel asm code, including very much this whole
> "syscall_trace_enter_phase1" thing. Maybe he sees something I don't.
>
> Andy?
>
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 11/410 27945]
>> RIP: 0010:[<ffffffff943dd415>] [<ffffffff943dd415>] intel_idle+0xd5/0x180
>> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 10/410 27945]
>> RIP: 0010:[<ffffffff943dd415>] [<ffffffff943dd415>] intel_idle+0xd5/0x180
>
> Nothing there.
>
>> Also, today I learned we can reach the perf_event_read code from
>> read(). Given I had /proc/sys/kernel/perf_event_paranoid set to 1,
>> I'm not sure how this is even possible. The only user of perf_fops
>> is perf_event_open syscall _after_ it's checked that sysctl.
>>
>> Oh, there's an ioctl path to perf too. Though trinity
>> doesn't know anything about it, so I find it surprising if it
>> managed to pull the right combination of entropy to make that
>> do the right thing. Still, that ioctl path probably needs
>> to also be checking that sysctl shouldn't it ?
>
> Hmm. Perf people are already mostly on the list. Peter/Ingo/Arnaldo?
>
> Linus

--

Andy Lutomirski
AMA Capital Management, LLC

Linus Torvalds

unread,

Nov 19, 2014, 2:38:19 PM11/19/14

to Andy Lutomirski, Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Frédéric Weisbecker, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 11:15 AM, Andy Lutomirski <lu...@amacapital.net> wrote:
>
> I suspect that the regression was triggered by the seccomp pull, since
> that reworked a lot of this code.

Note that it turns out that Dave can apparently see the same problems
with 3.17, so it's not actually a regression. So it may have been
going on for a while.

> Just to make sure I understand: it says "NMI watchdog", but this trace
> is from a timer interrupt, not NMI, right?

Yeah. The kernel/watchdog.c code always says "NMI watchdog", but it's
actually just a regular tiemr function: watchdog_timer_fn() started
with hrtimer_start().

> Is it possible that we've managed to return to userspace with
> interrupts off somehow? A loop in userspace that somehow has
> interrupts off can cause all kinds of fun lockups.

That sounds unlikely, but if there is some stack corruption going on.

However, it wouldn't even explain things, because even if interrupts
had been disabled in user space, and even if that popf got executed,
this wouldn't be where they got enabled. That would be the :"sti" in
the system call entry path (hidden behind the ENABLE_INTERRUPTS
macro).

Of course, maybe Dave has paravirtualization enabled (what a crock
_that_ is), and there is something wrong with that whole code.

> I don't understand the logic of what enables TIF_NOHZ.

Yeah, that makes two of us. But..

> In 3.17, I don't think that code would run with context tracking on,
> although I don't immediately see any bugs here.

See above: the problem apparently isn't new. Although it is possible
that we have two different issues going on..

Linus

Andy Lutomirski

unread,

Nov 19, 2014, 4:01:52 PM11/19/14

to Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On 11/19/2014 06:59 AM, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
> > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <da...@redhat.com> wrote:
> > >
> > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
> > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> > > RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
> > > Call Trace:
> > > [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
> > > [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
> >
> > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
> >
> > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
> >
> > That makes me wonder: does the problem go away if you disable NOHZ?
>
> Aparently not.

TIF_NOHZ is not the same thing as NOHZ. Can you try a kernel with
CONFIG_CONTEXT_TRACKING=n? Doing that may involve fiddling with RCU
settings a bit. The normal no HZ idle stuff has nothing to do with
TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
thread_info corruption going on here.

Hmm. This isn't a stack overflow, is it? That could cause all of these
problems quite easily, although I'd expect other symptoms, too.

>
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
> RIP: 0010:[<ffffffff94175be7>] [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120

This RIP should be impossible if context tracking is off.

> RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47

^^^^^^^^^

That is a strange coincidence. Where did 0x46 | (1<<32) come from?
That's a sensible interrupts-disabled flags value with the high part set
to 0x1. Those high bits are undefined, but they ought to all be zero.

Riddle me this: what are we doing in tracesys_phase2? This is a full
slow-path syscall. TIF_NOHZ doesn't cause that, I think. I'd love to
see the value of ti->flags here. Is trinity using ptrace?

Um. There's a bug. Patch coming after lunch. No clue whether it will
help here.

--Andy

Dave Jones

unread,

Nov 19, 2014, 4:49:39 PM11/19/14

to Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote:

> TIF_NOHZ is not the same thing as NOHZ. Can you try a kernel with
> CONFIG_CONTEXT_TRACKING=n? Doing that may involve fiddling with RCU
> settings a bit. The normal no HZ idle stuff has nothing to do with
> TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
> thread_info corruption going on here.

I'll try that next.

> > RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
>
> ^^^^^^^^^
>
> That is a strange coincidence. Where did 0x46 | (1<<32) come from?
> That's a sensible interrupts-disabled flags value with the high part set
> to 0x1. Those high bits are undefined, but they ought to all be zero.

This box is usually pretty solid, but it's been in service as a 24/7
fuzzing box for over a year now, so it's not outside the realm of
possibility that this could all be a hardware fault if some memory
has gone bad or the like. Unless we find something obvious in the
next few days, I'll try running memtest over the weekend (though
I've seen situations where that doesn't stress hardware enough to
manifest a problem, so it might not be entirely conclusive unless
it actually finds a fault).

I wish I had a second identical box to see if it would be reproducible.

> > [<ffffffff941689c6>] perf_read+0x226/0x370
> > [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
> > [<ffffffff941eafff>] vfs_read+0x9f/0x180
> > [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
> > [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9
>
> Riddle me this: what are we doing in tracesys_phase2? This is a full
> slow-path syscall. TIF_NOHZ doesn't cause that, I think. I'd love to
> see the value of ti->flags here. Is trinity using ptrace?

That's one of the few syscalls we actually blacklist (mostly because it
requires some more thinking: just passing it crap can get the fuzzer
into a confused state where it thinks child processes are dead, when
they aren't etc). So it shouldn't be calling ptrace ever.

Andy Lutomirski

unread,

Nov 19, 2014, 4:56:33 PM11/19/14

to Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, x...@kernel.org, Peter Zijlstra, Andy Lutomirski

TIF_NOHZ is 19 (i.e. _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME |
_TIF_SINGLESTEP), not (1<<19).

This code is involved in Dave's trinity lockup, but I don't see why
it would cause any of the problems he's seeing, except inadvertently
by causing a different path through entry_64.S's syscall handling.

Signed-off-by: Andy Lutomirski <lu...@amacapital.net>
---
arch/x86/kernel/ptrace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 749b0e423419..e510618b2e91 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1484,7 +1484,7 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
*/

if (work & _TIF_NOHZ) {
user_exit();

- work &= ~TIF_NOHZ;
+ work &= ~_TIF_NOHZ;
}

#ifdef CONFIG_SECCOMP
--
1.9.3

Thomas Gleixner

unread,

Nov 19, 2014, 4:56:51 PM11/19/14

to Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> I got a report lately involving context tracking. Not sure if it's
> the same here but the issue was that context tracking uses per cpu data
> and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> lazy paging.

This is complete nonsense. pcpu allocations are populated right
away. Otherwise no single line of kernel code which uses dynamically
allocated per cpu storage would be safe.

Thanks,

tglx

Borislav Petkov

unread,

Nov 19, 2014, 4:58:30 PM11/19/14

to Dave Jones, Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 04:47:43PM -0500, Dave Jones wrote:
> This box is usually pretty solid, but it's been in service as a 24/7
> fuzzing box for over a year now, so it's not outside the realm of
> possibility that this could all be a hardware fault if some memory
> has gone bad or the like.

You could grep old logs for "Hardware Error" and the usual suspects
coming from MCE/EDAC. Also /var/log/mcelog or something like that,
depending on what's running on that box.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Thomas Gleixner

unread,

Nov 19, 2014, 5:14:25 PM11/19/14

to Andy Lutomirski, Dave Jones, Linus Torvalds, Don Zickus, Linux Kernel, x...@kernel.org, Peter Zijlstra

On Wed, 19 Nov 2014, Andy Lutomirski wrote:

> TIF_NOHZ is 19 (i.e. _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME |
> _TIF_SINGLESTEP), not (1<<19).
>
> This code is involved in Dave's trinity lockup, but I don't see why
> it would cause any of the problems he's seeing, except inadvertently
> by causing a different path through entry_64.S's syscall handling.

Right, while it is wrong it does not explain the wreckage on 3.17,
which does not have that code.

Thanks,

tglx

Dave Jones

unread,

Nov 19, 2014, 5:20:14 PM11/19/14

to Linus Torvalds, Andy Lutomirski, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Frédéric Weisbecker, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 11:38:09AM -0800, Linus Torvalds wrote:

> > Is it possible that we've managed to return to userspace with
> > interrupts off somehow? A loop in userspace that somehow has
> > interrupts off can cause all kinds of fun lockups.
>
> That sounds unlikely, but if there is some stack corruption going on.
>
> However, it wouldn't even explain things, because even if interrupts
> had been disabled in user space, and even if that popf got executed,
> this wouldn't be where they got enabled. That would be the :"sti" in
> the system call entry path (hidden behind the ENABLE_INTERRUPTS
> macro).
>
> Of course, maybe Dave has paravirtualization enabled (what a crock
> _that_ is), and there is something wrong with that whole code.

I've had HYPERVISOR_GUEST disabled for a while, which also disables
the paravirt code afaics.

Dave

Frederic Weisbecker

unread,

Nov 19, 2014, 5:56:28 PM11/19/14

to Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
> On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> > I got a report lately involving context tracking. Not sure if it's
> > the same here but the issue was that context tracking uses per cpu data
> > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> > lazy paging.
>
> This is complete nonsense. pcpu allocations are populated right
> away. Otherwise no single line of kernel code which uses dynamically
> allocated per cpu storage would be safe.

Note this isn't faulting because part of the allocation is swapped. No
it's all reserved in the physical memory, but it's a lazy allocation.
Part of it isn't yet addressed in the P[UGM?]D. That's what vmalloc_fault() is for.

So it's a non-blocking/sleeping fault which is why it's probably fine
most of the time except on code that isn't fault-safe. And I suspect that
most people assume that kernel data won't fault so probably some other
places have similar issues.

That's a long standing issue. We even had to convert the perf callchain
allocation to ad-hoc kmalloc() based per cpu allocation to get over vmalloc
faults. At that time, NMIs couldn't handle faults and many callchains were
populated in NMIs. We had serious crashes because of per cpu memory faults.

I think that lazy adressing is there for allocation performance reasons. But
still having faultable per cpu memory is insame IMHO.

Dave Jones

unread,

Nov 19, 2014, 5:59:20 PM11/19/14

to Borislav Petkov, Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 10:58:14PM +0100, Borislav Petkov wrote:
> On Wed, Nov 19, 2014 at 04:47:43PM -0500, Dave Jones wrote:
> > This box is usually pretty solid, but it's been in service as a 24/7
> > fuzzing box for over a year now, so it's not outside the realm of
> > possibility that this could all be a hardware fault if some memory
> > has gone bad or the like.
>
> You could grep old logs for "Hardware Error" and the usual suspects
> coming from MCE/EDAC. Also /var/log/mcelog or something like that,
> depending on what's running on that box.

Nothing, but it wouldn't be the first time I'd seen a hardware fault
that didn't raise an MCE.

Dave

Andy Lutomirski

unread,

Nov 19, 2014, 5:59:33 PM11/19/14

to Frederic Weisbecker, Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 2:56 PM, Frederic Weisbecker <fwei...@gmail.com> wrote:
> On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
>> On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
>> > I got a report lately involving context tracking. Not sure if it's
>> > the same here but the issue was that context tracking uses per cpu data
>> > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
>> > lazy paging.
>>
>> This is complete nonsense. pcpu allocations are populated right
>> away. Otherwise no single line of kernel code which uses dynamically
>> allocated per cpu storage would be safe.
>
> Note this isn't faulting because part of the allocation is swapped. No
> it's all reserved in the physical memory, but it's a lazy allocation.
> Part of it isn't yet addressed in the P[UGM?]D. That's what vmalloc_fault() is for.
>
> So it's a non-blocking/sleeping fault which is why it's probably fine
> most of the time except on code that isn't fault-safe. And I suspect that
> most people assume that kernel data won't fault so probably some other
> places have similar issues.
>
> That's a long standing issue. We even had to convert the perf callchain
> allocation to ad-hoc kmalloc() based per cpu allocation to get over vmalloc
> faults. At that time, NMIs couldn't handle faults and many callchains were
> populated in NMIs. We had serious crashes because of per cpu memory faults.

Is there seriously more than 512GB of per-cpu virtual space or
whatever's needed to exceed a single pgd on x86_64?

And there are definitely placed that access per-cpu data in contexts
in which a non-IST fault is not allowed. Maybe not dynamic per-cpu
data, though.

--Andy

>
> I think that lazy adressing is there for allocation performance reasons. But
> still having faultable per cpu memory is insame IMHO.
>

--

Andy Lutomirski
AMA Capital Management, LLC

Frederic Weisbecker

unread,

Nov 19, 2014, 6:01:05 PM11/19/14

to Andy Lutomirski, Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

Note that's non-sleeping faults. So probably most places are fine except
a few of them that really don't want exception to mess up some state. I
can imagine some entry code that really don't want that.

Is kernel stack allocated by vmalloc or alloc_percpu()?

Andy Lutomirski

unread,

Nov 19, 2014, 6:07:47 PM11/19/14

to Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

Any non-IST fault at all on the kernel_stack reference in system_call
is instant root on non-SMAP systems and instant double-fault or more
challenging root on SMAP systems. The issue is that rsp is
user-controlled, so the CPU cannot deliver a non-IST fault safely.

>
> Is kernel stack allocated by vmalloc or alloc_percpu()?

DEFINE_PER_CPU(unsigned long, kernel_stack)

Note that I'm talking about kernel_stack, not the kernel stack itself.
The actual stack is regular linearly-mapped memory, although I plan on
trying to change that, complete with all kinds of care to avoid double
faults.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

Frederic Weisbecker

unread,

Nov 19, 2014, 6:08:07 PM11/19/14

to Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 02:59:01PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 19, 2014 at 2:56 PM, Frederic Weisbecker <fwei...@gmail.com> wrote:
> > On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
> >> On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> >> > I got a report lately involving context tracking. Not sure if it's
> >> > the same here but the issue was that context tracking uses per cpu data
> >> > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> >> > lazy paging.
> >>
> >> This is complete nonsense. pcpu allocations are populated right
> >> away. Otherwise no single line of kernel code which uses dynamically
> >> allocated per cpu storage would be safe.
> >
> > Note this isn't faulting because part of the allocation is swapped. No
> > it's all reserved in the physical memory, but it's a lazy allocation.
> > Part of it isn't yet addressed in the P[UGM?]D. That's what vmalloc_fault() is for.
> >
> > So it's a non-blocking/sleeping fault which is why it's probably fine
> > most of the time except on code that isn't fault-safe. And I suspect that
> > most people assume that kernel data won't fault so probably some other
> > places have similar issues.
> >
> > That's a long standing issue. We even had to convert the perf callchain
> > allocation to ad-hoc kmalloc() based per cpu allocation to get over vmalloc
> > faults. At that time, NMIs couldn't handle faults and many callchains were
> > populated in NMIs. We had serious crashes because of per cpu memory faults.
>
> Is there seriously more than 512GB of per-cpu virtual space or
> whatever's needed to exceed a single pgd on x86_64?

No idea, I'm clueless about -mm details.

>
> And there are definitely placed that access per-cpu data in contexts
> in which a non-IST fault is not allowed. Maybe not dynamic per-cpu
> data, though.

It probably happens to be fine because the code that accesses first the
related data is fault-safe. Or maybe not and some state is silently messed
up somewhere.

This doesn't leave a comfortable feeling.

Thomas Gleixner

unread,

Nov 19, 2014, 6:09:35 PM11/19/14

to Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

On Wed, 19 Nov 2014, Frederic Weisbecker wrote:

> On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
> > On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> > > I got a report lately involving context tracking. Not sure if it's
> > > the same here but the issue was that context tracking uses per cpu data
> > > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> > > lazy paging.
> >
> > This is complete nonsense. pcpu allocations are populated right
> > away. Otherwise no single line of kernel code which uses dynamically
> > allocated per cpu storage would be safe.
>
> Note this isn't faulting because part of the allocation is
> swapped. No it's all reserved in the physical memory, but it's a
> lazy allocation. Part of it isn't yet addressed in the
> P[UGM?]D. That's what vmalloc_fault() is for.

Sorry, I can't follow your argumentation here.

pcpu_alloc()
....
area_found:
....

/* clear the areas and return address relative to base address */
for_each_possible_cpu(cpu)
memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);

How would that memset fail to establish the mapping, which is
btw. already established via:

pcpu_populate_chunk()

already before that memset?

Are we talking about different per cpu allocators here or am I missing
something completely non obvious?

Thanks,

tglx

Frederic Weisbecker

unread,

Nov 19, 2014, 6:13:32 PM11/19/14

to Andy Lutomirski, Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 03:07:17PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 19, 2014 at 3:00 PM, Frederic Weisbecker <fwei...@gmail.com> wrote:
> > Note that's non-sleeping faults. So probably most places are fine except
> > a few of them that really don't want exception to mess up some state. I
> > can imagine some entry code that really don't want that.
>
> Any non-IST fault at all on the kernel_stack reference in system_call
> is instant root on non-SMAP systems and instant double-fault or more
> challenging root on SMAP systems. The issue is that rsp is
> user-controlled, so the CPU cannot deliver a non-IST fault safely.

Heh.

> >
> > Is kernel stack allocated by vmalloc or alloc_percpu()?
>
> DEFINE_PER_CPU(unsigned long, kernel_stack)
>
> Note that I'm talking about kernel_stack, not the kernel stack itself.

Ah. Note, static allocation like DEFINE_PER_CPU() is probably fine. The
issue is on dynamic allocations: alloc_percpu().

> The actual stack is regular linearly-mapped memory, although I plan on
> trying to change that, complete with all kinds of care to avoid double
> faults.

If you do so, you must really ensure that the resulting memory will never
fault.

Frederic Weisbecker

unread,

Nov 19, 2014, 6:50:47 PM11/19/14

to Thomas Gleixner, Tejun Heo, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

That's the same allocator yeah. So if the whole memory is dereferenced,
faults shouldn't happen indeed.

Maybe that was a bug a few years ago but not anymore.

I'm surprised because I got a report from Dave that very much suggested
a vmalloc fault. See the discussion "Deadlock in vtime_account_user() vs itself across a page fault":

http://marc.info/?l=linux-kernel&m=141047612120263&w=2

Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
After all it's allocated with vzalloc() so that part could be skipped. The memset(0)
is passed the whole size though so it looks like the whole is dereferenced.

(cc'ing Tejun just in case).

Now if faults on percpu memory don't happen anymore, perhaps we are accessing some
other vmalloc'ed area. In the above report from Dave, the fault happened somewhere
in account_user_time().

Andy Lutomirski

unread,

Nov 19, 2014, 6:54:45 PM11/19/14

to Thomas Gleixner, Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

I think that this will map them into init_mm->pgd and
current->active_mm->pgd, but it won't necessarily map them into the
rest of the pgds.

At the risk of suggesting something awful, if we preallocated all 256
or whatever kernel pmd pages at boot, this whole problem would go away
forever. It would only waste slightly under 1 MB of RAM (less on
extremely large systems).

--Andy

Thomas Gleixner

unread,

Nov 19, 2014, 7:01:25 PM11/19/14

to Andy Lutomirski, Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

On Wed, 19 Nov 2014, Andy Lutomirski wrote:
> On Wed, Nov 19, 2014 at 3:09 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> > Sorry, I can't follow your argumentation here.
> >
> > pcpu_alloc()
> > ....
> > area_found:
> > ....
> >
> > /* clear the areas and return address relative to base address */
> > for_each_possible_cpu(cpu)
> > memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
> >
> > How would that memset fail to establish the mapping, which is
> > btw. already established via:
> >
> > pcpu_populate_chunk()
> >
> > already before that memset?
>
> I think that this will map them into init_mm->pgd and
> current->active_mm->pgd, but it won't necessarily map them into the
> rest of the pgds.

And why would mapping them into the kernel mapping, i.e. init_mm not
be sufficient?

We are talking about kernel memory and not some random user space
mapping.

Thanks,

tglx

Andy Lutomirski

unread,

Nov 19, 2014, 7:31:25 PM11/19/14

to Thomas Gleixner, linux-...@vger.kernel.org, Arnaldo Carvalho de Melo, Peter Zijlstra, Linus Torvalds, Frederic Weisbecker, Don Zickus, Dave Jones, the arch/x86 maintainers

On Nov 19, 2014 4:00 PM, "Thomas Gleixner" <tg...@linutronix.de> wrote:
>
> On Wed, 19 Nov 2014, Andy Lutomirski wrote:
> > On Wed, Nov 19, 2014 at 3:09 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> > > Sorry, I can't follow your argumentation here.
> > >
> > > pcpu_alloc()
> > > ....
> > > area_found:
> > > ....
> > >
> > > /* clear the areas and return address relative to base address */
> > > for_each_possible_cpu(cpu)
> > > memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
> > >
> > > How would that memset fail to establish the mapping, which is
> > > btw. already established via:
> > >
> > > pcpu_populate_chunk()
> > >
> > > already before that memset?
> >
> > I think that this will map them into init_mm->pgd and
> > current->active_mm->pgd, but it won't necessarily map them into the
> > rest of the pgds.
>
> And why would mapping them into the kernel mapping, i.e. init_mm not
> be sufficient?

Because the kernel can run with any pgd loaded into cr3, and we rely
on vmalloc_fault to lazily populate pgds in all the non-init pgds as
needed. But this only happens if the first TLB-missing reference to
the pgd in question with any given cr3 value happens from a safe
context.

This is why I think that the grsec kernels will crash on very large
memory systems. They don't seem to get this right for the kernel
stack, and a page fault trying to access the stack is a big no-no.

--Andy

Linus Torvalds

unread,

Nov 19, 2014, 7:40:58 PM11/19/14

to Andy Lutomirski, Thomas Gleixner, linux-...@vger.kernel.org, Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 4:30 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>
> This is why I think that the grsec kernels will crash on very large
> memory systems. They don't seem to get this right for the kernel
> stack, and a page fault trying to access the stack is a big no-no.

For something like a stack, that's trivial, you could just probe it
before the actual task switch.

So I wouldn't worry about the kernel stack itself (although I think
vmallocing it isn't likely worth it), I'd worry more about some other
random dynamic percpu allocation. Although they arguably shouldn't
happen for low-level code that cannot handle the dynamic
pgd-population. And they generally don't.

It's really tracing that tends to be a special case not because of any
particular low-level code issue, but because instrumenting itself
recursively tends to be a bad idea.

Linus

Andy Lutomirski

unread,

Nov 19, 2014, 7:49:55 PM11/19/14

to Linus Torvalds, Thomas Gleixner, linux-...@vger.kernel.org, Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 4:40 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 4:30 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>
>> This is why I think that the grsec kernels will crash on very large
>> memory systems. They don't seem to get this right for the kernel
>> stack, and a page fault trying to access the stack is a big no-no.
>
> For something like a stack, that's trivial, you could just probe it
> before the actual task switch.

I thought so for a while, too, but now I disagree. On PGE hardware,
it seems entirely possible that the new stack would be in the TLB even
if it's not visible via cr3. Then, as soon as the TLB entry expires,
we double-fault.

>
> So I wouldn't worry about the kernel stack itself (although I think
> vmallocing it isn't likely worth it),

I don't want vmalloc to avoid low-order allocations -- I want it to
have guard pages. The fact that a user-triggerable stack overflow is
basically root right now and doesn't reliably OOPS scares me.

> I'd worry more about some other
> random dynamic percpu allocation. Although they arguably shouldn't
> happen for low-level code that cannot handle the dynamic
> pgd-population. And they generally don't.

This issue ought to be limited to nokprobes code, and I doubt that any
of that code touches dynamic per-cpu things.

>
> It's really tracing that tends to be a special case not because of any
> particular low-level code issue, but because instrumenting itself
> recursively tends to be a bad idea.
>
> Linus

--

Andy Lutomirski
AMA Capital Management, LLC

Linus Torvalds

unread,

Nov 19, 2014, 8:07:39 PM11/19/14

to Andy Lutomirski, Thomas Gleixner, linux-...@vger.kernel.org, Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 4:49 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>
> I thought so for a while, too, but now I disagree. On PGE hardware,
> it seems entirely possible that the new stack would be in the TLB even
> if it's not visible via cr3. Then, as soon as the TLB entry expires,
> we double-fault.

Ahh. Good point.

> I don't want vmalloc to avoid low-order allocations -- I want it to
> have guard pages. The fact that a user-triggerable stack overflow is
> basically root right now and doesn't reliably OOPS scares me.

Well, if you do that, you would have to make the double-fault handler
aware of the stack issue anyway, and then you could just do teh same
PGD repopulation that a page fault does and return (for the case where
you didn't overflow the stack, just had the page tables unpopulated -
obviously an actual stack overflow should do something more drastic).

Linus

Andy Lutomirski

unread,

Nov 19, 2014, 8:16:54 PM11/19/14

to Linus Torvalds, Thomas Gleixner, linux-...@vger.kernel.org, Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 5:07 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 4:49 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>
>> I thought so for a while, too, but now I disagree. On PGE hardware,
>> it seems entirely possible that the new stack would be in the TLB even
>> if it's not visible via cr3. Then, as soon as the TLB entry expires,
>> we double-fault.
>
> Ahh. Good point.
>
>> I don't want vmalloc to avoid low-order allocations -- I want it to
>> have guard pages. The fact that a user-triggerable stack overflow is
>> basically root right now and doesn't reliably OOPS scares me.
>
> Well, if you do that, you would have to make the double-fault handler
> aware of the stack issue anyway, and then you could just do teh same
> PGD repopulation that a page fault does and return (for the case where
> you didn't overflow the stack, just had the page tables unpopulated -
> obviously an actual stack overflow should do something more drastic).

And you were calling me crazy? :)

We could be restarting just about anything if that happens. Except
that if we double-faulted on a trap gate entry instead of an interrupt
gate entry, then we can't restart, and, unless we can somehow decode
the error code usefully (it's woefully undocumented), int 0x80 and
int3 might be impossible to handle correctly if it double-faults. And
please don't suggest moving int 0x80 to an IST stack :)

The SDM specifically says that you must not try to recover after a
double-fault. We do, however, recover from a double-fault in the
specific case of an iret failure during espfix64 processing (and I
even have a nice test case for it), but I think that hpa had a long
conversation with one of the microcode architects before he was okay
with that.

--Andy

Linus Torvalds

unread,

Nov 19, 2014, 9:42:38 PM11/19/14

to Andy Lutomirski, Thomas Gleixner, linux-...@vger.kernel.org, Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 5:16 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>
> And you were calling me crazy? :)

Hey, I'm crazy like a fox.

> We could be restarting just about anything if that happens. Except
> that if we double-faulted on a trap gate entry instead of an interrupt
> gate entry, then we can't restart, and, unless we can somehow decode
> the error code usefully (it's woefully undocumented), int 0x80 and
> int3 might be impossible to handle correctly if it double-faults. And
> please don't suggest moving int 0x80 to an IST stack :)

No, no. So tell me if this won't work:

- when forking a new process, make sure we allocate the vmalloc stack
*before* we copy the vm

- this should guarantee that all new processes will at least have its
*own* stack always in its page tables, since vmalloc always fills in
the page table of the current page tables of the thread doing the
vmalloc.

HOWEVER, that leaves the task switch *to* that process, and making
sure that the stack pointer is ok in between the "switch %rsp" and
"switch %cr3".

So then we make the rule be: switch %cr3 *before* switching %rsp, and
only in between those places can we get in trouble. Yes/no?

And that small section is all with interrupts disabled, and nothing
should take an exception. The C code might take a double fault on a
regular access to the old stack (the *new* stack is guaranteed to be
mapped, but the old stack is not), but that should be very similar to
what we already do with "iret". So we can just fill in the page tables
and return.

For safety, add a percpu counter that is cleared before the %cr3
setting, to make sure that we only do a *single* double-fault, but it
really sounds pretty safe. No?

The only deadly thing would be NMI, but that's an IST anyway, so not
an issue. No other traps should be able to happen except the double
page table miss.

But hey, maybe I'm not crazy like a fox. Maybe I'm just plain crazy,
and I missed something else.

And no, I don't think the above is necessarily a *good* idea. But it
doesn't seem really overly complicated either.

Linus

Andy Lutomirski

unread,

Nov 20, 2014, 1:17:22 AM11/20/14

to Linus Torvalds, Thomas Gleixner, linux-...@vger.kernel.org, Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 6:42 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 5:16 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>
>> And you were calling me crazy? :)
>
> Hey, I'm crazy like a fox.
>
>> We could be restarting just about anything if that happens. Except
>> that if we double-faulted on a trap gate entry instead of an interrupt
>> gate entry, then we can't restart, and, unless we can somehow decode
>> the error code usefully (it's woefully undocumented), int 0x80 and
>> int3 might be impossible to handle correctly if it double-faults. And
>> please don't suggest moving int 0x80 to an IST stack :)
>
> No, no. So tell me if this won't work:
>
> - when forking a new process, make sure we allocate the vmalloc stack
> *before* we copy the vm
>
> - this should guarantee that all new processes will at least have its
> *own* stack always in its page tables, since vmalloc always fills in
> the page table of the current page tables of the thread doing the
> vmalloc.

This gets interesting for kernel threads that don't really have an mm
in the first place, though.

>
> HOWEVER, that leaves the task switch *to* that process, and making
> sure that the stack pointer is ok in between the "switch %rsp" and
> "switch %cr3".
>
> So then we make the rule be: switch %cr3 *before* switching %rsp, and
> only in between those places can we get in trouble. Yes/no?
>

Kernel threads aside, sure. And we do it in this order anyway, I think.

> And that small section is all with interrupts disabled, and nothing
> should take an exception. The C code might take a double fault on a
> regular access to the old stack (the *new* stack is guaranteed to be
> mapped, but the old stack is not), but that should be very similar to
> what we already do with "iret". So we can just fill in the page tables
> and return.

Unless we try to dump the stack from an NMI or something, but that
should be fine regardless.

>
> For safety, add a percpu counter that is cleared before the %cr3
> setting, to make sure that we only do a *single* double-fault, but it
> really sounds pretty safe. No?

I wouldn't be surprised if that's just as expensive as just fixing up
the pgd in the first place. The fixup is just:

if (unlikely(pte_none(mm->pgd[pgd_address(rsp)]))) fix it;

or something like that.

>
> The only deadly thing would be NMI, but that's an IST anyway, so not
> an issue. No other traps should be able to happen except the double
> page table miss.
>
> But hey, maybe I'm not crazy like a fox. Maybe I'm just plain crazy,
> and I missed something else.

I actually kind of like it, other than the kernel thread issue.

We should arguably ditch lazy mm for kernel threads in favor of PCID,
but that's a different story. Or we could beg Intel to give us
separate kernel and user page table hierarchies.

--Andy

>
> And no, I don't think the above is necessarily a *good* idea. But it
> doesn't seem really overly complicated either.
>
> Linus

--

Andy Lutomirski
AMA Capital Management, LLC

Dave Young

unread,

Nov 20, 2014, 4:55:23 AM11/20/14

to Don Zickus, Dave Jones, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers, vgo...@redhat.com

On 11/19/14 at 09:41am, Don Zickus wrote:
> On Tue, Nov 18, 2014 at 05:02:54PM -0500, Dave Jones wrote:
> > On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:
> >
> > > > So here we mangle CPU3 in and lose the backtrace for cpu0, which might
> > > > be the real interesting one ....
> > >
> > > Can you provide another dump? The hope is we get something not mangled?
> >
> > Working on it..
> >
> > > The other option we have done in RHEL is panic the system and let kdump
> > > capture the memory. Then we can analyze the vmcore for the stack trace
> > > cpu0 stored in memory to get a rough idea where it might be if the cpu
> > > isn't responding very well.
> >
> > I don't know if it's because of the debug options I typically run with,
> > or that I'm perpetually cursed, but I've never managed to get kdump to
> > do anything useful. (The last time I tried it was actively harmful in
> > that not only did it fail to dump anything, it wedged the machine so
> > it didn't reboot after panic).
> >
> > Unless there's some magic step missing from the documentation at
> > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> > then I'm not optimistic it'll be useful.
>
> Well, I don't know when the last time you ran it, but I know the RH kexec
> folks have started pursuing a Fedora-first package patch rule a couple of
> years ago to ensure Fedora had a working kexec/kdump solution.

It started from Fedora 17, I think for Fedora pre F17 kdump support is very
limited, it is becoming better.

>
> As for the wedging part, it was a common problem to have the kernel hang
> while trying to boot the second kernel (and before console output
> happened). So the problem makes sense and is unfortunate. I would
> encourage you to try again. :-)

In fedora we will have more such issues than RHEL because the kernel is updated
frequestly. There's ocasinaly new problems in upstream kernel, such as kaslr
feature in X86.

Problem for Fedora is it is not by default enabled, so user need explictly
specify kerenl cmdline for crashkernel reservation and enable kdump serivce.

There's very few bugs reported from Fedora user. So I guess it is not well tested
in Fedora community. Since Dave bring up this issue I think it's at least a good
news to us that someone is using it. We can address the problem case by case then.

Probably a good way to get more testing is to add kdump anaconda addon by default
at installation phase so user can choose to enable kdump or not.

Thanks
Dave

Borislav Petkov

unread,

Nov 20, 2014, 5:33:58 AM11/20/14

to Dave Jones, Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 05:18:42PM -0500, Dave Jones wrote:
> Nothing, but it wouldn't be the first time I'd seen a hardware fault
> that didn't raise an MCE.

And maybe it tried but it didn't manage to come out due to hard wedging. :-)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Tejun Heo

unread,

Nov 20, 2014, 7:23:58 AM11/20/14

to Frederic Weisbecker, Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

Hello,

On Thu, Nov 20, 2014 at 12:50:36AM +0100, Frederic Weisbecker wrote:
> > Are we talking about different per cpu allocators here or am I missing
> > something completely non obvious?
>
> That's the same allocator yeah. So if the whole memory is dereferenced,
> faults shouldn't happen indeed.
>
> Maybe that was a bug a few years ago but not anymore.

It has been always like that tho. Percpu memory given out is always
populated and cleared.

> Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
> After all it's allocated with vzalloc() so that part could be skipped. The memset(0)

The vzalloc call is for the internal allocation bitmap not the actual
percpu memory area. The actual address areas for percpu memory are
obtained using pcpu_get_vm_areas() call and later get populated using
map_kernel_range_noflush() (flush is performed after mapping is
complete).

Trying to remember what happens with vmalloc_fault(). Ah okay, so
when a new PUD gets created for vmalloc area, we don't go through all
PGDs and update them. The PGD entries get faulted in lazily. Percpu
memory allocator clearing or not clearing the allocated area doesn't
have anything to do with it. The memory area is always fully
populated in the kernel page table. It's just that the population
happened while a different PGD was active and this PGD hasn't been
populated with the new PUD yet.

So, yeap, vmalloc_fault() can always happen when accessing vmalloc
areas and the only way to avoid that would be removing lazy PGD
population - going through all PGDs and populating new PUDs
immediately.

Thanks.

--
tejun

Frederic Weisbecker

unread,

Nov 20, 2014, 9:36:43 AM11/20/14

to Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 12:15:24AM -0500, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
>
> > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
> >
> > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
> >
> > That makes me wonder: does the problem go away if you disable NOHZ?
>
> Does nohz=off do enough ? I couldn't convince myself after looking at
> dmesg, and still seeing dynticks stuff in there.
>
> I'll do a rebuild with all the CONFIG_NO_HZ stuff off, though it also changes
> some other config stuff wrt timers.

You also need to disable context tracking. So you need to deactive also
CONFIG_RCU_USER_QS and CONFIG_CONTEXT_TRACKING_FORCE and eventually make sure
nothing else is turning on CONFIG_CONTEXT_TRACKING.

You can keep CONFIG_NO_HZ_IDLE though, just not CONFIG_NO_HZ_FULL.

Frederic Weisbecker

unread,

Nov 20, 2014, 10:04:25 AM11/20/14

to Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 09:59:02AM -0500, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:

> > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <da...@redhat.com> wrote:
> > >
> > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
> > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> > > RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
> > > Call Trace:
> > > [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
> > > [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d

> >
> > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
> >
> > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
> >
> > That makes me wonder: does the problem go away if you disable NOHZ?
>

> Aparently not.
>
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
> RIP: 0010:[<ffffffff94175be7>] [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
> RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
> RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725
> RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503
> R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48
> FS: 00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
> ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25
> 00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000
> 0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000
> Call Trace:
> [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
> [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff947d41bf>] tracesys+0x14/0x4a
> Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80
> sending NMI to other CPUs:
> NMI backtrace for cpu 1
> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000
> RIP: 0010:[<ffffffff940fb71e>] [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
> RSP: 0018:ffff8801a99a7d18 EFLAGS: 00000202
> RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038
> RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
> RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0
> FS: 00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
> ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948
> 0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff
> 0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001
> Call Trace:
> [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
> [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
> [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0

One thing that happens a lot in your crashes is a CPU sending IPIs. Maybe
stuck polling on csd->lock or something. But's it's not the CPU that soft
lockups. At least not the first that gets reported.

> [<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100
> [<ffffffff941671aa>] perf_event_read+0xca/0xd0
> [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
> [<ffffffff941689c6>] perf_read+0x226/0x370
> [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
> [<ffffffff941eafff>] vfs_read+0x9f/0x180
> [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
> [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9

Frederic Weisbecker

unread,

Nov 20, 2014, 10:08:34 AM11/20/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 12:03:59PM -0500, Dave Jones wrote:
> On Sat, Nov 15, 2014 at 10:33:19PM -0800, Linus Torvalds wrote:
>
> > > > I'll try that next, and check in on it tomorrow.
> > >
> > > No luck. Died even faster this time.
> >
> > Yeah, and your other lockups haven't even been TLB related. Not that
> > they look like anything else *either*.
> >
> > I have no ideas left. I'd go for a bisection - rather than try random
> > things, at least bisection will get us a smaller set of suspects if
> > you can go through a few cycles of it. Even if you decide that you
> > want to run for most of a day before you are convinced it's all good,
> > a couple of days should get you a handful of bisection points (that's
> > assuming you hit a couple of bad ones too that turn bad in a shorter
> > while). And 4 or five bisections should get us from 11k commits down
> > to the ~600 commit range. That would be a huge improvement.
>
> Great start to the week: I decided to confirm my recollection that .17
> was ok, only to hit this within 10 minutes.
>
> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
> CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
> 0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
> ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
> ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
> Call Trace:
> <NMI> [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
> [<ffffffff9583bcc0>] panic+0xd4/0x207
> [<ffffffff95150908>] watchdog_overflow_callback+0x118/0x120
> [<ffffffff95193dbe>] __perf_event_overflow+0xae/0x340
> [<ffffffff95192230>] ? perf_event_task_disable+0xa0/0xa0
> [<ffffffff9501a7bf>] ? x86_perf_event_set_period+0xbf/0x150
> [<ffffffff95194be4>] perf_event_overflow+0x14/0x20
> [<ffffffff95020676>] intel_pmu_handle_irq+0x206/0x410
> [<ffffffff9501966b>] perf_event_nmi_handler+0x2b/0x50
> [<ffffffff95007bb2>] nmi_handle+0xd2/0x390
> [<ffffffff95007ae5>] ? nmi_handle+0x5/0x390
> [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
> [<ffffffff950080a2>] default_do_nmi+0x72/0x1c0
> [<ffffffff950082a8>] do_nmi+0xb8/0x100
> [<ffffffff9584b9aa>] end_repeat_nmi+0x1e/0x2e
> [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
> [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
> [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
> <<EOE>> <IRQ> [<ffffffff95101685>] lock_hrtimer_base.isra.18+0x25/0x50
> [<ffffffff951019d3>] hrtimer_try_to_cancel+0x33/0x1f0

Ah that one got fixed in the merge window and in -stable, right?

> [<ffffffff95101baa>] hrtimer_cancel+0x1a/0x30
> [<ffffffff95113557>] tick_nohz_restart+0x17/0x90
> [<ffffffff95114533>] __tick_nohz_full_check+0xc3/0x100
> [<ffffffff9511457e>] nohz_full_kick_work_func+0xe/0x10
> [<ffffffff95188894>] irq_work_run_list+0x44/0x70
> [<ffffffff951888ea>] irq_work_run+0x2a/0x50
> [<ffffffff9510109b>] update_process_times+0x5b/0x70
> [<ffffffff95113325>] tick_sched_handle.isra.20+0x25/0x60
> [<ffffffff95113801>] tick_sched_timer+0x41/0x60
> [<ffffffff95102281>] __run_hrtimer+0x81/0x480
> [<ffffffff951137c0>] ? tick_sched_do_timer+0xb0/0xb0
> [<ffffffff95102977>] hrtimer_interrupt+0x117/0x270
> [<ffffffff950346d7>] local_apic_timer_interrupt+0x37/0x60
> [<ffffffff9584c44f>] smp_apic_timer_interrupt+0x3f/0x50
> [<ffffffff9584a86f>] apic_timer_interrupt+0x6f/0x80
> <EOI> [<ffffffff950d3f3a>] ? lock_release_holdtime.part.28+0x9a/0x160
> [<ffffffff950ef3b7>] ? rcu_is_watching+0x27/0x60
> [<ffffffff9508cb75>] kill_pid_info+0xf5/0x130
> [<ffffffff9508ca85>] ? kill_pid_info+0x5/0x130
> [<ffffffff9508ccd3>] SYSC_kill+0x103/0x330
> [<ffffffff9508cc7c>] ? SYSC_kill+0xac/0x330
> [<ffffffff9519b592>] ? context_tracking_user_exit+0x52/0x1a0
> [<ffffffff950d6f1d>] ? trace_hardirqs_on_caller+0x16d/0x210
> [<ffffffff950d6fcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff950137ad>] ? syscall_trace_enter+0x14d/0x330
> [<ffffffff9508f44e>] SyS_kill+0xe/0x10
> [<ffffffff95849b24>] tracesys+0xdd/0xe2
> Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> It could a completely different cause for lockup, but seeing this now
> has me wondering if perhaps it's something unrelated to the kernel.
> I have recollection of running late .17rc's for days without incident,
> and I'm pretty sure .17 was ok too. But a few weeks ago I did upgrade
> that test box to the Fedora 21 beta. Which means I have a new gcc.
> I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
> get 4.8 back on there and see if that's any better.
>
> Dave

Dave Jones

unread,

Nov 20, 2014, 10:25:58 AM11/20/14

to Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote:

> TIF_NOHZ is not the same thing as NOHZ. Can you try a kernel with
> CONFIG_CONTEXT_TRACKING=n? Doing that may involve fiddling with RCU
> settings a bit. The normal no HZ idle stuff has nothing to do with
> TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
> thread_info corruption going on here.

Disabling CONTEXT_TRACKING didn't change the problem.
Unfortunatly the full trace didn't make it over usb-serial this time. Grr.

Here's what came over serial..

NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c35:11634]
CPU: 2 PID: 11634 Comm: trinity-c35 Not tainted 3.18.0-rc5+ #94 [loadavg: 164.79 157.30 155.90 37/409 11893]
task: ffff88014e0d96f0 ti: ffff880220eb4000 task.ti: ffff880220eb4000
RIP: 0010:[<ffffffff88379605>] [<ffffffff88379605>] copy_user_enhanced_fast_string+0x5/0x10
RSP: 0018:ffff880220eb7ef0 EFLAGS: 00010283
RAX: ffff880220eb4000 RBX: ffffffff887dac64 RCX: 0000000000006a18
RDX: 000000000000e02f RSI: 00007f766f466620 RDI: ffff88016f6a7617
RBP: ffff880220eb7f78 R08: 8000000000000063 R09: 0000000000000004
R10: 0000000000000010 R11: 0000000000000000 R12: ffffffff880bf50d
R13: 0000000000000001 R14: ffff880220eb4000 R15: 0000000000000001
FS: 00007f766f459740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f766f461000 CR3: 000000018b00e000 CR4: 00000000001407e0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:

ffffffff882f4225 ffff880183db5a00 0000000001743440 00007f766f0fb000
fffffffffffffeff 0000000000000000 0000000000008d79 00007f766f45f000
ffffffff8837adae 00ff880220eb7f38 000000003203f1ac 0000000000000001
Call Trace:
[<ffffffff882f4225>] ? SyS_add_key+0xd5/0x240
[<ffffffff8837adae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff887da092>] system_call_fastpath+0x12/0x17
Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00 00 00 00 0f 1f 00 89 d1 <f3> a4 31 c0 0f 1f 00 c3 90 90 90 0f 1f 00 83 fa 08 0f 82 95 00

sending NMI to other CPUs:

Here's a crappy phonecam pic of the screen.
http://codemonkey.org.uk/junk/IMG_4311.jpg
There's a bit of trace missing between the above and what was on
the screen, so we missed some CPUs.

Frederic Weisbecker

unread,

Nov 20, 2014, 10:28:51 AM11/20/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers, Andi Lutomirski

On Sat, Nov 15, 2014 at 08:40:06PM -0500, Dave Jones wrote:
> On Sat, Nov 15, 2014 at 04:34:05PM -0500, Dave Jones wrote:
> > On Fri, Nov 14, 2014 at 02:01:27PM -0800, Linus Torvalds wrote:
> >
> > > But since you say "several times a day", just for fun, can you test
> > > the follow-up patch to that one-liner fix that Will Deacon posted
> > > today (Subject: "[PATCH] mmu_gather: move minimal range calculations
> > > into generic code"). That does some further cleanup in this area.
> >
> > A few hours ago it hit the NMI watchdog again with that patch applied.
> > Incomplete trace, but it looks different based on what did make it over.
> > Different RIP at least.
> >
> > [65155.054155] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [trinity-c127:12559]
> > [65155.054573] irq event stamp: 296752
> > [65155.054589] hardirqs last enabled at (296751): [<ffffffff9d87403d>] _raw_spin_unlock_irqrestore+0x5d/0x80
> > [65155.054625] hardirqs last disabled at (296752): [<ffffffff9d875cea>] apic_timer_interrupt+0x6a/0x80
> > [65155.054657] softirqs last enabled at (296188): [<ffffffff9d259943>] bdi_queue_work+0x83/0x270
> > [65155.054688] softirqs last disabled at (296184): [<ffffffff9d259920>] bdi_queue_work+0x60/0x270
> > [65155.054721] CPU: 1 PID: 12559 Comm: trinity-c127 Not tainted 3.18.0-rc4+ #84 [loadavg: 209.68 187.90 185.33 34/431 17515]
> > [65155.054795] task: ffff88023f664680 ti: ffff8801649f0000 task.ti: ffff8801649f0000
> > [65155.054820] RIP: 0010:[<ffffffff9d87403f>] [<ffffffff9d87403f>] _raw_spin_unlock_irqrestore+0x5f/0x80
> > [65155.054852] RSP: 0018:ffff8801649f3be8 EFLAGS: 00000292
> > [65155.054872] RAX: ffff88023f664680 RBX: 0000000000000007 RCX: 0000000000000007
> > [65155.054895] RDX: 00000000000029e0 RSI: ffff88023f664ea0 RDI: ffff88023f664680
> > [65155.054919] RBP: ffff8801649f3bf8 R08: 0000000000000000 R09: 0000000000000000
> > [65155.055956] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > [65155.056985] R13: ffff8801649f3b58 R14: ffffffff9d3e7d0e R15: 00000000000003e0
> > [65155.058037] FS: 00007f0dc957c700(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
> > [65155.059083] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [65155.060121] CR2: 00007f0dc958e000 CR3: 000000022f31e000 CR4: 00000000001407e0
> > [65155.061152] DR0: 00007f54162bc000 DR1: 00007feb92c3d000 DR2: 0000000000000000
> > [65155.062180] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> > [65155.063202] Stack:
> >
> > And that's all she wrote.
> >
> > > If Will's patch doesn't make a difference, what about reverting that
> > > ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
> > > don't see why any of this would be noticeable on x86 (it triggered
> > > issues on ARM64, but that was because ARM64 cared much more about the
> > > exact range).

> >
> > I'll try that next, and check in on it tomorrow.
>
> No luck. Died even faster this time.
>

> [ 772.459481] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [modprobe:31400]
> [ 772.459858] irq event stamp: 3362
> [ 772.459872] hardirqs last enabled at (3361): [<ffffffff941a437c>] context_tracking_user_enter+0x9c/0x2c0
> [ 772.459907] hardirqs last disabled at (3362): [<ffffffff94875bea>] apic_timer_interrupt+0x6a/0x80
> [ 772.459937] softirqs last enabled at (0): [<ffffffff940764d5>] copy_process.part.26+0x635/0x1d80
> [ 772.459968] softirqs last disabled at (0): [< (null)>] (null)
> [ 772.459996] CPU: 3 PID: 31400 Comm: modprobe Not tainted 3.18.0-rc4+ #85 [loadavg: 207.70 163.33 92.64 11/433 31547]
> [ 772.460086] task: ffff88022f0b2f00 ti: ffff88019a944000 task.ti: ffff88019a944000
> [ 772.460110] RIP: 0010:[<ffffffff941a437e>] [<ffffffff941a437e>] context_tracking_user_enter+0x9e/0x2c0
> [ 772.460142] RSP: 0018:ffff88019a947f00 EFLAGS: 00000282
> [ 772.460161] RAX: ffff88022f0b2f00 RBX: 0000000000000000 RCX: 0000000000000000
> [ 772.460184] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88022f0b2f00
> [ 772.460207] RBP: ffff88019a947f10 R08: 0000000000000000 R09: 0000000000000000
> [ 772.460229] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88019a947e90
> [ 772.460252] R13: ffffffff940f6d04 R14: ffff88019a947ec0 R15: ffff8802447cd640
> [ 772.460294] FS: 00007f3b71ee4700(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> [ 772.460362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 772.460391] CR2: 00007fffdad5af58 CR3: 000000011608e000 CR4: 00000000001407e0
> [ 772.460424] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 772.460447] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 772.460470] Stack:
> [ 772.460480] ffff88019a947f58 00000000006233a8 ffff88019a947f40 ffffffff9401429d
> [ 772.460512] 00000000006233a8 000000000041d68a 00000000006233a8 0000000000000000
> [ 772.460543] 00000000006233a0 ffffffff94874fa4 000000001008feff 000507d93d73a434
> [ 772.460574] Call Trace:
> [ 772.461576] [<ffffffff9401429d>] syscall_trace_leave+0xad/0x2e0
> [ 772.462572] [<ffffffff94874fa4>] int_check_syscall_exit_work+0x34/0x3d
> [ 772.463575] Code: f8 1c 00 84 c0 75 46 48 c7 c7 51 53 cd 94 e8 aa 23 24 00 65 c7 04 25 f4 f8 1c 00 01 00 00 00 f6 c7 02 74 19 e8 84 43 f3 ff 53 9d <5b> 41 5c 5d c3 0f 1f 44 00 00 c3 0f 1f 80 00 00 00 00 53 9d e8
> [ 772.465797] Kernel panic - not syncing: softlockup: hung tasks
> [ 772.466821] CPU: 3 PID: 31400 Comm: modprobe Tainted: G L 3.18.0-rc4+ #85 [loadavg: 207.70 163.33 92.64 11/433 31547]
> [ 772.468915] ffff88022f0b2f00 00000000de65d5f5 ffff880244603dc8 ffffffff94869e01
> [ 772.470031] 0000000000000000 ffffffff94c7599b ffff880244603e48 ffffffff94866b21
> [ 772.471085] ffff880200000008 ffff880244603e58 ffff880244603df8 00000000de65d5f5
> [ 772.472141] Call Trace:
> [ 772.473183] <IRQ> [<ffffffff94869e01>] dump_stack+0x4f/0x7c
> [ 772.474253] [<ffffffff94866b21>] panic+0xcf/0x202
> [ 772.475346] [<ffffffff94154d1e>] watchdog_timer_fn+0x27e/0x290
> [ 772.476414] [<ffffffff94106297>] __run_hrtimer+0xe7/0x740
> [ 772.477475] [<ffffffff94106b64>] ? hrtimer_interrupt+0x94/0x270
> [ 772.478555] [<ffffffff94154aa0>] ? watchdog+0x40/0x40
> [ 772.479627] [<ffffffff94106be7>] hrtimer_interrupt+0x117/0x270
> [ 772.480703] [<ffffffff940303db>] local_apic_timer_interrupt+0x3b/0x70
> [ 772.481777] [<ffffffff948777f3>] smp_apic_timer_interrupt+0x43/0x60
> [ 772.482856] [<ffffffff94875bef>] apic_timer_interrupt+0x6f/0x80
> [ 772.483915] <EOI> [<ffffffff941a437e>] ? context_tracking_user_enter+0x9e/0x2c0
> [ 772.484972] [<ffffffff9401429d>] syscall_trace_leave+0xad/0x2e0

It looks like we are looping somewhere around syscall_trace_leave(). Maybe the
TIF WORK_SYSCALL_EXIT flags aren't cleared properly after some of them got processed. Or
something keeps setting a TIF_WORK_SYSCALL_EXIT flag after they get cleared and we loop
endlessly to jump to int_check_syscall_exit_work().

Andi did some work there lately. Cc'ing him.

> [ 772.486042] [<ffffffff94874fa4>] int_check_syscall_exit_work+0x34/0x3d
> [ 772.487187] Kernel Offset: 0x13000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Dave Jones

unread,

Nov 20, 2014, 11:11:23 AM11/20/14

to Vivek Goyal, Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers, WANG Chao, Baoquan He, Dave Young

On Wed, Nov 19, 2014 at 11:28:06AM -0500, Vivek Goyal wrote:

> I am wondering may be in some cases we panic in second kernel and sit
> there. Probably we should append a kernel command line automatically
> say "panic=1" so that it reboots itself if second kernel panics.
>
> By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
> disable that as currently kexec/kdump stuff does not work with it. And
> it hangs very early in the boot process and I had to hook serial console
> to get following message on console.

I did have that enabled. (Perhaps the kconfig should conflict?)

After rebuilding without it, this..

> > dracut: *** Stripping files done ***
> > dracut: *** Store current command line parameters ***
> > dracut: *** Creating image file ***
> > dracut: *** Creating image file done ***
> > kdumpctl: cat: write error: Broken pipe
> > kdumpctl: kexec: failed to load kdump kernel
> > kdumpctl: Starting kdump: [FAILED]

went away. It generated the image, and things looked good.
I did echo c > /proc/sysrq-trigger and got this..

SysRq : Trigger a crash
BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1192
in_atomic(): 0, irqs_disabled(): 0, pid: 8860, name: bash
3 locks held by bash/8860:
#0: (sb_writers#5){......}, at: [<ffffffff811eac13>] vfs_write+0x1b3/0x1f0
#1: (rcu_read_lock){......}, at: [<ffffffff8144a435>] __handle_sysrq+0x5/0x1b0
#2: (&mm->mmap_sem){......}, at: [<ffffffff8103cb20>] __do_page_fault+0x140/0x600
Preemption disabled at:[<ffffffff817ca332>] printk+0x5c/0x72

CPU: 1 PID: 8860 Comm: bash Not tainted 3.18.0-rc5+ #95 [loadavg: 0.54 0.24 0.09 2/143 8909]
00000000000004a8 00000000e1f75c1b ffff880236473c28 ffffffff817ce5c7
0000000000000000 0000000000000000 ffff880236473c58 ffffffff8109af8a
ffff880236473c58 0000000000000029 0000000000000000 ffff880236473d88
Call Trace:
[<ffffffff817ce5c7>] dump_stack+0x4f/0x7c
[<ffffffff8109af8a>] __might_sleep+0x12a/0x190
[<ffffffff8103cb3b>] __do_page_fault+0x15b/0x600
[<ffffffff811613b2>] ? irq_work_queue+0x62/0xd0
[<ffffffff8137ad7d>] ? trace_hardirqs_off_thunk+0x3a/0x3f
[<ffffffff8103cfec>] do_page_fault+0xc/0x10
[<ffffffff817dbcf2>] page_fault+0x22/0x30
[<ffffffff817ca332>] ? printk+0x5c/0x72
[<ffffffff81449ce6>] ? sysrq_handle_crash+0x16/0x20
[<ffffffff8144a567>] __handle_sysrq+0x137/0x1b0
[<ffffffff8144a435>] ? __handle_sysrq+0x5/0x1b0
[<ffffffff8144aa4a>] write_sysrq_trigger+0x4a/0x50
[<ffffffff81259f2d>] proc_reg_write+0x3d/0x80
[<ffffffff811eab1a>] vfs_write+0xba/0x1f0
[<ffffffff811eb628>] SyS_write+0x58/0xd0
[<ffffffff817da052>] system_call_fastpath+0x12/0x17
Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 1 PID: 8860 Comm: bash Not tainted 3.18.0-rc5+ #95 [loadavg: 0.54 0.24 0.09 1/143 8909]
task: ffff8800a1a60000 ti: ffff880236470000 task.ti: ffff880236470000
RIP: 0010:[<ffffffff81449ce6>] [<ffffffff81449ce6>] sysrq_handle_crash+0x16/0x20
RSP: 0018:ffff880236473e38 EFLAGS: 00010246
RAX: 000000000000000f RBX: ffffffff81cb4a00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff817ca332 RDI: 0000000000000063
RBP: ffff880236473e38 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000358 R11: 0000000000000357 R12: 0000000000000063
R13: 0000000000000000 R14: 0000000000000007 R15: 0000000000000000
FS: 00007fc652f4e740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000000 CR3: 000000023a3b2000 CR4: 00000000001407e0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Stack:
ffff880236473e78 ffffffff8144a567 ffffffff8144a435 0000000000000002
0000000000000002 00007fc652f51000 0000000000000002 ffff880236473f48
ffff880236473ea8 ffffffff8144aa4a 0000000000000002 00007fc652f51000
Call Trace:
[<ffffffff8144a567>] __handle_sysrq+0x137/0x1b0
[<ffffffff8144a435>] ? __handle_sysrq+0x5/0x1b0
[<ffffffff8144aa4a>] write_sysrq_trigger+0x4a/0x50
[<ffffffff81259f2d>] proc_reg_write+0x3d/0x80
[<ffffffff811eab1a>] vfs_write+0xba/0x1f0
[<ffffffff811eb628>] SyS_write+0x58/0xd0
[<ffffffff817da052>] system_call_fastpath+0x12/0x17
Code: 01 f4 45 39 a5 b4 00 00 00 75 e2 4c 89 ef e8 d2 f7 ff ff eb d8 0f 1f 44 00 00 55 c7 05 08 b7 7e 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 0f 1f 44 00 00 55 31 c0 48 89 e5
RIP [<ffffffff81449ce6>] sysrq_handle_crash+0x16/0x20
RSP <ffff880236473e38>
CR2: 0000000000000000

Which, asides from the sleeping while atomic thing which isn't important,
does what I expected. Shortly later, it rebooted.

And then /var/crash was empty.

Dave Jones

unread,

Nov 20, 2014, 11:19:43 AM11/20/14

to Frederic Weisbecker, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

If that's true, that changes everything, and this might be more
bisectable. I did the test above on 3.17, but perhaps I should
try a run on 3.17.3

Frederic Weisbecker

unread,

Nov 20, 2014, 11:42:41 AM11/20/14

to Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

It might not be easier to bisect because stable is a seperate branch than the next -rc1.
And that above got fixed in -rc1, perhaps in the same merge window where the new different
issues were introduced. So you'll probably need to shutdown the above issue in order to
bisect the others.

What you can do is to bisect and then before every build apply the patches that
fix the above issue in -stable, those that I just enumerated to gregkh in our
discussion with him. There are only 4. Just try to apply all of them before each
build, unless they are already.

I could give you a much simpler hack but I fear it may chaoticly apply depending if
the real fixes are applied, halfway or not at all, all that with unpredictable results.
So lets rather stick to what we know to work.

Vivek Goyal

unread,

Nov 20, 2014, 11:48:30 AM11/20/14

to Dave Jones, Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers, WANG Chao, Baoquan He, Dave Young

On Thu, Nov 20, 2014 at 11:10:55AM -0500, Dave Jones wrote:
> On Wed, Nov 19, 2014 at 11:28:06AM -0500, Vivek Goyal wrote:
>
> > I am wondering may be in some cases we panic in second kernel and sit
> > there. Probably we should append a kernel command line automatically
> > say "panic=1" so that it reboots itself if second kernel panics.
> >
> > By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
> > disable that as currently kexec/kdump stuff does not work with it. And
> > it hangs very early in the boot process and I had to hook serial console
> > to get following message on console.
>
> I did have that enabled. (Perhaps the kconfig should conflict?)

Hi Dave,

Actually kexec/kdump allows booting into a different kernel than running
kernel. So one could have KEXEC and CONFIG_RANDOMIZE_BASE enabled in
the kernel at the same time but still booting into a kernel with
CONFIG_RANDOMIZE_BASE=n and that should work. CONFIG_RANDOMIZE_BASE is
only a problem if it is enabled in second kernel.So kconfig conflict
might not be a good fit here.

These messages came from first kernel. I think we have failed very early
in second kernel boot.

Can we try following and retry and see if some additional messages show
up on console and help us narrow down the problem.

- Enable verbose boot messages. CONFIG_X86_VERBOSE_BOOTUP=y

- Enable early printk in second kernel. (earlyprintk=ttyS0,115200).

You can either enable early printk in first kernel and reboot. That way
second kernel will automatically have it enabled. Or you can edit
"/etc/sysconfig/kdump" and append earlyprintk=<> to KDUMP_COMMANDLINE_APPEND.
You will need to restart kdump service after this.

- Enable some debug output during runtime from kexec purgatory. For that one
needs to pass additional arguments to /sbin/kexec. You can edit
/etc/sysconfig/kdump file and modify "KEXEC_ARGS" to pass additional
arguments to /sbin/kexec during kernel load. I use following for my
serial console.

KEXEC_ARGS="--console-serial --serial=0x3f8 --serial-baud=115200"

You will need to restart kdump service.

I hope above give us some information to work with and figure out where
did we fail while booting into second kernel.

Thanks
Vivek

Vivek Goyal

unread,

Nov 20, 2014, 11:54:20 AM11/20/14

to Dave Jones, Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers, WANG Chao, Baoquan He, Dave Young

On Thu, Nov 20, 2014 at 11:10:55AM -0500, Dave Jones wrote:

> On Wed, Nov 19, 2014 at 11:28:06AM -0500, Vivek Goyal wrote:
>
> > I am wondering may be in some cases we panic in second kernel and sit
> > there. Probably we should append a kernel command line automatically
> > say "panic=1" so that it reboots itself if second kernel panics.
> >
> > By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
> > disable that as currently kexec/kdump stuff does not work with it. And
> > it hangs very early in the boot process and I had to hook serial console
> > to get following message on console.
>
> I did have that enabled. (Perhaps the kconfig should conflict?)

Hi Dave,

Can you please also send me your kernel config file. I will try that on
my machine and see if I can reproduce the problem on my machine.

Thanks
Vivek

Dave Jones

unread,

Nov 20, 2014, 12:38:42 PM11/20/14

to Vivek Goyal, Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers, WANG Chao, Baoquan He, Dave Young

On Thu, Nov 20, 2014 at 11:48:09AM -0500, Vivek Goyal wrote:

> Can we try following and retry and see if some additional messages show
> up on console and help us narrow down the problem.
>
> - Enable verbose boot messages. CONFIG_X86_VERBOSE_BOOTUP=y
>
> - Enable early printk in second kernel. (earlyprintk=ttyS0,115200).
>
> You can either enable early printk in first kernel and reboot. That way
> second kernel will automatically have it enabled. Or you can edit
> "/etc/sysconfig/kdump" and append earlyprintk=<> to KDUMP_COMMANDLINE_APPEND.
> You will need to restart kdump service after this.
>
> - Enable some debug output during runtime from kexec purgatory. For that one
> needs to pass additional arguments to /sbin/kexec. You can edit
> /etc/sysconfig/kdump file and modify "KEXEC_ARGS" to pass additional
> arguments to /sbin/kexec during kernel load. I use following for my
> serial console.
>
> KEXEC_ARGS="--console-serial --serial=0x3f8 --serial-baud=115200"
>
> You will need to restart kdump service.

The only serial port on this machine is usb serial, which doesn't have io ports.

From my reading of the kexec man page, it doesn't look like I can tell
it to use ttyUSB0.

And because it relies on usb being initialized, this probably isn't
going to help too much with early boot.

earlyprintk=tty0 didn't show anything extra after the sysrq-c oops.
likewise, =ttyUSB0

I'm going to try bisecting the problem I'm debugging again, so I'm not
going to dig into this much more today.

Dave

Linus Torvalds

unread,

Nov 20, 2014, 2:43:17 PM11/20/14

to Dave Jones, Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Thu, Nov 20, 2014 at 7:25 AM, Dave Jones <da...@redhat.com> wrote:
>
> Disabling CONTEXT_TRACKING didn't change the problem.
> Unfortunatly the full trace didn't make it over usb-serial this time. Grr.
>
> Here's what came over serial..
>
> NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c35:11634]

> RIP: 0010:[<ffffffff88379605>] [<ffffffff88379605>] copy_user_enhanced_fast_string+0x5/0x10

> RAX: ffff880220eb4000 RBX: ffffffff887dac64 RCX: 0000000000006a18
> RDX: 000000000000e02f RSI: 00007f766f466620 RDI: ffff88016f6a7617
> RBP: ffff880220eb7f78 R08: 8000000000000063 R09: 0000000000000004

> Call Trace:
> [<ffffffff882f4225>] ? SyS_add_key+0xd5/0x240
> [<ffffffff8837adae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff887da092>] system_call_fastpath+0x12/0x17

Ok, that's just about half-way in a ~57kB memory copy (you can see it
in the register state: %rdx contains the original size of the key
payload, rcx contains the current remaining size: 57kB total, 27kB
left).

And it's holding absolutely zero locks, and not even doing anything
odd. It wasn't doing anything particularly odd before either, although
the kmalloc() of a 64kB area might just have caused a fair amount of
VM work, of course.

You know what? I'm seriously starting to think that these bugs aren't
actually real. Or rather, I don't think it's really a true softlockup,
because most of them seem to happen in totally harmless code.

So I'm wondering whether the real issue might not be just this:

[loadavg: 164.79 157.30 155.90 37/409 11893]

together with possibly a scheduler issue and/or a bug in the smpboot
thread logic (that the watchdog uses) or similar.

That's *especially* true if it turns out that the 3.17 problem you saw
was actually a perf bug that has already been fixed and is in stable.
We've been looking at kernel/smp.c changes, and looking for x86 IPI or
APIC changes, and found some harmlessly (at least on x86) suspicious
code and this exercise might be worth it for that reason, but what if
it's really just a scheduler regression.

There's been a *lot* more scheduler changes since 3.17 than the small
things we've looked at for x86 entry or IPI handling. And the
scheduler changes have been about things like overloaded scheduling
groups etc, and I could easily imaging that some bug *there* ends up
causing the watchdog process not to schedule.

Hmm? Scheduler people?

Linus

Dave Jones

unread,

Nov 20, 2014, 3:06:48 PM11/20/14

to Linus Torvalds, Andy Lutomirski, Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Thu, Nov 20, 2014 at 11:43:07AM -0800, Linus Torvalds wrote:

> You know what? I'm seriously starting to think that these bugs aren't
> actually real. Or rather, I don't think it's really a true softlockup,
> because most of them seem to happen in totally harmless code.
>
> So I'm wondering whether the real issue might not be just this:
>
> [loadavg: 164.79 157.30 155.90 37/409 11893]
>
> together with possibly a scheduler issue and/or a bug in the smpboot
> thread logic (that the watchdog uses) or similar.
>
> That's *especially* true if it turns out that the 3.17 problem you saw
> was actually a perf bug that has already been fixed and is in stable.
> We've been looking at kernel/smp.c changes, and looking for x86 IPI or
> APIC changes, and found some harmlessly (at least on x86) suspicious
> code and this exercise might be worth it for that reason, but what if
> it's really just a scheduler regression.

I started a run against 3.17 with the perf fixes. If that survives
today, I'll start a bisection tomorrow.

> There's been a *lot* more scheduler changes since 3.17 than the small
> things we've looked at for x86 entry or IPI handling. And the
> scheduler changes have been about things like overloaded scheduling
> groups etc, and I could easily imaging that some bug *there* ends up
> causing the watchdog process not to schedule.

One other data point: I put another box into service for testing,
but it's considerably slower (a ~6 year old Xeon vs the Haswell).
Maybe it's just because it's so much slower that it'll take longer,
(or slow enough that the bug is masked) but that machine hasn't had
a problem yet in almost a day of runtime.

Dave

Linus Torvalds

unread,

Nov 20, 2014, 3:33:31 PM11/20/14

to Thomas Gleixner, Andy Lutomirski, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 2:13 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
>
> Right, while it is wrong it does not explain the wreckage on 3.17,
> which does not have that code.

Thomas, I'm currently going off the assumption that I'll see this from
the x86 trees, and I can ignore the patch. It doesn't seem like this
is a particularly pressing bug.

If it's *not* going to show up as a pull request, holler, and I'll
just apply it.

Linus

Don Zickus

unread,

Nov 20, 2014, 3:39:02 PM11/20/14

to Linus Torvalds, Dave Jones, Andy Lutomirski, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Thu, Nov 20, 2014 at 11:43:07AM -0800, Linus Torvalds wrote:

> On Thu, Nov 20, 2014 at 7:25 AM, Dave Jones <da...@redhat.com> wrote:
> >
> > Disabling CONTEXT_TRACKING didn't change the problem.
> > Unfortunatly the full trace didn't make it over usb-serial this time. Grr.
> >
> > Here's what came over serial..
> >
> > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c35:11634]
> > RIP: 0010:[<ffffffff88379605>] [<ffffffff88379605>] copy_user_enhanced_fast_string+0x5/0x10
> > RAX: ffff880220eb4000 RBX: ffffffff887dac64 RCX: 0000000000006a18
> > RDX: 000000000000e02f RSI: 00007f766f466620 RDI: ffff88016f6a7617
> > RBP: ffff880220eb7f78 R08: 8000000000000063 R09: 0000000000000004
> > Call Trace:
> > [<ffffffff882f4225>] ? SyS_add_key+0xd5/0x240
> > [<ffffffff8837adae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [<ffffffff887da092>] system_call_fastpath+0x12/0x17
>
> Ok, that's just about half-way in a ~57kB memory copy (you can see it
> in the register state: %rdx contains the original size of the key
> payload, rcx contains the current remaining size: 57kB total, 27kB
> left).
>
> And it's holding absolutely zero locks, and not even doing anything
> odd. It wasn't doing anything particularly odd before either, although
> the kmalloc() of a 64kB area might just have caused a fair amount of
> VM work, of course.

Just for clarification, softlockups are processes hogging the cpu (thus
blocking the high priority per-cpu watchdog thread).

Hardlockups on the other hand are cpus with interrupts disabled for too
long (thus blocking the timer interrupt).

The might coincide with your scheduler theory below. Don't know.

Cheers,
Don

Linus Torvalds

unread,

Nov 20, 2014, 3:51:26 PM11/20/14

to Don Zickus, Dave Jones, Andy Lutomirski, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Thu, Nov 20, 2014 at 12:37 PM, Don Zickus <dzi...@redhat.com> wrote:
>
> Just for clarification, softlockups are processes hogging the cpu (thus
> blocking the high priority per-cpu watchdog thread).

Right. And there is no actual sign of any CPU hogging going on.
There's a single system call with a small payload (I think it's safe
to call 64kB small these days), no hugely contended CPU-spinning
locking, nada.

Thomas Gleixner

unread,

Nov 20, 2014, 4:58:36 PM11/20/14

to Tejun Heo, Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

On Thu, 20 Nov 2014, Tejun Heo wrote:
> On Thu, Nov 20, 2014 at 12:50:36AM +0100, Frederic Weisbecker wrote:
> > > Are we talking about different per cpu allocators here or am I missing
> > > something completely non obvious?
> >
> > That's the same allocator yeah. So if the whole memory is dereferenced,
> > faults shouldn't happen indeed.
> >
> > Maybe that was a bug a few years ago but not anymore.
>
> It has been always like that tho. Percpu memory given out is always
> populated and cleared.
>
> > Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
> > After all it's allocated with vzalloc() so that part could be skipped. The memset(0)
>
> The vzalloc call is for the internal allocation bitmap not the actual
> percpu memory area. The actual address areas for percpu memory are
> obtained using pcpu_get_vm_areas() call and later get populated using
> map_kernel_range_noflush() (flush is performed after mapping is
> complete).
>
> Trying to remember what happens with vmalloc_fault(). Ah okay, so
> when a new PUD gets created for vmalloc area, we don't go through all
> PGDs and update them. The PGD entries get faulted in lazily. Percpu
> memory allocator clearing or not clearing the allocated area doesn't
> have anything to do with it. The memory area is always fully
> populated in the kernel page table. It's just that the population
> happened while a different PGD was active and this PGD hasn't been
> populated with the new PUD yet.

It's completely undocumented behaviour, whether it has been that way
for ever or not. And I agree with Fredric, that it is insane. Actuallu
it's beyond insane, really.

> So, yeap, vmalloc_fault() can always happen when accessing vmalloc
> areas and the only way to avoid that would be removing lazy PGD
> population - going through all PGDs and populating new PUDs
> immediately.

There is no requirement to go through ALL PGDs and populate that stuff
immediately.

Lets look at the two types of allocations

1) Kernel percpu allocations

2) Per process/task percpu allocations

Of course we do not have a way to distinguish those, but we really
should have one.

#1 Kernel percpu allocations usually happen in the context of driver
bringup, subsystem initialization, interrupt setup etc.

So this is functionality which is not a hotpath and usually
requires some form of synchronization versus the rest of the system
anyway.

The per cpu population stuff is serialized with a mutex anyway, so
what's wrong to have a globaly visible percpu sequence counter,
which is incremented whenever a new allocation is populated or torn
down?

We can make that sequence counter a per cpu variable as well to
avoid the issues of a global variable (preferrably that's a
compile/boot time allocated percpu variable to avoid the obvious
circulus vitiosus)

Now after that increment the allocation side needs to wait for a
scheduling cycle on all cpus (we have mechanisms for that)

So in the scheduler if the same task gets reselected you check that
sequence count and update the PGD if different. If a task switch
happens then you also need to check the sequence count and act
accordingly.

If we make the sequence counter a percpu variable as outlined above
the overhead of checking this is just noise versus the other
nonsense we do in schedule().

#2 That's process related statistics and instrumentation stuff.

Now that just needs a immediate population on the process->mm->pgd
aside of the init_mm.pgd, but that's really not a big deal.

Of course that does not solve the issues we have with the current
infrastructure retroactively, but it allows us to avoid fuckups like
the one Frederic was talking about that perf invented its own kmalloc
based 'percpu' replacement just to workaround the shortcoming in a
particular place.

What really frightens me is the potential and well hidden fuckup
potential which lurks around the corner and the hard to debug once in
a while fallout which might be caused by this.

Thanks,

tglx

tip-bot for Andy Lutomirski

unread,

Nov 20, 2014, 5:05:21 PM11/20/14

to linux-ti...@vger.kernel.org, pet...@infradead.org, dzi...@redhat.com, mi...@kernel.org, lu...@amacapital.net, da...@redhat.com, torv...@linux-foundation.org, h...@zytor.com, linux-...@vger.kernel.org, tg...@linutronix.de

Commit-ID: b5e212a3051b65e426a513901d9c7001681c7215
Gitweb: http://git.kernel.org/tip/b5e212a3051b65e426a513901d9c7001681c7215
Author: Andy Lutomirski <lu...@amacapital.net>
AuthorDate: Wed, 19 Nov 2014 13:56:19 -0800
Committer: Thomas Gleixner <tg...@linutronix.de>
CommitDate: Thu, 20 Nov 2014 23:01:53 +0100

x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1

TIF_NOHZ is 19 (i.e. _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME |
_TIF_SINGLESTEP), not (1<<19).

This code is involved in Dave's trinity lockup, but I don't see why
it would cause any of the problems he's seeing, except inadvertently
by causing a different path through entry_64.S's syscall handling.

Signed-off-by: Andy Lutomirski <lu...@amacapital.net>
Cc: Don Zickus <dzi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Dave Jones <da...@redhat.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Link: http://lkml.kernel.org/r/a6cd3b60a3f53afb6e1c8081b0ec30...@amacapital.net
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/x86/kernel/ptrace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 749b0e4..e510618 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1484,7 +1484,7 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
user_exit();
- work &= ~TIF_NOHZ;
+ work &= ~_TIF_NOHZ;
}

#ifdef CONFIG_SECCOMP

Andy Lutomirski

unread,

Nov 20, 2014, 5:06:59 PM11/20/14

to Thomas Gleixner, Tejun Heo, Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers, Peter Zijlstra, Arnaldo Carvalho de Melo

This seems like a reasonable idea, but I'd suggest a minor change:
rather than using a sequence number, track the number of kernel pgds.
That number should rarely change, and it's only one byte long. That
means that we can easily stick it in mm_context_t without making it
any bigger.

The count for init_mm could be copied into cpu_tlbstate, which is
always hot on context switch.

>
>
> #2 That's process related statistics and instrumentation stuff.
>
> Now that just needs a immediate population on the process->mm->pgd
> aside of the init_mm.pgd, but that's really not a big deal.
>
> Of course that does not solve the issues we have with the current
> infrastructure retroactively, but it allows us to avoid fuckups like
> the one Frederic was talking about that perf invented its own kmalloc
> based 'percpu' replacement just to workaround the shortcoming in a
> particular place.
>
> What really frightens me is the potential and well hidden fuckup
> potential which lurks around the corner and the hard to debug once in
> a while fallout which might be caused by this.

The annoying part of this is that pgd allocation is *so* rare that
bugs here can probably go unnoticed for a long time.

--Andy