Re: proc_flush_task oops

Eric W. Biederman

unread,

Dec 19, 2017, 8:54:52 PM12/19/17

to Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com

ebie...@xmission.com (Eric W. Biederman) writes:

> Dave Jones <da...@codemonkey.org.uk> writes:
>
>> On Tue, Dec 19, 2017 at 12:27:30PM -0600, Eric W. Biederman wrote:
>> > Dave Jones <da...@codemonkey.org.uk> writes:
>> >
>> > > On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
>> > >
>> > > > But I don't see what would have changed in this area recently.
>> > > >
>> > > > Do you end up saving the seeds that cause crashes? Is this
>> > > > reproducible? (Other than seeing it twoce, of course)
>> > >
>> > > Only clue so far, is every time I'm able to trigger it, the last thing
>> > > the child process that triggers it did, was an execveat.
>> >
>> > Is there any chance the excveat might be called from a child thread?
>>
>> If trinity choose one of the exec syscalls, it forks off an extra child
>> to do it in, on the off-chance that it succeeds, and we never return.
>> https://github.com/kernelslacker/trinity/blob/master/syscall.c#L139
>
> extrapid = fork();
> if (extrapid == 0) {
> /* grand-child */
> char childname[]="trinity-subchild";
> prctl(PR_SET_NAME, (unsigned long) &childname);
>
> __do_syscall(rec, GOING_AWAY);
> /* if this was for eg. an successful execve, we should never get here.
> * if it failed though... */
> _exit(EXIT_SUCCESS);
> }
>
> That is interesting.
>
>
> So the system call sequence is a fork which just succeeded and than an
> exec. That reduces the possibilities quite a lot.
>
> With pids there was a recent change that just replaced the pid hash
> table and the pid bitmap with and idr. It changes the locking somewhat
> and probably changes the timing so that might be the culprit.
>
> I am trying to figure out if there is an interface that would let
> ns_last_pid for a pid namespace be accessed before the first pid is
> allocated and I am not seeing it. It does not appear to be possible
> to mount a proc for a pid namespace you are not currently in.
>
> *Scratches my head* I am not seeing anything obvious.

Can you try this patch as you reproduce this issue?

diff --git a/kernel/pid.c b/kernel/pid.c
index b13b624e2c49..df9e5d4d8f83 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -210,6 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
goto out_unlock;
for ( ; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */
+ WARN_ON(!upid->ns->proc_mnt);
idr_replace(&upid->ns->idr, pid, upid->nr);
upid->ns->pid_allocated++;
}

If the warning triggers it means the bug is in alloc_pid and somehow
something has gotten past the is_child_reaper check.

If the warning does not trigger it means something is stomping proc_mnt.

In the entire kernel there are exactly two assignments to proc_mnt.
- kmem_cache_zalloc in create_pid_namespace.
- In pid_ns_prepare_proc where proc_mnt is set to a non-zero value.

On the 29th of Nov syzkaller also hit this and gave me this reproducer
that I can't figure out heads or tails of.

#{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:namespace Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf72, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
socket$inet6_dccp(0xa, 0x6, 0x0)
unshare(0x20000400)
sendmsg$unix(0xffffffffffffffff, &(0x7f0000001000-0x38)={&(0x7f0000239000-0x8)=@abs={0x0, 0x0, 0x0}, 0x8, &(0x7f0000008000)=[], 0x0, &(0x7f0000001000-0x10)=[@rights={0x200, 0x1, 0x1, [0xffffffffffffffff]}], 0x1, 0x0}, 0x0)
process_vm_writev(0x0, &(0x7f0000699000-0x70)=[{&(0x7f00006a5000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x4c}, {&(0x7f00007b9000-0x54)="", 0x0}, {&(0x7f00004f3000)="", 0x0}, {&(0x7f00002e3000-0xd6)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xd6}, {&(0x7f0000f2e000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x52}, {&(0x7f00008e5000-0x10)="00000000000000000000000000000000", 0x10}, {&(0x7f0000a3a000)="", 0x0}], 0x7, &(0x7f0000d05000)=[{&(0x7f0000d64000)="", 0x0}, {&(0x7f0000062000-0x93)="", 0x0}, {&(0x7f0000a16000-0x7e)="000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x7e}, {&(0x7f00003dc000-0x9a)="", 0x0}, {&(0x7f0000fe3000-0xc7)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xc7}], 0x5, 0x0)
pselect6(0x40, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20000, 0x0}, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f00000de000-0x40)={0xffffffffffffffe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, &(0x7f00008e6000-0x10)={0x0, 0x989680}, &(0x7f0000205000-0x10)={&(0x7f00006e4000-0x8)={0x0}, 0x8})
clone(0x20900, &(0x7f0000a94000-0x1)="6f", &(0x7f00002b8000-0x4)=0x0, &(0x7f000029e000)=0x0, &(0x7f00006fe000)="")
ioctl$KVM_ENABLE_CAP_CPU(0xffffffffffffffff, 0x4068aea3, &(0x7f0000e48000)={0x7b, 0x0, [0x1, 0x1, 0x800, 0x1], [0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]})
epoll_ctl$EPOLL_CTL_DEL(0xffffffffffffffff, 0x2, 0xffffffffffffffff)

#{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:namespace Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf72, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
socket$inet6_dccp(0xa, 0x6, 0x0)
unshare(0x20000400)
sendmsg$unix(0xffffffffffffffff, &(0x7f0000001000-0x38)={&(0x7f0000239000-0x8)=@abs={0x0, 0x0, 0x0}, 0x8, &(0x7f0000008000)=[], 0x0, &(0x7f0000001000-0x10)=[@rights={0x200, 0x1, 0x1, [0xffffffffffffffff]}], 0x1, 0x0}, 0x0)
process_vm_writev(0x0, &(0x7f0000699000-0x70)=[{&(0x7f00006a5000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x4c}, {&(0x7f00007b9000-0x54)="", 0x0}, {&(0x7f00004f3000)="", 0x0}, {&(0x7f00002e3000-0xd6)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xd6}, {&(0x7f0000f2e000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x52}, {&(0x7f00008e5000-0x10)="00000000000000000000000000000000", 0x10}, {&(0x7f0000a3a000)="", 0x0}], 0x7, &(0x7f0000d05000)=[{&(0x7f0000d64000)="", 0x0}, {&(0x7f0000062000-0x93)="", 0x0}, {&(0x7f0000a16000-0x7e)="000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x7e}, {&(0x7f00003dc000-0x9a)="", 0x0}, {&(0x7f0000fe3000-0xc7)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xc7}], 0x5, 0x0)
pselect6(0x40, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20000, 0x0}, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f00000de000-0x40)={0xffffffffffffffe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, &(0x7f00008e6000-0x10)={0x0, 0x989680}, &(0x7f0000205000-0x10)={&(0x7f00006e4000-0x8)={0x0}, 0x8})
clone(0x20900, &(0x7f0000a94000-0x1)="6f", &(0x7f00002b8000-0x4)=0x0, &(0x7f000029e000)=0x0, &(0x7f00006fe000)="")
ioctl$KVM_ENABLE_CAP_CPU(0xffffffffffffffff, 0x4068aea3, &(0x7f0000e48000)={0x7b, 0x0, [0x1, 0x1, 0x800, 0x1], [0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]})
epoll_ctl$EPOLL_CTL_DEL(0xffffffffffffffff, 0x2, 0xffffffffffffffff)

Eric

Dave Jones

unread,

Dec 20, 2017, 12:28:05 AM12/20/17

to Eric W. Biederman, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com

On Tue, Dec 19, 2017 at 07:54:24PM -0600, Eric W. Biederman wrote:

> > *Scratches my head* I am not seeing anything obvious.
>
> Can you try this patch as you reproduce this issue?
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index b13b624e2c49..df9e5d4d8f83 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -210,6 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> goto out_unlock;
> for ( ; upid >= pid->numbers; --upid) {
> /* Make the PID visible to find_pid_ns. */
> + WARN_ON(!upid->ns->proc_mnt);
> idr_replace(&upid->ns->idr, pid, upid->nr);
> upid->ns->pid_allocated++;
> }
>
>
> If the warning triggers it means the bug is in alloc_pid and somehow
> something has gotten past the is_child_reaper check.

You're onto something.

WARNING: CPU: 1 PID: 12020 at kernel/pid.c:213 alloc_pid+0x230/0x280
CPU: 1 PID: 12020 Comm: trinity-c29 Not tainted 4.15.0-rc4-think+ #3
RIP: 0010:alloc_pid+0x230/0x280
RSP: 0018:ffffc90009977d48 EFLAGS: 00010046
RAX: 0000000000000030 RBX: ffff8804fb431280 RCX: 8f5c28f5c28f5c29
RDX: ffff88050a00de40 RSI: ffffffff82005218 RDI: ffff8804fc6aa9a8
RBP: ffff8804fb431270 R08: 0000000000000000 R09: 0000000000000001
R10: ffffc90009977cc0 R11: eab94e31da7171b7 R12: ffff8804fb431260
R13: ffff8804fb431240 R14: ffffffff82005200 R15: ffff8804fb431268
FS: 00007f49b9065700(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f49b906a000 CR3: 00000004f7446001 CR4: 00000000001606e0
DR0: 00007f0b4c405000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
copy_process.part.41+0x14fa/0x1e30
_do_fork+0xe7/0x720
? rcu_read_lock_sched_held+0x6c/0x80
? syscall_trace_enter+0x2d7/0x340
do_syscall_64+0x60/0x210
entry_SYSCALL64_slow_path+0x25/0x25

followed immediately by...

Oops: 0000 [#1] SMP
CPU: 1 PID: 12020 Comm: trinity-c29 Tainted: G W 4.15.0-rc4-think+ #3
RIP: 0010:proc_flush_task+0x8e/0x1b0
RSP: 0018:ffffc90009977c40 EFLAGS: 00010286
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
RDX: 0000000000000000 RSI: ffffc90009977c50 RDI: 0000000000000000
RBP: ffffc90009977c63 R08: 0000000000000000 R09: 0000000000000002
R10: ffffc90009977b70 R11: ffffc90009977c64 R12: 0000000000000004
R13: 0000000000000000 R14: 0000000000000004 R15: ffff8804fb431240
FS: 00007f49b9065700(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000004f7446001 CR4: 00000000001606e0
DR0: 00007f0b4c405000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
? release_task+0xaf/0x680
release_task+0xd2/0x680
? wait_consider_task+0xb82/0xce0
wait_consider_task+0xbe9/0xce0
? do_wait+0xe1/0x330
do_wait+0x151/0x330
kernel_wait4+0x8d/0x150
? task_stopped_code+0x50/0x50
SYSC_wait4+0x95/0xa0
? rcu_read_lock_sched_held+0x6c/0x80
? syscall_trace_enter+0x2d7/0x340
? do_syscall_64+0x60/0x210
do_syscall_64+0x60/0x210
entry_SYSCALL64_slow_path+0x25/0x25

Dmitry Vyukov

unread,

Dec 20, 2017, 3:00:33 AM12/20/17

to Eric W. Biederman, Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com

You can ask syzbot to test a patch if there is a reproducer.
Instructions are at the bottom of the report email.

> #{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:namespace Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
> mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
> perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf72, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
> socket$inet6_dccp(0xa, 0x6, 0x0)
> unshare(0x20000400)
> sendmsg$unix(0xffffffffffffffff, &(0x7f0000001000-0x38)={&(0x7f0000239000-0x8)=@abs={0x0, 0x0, 0x0}, 0x8, &(0x7f0000008000)=[], 0x0, &(0x7f0000001000-0x10)=[@rights={0x200, 0x1, 0x1, [0xffffffffffffffff]}], 0x1, 0x0}, 0x0)
> process_vm_writev(0x0, &(0x7f0000699000-0x70)=[{&(0x7f00006a5000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x4c}, {&(0x7f00007b9000-0x54)="", 0x0}, {&(0x7f00004f3000)="", 0x0}, {&(0x7f00002e3000-0xd6)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xd6}, {&(0x7f0000f2e000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x52}, {&(0x7f00008e5000-0x10)="00000000000000000000000000000000", 0x10}, {&(0x7f0000a3a000)="", 0x0}], 0x7, &(0x7f0000d05000)=[{&(0x7f0000d64000)="", 0x0}, {&(0x7f0000062000-0x93)="", 0x0}, {&(0x7f0000a16000-0x7e)="000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x7e}, {&(0x7f00003dc000-0x9a)="", 0x0}, {&(0x7f0000fe3000-0xc7)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xc7}], 0x5, 0x0)
> pselect6(0x40, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20000, 0x0}, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f00000de000-0x40)={0xffffffffffffffe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, &(0x7f00008e6000-0x10)={0x0, 0x989680}, &(0x7f0000205000-0x10)={&(0x7f00006e4000-0x8)={0x0}, 0x8})
> clone(0x20900, &(0x7f0000a94000-0x1)="6f", &(0x7f00002b8000-0x4)=0x0, &(0x7f000029e000)=0x0, &(0x7f00006fe000)="")
> ioctl$KVM_ENABLE_CAP_CPU(0xffffffffffffffff, 0x4068aea3, &(0x7f0000e48000)={0x7b, 0x0, [0x1, 0x1, 0x800, 0x1], [0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]})
> epoll_ctl$EPOLL_CTL_DEL(0xffffffffffffffff, 0x2, 0xffffffffffffffff)
>
> #{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:namespace Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
> mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
> perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf72, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
> socket$inet6_dccp(0xa, 0x6, 0x0)
> unshare(0x20000400)
> sendmsg$unix(0xffffffffffffffff, &(0x7f0000001000-0x38)={&(0x7f0000239000-0x8)=@abs={0x0, 0x0, 0x0}, 0x8, &(0x7f0000008000)=[], 0x0, &(0x7f0000001000-0x10)=[@rights={0x200, 0x1, 0x1, [0xffffffffffffffff]}], 0x1, 0x0}, 0x0)
> process_vm_writev(0x0, &(0x7f0000699000-0x70)=[{&(0x7f00006a5000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x4c}, {&(0x7f00007b9000-0x54)="", 0x0}, {&(0x7f00004f3000)="", 0x0}, {&(0x7f00002e3000-0xd6)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xd6}, {&(0x7f0000f2e000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x52}, {&(0x7f00008e5000-0x10)="00000000000000000000000000000000", 0x10}, {&(0x7f0000a3a000)="", 0x0}], 0x7, &(0x7f0000d05000)=[{&(0x7f0000d64000)="", 0x0}, {&(0x7f0000062000-0x93)="", 0x0}, {&(0x7f0000a16000-0x7e)="000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x7e}, {&(0x7f00003dc000-0x9a)="", 0x0}, {&(0x7f0000fe3000-0xc7)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xc7}], 0x5, 0x0)
> pselect6(0x40, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20000, 0x0}, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f00000de000-0x40)={0xffffffffffffffe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, &(0x7f00008e6000-0x10)={0x0, 0x989680}, &(0x7f0000205000-0x10)={&(0x7f00006e4000-0x8)={0x0}, 0x8})
> clone(0x20900, &(0x7f0000a94000-0x1)="6f", &(0x7f00002b8000-0x4)=0x0, &(0x7f000029e000)=0x0, &(0x7f00006fe000)="")
> ioctl$KVM_ENABLE_CAP_CPU(0xffffffffffffffff, 0x4068aea3, &(0x7f0000e48000)={0x7b, 0x0, [0x1, 0x1, 0x800, 0x1], [0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]})
> epoll_ctl$EPOLL_CTL_DEL(0xffffffffffffffff, 0x2, 0xffffffffffffffff)
>
>
> Eric
>

> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bug...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/87mv2e17vz.fsf%40xmission.com.
> For more options, visit https://groups.google.com/d/optout.

Eric W. Biederman

unread,

Dec 20, 2017, 1:26:20 PM12/20/17

to Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Alexey Dobriyan

I am not seeing where things go wrong, but that puts the recent pid bitmap, bit
hash to idr change in the suspect zone.

Can you try reverting that change:

e8cfbc245e24 ("pid: remove pidhash")
95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")

While keeping the warning in place so we can see if this fixes the
allocation problem?

Eric

Dave Jones

unread,

Dec 20, 2017, 10:16:08 PM12/20/17

to Eric W. Biederman, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Alexey Dobriyan

On Wed, Dec 20, 2017 at 12:25:52PM -0600, Eric W. Biederman wrote:
> > >
> > > If the warning triggers it means the bug is in alloc_pid and somehow
> > > something has gotten past the is_child_reaper check.
> >
> > You're onto something.
> >

> I am not seeing where things go wrong, but that puts the recent pid bitmap, bit
> hash to idr change in the suspect zone.
>
> Can you try reverting that change:
>
> e8cfbc245e24 ("pid: remove pidhash")
> 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
>
> While keeping the warning in place so we can see if this fixes the
> allocation problem?

So I can't trigger this any more with those reverted. I seem to hit a
bunch of other long-standing bugs first. I'll keep running it
overnight, but it looks like this is where the problem lies.

Dave

Eric W. Biederman

unread,

Dec 21, 2017, 3:27:23 AM12/21/17

to Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Alexey Dobriyan, Oleg Nesterov, Rik van Riel, Andrew Morton

I would really like to hear from the people who made this change if they
are interested in tracking down this failure.

It might be as simple as the locking changed enough that the locking
instrumentation is now slowing things down, and opening up an old race.

I have stared at this code, and written some test programs and I can't
see what is going on. alloc_pid by design and in implementation (as far
as I can see) is always single threaded when allocating the first pid
in a pid namespace. idr_init always initialized idr_next to 0.

So how we can get past:

if (unlikely(is_child_reaper(pid))) {
if (pid_ns_prepare_proc(ns)) {
disable_pid_allocation(ns);
goto out_free;
}
}

with proc_mnt still set to NULL is a mystery to me.

Is there any chance the idr code doesn't always return the lowest valid
free number? So init gets assigned something other than 1?

Eric

Alexey Dobriyan

unread,

Dec 21, 2017, 5:38:14 AM12/21/17

to Eric W. Biederman, Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

On 12/21/17, Eric W. Biederman <ebie...@xmission.com> wrote:
> I have stared at this code, and written some test programs and I can't
> see what is going on. alloc_pid by design and in implementation (as far
> as I can see) is always single threaded when allocating the first pid
> in a pid namespace. idr_init always initialized idr_next to 0.
>
> So how we can get past:
>
> if (unlikely(is_child_reaper(pid))) {
> if (pid_ns_prepare_proc(ns)) {
> disable_pid_allocation(ns);
> goto out_free;
> }
> }
>
> with proc_mnt still set to NULL is a mystery to me.
>
> Is there any chance the idr code doesn't always return the lowest valid
> free number? So init gets assigned something other than 1?

Well, this theory is easy to test (attached).

There is a "valid" way to break the code via kernel.ns_last_pid:
unshare+write+fork but the reproducer doesn't seem to use it (or it does?)

pid1.diff

Dave Jones

unread,

Dec 21, 2017, 9:25:38 AM12/21/17

to Alexey Dobriyan, Eric W. Biederman, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
> On 12/21/17, Eric W. Biederman <ebie...@xmission.com> wrote:
> > I have stared at this code, and written some test programs and I can't
> > see what is going on. alloc_pid by design and in implementation (as far
> > as I can see) is always single threaded when allocating the first pid
> > in a pid namespace. idr_init always initialized idr_next to 0.
> >
> > So how we can get past:
> >
> > if (unlikely(is_child_reaper(pid))) {
> > if (pid_ns_prepare_proc(ns)) {
> > disable_pid_allocation(ns);
> > goto out_free;
> > }
> > }
> >
> > with proc_mnt still set to NULL is a mystery to me.
> >
> > Is there any chance the idr code doesn't always return the lowest valid
> > free number? So init gets assigned something other than 1?
>
> Well, this theory is easy to test (attached).

I'll give this a shot and report back when I get to the office.

> There is a "valid" way to break the code via kernel.ns_last_pid:
> unshare+write+fork but the reproducer doesn't seem to use it (or it does?)

that sysctl is root only, so that isn't at play here.

Dav

Eric W. Biederman

unread,

Dec 21, 2017, 11:41:51 AM12/21/17

to Dave Jones, Alexey Dobriyan, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

ns_capable(CAP_SYS_ADMIN) will allow root in a user namespace. So the
sysctl should be fuzzable.

The ns_last_pid sysctl is still not in play because it changes
task_active_pid_ns (aka the pid namespace of the callers pid) not
pid_ns_for_children. So it still is not in play.

Every time I think of a "valid" way to break the code, I double check
myself and find there are already checks in place to prevent that.

Eric

Dave Jones

unread,

Dec 21, 2017, 5:00:47 PM12/21/17

to Alexey Dobriyan, Eric W. Biederman, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:

> > with proc_mnt still set to NULL is a mystery to me.
> >
> > Is there any chance the idr code doesn't always return the lowest valid
> > free number? So init gets assigned something other than 1?
>
> Well, this theory is easy to test (attached).

I didn't hit this BUG, but I hit the same oops in proc_flush_task.

Dave

Eric W. Biederman

unread,

Dec 21, 2017, 8:31:56 PM12/21/17

to Dave Jones, Alexey Dobriyan, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

Scratch one idea.

If it isn't too much trouble can you try this.

I am wondering if somehow the proc_mnt that is NULL is somewhere in the
middle of the stack of pid namespaces.

This adds two warnings. The first just reports which pid namespace in
the stack of pid namespaces is problematic, and the pid number in that
pid namespace. Which should give a whole lot more to go by.

The second warning complains if we manage to create a pid namespace
where the parent pid namespace is not properly set up. The test to
prevent that looks quite robust, but at this point I don't know where to
look.

Thank you very much,
Eric

diff --git a/kernel/pid.c b/kernel/pid.c
index b13b624e2c49..a1e8734afbba 100644

--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -210,6 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
goto out_unlock;
for ( ; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */

+ WARN(!upid->ns->proc_mnt, "%ld/%d: %d no proc_mnt", (upid - pid->numbers), pid->level, upid->nr);

idr_replace(&upid->ns->idr, pid, upid->nr);
upid->ns->pid_allocated++;
}

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0b53eef7d34b..8f4c02c7223a 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -136,6 +136,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
ns->pid_allocated = PIDNS_ADDING;
INIT_WORK(&ns->proc_work, proc_cleanup_work);

+ WARN_ON(!parent_pid_ns->proc_mnt);
+
return ns;

out_free_idr:

Dave Jones

unread,

Dec 21, 2017, 10:35:02 PM12/21/17

to Eric W. Biederman, Alexey Dobriyan, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

On Thu, Dec 21, 2017 at 07:31:26PM -0600, Eric W. Biederman wrote:
> Dave Jones <da...@codemonkey.org.uk> writes:
>
> > On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
> >
> > > > with proc_mnt still set to NULL is a mystery to me.
> > > >
> > > > Is there any chance the idr code doesn't always return the lowest valid
> > > > free number? So init gets assigned something other than 1?
> > >
> > > Well, this theory is easy to test (attached).
> >
> > I didn't hit this BUG, but I hit the same oops in proc_flush_task.
>
> Scratch one idea.
>
> If it isn't too much trouble can you try this.
>
> I am wondering if somehow the proc_mnt that is NULL is somewhere in the
> middle of the stack of pid namespaces.
>
> This adds two warnings. The first just reports which pid namespace in
> the stack of pid namespaces is problematic, and the pid number in that
> pid namespace. Which should give a whole lot more to go by.
>
> The second warning complains if we manage to create a pid namespace
> where the parent pid namespace is not properly set up. The test to
> prevent that looks quite robust, but at this point I don't know where to
> look.

Progress ?

[ 1653.030190] ------------[ cut here ]------------
[ 1653.030852] 1/1: 2 no proc_mnt
[ 1653.030946] WARNING: CPU: 2 PID: 4420 at kernel/pid.c:213 alloc_pid+0x24f/0x2a0

Eric W. Biederman

unread,

Dec 22, 2017, 2:58:54 AM12/22/17

to Dave Jones, Alexey Dobriyan, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

Yes. I don't know why Alexey's patch did not fire but this is
confirmation that the first pid allocated was #2 and not #1.

Which explains the pid_mnt not being set, and it is definitely the new
code, changing from the pid bitmap+hash table to an idr.

So it looks like idr_alloc_cyclic in some configuration for the first
allocation returns value #2 instead of value #1.

I don't know that code, and it is quite complicated so I will have to
stare at it some more to even guess why it is doing that.

This is confirmation that reverting those pid changes will fix the
problem. As they are definitely at fault.

Hmm. After a little more staring I have a hunch what is going wrong.
It is just possible that there is a failure in alloc_pid during the
first pid allocation and then idr_next gets left at 2. I need to sleep
before I can think of a patch to test that.

Hmm. A failure and then restart would also explain why Alexey's patch
did not fire. An incomplete reset of state.

Eric

Alexey Dobriyan

unread,

Dec 22, 2017, 5:13:07 AM12/22/17

to Eric W. Biederman, Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

You are right (you are also right about sysctl :-\)

unshare
fork
alloc_pid in level 1 succeeds
alloc_pid in level 0 fails, ->idr_next is 2
fork
alloc pid 2
exit

Reliable reproducer and fail injection patch attached

I'd say proper fix is allocating pids in the opposite order
so that failure in the last layer doesn't move IDR cursor
in baby child pidns.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: proc_flush_task+0x7b/0x180
PGD 3dbb5067 P4D 3dbb5067 PUD 3d11c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 2775 Comm: trinity Not tainted
4.15.0-rc4-00202-gead68f216110-dirty #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.10.2-1.fc27 04/01/2014
RIP: 0010:proc_flush_task+0x7b/0x180
RSP: 0018:ffffc9000016bca8 EFLAGS: 00010296
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000032
RDX: ffffc9000016bcbc RSI: ffffc9000016bcc8 RDI: 0000000000000001
RBP: ffffc9000016bcbb R08: 0000000036373732 R09: 0000000000000000
R10: 0000000000000000 R11: ffffc9000016bcbc R12: 0000000000000002
R13: 0000000000000000 R14: 0000000000000002 R15: ffff88003e2aee00
FS: 00007f9b52c86700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000000 CR3: 000000003d24b000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
release_task+0x3c/0x430
? thread_group_cputime_adjusted+0x35/0x40
wait_consider_task+0x7a3/0x7d0
do_wait+0xe5/0x1f0
kernel_wait4+0x74/0x110
? task_stopped_code+0x50/0x50
SyS_wait4+0x77/0x80
? handle_mm_fault+0x4a/0x50
? __do_page_fault+0x1a9/0x3e0
? entry_SYSCALL_64_fastpath+0x13/0x6c
entry_SYSCALL_64_fastpath+0x13/0x6c

proc-flush-task.diff

proc-flush-task.c

Eric W. Biederman

unread,

Dec 22, 2017, 9:42:24 AM12/22/17

to Alexey Dobriyan, Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

I agree with you about changing the order. That will make
the code simpler and in the second loop actually conforming C code,
and fix the immediate problem.

I was worrying about the case where the mount of the proc filesystem
fails, but we call disable_pid_allocation in that case. So we won't try
to allocate a pid again. That seems better than any path where we might
have to reset the allocation state.

The nasty thing is that the pid bitmap+hashtable code also did not set
the pointer back and we did not have any problems. AKA the bug is not
new.

Which means with the new code allocating pid numbers is failing much
more often when allocating pids than the old code ever did. That is not
good.

Is it perhaps the GFP_ATOMIC in a context where we could otherwise
sleep?

Eric

Alexey Dobriyan

unread,

Dec 22, 2017, 11:11:27 AM12/22/17

to Eric W. Biederman, Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

On Fri, Dec 22, 2017 at 08:41:54AM -0600, Eric W. Biederman wrote:
> Alexey Dobriyan <adob...@gmail.com> writes:

> > unshare
> > fork
> > alloc_pid in level 1 succeeds
> > alloc_pid in level 0 fails, ->idr_next is 2
> > fork
> > alloc pid 2
> > exit
> >
> > Reliable reproducer and fail injection patch attached
> >
> > I'd say proper fix is allocating pids in the opposite order
> > so that failure in the last layer doesn't move IDR cursor
> > in baby child pidns.
>
> I agree with you about changing the order. That will make
> the code simpler and in the second loop actually conforming C code,
> and fix the immediate problem.

Something like that (barely tested)

---

include/linux/pid_namespace.h | 2 ++
kernel/pid.c | 15 +++++++++++----
kernel/pid_namespace.c | 3 ---
3 files changed, 13 insertions(+), 7 deletions(-)

--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -28,6 +28,8 @@ struct pid_namespace {
unsigned int pid_allocated;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
+/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
+#define MAX_PID_NS_LEVEL 32
unsigned int level;
struct pid_namespace *parent;
#ifdef CONFIG_PROC_FS
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -146,6 +146,7 @@ void free_pid(struct pid *pid)

struct pid *alloc_pid(struct pid_namespace *ns)

{
+ struct pid_namespace *pid_ns[MAX_PID_NS_LEVEL + 1];
struct pid *pid;
enum pid_type type;
int i, nr;
@@ -157,12 +158,19 @@ struct pid *alloc_pid(struct pid_namespace *ns)
if (!pid)
return ERR_PTR(retval);

- tmp = ns;
pid->level = ns->level;

- for (i = ns->level; i >= 0; i--) {
+ tmp = ns;
+ do {
+ pid_ns[tmp->level] = tmp;
+ tmp = tmp->parent;
+ } while (tmp->level != 0);
+
+ for (i = 0; i <= ns->level; i++) {
int pid_min = 1;

+ tmp = pid_ns[i];
+
idr_preload(GFP_KERNEL);
spin_lock_irq(&pidmap_lock);

@@ -189,7 +197,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)

pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
- tmp = tmp->parent;
}

if (unlikely(is_child_reaper(pid))) {
@@ -223,7 +230,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)

out_free:
spin_lock_irq(&pidmap_lock);
- while (++i <= ns->level)
+ while (--i >= 0)
idr_remove(&ns->idr, (pid->numbers + i)->nr);

spin_unlock_irq(&pidmap_lock);
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -80,9 +80,6 @@ static void proc_cleanup_work(struct work_struct *work)
pid_ns_release_proc(ns);
}

-/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
-#define MAX_PID_NS_LEVEL 32
-
static struct ucounts *inc_pid_namespaces(struct user_namespace *ns)
{
return inc_ucount(ns, current_euid(), UCOUNT_PID_NAMESPACES);

Eric W. Biederman

unread,

Dec 23, 2017, 10:12:47 PM12/23/17

to Alexey Dobriyan, Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

Alexey Dobriyan <adob...@gmail.com> writes:

> On Fri, Dec 22, 2017 at 08:41:54AM -0600, Eric W. Biederman wrote:
>> Alexey Dobriyan <adob...@gmail.com> writes:
>
>> > unshare
>> > fork
>> > alloc_pid in level 1 succeeds
>> > alloc_pid in level 0 fails, ->idr_next is 2
>> > fork
>> > alloc pid 2
>> > exit
>> >
>> > Reliable reproducer and fail injection patch attached
>> >
>> > I'd say proper fix is allocating pids in the opposite order
>> > so that failure in the last layer doesn't move IDR cursor
>> > in baby child pidns.
>>
>> I agree with you about changing the order. That will make
>> the code simpler and in the second loop actually conforming C code,
>> and fix the immediate problem.
>
> Something like that (barely tested)

I have thought about this some more and I think we can do better.

I don't like the on stack pid_ns array.

The only reason the code calls disable_pid_allocation is that we
don't handle this error.

The semantics of least surprise, are that if we run out of resources
while allocating something, trying again when more resources are
available will make it work.

So it looks like handling the error will improve the quality of the
implemenation, and be a simpler, less dangerous patch.

diff --git a/kernel/pid.c b/kernel/pid.c
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -226,6 +224,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)

while (++i <= ns->level)

idr_remove(&ns->idr, (pid->numbers + i)->nr);

+ /* On failure to allocate the first pid, reset the state */
+ if (ns->pid_allocated == PIDNS_ADDING)
+ idr_set_cursor(&ns->idr, 0);
+
spin_unlock_irq(&pidmap_lock);

kmem_cache_free(ns->pid_cachep, pid);

Eric

Eric W. Biederman

unread,

Dec 23, 2017, 10:16:34 PM12/23/17

to Alexey Dobriyan, Dave Jones, Linus Torvalds, Al Viro, Linux Kernel, syzkall...@googlegroups.com, Gargi Sharma, Oleg Nesterov, Rik van Riel, Andrew Morton

With the replacement of the pid bitmap and hashtable with an idr in
alloc_pid started occassionally failing when allocating the first pid
in a pid namespace. Things were not completely reset resulting in
the first allocated pid getting the number 2 (not 1). Which
further resulted in ns->proc_mnt not getting set and eventually
causing an oops in proc_flush_task.

Oops: 0000 [#1] SMP
CPU: 2 PID: 6743 Comm: trinity-c117 Not tainted 4.15.0-rc4-think+ #2
RIP: 0010:proc_flush_task+0x8e/0x1b0
RSP: 0018:ffffc9000bbffc40 EFLAGS: 00010286
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
RDX: 0000000000000000 RSI: ffffc9000bbffc50 RDI: 0000000000000000
RBP: ffffc9000bbffc63 R08: 0000000000000000 R09: 0000000000000002
R10: ffffc9000bbffb70 R11: ffffc9000bbffc64 R12: 0000000000000003
R13: 0000000000000000 R14: 0000000000000003 R15: ffff8804c10d7840
FS: 00007f7cb8965700(0000) GS:ffff88050a200000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000000 CR3: 00000003e21ae003 CR4: 00000000001606e0
DR0: 00007fb1d6c22000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
? release_task+0xaf/0x680
release_task+0xd2/0x680
? wait_consider_task+0xb82/0xce0
wait_consider_task+0xbe9/0xce0
? do_wait+0xe1/0x330
do_wait+0x151/0x330
kernel_wait4+0x8d/0x150
? task_stopped_code+0x50/0x50
SYSC_wait4+0x95/0xa0
? rcu_read_lock_sched_held+0x6c/0x80
? syscall_trace_enter+0x2d7/0x340
? do_syscall_64+0x60/0x210
do_syscall_64+0x60/0x210
entry_SYSCALL64_slow_path+0x25/0x25

RIP: 0033:0x7f7cb82603aa
RSP: 002b:00007ffd60770bc8 EFLAGS: 00000246
ORIG_RAX: 000000000000003d
RAX: ffffffffffffffda RBX: 00007f7cb6cd4000 RCX: 00007f7cb82603aa
RDX: 000000000000000b RSI: 00007ffd60770bd0 RDI: 0000000000007cca
RBP: 0000000000007cca R08: 00007f7cb8965700 R09: 00007ffd607c7080
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd60770bd0 R14: 00007f7cb6cd4058 R15: 00000000cccccccd
Code: c1 e2 04 44 8b 60 30 48 8b 40 38 44 8b 34 11 48 c7 c2 60 3a f5 81 44 89 e1 4c 8b 68 58 e8 4b b4 77 00 89 44 24 14 48 8d 74 24 10 <49> 8b 7d 00 e8 b9 6a f9 ff 48 85 c0 74 1a 48 89 c7 48 89 44 24
RIP: proc_flush_task+0x8e/0x1b0 RSP: ffffc9000bbffc40
CR2: 0000000000000000
---[ end trace 53d67a6481059862 ]---

Improve the quality of the implementation by resetting the place to
start allocating pids on failure to allocate the first pid.

As improving the quality of the implementation is the goal remove the now
unnecesarry disable_pid_allocations call when we fail to mount proc.

Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
Fixes: 8ef047aaaeb8 ("pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid")
Reported-by: Dave Jones <da...@codemonkey.org.uk>
Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---

I have tested this by forcing an allocation failure after allocating pid
1, and the symptoms are identical to what Dave Jones has observed.

I have further confirmed that this change causes the failure to go away.

If no one objects I will send a pull request to Linus in the next day or two.

kernel/pid.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index b13b624e2c49..1e8bb6550ec4 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -193,10 +193,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
}

if (unlikely(is_child_reaper(pid))) {
- if (pid_ns_prepare_proc(ns)) {
- disable_pid_allocation(ns);
+ if (pid_ns_prepare_proc(ns))
goto out_free;
- }
}

get_pid_ns(ns);

@@ -226,6 +224,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
while (++i <= ns->level)
idr_remove(&ns->idr, (pid->numbers + i)->nr);

+ /* On failure to allocate the first pid, reset the state */
+ if (ns->pid_allocated == PIDNS_ADDING)
+ idr_set_cursor(&ns->idr, 0);
+
spin_unlock_irq(&pidmap_lock);

kmem_cache_free(ns->pid_cachep, pid);

--
2.14.1

Reply all

Reply to author

Forward