Hi,
2017-10-05 22:29 GMT+02:00 Joel Weinberger <
joel.we...@gmail.com>:
> Thanks! Indeed, --disable_new_pid solves the problem. Just to clarify what's
> going on here and make sure I fully understand:
>
> Starting in PID namespace NS1, -Me calls unshare() before execv(), by
> default creating a new PID namespace with CLONE_NEWPID, which we'll call
> NS2.
> However, unshare() does not put the current process itself into NS2, so the
> top process the execv runs in is still NS1.
> When the binary is run, if it forks/clones, it puts all the new processes
> into NS2.
> This includes clones to make threads. However, threads must be in the same
> PID namespace as their parent process. Since in this case the parent is in
> NS1 but new processes attempt to be put in NS2, thread creation fails.
> Running with --disable_new_pid simply gets rid of the CLONE_NEWPID flag to
> unshare, so NS2 is never created, and the threads are created in NS1.
>
> I'm guessing that this means another (convoluted) solution would be to fork
> in the binary passed to execv, grab the PID namespace, pass the FD for that
> namespace to the parent in NS1, then run setns() in the parent to put itself
> in NS2. I might test this just to verify, but it is almost certainly
> unnecessary for my purposes.
As per 'man pid_namespaces' both unshare() and setns() work in the
same way (only child processes go into new namespace).
I understand reasons for not making the it default behavior (glibc
would go crazy as it caches PID/TID values), but surprisingly there's
no way to force it, or even for it during execve(), what is seemingly
OK, and would solve some problems.
What one can try is to actually unshare(CLONE_NEWPID), then open
/proc/self/ns/pid_for_children and then do setns() on it. It probably
not work but stiill worth trying.
BTW, there's another "interesting behavior" of CLONE_NEWPID. There
must always be a process inside this namespace (so called ns-init),
in order for subsequent clone/fork's to work. Otherwise clone/fork's
will end up with ENOMEM. Therefore with -Me, nsjail creates a dummy
init which is always inside this pid ns.
So, -Me is somewhat incomplete, and I see no good ways of making it
work, so it includes pid namespaces