Unable to create pthreads when in MODE_STANDALONE_EXECVE

62 views
Skip to first unread message

Joel Weinberger

unread,
Oct 4, 2017, 7:02:08 PM10/4/17
to nsjail
Hi there. When running nsjail in MODE_STANDALONE_EXECVE via the -Me option, I'm unable to create pthreads, as they're failing with EINVAL. However, when I run with the same arguments to nsjail except with -Mo to run in MODE_STANDALONE_ONCE, the threads are happily created and run. Is this expected behavior? What could be changing in these modes to block pthread creation?

Thanks,
   Joel

Robert Święcki

unread,
Oct 4, 2017, 8:52:15 PM10/4/17
to Joel Weinberger, nsjail
Hi,
The problem here is that with -Me unshare() is called instead of
clone(). In that case the current process is not put in the PID
namespace, but all its descentands will. Also, all threads.

But, it's impossible for two threads to be in two PID namespaces at
the same time, hence EINVAL in the subsequent clone.

The solution (probably) will be to use --disable_clone_newpid

It doesn't change much, as with -Me a process is not put in a separate
PID namespace anyway due to limitations in the Linux kernel.

--
Robert Święcki

Joel Weinberger

unread,
Oct 5, 2017, 4:29:47 PM10/5/17
to nsjail
Thanks! Indeed, --disable_new_pid solves the problem. Just to clarify what's going on here and make sure I fully understand:
  • Starting in PID namespace NS1, -Me calls unshare() before execv(), by default creating a new PID namespace with CLONE_NEWPID, which we'll call NS2.
  • However, unshare() does not put the current process itself into NS2, so the top process the execv runs in is still NS1.
  • When the binary is run, if it forks/clones, it puts all the new processes into NS2.
  • This includes clones to make threads. However, threads must be in the same PID namespace as their parent process. Since in this case the parent is in NS1 but new processes attempt to be put in NS2, thread creation fails.
  • Running with --disable_new_pid simply gets rid of the CLONE_NEWPID flag to unshare, so NS2 is never created, and the threads are created in NS1.
I'm guessing that this means another (convoluted) solution would be to fork in the binary passed to execv, grab the PID namespace, pass the FD for that namespace to the parent in NS1, then run setns() in the parent to put itself in NS2. I might test this just to verify, but it is almost certainly unnecessary for my purposes.

In any case, thank you such much for the helpful response and explanation.
--Joel

Robert Święcki

unread,
Oct 5, 2017, 6:54:25 PM10/5/17
to Joel Weinberger, nsjail
Hi,

2017-10-05 22:29 GMT+02:00 Joel Weinberger <joel.we...@gmail.com>:
> Thanks! Indeed, --disable_new_pid solves the problem. Just to clarify what's
> going on here and make sure I fully understand:
>
> Starting in PID namespace NS1, -Me calls unshare() before execv(), by
> default creating a new PID namespace with CLONE_NEWPID, which we'll call
> NS2.
> However, unshare() does not put the current process itself into NS2, so the
> top process the execv runs in is still NS1.
> When the binary is run, if it forks/clones, it puts all the new processes
> into NS2.
> This includes clones to make threads. However, threads must be in the same
> PID namespace as their parent process. Since in this case the parent is in
> NS1 but new processes attempt to be put in NS2, thread creation fails.
> Running with --disable_new_pid simply gets rid of the CLONE_NEWPID flag to
> unshare, so NS2 is never created, and the threads are created in NS1.
>
> I'm guessing that this means another (convoluted) solution would be to fork
> in the binary passed to execv, grab the PID namespace, pass the FD for that
> namespace to the parent in NS1, then run setns() in the parent to put itself
> in NS2. I might test this just to verify, but it is almost certainly
> unnecessary for my purposes.

As per 'man pid_namespaces' both unshare() and setns() work in the
same way (only child processes go into new namespace).

I understand reasons for not making the it default behavior (glibc
would go crazy as it caches PID/TID values), but surprisingly there's
no way to force it, or even for it during execve(), what is seemingly
OK, and would solve some problems.

What one can try is to actually unshare(CLONE_NEWPID), then open
/proc/self/ns/pid_for_children and then do setns() on it. It probably
not work but stiill worth trying.

BTW, there's another "interesting behavior" of CLONE_NEWPID. There
must always be a process inside this namespace (so called ns-init),
in order for subsequent clone/fork's to work. Otherwise clone/fork's
will end up with ENOMEM. Therefore with -Me, nsjail creates a dummy
init which is always inside this pid ns.

So, -Me is somewhat incomplete, and I see no good ways of making it
work, so it includes pid namespaces

Joel Weinberger

unread,
Oct 6, 2017, 1:06:58 PM10/6/17
to Robert Święcki, nsjail
On Thu, Oct 5, 2017 at 3:54 PM Robert Święcki <rob...@swiecki.net> wrote:
Hi,

2017-10-05 22:29 GMT+02:00 Joel Weinberger <joel.we...@gmail.com>:
> Thanks! Indeed, --disable_new_pid solves the problem. Just to clarify what's
> going on here and make sure I fully understand:
>
> Starting in PID namespace NS1, -Me calls unshare() before execv(), by
> default creating a new PID namespace with CLONE_NEWPID, which we'll call
> NS2.
> However, unshare() does not put the current process itself into NS2, so the
> top process the execv runs in is still NS1.
> When the binary is run, if it forks/clones, it puts all the new processes
> into NS2.
> This includes clones to make threads. However, threads must be in the same
> PID namespace as their parent process. Since in this case the parent is in
> NS1 but new processes attempt to be put in NS2, thread creation fails.
> Running with --disable_new_pid simply gets rid of the CLONE_NEWPID flag to
> unshare, so NS2 is never created, and the threads are created in NS1.
>
> I'm guessing that this means another (convoluted) solution would be to fork
> in the binary passed to execv, grab the PID namespace, pass the FD for that
> namespace to the parent in NS1, then run setns() in the parent to put itself
> in NS2. I might test this just to verify, but it is almost certainly
> unnecessary for my purposes.

As per 'man pid_namespaces' both unshare() and setns() work in the
same way (only child processes go into new namespace).
Oh, yikes, I misread the setns() explanation initially. Great catch! 

I understand reasons for not making the it default behavior (glibc
would go crazy as it caches PID/TID values), but surprisingly there's
no way to force it, or even for it during execve(), what is seemingly
OK, and would solve some problems.

What one can try is to actually unshare(CLONE_NEWPID), then open
/proc/self/ns/pid_for_children and then do setns() on it. It probably
not work but stiill worth trying.

BTW, there's another "interesting behavior" of CLONE_NEWPID. There
must always be a process inside this namespace (so called  ns-init),
in order for subsequent clone/fork's to work. Otherwise clone/fork's
will end up with ENOMEM. Therefore with -Me, nsjail creates a dummy
init which is always inside this pid ns.
Wow, crazy. Hence the reaper in pidInitNs(), I suppose.

Robert Święcki

unread,
Oct 6, 2017, 4:55:07 PM10/6/17
to Joel Weinberger, nsjail
Yup

PS: If you or anybody else would like to give -Me a shot, and try to
make it work "as expected" (i.e. including the current process
entering the new PID namespace), that could be an interesting piece of
work. Some ideas include using CLONE_PARENT, to inform the parent
process (e.g. bash) that there are more processes that need to be
waited for than just its first child process, or making use of
/proc/self/ns/pid_for_children

Another alternative would be a patch for a kernel which would make PID
ns take effect across execve, e.g. through ioctl, e.g:

ioctl(fd="/proc/self/ns/pid",
SOMETHING_SOMETHING_ENABLE_PID_NS_ACROSS_EXECVE, 1);
--
Robert Święcki

Joel Weinberger

unread,
Oct 6, 2017, 6:05:35 PM10/6/17
to Robert Święcki, nsjail
Don't think I'll be getting to it right now, but I'll definitely keep it in mind. For now, if I decide I should have a new PID namespace, I'll probably just set the namespace and then launch nsjail.
--Joel
Reply all
Reply to author
Forward
0 new messages