Hi All,
In Singularity 2 (I'm testing with 2.6.0), if you 'shell' or 'exec' to a running instance that was started in a custom network namespace (created with the 'ip netns add' command, not using the -n|--net Singularity flag) the process(es) spawned will also use that custom network namespace. This is highly beneficial because it means that each Singularity container instance can use its own 'ip netns' managed custom network namespace and that users of each container end up in that custom network namespace by default when running 'shell' or 'exec' against the instance, regardless of the network namespace of the user's shell.
In Singularity 3 (I'm testing with 3.0.0-alpha.1-144-g345371f, git clone from 08/31/2018), it appears that behavior has changed. If an instance is started in a custom network namespace (created with the 'ip netns add' command, not using the -n|--net Singularity flag), the initial processes start in the custom network namespace but future process(es) spawned by running 'shell' or 'exec' to that instance end up associated with the root (default) network namespace of the host system instead of the custom network namespace. Since non-root users cannot use 'ip netns exec' or 'nsenter' to explicitly start a process in a particular namespace, non-root user 'shell' and 'exec' interactions with that Singularity 3 instance will result in processes that essentially "escape" the custom network namespace and are instead exposed to the root network namespace and everything it contains (network interfaces, iptables rules, routing tables, etc)...
Are there plans to incorporate Singularity 2's behavior described above into Singularity 3?
Although I realize that the -n|--net flag can be passed to 'instance start' to cause Singularity 3 to generate a new network namespace, using 'ip netns add' instead not only creates what is essentially a "named" custom network namespace (e.g. /var/run/netns/name-of-custom-network-namespace) but also makes it possible to use namespace-specific configuration files in /etc (see
http://man7.org/linux/man-pages/man8/ip-netns.8.html ).
Below I've included two sections ('Examples' and 'Relevant Code') to provide examples and references to relevant code, both with an excruciating (but hopefully helpful) amount of detail. ;)
If you have any questions or need additional information, please let me know.
Thanks!
Sean
Examples
Singularity 2
As an example, consider the following script being run as root when a host is started:
#!/bin/bash
ip netns add container1-hostname
ip netns exec container1-hostname su -c "singularity instance.start /home/nonrootuser/centos.simg container1" - nonrootuser
The script above creates a new custom network namespace named 'container1-hostname' and starts a Singularity instance named 'container1' (running as nonrootuser) inside that custom network namespace. The result in Singularity 2 is a single process for the instance ('singularity-instance: nonrootuser [container1]') that is using the 'container1-hostname' custom network namespace:
[root@singularity2 ~]# lsns -t net -o NS,PATH,TYPE,NPROCS,PPID,PID,USER,UID,COMMAND
NS PATH TYPE NPROCS PPID PID USER UID COMMAND
4026531956 /proc/1/ns/net net 94 0 1 root 0 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026532163 /proc/2695/ns/net net 1 1 2695 nonrootuser 1000 singularity-instance: nonrootuser [container1]
[root@singularity2 ~]# ls -i /var/run/netns/container1-hostname
4026532163 /var/run/netns/container1-hostname
In this example,
4026531956 is the inode containing the root network namespace used by default for all processes running on the host system (94 processes in this example) and
4026532163 is the inode containing our custom network namespace. Now let's say nonrootuser shells into the running instance that was started as their user when the host system booted:
[nonrootuser@singularity2 ~]$ singularity instance.list
DAEMON NAME PID CONTAINER IMAGE
container1 2695 /home/nonrootuser/centos.simg
[nonrootuser@singularity2 ~]$ singularity shell instance://container1
Singularity: Invoking an interactive shell within container...
Singularity centos.simg:~> readlink /proc/self/ns/net
net:[4026532163]
The resulting bash process for the Singularity shell ends up with our correct custom network namespace (the network namespace file descriptor for its process refers to inode
4026532163, which we saw earlier as the inode containing our 'container1-hostname' custom network namespace in this example). The same behavior applies to exec:
[nonrootuser@singularity2 ~]$ singularity exec instance://container1 readlink /proc/self/ns/net
net:[4026532163]
This is excellent since it means that all processes resulting from nonrootuser's interaction with that instance will use our custom network namespace!
Singularity 3
Now let's look at the different behavior in Singularity 3. Consider the following script (identical to the script in the Singularity 2 example except for using 'instance start' instead of 'instance.start') being run as root when a host is started:
#!/bin/bash
ip netns add container1-hostname
ip netns exec container1-hostname su -c "singularity instance start /home/nonrootuser/centos.simg container1" - nonrootuser
The script above creates a new custom network namespace named 'container1-hostname' and starts a Singularity instance named 'container1' (running as nonrootuser) inside that custom network namespace. The result in Singularity 3 is two processes for the instance (parent process 'Singularity instance: nonrootuser [container1]' and child process 'sinit') that are both using the 'container1-hostname' custom network namespace:
[root@singularity3 ~]# lsns -t net -o NS,PATH,TYPE,NPROCS,PPID,PID,USER,UID,COMMAND
NS PATH TYPE NPROCS PPID PID USER UID COMMAND
4026531956 /proc/1/ns/net net 99 0 1 root 0 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026532163 /proc/2676/ns/net net 2 1 2676 nonrootuser 1000 Singularity instance: nonrootuser [container1]
[root@singularity3 ~]# ls -i /var/run/netns/container1-hostname
4026532163 /var/run/netns/container1-hostname
In this example,
4026531956 is the inode containing the root network namespace used by default for all processes running on the host system (99 processes in this example) and 4026532163 is the inode containing our custom network namespace. Now let's say nonrootuser shells into the running instance that was started as their user when the host system booted:
[nonrootuser@singularity3 ~]$ singularity instance list
INSTANCE NAME PID IMAGE
container1 2677 /home/nonrootuser/centos.simg
[nonrootuser@singularity3 ~]$ singularity shell instance://container1
Singularity :~> readlink /proc/self/ns/net
net:[4026531956]
Instead of the resulting bash process for the Singularity shell ending up in our correct custom network namespace like it did in Singularity 2, it ends up in the root network namespace (the network namespace file descriptor for its process refers to inode
4026531956, which we saw earlier as the inode containing the root/default network namespace in this example). :( The same behavior unfortunately applies to exec:
[nonrootuser@singularity3 ~]$ singularity exec instance://container1 readlink /proc/self/ns/net
net:[4026531956]
The result is that all processes started by a user connecting to that instance essentially "escape" the custom network namespace and are instead exposed to the root network namespace and everything it contains (network interfaces, iptables rules, routing tables, etc)...
Relevant Code
Singularity 2
In Singularity 2, when 'singularity exec instance://instancename command' is run, my current understanding is that the following occurs:
* /usr/bin/singularity (or whatever its path may be on your system) - 'exec' is passed as $SINGULARITY_COMMAND to the line 'exec $SINGULARITY_libexecdir/singularity/cli/$SINGULARITY_COMMAND.exec "$@"'.
* /usr/libexec/singularity/cli/exec.exec (or whatever its path may be on your system) - The line '. "$SINGULARITY_libexecdir/singularity/image-handler.sh"' runs image-handler.sh.
* /usr/libexec/singularity/image-handler.sh (or whatever its path may be on your system) - 'instance://instancename' being passed to exec results in the line '. "$SINGULARITY_libexecdir/singularity/handlers/image-instance.sh"' being run.
* /usr/libexec/singularity/handlers/image-instance.sh (or whatever its path may be on your system) - If an instance by that name is indeed running, the line 'SINGULARITY_DAEMON_JOIN=1' is run and SINGULARITY_DAEMON_JOIN is exported as a bash environment variable.
* /usr/libexec/singularity/cli/exec.exec (or whatever its path may be on your system) - The line 'exec "$SINGULARITY_libexecdir/singularity/bin/action-suid" "$@" <&0' is run if SINGULARITY_NOSUID isn't set.
* /usr/libexec/singularity/bin/action-suid (or whatever its path may be on your system) - Regardless of whether or not DAEMON_JOIN was set in the "Singularity registry" (i.e. if 'SINGULARITY_DAEMON_JOIN=1' was run in image-instance.sh), 'singularity_runtime_ns(SR_NS_ALL);' is run.
* /usr/lib64/singularity/libsingularity-runtime.so.1 (or whatever its path may be on your system) - In function definition 'int singularity_runtime_ns(unsigned int flags)', conditional 'if ( singularity_registry_get("DAEMON_JOIN")' returns true since DAEMON_JOIN was set to 1 in the "Singularity registry" ('SINGULARITY_DAEMON_JOIN=1' was run in image-instance.sh). As a result, the line 'return(_singularity_runtime_ns_join(flags));' is run (with 'flags' being an integer representation of SR_NS_ALL at this point).
* /usr/lib64/singularity/libsingularity-runtime.so.1 (or whatever its path may be on your system) - In function definition 'int _singularity_runtime_ns_join(unsigned int flags)', conditional 'if ( flags & SR_NS_NET )' returns true since SR_NS_NET is part of SR_NS_ALL (passed earlier) and the line 'retval += _singularity_runtime_ns_net_join();' is run.
* /usr/lib64/singularity/libsingularity-runtime.so.1 (or whatever its path may be on your system) - In function definition 'int _singularity_runtime_ns_net_join(void)', the file descriptor for the running instance's network namespace is opened and 'setns(net_fd, CLONE_NEWNET)' is ran as part of a conditional. The setns system call (see
http://man7.org/linux/man-pages/man2/setns.2.html ) associates the calling thread (in our case, the exec process) with a particular namespace of the file descriptor its passed. In this case, the file descriptor for the running instance's network namespace is passed in along with 'CLONE_NEWNET' (used for network namespaces) as the namespace type. The end result is that the process created from running 'singularity exec instance://instancename command' is assigned to whatever network namespace the process of the running instance is in. Cool! :)
Singularity 3
I haven't yet had time to examine the code execution steps for the same example command ('singularity exec instance://instancename command') in Singularity 3, but I have determined that it does indeed have code that supports a 'shell' or 'exec' process being assigned to a custom network namespace of a running instance. The issue is that the aforementioned behavior (assignment of a 'shell' or 'exec' process to a running instance's custom network namespace) only occurs if that instance was started with the -n|--net Singularity flag (which of course instructs Singularity to create its own custom network namespace). If the instance was started without that flag from a custom network namespace that was instead created with 'ip netns add', 'shell' and 'exec' processes run against an instance end up in the root network namespace instead of the desired custom network namespace.
One fundamental difference is that the JSON configuration for the running instance located at /var/run/singularity/instances/${USER}/nameofinstance.json only includes 'network' in its list of namespaces if that instance was started with the -n|--net Singularity flag. If that instance was started without that flag from a custom network namespace that was instead created with 'ip netns add', 'network' is not included in the list of namespaces...
For example, consider the following script being run as root:
#!/bin/bash
ip netns add container1-hostname
ip netns exec container1-hostname su -c "singularity instance start /home/nonrootuser/centos.simg container1" - nonrootuser
su -c "singularity instance start --net /home/nonrootuser/centos.simg container2" - nonrootuser
The result is two Singularity instances, container1 and container2. container1 was started inside of the 'container1-hostname' custom network namespace that was created with the 'ip netns add' command and the two resulting initial processes (parent process 'Singularity instance: nonrootuser [container1]' and child process 'sinit') are both using that namespace. container2 was instead started with the -n|--net Singularity flag to instruct Singularity to create its own custom network namespace. container2 also starts with two initial processes (parent process 'Singularity instance: nonrootuser [container2]' and child process 'sinit') but only the 'sinit' process ends up using the custom network namespace that Singularity created. The parent process 'Singularity instance: nonrootuser [container2]' ends up in the root (default) network namespace instead of the custom network namespace. Consider the following four processes:
nonrootuser 1157 Singularity instance: vagrant [container1]
nonrootuser 1158 \_ sinit
nonrootuser 1205 Singularity instance: vagrant [container2]
nonrootuser 1206 \_ sinit
PIDs 1157 and 1158 are the initial parent and child processes for container1 and PIDs 1205 and 1206 are the initial parent and child processes for container2. Let's confirm which network namespaces those processes are using:
[root@singularity3 ~]# lsns -t net -o NS,PATH,TYPE,NPROCS,PPID,PID,USER,UID,COMMAND
NS PATH TYPE NPROCS PPID PID USER UID COMMAND
4026531956 /proc/1/ns/net net 99 0 1 root 0 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026532163 /proc/1157/ns/net net 2 1 1157 nonrootuser 1000 Singularity instance: nonrootuser [container1]
4026532242 /proc/1206/ns/net net 1 1205 1206 nonrootuser 1000 sinit
In this example,
4026531956 is the inode containing the root network namespace used by default for all processes running on the host system (99 processes in this example), 4026532163 is the inode containing the 'container1-hostname' custom network namespace we created with 'ip netns add', and 4026532242 is the inode containing the custom network namespace that Singularity created when the container2 instance was started with the --net flag.
[root@singularity3 ~]# ls -i /var/run/netns/container1-hostname
4026532163 /var/run/netns/container1-hostname
[root@singularity3 ~]# readlink /proc/1157/ns/net
net:[4026532163]
[root@singularity3 ~]# readlink /proc/1158/ns/net
net:[4026532163]
[root@singularity3 ~]# readlink /proc/1205/ns/net
net:[4026531956]
[root@singularity3 ~]# readlink /proc/1206/ns/net
net:[4026532242]
Now let's compare the JSON configuration of both running instances, specifically the values from '.engineConfig.ociConfig.linux.namespaces':
[nonrootuser@singularity3 ~]$ cat /var/run/singularity/instances/nonrootuser/container1.json | jq '.config' | perl -pi -e "s/\"//g" | base64 --decode | jq '.engineConfig.ociConfig.linux.namespaces'
[
{
"type": "pid",
"path": "/proc/1158/ns/pid"
},
{
"type": "ipc",
"path": "/proc/1158/ns/ipc"
},
{
"type": "mount",
"path": "/proc/1158/ns/mnt"
}
]
[nonrootuser@singularity3 ~]$ cat /var/run/singularity/instances/nonrootuser/container2.json | jq '.config' | perl -pi -e "s/\"//g" | base64 --decode | jq '.engineConfig.ociConfig.linux.namespaces'
[
{
"type": "network",
"path": "/proc/1206/ns/net"
},
{
"type": "pid",
"path": "/proc/1206/ns/pid"
},
{
"type": "ipc",
"path": "/proc/1206/ns/ipc"
},
{
"type": "mount",
"path": "/proc/1206/ns/mnt"
}
]
As you can see, the 'network' namespace only appears in the JSON configuration for a running instance if that instance was started with the -n|--net Singularity flag. Now let's test and see what network namespace ends up being assigned to 'exec' processes run against both of those running instances:
[nonrootuser@singularity3 ~]$ singularity exec instance://container1 readlink /proc/self/ns/net
net:[4026531956]
[nonrootuser@singularity3 ~]$ singularity exec instance://container2 readlink /proc/self/ns/net
net:[4026532242]
'exec' (and 'shell') processes run against container1 end up in the root (default) network namespace of the host system instead of the desired 'container1-hostname' custom network namespace we created with 'ip netns add' earlier, even though both of the container1 processes are in that custom namespace! 'exec' (and 'shell') processes run against container2 end up in the custom network namespace created by Singularity.
It would be highly beneficial if Singularity 3 could be updated such that 'shell' and 'exec' processes run against an instance automatically receive the network namespace of the instance itself. That would be consistent with the behavior of Singularity 2 and would prevent 'shell' and 'exec' processes run against an instance from "escaping" a custom network namespace and ending up with completely different interfaces, iptables rules, routing tables, etc than anticipated.