Nsjail in docker

2,218 views
Skip to first unread message

Mikołaj Rządca

unread,
Mar 5, 2019, 4:28:23 AM3/5/19
to nsjail
Hello everyone,

Is there any way to run nsjail in docker without privileged mode?

Robert Święcki

unread,
Mar 5, 2019, 7:08:11 AM3/5/19
to Mikołaj Rządca, nsjail, dominik.b....@gmail.com
If I'm not mistaken, docker --privilleged mode is used just because of the personality() syscall, or are you seeing any other problems with it?

It was actually https://github.com/disconnect3d who implemented it, maybe he'll be able to help? I'm CCing him, since his e-mail seems to be published on his homepage.

wt., 5 mar 2019 o 10:28 Mikołaj Rządca <herm...@gmail.com> napisał(a):
Hello everyone,

Is there any way to run nsjail in docker without privileged mode?

--
You received this message because you are subscribed to the Google Groups "nsjail" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nsjail+un...@googlegroups.com.
To post to this group, send email to nsj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nsjail/e02e62f3-1c1e-4f2b-9ea3-02d09a762aa7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Robert Święcki

dominik.b...@gmail.com

unread,
Mar 5, 2019, 8:42:42 AM3/5/19
to nsjail

Dominik Czarnota

14:38 (3 minuty temu)
do nsjail
[re-posting this to the group after signing in, sorry if some of you got double notification]

> docker --privilleged mode is used just because of the personality() syscall
There is much more to that. Docker itself uses linux namespaces, cgroups and a default seccomp profile (via docs: "The default seccomp profile will adjust to the selected capabilities, in order to allow use of facilities allowed by the capabilities, so you should not have to adjust this") to limit the things the user can do inside the container so nsjail can't do many other things (set namespaces/cgroups/mounts).

So:
> Is there any way to run nsjail in docker without privileged mode?
In theory: yes or maybe? We would need to make it so that the container can use everything nsjail needs.

So let's try this. We know that nsjail can work properly if we use `--privileged`:
```
$ docker run --rm -it --privileged disconnect3d/nsjail nsjail -R / /bin/ls -- /
[2019-03-05T13:12:33+0000] Mode: STANDALONE_ONCE
[2019-03-05T13:12:33+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/bin/ls', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:true, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[2019-03-05T13:12:33+0000] Mount point: src:'' dst:'/' flags:'MS_RDONLY' type:'tmpfs' options:'' is_dir:true
[2019-03-05T13:12:33+0000] Mount point: src:'/' dst:'/' flags:'MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE' type:'' options:'' is_dir:true
[2019-03-05T13:12:33+0000] Mount point: src:'' dst:'/proc' flags:'MS_RDONLY' type:'proc' options:'' is_dir:true
[2019-03-05T13:12:33+0000] Uid map: inside_uid:0 outside_uid:0 count:1 newuidmap:false
[2019-03-05T13:12:33+0000] [W][1] void cmdline::logParams(nsjconf_t*)():236 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[2019-03-05T13:12:33+0000] Gid map: inside_gid:0 outside_gid:0 count:1 newgidmap:false
[2019-03-05T13:12:33+0000] [W][1] void cmdline::logParams(nsjconf_t*)():247 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[2019-03-05T13:12:33+0000] Executing '/bin/ls' for '[STANDALONE MODE]'
bin  boot  dev etc  home  lib lib64  media  mnt  opt proc  root  run  sbin  srv  sys  tmp  usr  var
[2019-03-05T13:12:33+0000] PID: 6 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)
```

So maybe we can remove `--privileged` and add all linux capabilities and remove seccomp profile?:
```
$ docker run --rm -it --cap-add=ALL --security-opt seccomp=unconfined disconnect3d/nsjail nsjail -R / /bin/ls -- /
[2019-03-05T13:12:49+0000] Mode: STANDALONE_ONCE
[2019-03-05T13:12:49+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/bin/ls', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:true, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[2019-03-05T13:12:49+0000] Mount point: src:'' dst:'/' flags:'MS_RDONLY' type:'tmpfs' options:'' is_dir:true
[2019-03-05T13:12:49+0000] Mount point: src:'/' dst:'/' flags:'MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE' type:'' options:'' is_dir:true
[2019-03-05T13:12:49+0000] Mount point: src:'' dst:'/proc' flags:'MS_RDONLY' type:'proc' options:'' is_dir:true
[2019-03-05T13:12:49+0000] Uid map: inside_uid:0 outside_uid:0 count:1 newuidmap:false
[2019-03-05T13:12:49+0000] [W][1] void cmdline::logParams(nsjconf_t*)():236 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[2019-03-05T13:12:49+0000] Gid map: inside_gid:0 outside_gid:0 count:1 newgidmap:false
[2019-03-05T13:12:49+0000] [W][1] void cmdline::logParams(nsjconf_t*)():247 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[2019-03-05T13:12:49+0000] [E][1] bool mnt::initNsInternal(nsjconf_t*)():370 mount('/', '/', NULL, MS_REC|MS_PRIVATE, NULL): Permission denied
[2019-03-05T13:12:49+0000] [F][1] bool subproc::runChild(nsjconf_t*, int, int, int)():432 Launching child process failed
[2019-03-05T13:12:49+0000] [W][1] bool subproc::runChild(nsjconf_t*, int, int, int)():460 Received error message from the child process before it has been executed
[2019-03-05T13:12:49+0000] [E][1] int nsjail::standaloneMode(nsjconf_t*)():146 Couldn't launch the child process
```

But we are still getting:
> [2019-03-05T13:12:49+0000] [E][1] bool mnt::initNsInternal(nsjconf_t*)():370 mount('/', '/', NULL, MS_REC|MS_PRIVATE, NULL): Permission denied

And I am not sure how to make it work from here.

The --privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do. This flag exists to allow special use-cases, like running Docker within Docker.

---

Also it is probably worth to ask: why do one want to run nsjail in a docker container?
I used it because building nsjail and running it on different machines and distros can be a hassle and docker speeds up this process (i.e. have *one* nsjail build & run environment).

Also, if your nsjail config requires to be run under root, it is probably the same as if you would run it through docker with `--privileged`.

Mikołaj Rządca

unread,
Mar 5, 2019, 8:54:32 AM3/5/19
to Dominik Czarnota, Robert Święcki, nsjail, dominik.b....@gmail.com
Thank you very much for such a detailed response. There is some third-party app which uses Nsjail for isolation purpose. From the other hand, our build environment rely on Docker containers and for security purpose, we don't want to use "--privileged" mode if it's not necessary. I have a very similar problem to this from your example with the /proc mounting.

On Tue, Mar 5, 2019 at 2:36 PM Dominik Czarnota <dominik.b...@gmail.com> wrote:
> docker --privilleged mode is used just because of the personality() syscall
There is much more to that. Docker itself uses linux namespaces, cgroups and a default seccomp profile (via docs: "The default seccomp profile will adjust to the selected capabilities, in order to allow use of facilities allowed by the capabilities, so you should not have to adjust this") to limit the things the user can do inside the container so nsjail can't do many other things (set namespaces/cgroups/mounts).

So:
> Is there any way to run nsjail in docker without privileged mode?
In theory: yes or maybe? We would need to make it so that the container can use everything nsjail needs.

So let's try this. We know that nsjail can work properly if we use `--privileged`:
```
$ docker run --rm -it --privileged disconnect3d/nsjail nsjail -R / /bin/ls -- /
[2019-03-05T13:12:33+0000] Mode: STANDALONE_ONCE
[2019-03-05T13:12:33+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/bin/ls', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:true, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[2019-03-05T13:12:33+0000] Mount point: src:'' dst:'/' flags:'MS_RDONLY' type:'tmpfs' options:'' is_dir:true
[2019-03-05T13:12:33+0000] Mount point: src:'/' dst:'/' flags:'MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE' type:'' options:'' is_dir:true
[2019-03-05T13:12:33+0000] Mount point: src:'' dst:'/proc' flags:'MS_RDONLY' type:'proc' options:'' is_dir:true
[2019-03-05T13:12:33+0000] Uid map: inside_uid:0 outside_uid:0 count:1 newuidmap:false
[2019-03-05T13:12:33+0000] [W][1] void cmdline::logParams(nsjconf_t*)():236 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[2019-03-05T13:12:33+0000] Gid map: inside_gid:0 outside_gid:0 count:1 newgidmap:false
[2019-03-05T13:12:33+0000] [W][1] void cmdline::logParams(nsjconf_t*)():247 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[2019-03-05T13:12:33+0000] Executing '/bin/ls' for '[STANDALONE MODE]'
bin  boot  dev etc  home  lib lib64  media  mnt  opt proc  root  run  sbin  srv  sys  tmp  usr  var
[2019-03-05T13:12:33+0000] PID: 6 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)
```

So maybe we can remove `--privileged` and add all linux capabilities and remove seccomp profile?:
```
$ docker run --rm -it --cap-add=ALL --security-opt seccomp=unconfined disconnect3d/nsjail nsjail -R / /bin/ls -- /
[2019-03-05T13:12:49+0000] Mode: STANDALONE_ONCE
[2019-03-05T13:12:49+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/bin/ls', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:true, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[2019-03-05T13:12:49+0000] Mount point: src:'' dst:'/' flags:'MS_RDONLY' type:'tmpfs' options:'' is_dir:true
[2019-03-05T13:12:49+0000] Mount point: src:'/' dst:'/' flags:'MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE' type:'' options:'' is_dir:true
[2019-03-05T13:12:49+0000] Mount point: src:'' dst:'/proc' flags:'MS_RDONLY' type:'proc' options:'' is_dir:true
[2019-03-05T13:12:49+0000] Uid map: inside_uid:0 outside_uid:0 count:1 newuidmap:false
[2019-03-05T13:12:49+0000] [W][1] void cmdline::logParams(nsjconf_t*)():236 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[2019-03-05T13:12:49+0000] Gid map: inside_gid:0 outside_gid:0 count:1 newgidmap:false
[2019-03-05T13:12:49+0000] [W][1] void cmdline::logParams(nsjconf_t*)():247 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[2019-03-05T13:12:49+0000] [E][1] bool mnt::initNsInternal(nsjconf_t*)():370 mount('/', '/', NULL, MS_REC|MS_PRIVATE, NULL): Permission denied
[2019-03-05T13:12:49+0000] [F][1] bool subproc::runChild(nsjconf_t*, int, int, int)():432 Launching child process failed
[2019-03-05T13:12:49+0000] [W][1] bool subproc::runChild(nsjconf_t*, int, int, int)():460 Received error message from the child process before it has been executed
[2019-03-05T13:12:49+0000] [E][1] int nsjail::standaloneMode(nsjconf_t*)():146 Couldn't launch the child process
```

But we are still getting:
> [2019-03-05T13:12:49+0000] [E][1] bool mnt::initNsInternal(nsjconf_t*)():370 mount('/', '/', NULL, MS_REC|MS_PRIVATE, NULL): Permission denied

And I am not sure how to make it work from here.

The --privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do. This flag exists to allow special use-cases, like running Docker within Docker.

---

Also it is probably worth to ask: why do one want to run nsjail in a docker container?
I used it because building nsjail and running it on different machines and distros can be a hassle and docker speeds up this process (i.e. have *one* nsjail build & run environment).

Also, if your nsjail config requires to be run under root, it is probably the same as if you would run it through docker with `--privileged`.

Mikołaj Rządca

unread,
Mar 13, 2019, 6:26:55 AM3/13/19
to nsjail
Ok, I got it.

Could you explain that? :)

vagrant@localhost:~$ sudo docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt=no-new-privileges --cap-add SYS_ADMIN -v /proc:/new_proc disconnect3d/nsjail /bin/bash

root@471cd0e0c9f0:/# nsjail -R / /bin/ls -- /

[2019-03-13T10:23:51+0000] Mode: STANDALONE_ONCE

[2019-03-13T10:23:51+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/bin/ls', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:true, keep_caps:false, disable_no_new_privs:false, max_cpus:0

[2019-03-13T10:23:51+0000] Mount point: src:'' dst:'/' flags:'MS_RDONLY' type:'tmpfs' options:'' is_dir:true

[2019-03-13T10:23:51+0000] Mount point: src:'/' dst:'/' flags:'MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE' type:'' options:'' is_dir:true

[2019-03-13T10:23:51+0000] Mount point: src:'' dst:'/proc' flags:'MS_RDONLY' type:'proc' options:'' is_dir:true

[2019-03-13T10:23:51+0000] Uid map: inside_uid:0 outside_uid:0 count:1 newuidmap:false

[2019-03-13T10:23:51+0000] [W][15] void cmdline::logParams(nsjconf_t*)():236 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files

[2019-03-13T10:23:51+0000] Gid map: inside_gid:0 outside_gid:0 count:1 newgidmap:false

[2019-03-13T10:23:51+0000] [W][15] void cmdline::logParams(nsjconf_t*)():247 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files

[2019-03-13T10:23:51+0000] Executing '/bin/ls' for '[STANDALONE MODE]'

bin  boot  dev etc  home  lib lib64  media  mnt  new_proc  opt  proc root  run  sbin  srv  sys  tmp usr  var

[2019-03-13T10:23:51+0000] PID: 16 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)

root@471cd0e0c9f0:/# 

root@471cd0e0c9f0:/# 

root@471cd0e0c9f0:/# exit

vagrant@localhost:~$ sudo docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt=no-new-privileges --cap-add SYS_ADMIN disconnect3d/nsjail /bin/bash

root@c649a6874ef7:/# mkdir -p /tmp/nsjail.root/proc

root@c649a6874ef7:/# mount -t proc none /tmp/nsjail.root/proc

root@c649a6874ef7:/# nsjail -R / /bin/ls -- /

[2019-03-13T10:24:42+0000] Mode: STANDALONE_ONCE

[2019-03-13T10:24:42+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/bin/ls', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:true, keep_caps:false, disable_no_new_privs:false, max_cpus:0

[2019-03-13T10:24:42+0000] Mount point: src:'' dst:'/' flags:'MS_RDONLY' type:'tmpfs' options:'' is_dir:true

[2019-03-13T10:24:42+0000] Mount point: src:'/' dst:'/' flags:'MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE' type:'' options:'' is_dir:true

[2019-03-13T10:24:42+0000] Mount point: src:'' dst:'/proc' flags:'MS_RDONLY' type:'proc' options:'' is_dir:true

[2019-03-13T10:24:42+0000] Uid map: inside_uid:0 outside_uid:0 count:1 newuidmap:false

[2019-03-13T10:24:42+0000] [W][17] void cmdline::logParams(nsjconf_t*)():236 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files

[2019-03-13T10:24:42+0000] Gid map: inside_gid:0 outside_gid:0 count:1 newgidmap:false

[2019-03-13T10:24:42+0000] [W][17] void cmdline::logParams(nsjconf_t*)():247 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files

[2019-03-13T10:24:42+0000] Executing '/bin/ls' for '[STANDALONE MODE]'

bin  boot  dev etc  home  lib lib64  media  mnt  opt proc  root  run  sbin  srv  sys  tmp  usr  var

[2019-03-13T10:24:42+0000] PID: 18 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)

root@c649a6874ef7:/# ^C

root@c649a6874ef7:/# exit

vagrant@localhost:~$ sudo docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt=no-new-privileges --cap-add SYS_ADMIN disconnect3d/nsjail /bin/bash

root@4f2ae317840d:/# nsjail -R / /bin/ls -- /

[2019-03-13T10:25:25+0000] Mode: STANDALONE_ONCE

[2019-03-13T10:25:25+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/bin/ls', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:true, keep_caps:false, disable_no_new_privs:false, max_cpus:0

[2019-03-13T10:25:25+0000] Mount point: src:'' dst:'/' flags:'MS_RDONLY' type:'tmpfs' options:'' is_dir:true

[2019-03-13T10:25:25+0000] Mount point: src:'/' dst:'/' flags:'MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE' type:'' options:'' is_dir:true

[2019-03-13T10:25:25+0000] Mount point: src:'' dst:'/proc' flags:'MS_RDONLY' type:'proc' options:'' is_dir:true

[2019-03-13T10:25:25+0000] Uid map: inside_uid:0 outside_uid:0 count:1 newuidmap:false

[2019-03-13T10:25:25+0000] [W][15] void cmdline::logParams(nsjconf_t*)():236 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files

[2019-03-13T10:25:25+0000] Gid map: inside_gid:0 outside_gid:0 count:1 newgidmap:false

[2019-03-13T10:25:25+0000] [W][15] void cmdline::logParams(nsjconf_t*)():247 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files

[2019-03-13T10:25:25+0000] [W][1] bool mnt::mountPt(mount_t*, const char*, const char*)():204 mount('src:'' dst:'/proc' flags:'MS_RDONLY' type:'proc' options:'' is_dir:true') src:'none' dstpath:'/tmp/nsjail.0.root//proc' failed: Operation not permitted

[2019-03-13T10:25:25+0000] [W][1] bool mnt::mountPt(mount_t*, const char*, const char*)():209 procfs can only be mounted if the original /proc doesn't have any other file-systems mounted on top of it (e.g. /dev/null on top of /proc/kcore): Operation not permitted

[2019-03-13T10:25:25+0000] [F][1] bool subproc::runChild(nsjconf_t*, int, int, int)():432 Launching child process failed

[2019-03-13T10:25:25+0000] [W][15] bool subproc::runChild(nsjconf_t*, int, int, int)():460 Received error message from the child process before it has been executed

[2019-03-13T10:25:25+0000] [E][15] int nsjail::standaloneMode(nsjconf_t*)():146 Couldn't launch the child process

Rory Nugent

unread,
Dec 5, 2023, 2:33:34 PM12/5/23
to nsjail
I wanted to follow up here and see if there has been any further progress.

I am working on a similar scenario where we need nsjail functioning within a container on an embedded platform but we are not afforded privileged mode when the Docker container is started.

Is there any path forward for allowing nsjail to run in a non-privileged mode? Or are there any theories on how nsjail may need to be changed to support this?

Thank you in advance!
Reply all
Reply to author
Forward
0 new messages