Need advice about fixing PROC mount failures in a DIY Linux container

Lew Pitcher

unread,

Jan 6, 2023, 8:27:31 PM1/6/23

to

Hi, all

I've come late to the party, and have just started learning
about the ins and outs of Linux containers. To get a better
understanding of the subject, I decided to learn about the
underlying technologies by building my own container software.

I've modelled my DIY container on Brian Swetland's mkbox
container[1], and have a demonstration program that works
on my development system (a 64bit AMD Ryzen 5 3400G with
Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied).
[1] https://github.com/swetland/mkbox

However, when I run either Brian's mkbox or my demo program
on my "production" system (another 64bit AMD Ryzen 5 3400G
with Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied), the
container breaks while trying to mount the proc filesystem
to the new (isolated) root fs.

Specifically, I get an "Operation not permitted" error when
I try to
mount("proc","proc","proc",MS_REC,NULL)
/but/ ONLY ON THIS ONE SYSTEM.

This failure affects both my DIY container and Brian's mkbox
container.

With my DIY container, I've checked the capabilities given
to the container process, and they are identical and complete
on both systems. On both systems, I run the container process
(mine and Brian's) from the same unprivileged UID/GID.

I have to conclude that there's a difference in the two
environments that causes this problem, but I don't know what
that difference is. Both systems use the type CPU, the
same amount of memory, the same 64-bit addressing mode,
the same kernel, and the same distribution (with the same
essential utilities).

There /are/ differences in the two systems:
pn the development system, my user is a member of a
number of groups that it is not a member of on the
"production" system. I run a root pulseaudio (I have my
reasons) on the development system that I do not on
the "production" system. Et cetera.

Can anyone suggest an environmental factor or set of
factors that might cause this behaviour?

For reference, I include a copy of a minimal implementation
of my DIY container that illustrates the problem, along with
captures of both a successful run on my development system
and an unsuccessful run on my production system.

========== demo.c ==========
/*
** demonstrate selective problem with Slackware Linux 14.2
** user namespace creation (Kernel 4.4.301)
*/

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sys/mount.h>
#include <sched.h>
#include <string.h>
#include <errno.h>

/* pivot_root() prototype not supplied by headers */
extern int pivot_root(const char *new_root, const char *put_old);

void Die(int line); /* generate error message and exit process */
#define DIE() Die(__LINE__)

int main(void)
{
char *fauxRoot = "./.fauxroot", /* will be our new root filesystem */
*oldRoot = ".oldroot", /* where pivot_root puts old root fs */
*oldProc = ".oldproc", /* where we temp relocate /proc to */
*newProc = "proc"; /* where we mount /proc to */
pid_t init_pid;

umask(0);

rmdir(fauxRoot); if (mkdir(fauxRoot,0777)) DIE();

if (unshare(CLONE_NEWUSER|CLONE_NEWNS|CLONE_NEWPID)) DIE();

if (mount("none","/",NULL,MS_REC|MS_PRIVATE,NULL)) DIE();
if (mount(fauxRoot,fauxRoot,NULL,MS_BIND|MS_NOSUID,NULL)) DIE();
if (chdir(fauxRoot)) DIE();

rmdir(oldRoot); if (mkdir(oldRoot,0751)) DIE();
rmdir(oldProc); if (mkdir(oldProc,0755)) DIE();
rmdir(newProc); if (mkdir(newProc,0755)) DIE();

if (mount("/proc",oldProc,NULL,MS_BIND|MS_REC,NULL)) DIE();

/* set new uid, gid */
{
FILE *map;

if ((map = fopen("/proc/self/uid_map","w")) == NULL) DIE();
fprintf(map,"0 %lu 1\n",(unsigned long)getuid());
fclose(map);

if ((map = fopen("/proc/self/setgroups","w")) == NULL) DIE();
fwrite("deny",4,1,map);
fclose(map);

if ((map = fopen("/proc/self/gid_map","w")) == NULL) DIE();
fprintf(map,"0 %lu 1\n",(unsigned long)getgid());
fclose(map);
}

if (pivot_root(".",oldRoot)) DIE();
if (umount2(oldRoot,MNT_DETACH)) DIE();
if (rmdir(oldRoot)) DIE();

switch (init_pid = fork())
{
case -1:
DIE();
break;

case 0:
if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();
if (umount2(oldProc,MNT_DETACH)) DIE();
if (rmdir(oldProc)) DIE();
printf("INIT: my pid is %lu\n",(unsigned long)getpid());
break;

default:
printf("PARENT: INIT pid is %lu\n",(unsigned long)init_pid);
wait(NULL);
break;
}

return EXIT_SUCCESS;
}

void Die(int line)
{
fprintf(stderr,"Error encountered at line %d: %s\n",line,strerror(errno));
exit(EXIT_FAILURE);
}

========== successful execution on development system ==========
Script started on Fri 06 Jan 2023 08:20:12 PM EST
20:20 $ uname -a

Linux wordsworth 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux

20:20 $ cat /etc/slackware-version

Slackware 14.2

20:20 $ rm demo

20:20 $ rm -rf .fauxroot

20:20 $ cc -o demo demo.c

20:20 $ ./demo

PARENT: INIT pid is 558

INIT: my pid is 1

20:20 $ ls -laR .fauxroot
fauxroot:

total 12

drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 .

drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:20 ..

drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 proc

fauxroot/proc:

total 8

drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 .

drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 ..

20:21 $ exit

exit

Script done on Fri 06 Jan 2023 08:21:02 PM EST

========== unsuccessful execution on production system ==========
Script started on Fri Jan 6 20:21:11 2023
~/code/namespaces $ uname -a

Linux merlin 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux

~/code/namespaces $ cat /etc/slackware-version

Slackware 14.2

~/code/namespaces $ rm demo

~/code/namespaces $ rm -rf .fauxroot

~/code/namespaces $ cc -o demo demo.c

~/code/namespaces $ ./demo

PARENT: INIT pid is 1651

Error encountered at line 77: Operation not permitted

~/code/namespaces $ nl -ba demo.c | grep ' 77'

77 if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();

~/code/namespaces $ ls -laR .fauxroot
fauxroot:

total 16

drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 .

drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:21 ..

drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .oldproc

drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 proc

fauxroot/.oldproc:

total 8

drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .

drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..

fauxroot/proc:

total 8

drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .

drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..

~/code/namespaces $ exit

exit

Script done on Fri Jan 6 20:22:50 2023

--
Lew Pitcher
"In Skills, We Trust"

Lew Pitcher

unread,

Jan 6, 2023, 9:12:46 PM1/6/23

to

[snip]

Well, I can answer my own question, now. But the answer
leads to more questions.

The reason I get "Operation not permitted" on the
container /proc mount on my "production" system is that
I also run an nfs server on my "production" system (and
do not run one on my development system), and is nfs
server maintains two mountpoints within the /proc
filesystem.

Apparently, the attempt to mount /proc within my container
was blocked by the existance of these two mount points
(/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
rpc and nfs servers, and umounted these two mounts, I could
successfully run my demo container.

/Now/ the question is: how do I get my container /proc mount
to ignore or bypass these two nfsd mounts?

Jasen Betts

unread,

Jan 7, 2023, 2:30:39 AM1/7/23

to

On 2023-01-07, Lew Pitcher <lew.p...@digitalfreehold.ca> wrote:
> On Sat, 07 Jan 2023 01:27:28 +0000, Lew Pitcher wrote:

>> I try to
>> mount("proc","proc","proc",MS_REC,NULL)
>> /but/ ONLY ON THIS ONE SYSTEM.

> Well, I can answer my own question, now. But the answer
> leads to more questions.
>
> The reason I get "Operation not permitted" on the
> container /proc mount on my "production" system is that
> I also run an nfs server on my "production" system (and
> do not run one on my development system), and is nfs
> server maintains two mountpoints within the /proc
> filesystem.
>
> Apparently, the attempt to mount /proc within my container
> was blocked by the existance of these two mount points
> (/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
> rpc and nfs servers, and umounted these two mounts, I could
> successfully run my demo container.
>
> /Now/ the question is: how do I get my container /proc mount
> to ignore or bypass these two nfsd mounts?

What's the difference between mount() and /bin/mount

--
Jasen.
pǝsɹǝʌǝɹ sʇɥƃᴉɹ ll∀

John-Paul Stewart

unread,

Jan 7, 2023, 11:42:12 AM1/7/23

to

[Followups set to comp.os.linux.misc since I don't read any of the other
groups]

On 1/6/23 21:12, Lew Pitcher wrote:
>
> The reason I get "Operation not permitted" on the
> container /proc mount on my "production" system is that
> I also run an nfs server on my "production" system (and
> do not run one on my development system), and is nfs
> server maintains two mountpoints within the /proc
> filesystem.
>
> Apparently, the attempt to mount /proc within my container
> was blocked by the existance of these two mount points
> (/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
> rpc and nfs servers, and umounted these two mounts, I could
> successfully run my demo container.
>
> /Now/ the question is: how do I get my container /proc mount
> to ignore or bypass these two nfsd mounts?

In your OP you showed that you've got MS_REC in the mountflags field,
which will cause a recursive mount; i.e., you've explicitly asked for
the inclusion of the NFS-related subtrees. Have you tried without that
flag? MS_BIND would seem a more appropriate choice instead, IMHO, since
it doesn't do the recursion. Then, by default, the subtrees will be
excluded.

See also the section on "Changing the propagation type of an existing
mount" in the mount(2) man page for other ways to prevent the NFS
subtrees from being processed recursively. That might be relevant if
you want to recurse into other parts of the /proc tree, just not the two
directories you've named.

Rainer Weikusat

unread,

Jan 9, 2023, 2:27:19 PM1/9/23

to

Lew Pitcher <lew.p...@digitalfreehold.ca> writes:

[...]

> Well, I can answer my own question, now. But the answer
> leads to more questions.
>
> The reason I get "Operation not permitted" on the
> container /proc mount on my "production" system is that
> I also run an nfs server on my "production" system (and
> do not run one on my development system), and is nfs
> server maintains two mountpoints within the /proc
> filesystem.
>
> Apparently, the attempt to mount /proc within my container
> was blocked by the existance of these two mount points
> (/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
> rpc and nfs servers, and umounted these two mounts, I could
> successfully run my demo container.
>
> /Now/ the question is: how do I get my container /proc mount
> to ignore or bypass these two nfsd mounts?

Instead of doing a bind mount of a proc filesystem already mounted
somewhere, you could mount a new instance of it. The command for this
would be

mount -t proc proc <mount point>

You'll generally also want to mount sysfs, BTW.

Joseph Rosevear

unread,

Feb 17, 2023, 8:18:00 PM2/17/23

to

I'm going to try *again* to reply (having trouble):

Because I write bash code, your question looks to me like you are asking
for the difference between a shell function called mount and the mount
executable at /bin/mount.

For example you could write a function definition, source it, and then
use it to perform the real (or modified) /bin/mount. Here is such a
function defintion with no modification:

mount() {

local device
local point

device="$1"
point="$2"

/bin/mount $device $point

return
}

Notice the "mount()" syntax in the function definition. That is what
prompted me to respond as I did.

-Joe

Henrik Carlqvist

unread,

Feb 18, 2023, 5:26:24 AM2/18/23

to

On Sat, 18 Feb 2023 01:17:58 +0000, Joseph Rosevear wrote:

> On Sat, 7 Jan 2023 07:06:37 -0000 (UTC), Jasen Betts wrote:
>> What's the difference between mount() and /bin/mount
>
> I'm going to try *again* to reply (having trouble):
>
> Because I write bash code, your question looks to me like you are asking
> for the difference between a shell function called mount and the mount
> executable at /bin/mount.
>
> For example you could write a function definition, source it, and then
> use it to perform the real (or modified) /bin/mount. Here is such a
> function defintion with no modification:

Actually, it is exactly as you say that the mount command described by:

man 8 mount

is the executable /bin/mount

But, the mount call described by the man page:

man 2 mount

is not some bash function you are supposed to write yourself but the C
API system call to the Linux kernel. Lew in his original post
(crossposted to 4 different newsgroups) were writing a C program called
demo.c. From such a C program it is best to use the C API and call
mount(), but it would also be possible to call system("/bin/mount ...");

regards Henrik