Shared files within a jail

Hans Zaunere

unread,

Nov 12, 2002, 10:08:47 PM11/12/02

to

After much searching and contemplation, I've decided to ask the
question directly:

I'm implementing a jail server, which will provide a very limited set
of resources (Apache/MySQL/PHP). Setup is going well, however I've run
into a little snag that I hope can be worked out.

I want to allow the users the ability to compile and use their own
instances of Apache and MySQL from within the jail. But instead of
duplicating the basic system libs and bins, I'd like to maintain a
single repository of this, which can then be read-only from within the
jail. Options:

-- Symlinks won't work because of the chroot.
-- Mounts from within the jail aren't allowed, plus a single partition
can't be mounted multiple times, AFAIK.
-- I don't have NFS setup, and I would like to avoid it as much as
possible.
-- mount_null seems to be the answer, however the warning at the end of
the man page is scary.

Is there any combination of these (or anything I'm forgetting) that
could help me here? Is mount_null stable?

I've had an account on a jail server which had /shared visible within
the jail, and symlinks to /bin, /usr/lib and such. I'm not sure how
this was actually implemented, and I'd be interested if anyone has seen
or heard of any solutions to this type of problem.

Best,

=====
Hans Zaunere
New York PHP
http://nyphp.org
ha...@nyphp.org

__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

To Unsubscribe: send mail to majo...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Daniel O'Connor

unread,

Nov 12, 2002, 10:36:50 PM11/12/02

to

On Wed, 2002-11-13 at 13:38, Hans Zaunere wrote:
> -- Symlinks won't work because of the chroot.
> -- Mounts from within the jail aren't allowed, plus a single partition
> can't be mounted multiple times, AFAIK.
> -- I don't have NFS setup, and I would like to avoid it as much as
> possible.
> -- mount_null seems to be the answer, however the warning at the end of
> the man page is scary.
>
> Is there any combination of these (or anything I'm forgetting) that
> could help me here? Is mount_null stable?
>
> I've had an account on a jail server which had /shared visible within
> the jail, and symlinks to /bin, /usr/lib and such. I'm not sure how
> this was actually implemented, and I'd be interested if anyone has seen
> or heard of any solutions to this type of problem.

You should be able to use hardlinks for this sort of thing.

Make sure you mark them immutable though, otherwise someone in a jail
could compromise other users of those libraries [in another jail].

--
>
> Daniel O'Connor software and network engineer
> for Genesis Software - http://www.gsoft.com.au
> "The nice thing about standards is that there
> are so many of them to choose from."
> -- Andrew Tanenbaum
> GPG Fingerprint - 9A8C 569F 685A D928 5140 AE4B 319B 41F4 5D17 FDD5

Hans Zaunere

unread,

Nov 12, 2002, 10:47:26 PM11/12/02

to

> > I've had an account on a jail server which had /shared visible
> > within the jail, and symlinks to /bin, /usr/lib and such. I'm not
> > sure how this was actually implemented, and I'd be interested if
> > anyone has seen or heard of any solutions to this type of problem.
>
> You should be able to use hardlinks for this sort of thing.

Two issues arise:
1) I'd like to be able to link an entire directory for convience and
maintenance purposes.

2) Cross partition links not possible.

Number 2 is really the kicker, as far as I can tell. Is there some way
around this?

Hans

__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

To Unsubscribe: send mail to majo...@FreeBSD.org

Matthew Dillon

unread,

Nov 13, 2002, 12:30:59 AM11/13/02

to

Try using null mounts. The warning is in there because making the
null mount code work is a real hack and the authors aren't entirely
sure that everything's gotten covered. That said, use of a null mount
is certainly a lot safer if the stuff behind the mount is mostly
static.

Note that you can also use localhost NFS mounts to replicate pieces of
filesystems within jails, but you need to remember that the kernel
will wind up caching multiple copies of the data for these two cases
and that NFS has file locking issues.

Finally, keep in mind that disk space these days is quite cheap.
Duplicating the data is not as bad a way to go as you might think, and
it allows you to incrementally upgrade each jail. It may suffice to use
the null mount trick *only* for the big binaries and libraries that you
really want to share, and it may be reasonable to use softlinks to
accomplish it, like this:

JAIL FILESYSTEM:

/ complete copy of /
/usr complete copy of /usr
/mnt null mount of the master /
/mnt/usr null mount of the master /usr

And then use softlinks to enforce binary sharing by default:

/bin/* instead of the binaries make softlinks to /mnt/bin
/usr/bin/* ... softlinks to /mnt/usr/bin
/usr/lib/* ... softlinks to /mnt/usr/lib
/usr/local/lib/* ... softlinks to /mnt/usr/local/lib
/usr/local/bin/* ... softlinks to /mnt/usr/local/bin

So that way the user can remove the softlink and install his own
copy of the software if he wishes, and mess with anything else as well.

That's just an example. There are a thousand ways to do it.

-Matt
Matthew Dillon
<dil...@backplane.com>

Cameron Grant

unread,

Nov 13, 2002, 1:33:31 AM11/13/02

to

> Try using null mounts. The warning is in there because making the
> null mount code work is a real hack and the authors aren't entirely
> sure that everything's gotten covered. That said, use of a null mount
> is certainly a lot safer if the stuff behind the mount is mostly
> static.

null mounts, in -stable at least, are broken for this purpose. on
connection, sshd revoke()s some device- its pty, i assume, and when this
hits the nullfs layer a null pointer is dereferenced. if i had vfs-clue i'd
have fixed it when i found the panic about two weeks ago. when i overcame
this by putting the jails /dev on an nfs loopback, i managed to produce two
more different panics.

-cg

Daniel O'Connor

unread,

Nov 12, 2002, 10:56:08 PM11/12/02

to

On Wed, 2002-11-13 at 14:17, Hans Zaunere wrote:
> Two issues arise:
> 1) I'd like to be able to link an entire directory for convience and
> maintenance purposes.

Write a script :)

> 2) Cross partition links not possible.
>
> Number 2 is really the kicker, as far as I can tell. Is there some way
> around this?

Don't think so, you're stuck :(

--
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
"The nice thing about standards is that there
are so many of them to choose from."
-- Andrew Tanenbaum
GPG Fingerprint - 9A8C 569F 685A D928 5140 AE4B 319B 41F4 5D17 FDD5

Matthew Dillon

unread,

Nov 13, 2002, 1:43:38 AM11/13/02

to

:> is certainly a lot safer if the stuff behind the mount is mostly

:> static.
:
:null mounts, in -stable at least, are broken for this purpose. on
:connection, sshd revoke()s some device- its pty, i assume, and when this
:hits the nullfs layer a null pointer is dereferenced. if i had vfs-clue i'd
:have fixed it when i found the panic about two weeks ago. when i overcame
:this by putting the jails /dev on an nfs loopback, i managed to produce two
:more different panics.
:
: -cg

Well, that sounds like an addressable bug. But I don't see any
paricular reason why it would be a show-stopper. /dev doesn't
take up any significant amount of space, just copy it for each jail.

-Matt

Terry Lambert

unread,

Nov 13, 2002, 1:55:25 AM11/13/02

to

Hans Zaunere wrote:
> I want to allow the users the ability to compile and use their own
> instances of Apache and MySQL from within the jail. But instead of
> duplicating the basic system libs and bins, I'd like to maintain a
> single repository of this, which can then be read-only from within the
> jail. Options:
>
> -- Symlinks won't work because of the chroot.
> -- Mounts from within the jail aren't allowed, plus a single partition
> can't be mounted multiple times, AFAIK.
> -- I don't have NFS setup, and I would like to avoid it as much as
> possible.
> -- mount_null seems to be the answer, however the warning at the end of
> the man page is scary.

It's less scary, since you will be mounting read-only.

-- Terry

Terry Lambert

unread,

Nov 13, 2002, 2:09:45 AM11/13/02

to

Matthew Dillon wrote:
> Try using null mounts. The warning is in there because making the
> null mount code work is a real hack and the authors aren't entirely
> sure that everything's gotten covered. That said, use of a null mount
> is certainly a lot safer if the stuff behind the mount is mostly
> static.

The problem is in the VM object alias code. Specifically, the
getpages/putpages have to be implemented in terms of read/write,
so that there are not two vm_object_t's that refer to the same
data, since there is no "upcall" to notify of changes in a lower
layer, and therefore guarantee coherency.

This basically means that the "pig tricks" that most people who
don't know any better do, like using both mmap() and file I/O
against the same file, require explicit calls to msync() to
ensure cache coherency. Most people who write code these days
don't expect to have to call msync, and even if they expect to,
they're not entirely sure of when/why/how to call it.

This is the same reason that dropping the getpages/putpages VOPs
from the SMBFS implementation "fixes" the "cp" problem (by making
"cp" dork like "dd", by converting the getpages() request into a
read() request, instead). But doing that introduces the same
cache coherency problems, again.

You can basically ignore this problem entirely, since your mounts
are going to be read-only, and you aren't going to have to worry
about someone dirtying pages through a nullfs mount.

> Note that you can also use localhost NFS mounts to replicate pieces of
> filesystems within jails, but you need to remember that the kernel
> will wind up caching multiple copies of the data for these two cases
> and that NFS has file locking issues.

Yes. This will also work, if the man page for nullfs turns out to
be "too scary". ;^). Same coherency issues.

> Finally, keep in mind that disk space these days is quite cheap.
> Duplicating the data is not as bad a way to go as you might think, and
> it allows you to incrementally upgrade each jail. It may suffice to use
> the null mount trick *only* for the big binaries and libraries that you
> really want to share, and it may be reasonable to use softlinks to
> accomplish it, like this:

And, in fact, this is what I tend to do. But since the case in point
is for MySQL/Apache/etc., there's probably a lot more jhail instances
than what you are used to seeing. This is a shared hosting platform,
which is trying to pretend it's not shared, right?

If you go this route, you may want to bump up the number of inodes
by quite a bit above the default...

-- Terry

Terry Lambert

unread,

Nov 13, 2002, 2:11:54 AM11/13/02

to

Cameron Grant wrote:
> null mounts, in -stable at least, are broken for this purpose. on
> connection, sshd revoke()s some device- its pty, i assume, and when this
> hits the nullfs layer a null pointer is dereferenced. if i had vfs-clue i'd
> have fixed it when i found the panic about two weeks ago. when i overcame
> this by putting the jails /dev on an nfs loopback, i managed to produce two
> more different panics.

1) Use devfs instead.

2) Mount a devfs instance in each jail. Problem solved.

-- Terry

Matthew Dillon

unread,

Nov 13, 2002, 6:14:03 AM11/13/02

to

:> Try using null mounts. The warning is in there because making the

:> null mount code work is a real hack and the authors aren't entirely
:> sure that everything's gotten covered. That said, use of a null mount
:> is certainly a lot safer if the stuff behind the mount is mostly
:> static.
:
:The problem is in the VM object alias code. Specifically, the
:getpages/putpages have to be implemented in terms of read/write,
:so that there are not two vm_object_t's that refer to the same
:data, since there is no "upcall" to notify of changes in a lower
:layer, and therefore guarantee coherency.

I'm fairly sure the VM issues were fixed when VOP_GETVOBJECT was
added. A file accessed via a null mount will have the same VM object
as the file in the original filesystem. I'm not 100% sure about
that, I wasn't the one who did it, but I seem to recall it being
discussed.

-Matt

Pawel Jakub Dawidek

unread,

Nov 13, 2002, 6:23:25 AM11/13/02

to

On Tue, Nov 12, 2002 at 07:08:47PM -0800, Hans Zaunere wrote:
+> -- mount_null seems to be the answer, however the warning at the end of
+> the man page is scary.
+>
+> Is there any combination of these (or anything I'm forgetting) that
+> could help me here? Is mount_null stable?

I'm using mount_null(8) for my jails for a long time and everything
works fine.

milla:root:~# mount | grep null | wc -l
22

--
Pawel Jakub Dawidek
UNIX Systems Administrator
http://garage.freebsd.pl
Am I Evil? Yes, I Am.

The Anarcat

unread,

Nov 13, 2002, 12:16:17 PM11/13/02

to

On Tue Nov 12, 2002 at 11:11:54PM -0800, Terry Lambert wrote:
> Cameron Grant wrote:
> > null mounts, in -stable at least, are broken for this purpose. on
> > connection, sshd revoke()s some device- its pty, i assume, and when this
> > hits the nullfs layer a null pointer is dereferenced. if i had vfs-clue i'd
> > have fixed it when i found the panic about two weeks ago. when i overcame
> > this by putting the jails /dev on an nfs loopback, i managed to produce two
> > more different panics.
>
> 1) Use devfs instead.

On -stable?

A.

Dmitry Morozovsky

unread,

Nov 13, 2002, 2:20:57 PM11/13/02

to

On Tue, 12 Nov 2002, Hans Zaunere wrote:

HZ> After much searching and contemplation, I've decided to ask the
HZ> question directly:
HZ>
HZ> I'm implementing a jail server, which will provide a very limited set
HZ> of resources (Apache/MySQL/PHP). Setup is going well, however I've run
HZ> into a little snag that I hope can be worked out.
HZ>
HZ> I want to allow the users the ability to compile and use their own
HZ> instances of Apache and MySQL from within the jail. But instead of
HZ> duplicating the basic system libs and bins, I'd like to maintain a
HZ> single repository of this, which can then be read-only from within the
HZ> jail. Options:
HZ>
HZ> -- Symlinks won't work because of the chroot.
HZ> -- Mounts from within the jail aren't allowed, plus a single partition
HZ> can't be mounted multiple times, AFAIK.
HZ> -- I don't have NFS setup, and I would like to avoid it as much as
HZ> possible.
HZ> -- mount_null seems to be the answer, however the warning at the end of
HZ> the man page is scary.
HZ>
HZ> Is there any combination of these (or anything I'm forgetting) that
HZ> could help me here? Is mount_null stable?
HZ>
HZ> I've had an account on a jail server which had /shared visible within
HZ> the jail, and symlinks to /bin, /usr/lib and such. I'm not sure how
HZ> this was actually implemented, and I'd be interested if anyone has seen
HZ> or heard of any solutions to this type of problem.

I did multiple sets of

null:/shared/J/usr /J/jailNN/usr
procfs /J/jailNN/proc
mfs:48k /J/jailNN/dev

with a bit of tweaking such as:
/bin and /sbin moved to ${JHOME}/usr/Rbin and /Rsbin and symlinked,
/usr/home and /usr/local have moved out to jail home and symlinked

for standard jail there as also useful mount such as

null:/shared/J/local /J/jailNN/local

... and it at least seems workable for some ten to twenty jails on a moderately
powerful (1g5 Athlon with 512M of memory) machine. All jails are rather
lightweight (have only Apaches/PHP besides base system) though.

Sincerely,
D.Marck [DM5020, DM268-RIPE, DM3-RIPN]
------------------------------------------------------------------------
*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru ***
------------------------------------------------------------------------

Hans Zaunere

unread,

Nov 13, 2002, 3:27:35 PM11/13/02

to

--- Terry Lambert <tlam...@mindspring.com> wrote:
> Hans Zaunere wrote:
> > I want to allow the users the ability to compile and use their own
> > instances of Apache and MySQL from within the jail. But instead of
> > duplicating the basic system libs and bins, I'd like to maintain a
> > single repository of this, which can then be read-only from within
> the jail. Options:
> >
> > -- Symlinks won't work because of the chroot.
> > -- Mounts from within the jail aren't allowed, plus a single
> partition
> > can't be mounted multiple times, AFAIK.
> > -- I don't have NFS setup, and I would like to avoid it as much as
> > possible.
> > -- mount_null seems to be the answer, however the warning at the
> end of
> > the man page is scary.
>
> It's less scary, since you will be mounting read-only.

I thank everyone for their suggestions, and I think I will go with null
mounts, since it will in fact be read-only.

I'd like to add that I think a completion of mount_null (and taking out
the fright from the bottom of the man page :) would be greatly
appreciated, since the functionality it provides is very valuable to
running jails. I'm also looking forward to the next "version" of jail
implementation!

Thanks again all and keep up the excellent work,

=====
Hans Zaunere
New York PHP
http://nyphp.org
ha...@nyphp.org

__________________________________________________

Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

To Unsubscribe: send mail to majo...@FreeBSD.org

Terry Lambert

unread,

Nov 13, 2002, 6:28:22 PM11/13/02

to

Pawel Jakub Dawidek wrote:
> On Tue, Nov 12, 2002 at 07:08:47PM -0800, Hans Zaunere wrote:
> +> -- mount_null seems to be the answer, however the warning at the end of
> +> the man page is scary.
> +>
> +> Is there any combination of these (or anything I'm forgetting) that
> +> could help me here? Is mount_null stable?
>
> I'm using mount_null(8) for my jails for a long time and everything
> works fine.

Don't worry about it. It's only a problem for mmap'ed files
which are also read/written. Sheesh.

-- Terry

Pawel Jakub Dawidek

unread,

Nov 13, 2002, 6:45:12 PM11/13/02

to

On Wed, Nov 13, 2002 at 12:27:35PM -0800, Hans Zaunere wrote:
+> [...] I'm also looking forward to the next "version" of jail
+> implementation!

You're talking about jailNG? If I understand everything correct there
will be no jailNG. TrustedBSD features will handle with jail-things.
I'm wrong?

Pawel Jakub Dawidek

unread,

Nov 13, 2002, 6:49:05 PM11/13/02

to

On Wed, Nov 13, 2002 at 03:28:22PM -0800, Terry Lambert wrote:
+> Don't worry about it. It's only a problem for mmap'ed files
+> which are also read/written. Sheesh.

I have found one little bug in nullfs. I've send it some time ago
to hackers@, but without any respond.

Here it is, maybe someone could check it:

-----[ start mail ]-----
I have found something like this, but I'm not sure of this
is a bug in nullfs:

# cd
# mkdir dir1
# mkdir dir1/dir2
# mkdir dir3
# mount_null dir1 dir3

Now simple proram "test":

-----[ start ]-----
#include <sys/param.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>

int
main(int argc, char *argv[])
{
char buf[MAXPATHLEN];

/* I just want to be sure that I'm calling syscall directly. */
if (syscall(SYS___getcwd, buf, sizeof buf) != 0) {
fprintf(stderr, "%s: %s\n", argv[0], strerror(errno));
exit(1);
}
printf("out: [%s]\n", buf);
exit(0);
}
-----[ end ]-----

And now:

# cd ~/dir3/dir1
# /path/to/test
/path/to/test: Not a directory

Problem is here (line 571 in /sys/kern/vfs_cache.c):

if (vp->v_dd->v_id != vp->v_ddid) {
numcwdfail1++;
free(buf, M_TEMP);
return (ENOTDIR);
}

If "dir3" is for example NFS mount-point there are no problems.
Any ideas?
-----[ end mail ]-----

Matthew Dillon

unread,

Nov 13, 2002, 6:58:10 PM11/13/02

to

:> I'm fairly sure the VM issues were fixed when VOP_GETVOBJECT was

:> added. A file accessed via a null mount will have the same VM object
:> as the file in the original filesystem. I'm not 100% sure about
:> that, I wasn't the one who did it, but I seem to recall it being
:> discussed.

:
:VOP_GETVOBJECT is a different name, but the VOP was my suggestion,
:to allow an upper layer to obtain a backing object, and to
:collapse intermediate layers.
:
:The issue is that the NULLFS getpages falls through the the
:vfs_default.c vop_stdgetpages(), which calls the function
:vnode_pager_generic_getpages(), which in turn, calls VOP_BMAP(),
:which in null_vnops.c is vop_eopnotsupp(), so it falls back to
:vnode_pager_input_smlfs(), which VOP_BMAP()'s *again*, but off
:the device.
:
:At which point, you've lost coherency.
:
:-- Terry

It should be calling VOP_BMAP through the VP stored in the VM
object, which will be the underlying file, not the nullfs.

Terry Lambert

unread,

Nov 13, 2002, 8:00:24 PM11/13/02

to

The Anarcat wrote:
> On Tue Nov 12, 2002 at 11:11:54PM -0800, Terry Lambert wrote:
> > Cameron Grant wrote:
> > > null mounts, in -stable at least, are broken for this purpose. on
> > > connection, sshd revoke()s some device- its pty, i assume, and when this
> > > hits the nullfs layer a null pointer is dereferenced. if i had vfs-clue i'd
> > > have fixed it when i found the panic about two weeks ago. when i overcame
> > > this by putting the jails /dev on an nfs loopback, i managed to produce two
> > > more different panics.
> >
> > 1) Use devfs instead.
>
> On -stable?

Yes.

-- Terry

The Anarcat

unread,

Nov 13, 2002, 8:04:45 PM11/13/02

to

On Wed Nov 13, 2002 at 05:00:24PM -0800, Terry Lambert wrote:
> The Anarcat wrote:
> > On Tue Nov 12, 2002 at 11:11:54PM -0800, Terry Lambert wrote:
> > > 1) Use devfs instead.
> >
> > On -stable?
>
> Yes.

Wasn't -stable devfs retired some time ago?

A.
--
From the age of uniformity, from the age of solitude, from the age of
Big Brother, from the age of doublethink - greetings!

Terry Lambert

unread,

Nov 13, 2002, 6:24:42 PM11/13/02

to

Matthew Dillon wrote:
> :> Try using null mounts. The warning is in there because making the
> :> null mount code work is a real hack and the authors aren't entirely
> :> sure that everything's gotten covered. That said, use of a null mount
> :> is certainly a lot safer if the stuff behind the mount is mostly
> :> static.
> :
> :The problem is in the VM object alias code. Specifically, the
> :getpages/putpages have to be implemented in terms of read/write,
> :so that there are not two vm_object_t's that refer to the same
> :data, since there is no "upcall" to notify of changes in a lower
> :layer, and therefore guarantee coherency.
>
> I'm fairly sure the VM issues were fixed when VOP_GETVOBJECT was
> added. A file accessed via a null mount will have the same VM object
> as the file in the original filesystem. I'm not 100% sure about
> that, I wasn't the one who did it, but I seem to recall it being
> discussed.

VOP_GETVOBJECT is a different name, but the VOP was my suggestion,

to allow an upper layer to obtain a backing object, and to
collapse intermediate layers.

The issue is that the NULLFS getpages falls through the the
vfs_default.c vop_stdgetpages(), which calls the function
vnode_pager_generic_getpages(), which in turn, calls VOP_BMAP(),
which in null_vnops.c is vop_eopnotsupp(), so it falls back to
vnode_pager_input_smlfs(), which VOP_BMAP()'s *again*, but off
the device.

At which point, you've lost coherency.

-- Terry

To Unsubscribe: send mail to majo...@FreeBSD.org

Terry Lambert

unread,

Nov 13, 2002, 8:55:45 PM11/13/02

to

Pawel Jakub Dawidek wrote:
> On Wed, Nov 13, 2002 at 03:28:22PM -0800, Terry Lambert wrote:
> +> Don't worry about it. It's only a problem for mmap'ed files
> +> which are also read/written. Sheesh.
>
> I have found one little bug in nullfs. I've send it some time ago
> to hackers@, but without any respond.

__getcwd(2) doesn't work like you think it works.

It works by looking up things in the directory name cache.

It's perfectly acceptable for it to fail.

This is why you are supposed t use getcwd(3), instead, which
can recover if the system call fails.

Realize that directories do not have necessarily valid parent
pointers hanging around.

By overloading the lookup cache, I can cause your program to
fail on NFS, as well, You just aren't waving the right dead
chicken in your test case.

Terry Lambert

unread,

Nov 13, 2002, 9:04:02 PM11/13/02

to

Matthew Dillon wrote:
> :VOP_GETVOBJECT is a different name, but the VOP was my suggestion,
> :to allow an upper layer to obtain a backing object, and to
> :collapse intermediate layers.
> :
> :The issue is that the NULLFS getpages falls through the the
> :vfs_default.c vop_stdgetpages(), which calls the function
> :vnode_pager_generic_getpages(), which in turn, calls VOP_BMAP(),
> :which in null_vnops.c is vop_eopnotsupp(), so it falls back to
> :vnode_pager_input_smlfs(), which VOP_BMAP()'s *again*, but off
> :the device.
> :
> :At which point, you've lost coherency.
>

> It should be calling VOP_BMAP through the VP stored in the VM
> object, which will be the underlying file, not the nullfs.

Probably, but it's not doing that. The NULLFS implement VOP_BMAP
as vop_eopnotsupp; it doesn't fall through. Even if it did fall
through, the vfs_default.c code is not really written with stacking
in mind, it's written with a local-media FS in mind. VOP_BMAP is
simply not implemented for NULLFS, and is nearly impossible to
implement correctly for a stacking VFS layer in any case, given
the object aliasing problem.

This is a deeply ingrained bug in FreeBSD's implementation of VFS
stacking.

The only safe workaround is to fail back to the read/write of
the buffers, and lose coherency between instances of the FS...
and that's what happens: you get coherency down, if you do
explicit msync's, but lose it back up into the other instances
local copies of the data.

-- Terry

Terry Lambert

unread,

Nov 13, 2002, 9:08:49 PM11/13/02

to

The Anarcat wrote:
> On Wed Nov 13, 2002 at 05:00:24PM -0800, Terry Lambert wrote:
> > The Anarcat wrote:
> > > On Tue Nov 12, 2002 at 11:11:54PM -0800, Terry Lambert wrote:
> > > > 1) Use devfs instead.
> > >
> > > On -stable?
> >
> > Yes.
>
> Wasn't -stable devfs retired some time ago?

No. You are thinking of Julian's devfs, which PHK replaced with
PHK's devfs.

Matthew Dillon

unread,

Nov 13, 2002, 9:55:37 PM11/13/02

to

:>
:> It should be calling VOP_BMAP through the VP stored in the VM

:> object, which will be the underlying file, not the nullfs.
:
:Probably, but it's not doing that. The NULLFS implement VOP_BMAP
:as vop_eopnotsupp; it doesn't fall through. Even if it did fall
:through, the vfs_default.c code is not really written with stacking
:in mind, it's written with a local-media FS in mind. VOP_BMAP is
:simply not implemented for NULLFS, and is nearly impossible to
:implement correctly for a stacking VFS layer in any case, given
:the object aliasing problem.
:
:This is a deeply ingrained bug in FreeBSD's implementation of VFS
:stacking.

I don't think it's doing that. As far as I can tell it is
calling VOP_GETPAGES, which will hit nullfs, and then nullfs should
simply call the underlying vnode's VOP_GETPAGES via the null_bypass()
function.

-Matt

Terry Lambert

unread,

Nov 14, 2002, 2:42:14 PM11/14/02

to

Matthew Dillon wrote:
> So this patch is a hack. It returns special devices directly whenever
> possible but must still synthesize temporary vnodes for them for
> RENAME and DELETE operations. But short of rewriting a big chunk of
> the device tracking infrastructure there is no other solution.

If you are going to do that, why not just add:

static vop_t **nullfs_specop_p;
static struct vnodeopv_entry_desc nullfs_specop_entries[] = {
...
};
static struct vnodeopv_desc fs_specop_opv_desc =
{ &nullfs_specop_p, nullfs_specop_entries };
VNODEOP_SET(nullfs_specop_opv_desc);

???

That way the devices get exported directly (still), but the rename,
delete, and other code can be left alone.

It's really ugly to think of a "nullfs" doing this, though, so
I guess it's sixes on which approach is used. Told you it was
crufty. 8-(.

-- Terry

Matthew Dillon

unread,

Nov 14, 2002, 5:11:37 PM11/14/02

to

:Matthew Dillon wrote:
:> So this patch is a hack. It returns special devices directly whenever
:> possible but must still synthesize temporary vnodes for them for
:> RENAME and DELETE operations. But short of rewriting a big chunk of
:> the device tracking infrastructure there is no other solution.
:
:If you are going to do that, why not just add:
:
:static vop_t **nullfs_specop_p;
:static struct vnodeopv_entry_desc nullfs_specop_entries[] = {
:...
:};
:static struct vnodeopv_desc fs_specop_opv_desc =
: { &nullfs_specop_p, nullfs_specop_entries };
:VNODEOP_SET(nullfs_specop_opv_desc);
:
:???
:
:That way the devices get exported directly (still), but the rename,
:delete, and other code can be left alone.
:
:It's really ugly to think of a "nullfs" doing this, though, so
:I guess it's sixes on which approach is used. Told you it was
:crufty. 8-(.
:
:-- Terry

Hmm. That might just work since unionfs (with the patch) doesn't
try to cache non-regular vnodes, and (nullfs doesn't try to
cache anything). It would allow us to call addalias() and track
v_rdev (though there might be a problem with sequencing since
calling addalias while still holding lowervp or uppervp temporarily
bump the count above 1 and possibly confuse the device driver into
believing that the device has been opened when it may not have been).
Unionfs and nullfs would still have to be aware of all the special
vnode types.

-Matt
Matthew Dillon
<dil...@backplane.com>

Matthew Dillon

unread,

Nov 14, 2002, 2:29:26 PM11/14/02

to

Cameron and I have been working through some of the more blatent bugs.

Here is an intermediate patch for -stable, for both unionfs and nullfs.
There are still plenty of bugs left but this patch should fix the
major issues with devices.

Basically what is going on is that special vnode types like VBLK and VCHR
are also assumed to have special fields filled in which nullfs and unionfs
do not fill in when they synthesized a vnode. Unfortunately, some of
these fields *CAN'T* be filled in. For example, take a VCHR vnode.
The system expects v_rdev to be filled in. v_rdev cannot be safely
filled in without aliasing the device. The device cannot be safely
aliased because the system makes major assumptions in regards to the
alias/vnode-ref counts in order to determine whether a device close
or a revoke() can be done. If we alias the device, everything breaks.
I spent four hours trying to alias the device and couldn't get it to
work. Either it caused specfs to call d_close without the device first
being opened, or it caused revoke() to fail, or it through the device
was opened multiple times when it wasn't, or it thought the device
was opened when it wasn't (that was why sshd hung, because the child
process closed the tty side of the pty and the pty side still thought
the tty side was open because it the vnode was being cached by nullfs
or unionfs).

In short, FreeBSD's device tracking code needs to be seriously
rewritten. FreeBSD cannot distinguish between vnodes which have
d_open()'d (VOP_OPEN()'d) the device and vnodes which have not.

So this patch is a hack. It returns special devices directly whenever
possible but must still synthesize temporary vnodes for them for
RENAME and DELETE operations. But short of rewriting a big chunk of
the device tracking infrastructure there is no other solution.

-Matt

Index: kern/vfs_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.249.2.29
diff -u -r1.249.2.29 vfs_subr.c
--- kern/vfs_subr.c 13 Oct 2002 16:19:12 -0000 1.249.2.29
+++ kern/vfs_subr.c 14 Nov 2002 18:01:43 -0000
@@ -2115,10 +2115,12 @@
int count;

count = 0;
- simple_lock(&spechash_slock);
- SLIST_FOREACH(vq, &vp->v_hashchain, v_specnext)
- count += vq->v_usecount;
- simple_unlock(&spechash_slock);
+ if (vp->v_rdev) {
+ simple_lock(&spechash_slock);
+ SLIST_FOREACH(vq, &vp->v_hashchain, v_specnext)
+ count += vq->v_usecount;
+ simple_unlock(&spechash_slock);
+ }
return (count);
}

Index: miscfs/nullfs/null_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/nullfs/Attic/null_subr.c,v
retrieving revision 1.21.2.4
diff -u -r1.21.2.4 null_subr.c
--- miscfs/nullfs/null_subr.c 26 Jun 2001 04:20:09 -0000 1.21.2.4
+++ miscfs/nullfs/null_subr.c 14 Nov 2002 17:55:09 -0000
@@ -181,6 +181,7 @@
xp->null_vnode = vp;
vp->v_data = xp;
xp->null_lowervp = lowervp;
+
/*
* Before we insert our new node onto the hash chains,
* check to see if someone else has beaten us to it.
Index: miscfs/nullfs/null_vfsops.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/nullfs/Attic/null_vfsops.c,v
retrieving revision 1.35.2.3
diff -u -r1.35.2.3 null_vfsops.c
--- miscfs/nullfs/null_vfsops.c 26 Jul 2001 20:37:11 -0000 1.35.2.3
+++ miscfs/nullfs/null_vfsops.c 14 Nov 2002 17:55:09 -0000
@@ -246,6 +246,7 @@
*/
mntdata = mp->mnt_data;
mp->mnt_data = 0;
+ mp->mnt_flag &= ~MNT_LOCAL;
free(mntdata, M_NULLFSMNT);
return 0;
}
Index: miscfs/nullfs/null_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/nullfs/Attic/null_vnops.c,v
retrieving revision 1.38.2.6
diff -u -r1.38.2.6 null_vnops.c
--- miscfs/nullfs/null_vnops.c 31 Jul 2002 00:32:28 -0000 1.38.2.6
+++ miscfs/nullfs/null_vnops.c 14 Nov 2002 18:38:28 -0000
@@ -194,6 +194,7 @@
static int null_destroyvobject(struct vop_destroyvobject_args *ap);
static int null_getattr(struct vop_getattr_args *ap);
static int null_getvobject(struct vop_getvobject_args *ap);
+static int null_revoke(struct vop_revoke_args *ap);
static int null_inactive(struct vop_inactive_args *ap);
static int null_islocked(struct vop_islocked_args *ap);
static int null_lock(struct vop_lock_args *ap);
@@ -388,14 +389,39 @@
if (cnp->cn_flags & PDIRUNLOCK)
VOP_UNLOCK(dvp, LK_THISLAYER, p);
if ((error == 0 || error == EJUSTRETURN) && lvp != NULL) {
+ /*
+ * Return an appropriately synthesized node. Special
+ * file types (e.g. VBLK, VCHR, and others) are a real
+ * problem because the system makes assumptions about
+ * special fields in the vnode which we cannot safely
+ * duplicate. Unfortunately we have to synthesize nodes if
+ * we are going to do a deletion or rename to avoid
+ * confusing the bypass code.
+ *
+ * VCHR and VBLK are particularly difficult, because the
+ * rest of the system makes some bad assumptions on whether
+ * to close a device or whether the device is 'opened' multiple
+ * times simply based on the number of vnodes aliased to it
+ * and theri ref counts.
+ */
+ int can_synthesize = 0;
+
+ if (cnp->cn_nameiop != LOOKUP && cnp->cn_nameiop != CREATE) {
+ can_synthesize = 1;
+ } else if (lvp->v_type == VDIR || lvp->v_type == VREG ||
+ lvp->v_type == VLNK) {
+ can_synthesize = 1;
+ }
if (ldvp == lvp) {
*ap->a_vpp = dvp;
VREF(dvp);
vrele(lvp);
- } else {
+ } else if (can_synthesize) {
error = null_node_create(dvp->v_mount, lvp, &vp);
if (error == 0)
*ap->a_vpp = vp;
+ } else {
+ *ap->a_vpp = lvp;
}
}
return (error);
@@ -726,6 +752,7 @@
VOP_UNLOCK(vp, LK_THISLAYER, p);

vput(lowervp);
+
/*
* Now it is safe to drop references to the lower vnode.
* VOP_INACTIVE() will be called by vrele() if necessary.
@@ -829,11 +856,31 @@
}

/*
+ * Revoke - just vgone the node. Device nodes are passed to the
+ * caller layer directly.
+ */
+static int
+null_revoke(ap)
+ struct vop_revoke_args /* {
+ struct vnode *a_vp;
+ int a_flags;
+ } */ *ap;
+{
+ struct vnode *lvp = NULLVPTOLOWERVP(ap->a_vp);
+
+ if (lvp == NULL)
+ return EINVAL;
+ vgone(ap->a_vp);
+ return (0);
+}
+
+/*
* Global vfs data structures
*/
vop_t **null_vnodeop_p;
static struct vnodeopv_entry_desc null_vnodeop_entries[] = {
{ &vop_default_desc, (vop_t *) null_bypass },
+ { &vop_revoke_desc, (vop_t *) null_revoke },
{ &vop_access_desc, (vop_t *) null_access },
{ &vop_createvobject_desc, (vop_t *) null_createvobject },
{ &vop_destroyvobject_desc, (vop_t *) null_destroyvobject },
Index: miscfs/union/union_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/union/Attic/union_subr.c,v
retrieving revision 1.43.2.2
diff -u -r1.43.2.2 union_subr.c
--- miscfs/union/union_subr.c 25 Dec 2001 01:44:45 -0000 1.43.2.2
+++ miscfs/union/union_subr.c 14 Nov 2002 19:02:30 -0000
@@ -369,6 +369,52 @@
vflag = VROOT;
}

+ /*
+ * We have to synthesize special nodes under certain circumstances..
+ * when a DELETE or RENAME is to be performed. But for anything
+ * that will open the vnode (LOOKUP, CREATE), we cannot safely return
+ * a synthesized vnode and must instead return the actual vnode.
+ * This is because the system makes assumptions not only about
+ * special fields in the vnode when non-normal vnodes are returned,
+ * but also makes assumptions based on the ref count in special vnodes.
+ * (see revoke() and the miscfs/specfs code for examples).
+ *
+ * (The docache flag is ignored in the direct case).
+ */
+ if (cnp && (cnp->cn_nameiop == LOOKUP || cnp->cn_nameiop == CREATE)) {
+ if (uppervp && uppervp->v_type != VREG &&
+ uppervp->v_type != VDIR && uppervp->v_type != VLNK) {
+ *vpp = uppervp;
+ if (upperdvp)
+ vrele(upperdvp);
+ if (lowervp)
+ vrele(lowervp);
+ vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY, p);
+ return(0);
+ } else if (lowervp && lowervp->v_type != VREG &&
+ lowervp->v_type != VDIR && lowervp->v_type != VLNK) {
+ *vpp = lowervp;
+ if (upperdvp)
+ vrele(upperdvp);
+ if (uppervp)
+ vrele(uppervp);
+ vn_lock(*vpp, LK_EXCLUSIVE | LK_RETRY, p);
+ return(0);
+ }
+ }
+
+ /*
+ * Do not cache special situations
+ */
+ if (uppervp && uppervp->v_type != VREG &&
+ uppervp->v_type != VDIR && uppervp->v_type != VLNK) {
+ docache = 0;
+ }
+ if (lowervp && lowervp->v_type != VREG &&
+ lowervp->v_type != VDIR && lowervp->v_type != VLNK) {
+ docache = 0;
+ }
+
loop:
if (!docache) {
un = 0;
@@ -538,7 +584,6 @@
/*
* Create new node rather then replace old node
*/
-
error = getnewvnode(VT_UNION, mp, union_vnodeop_p, vpp);
if (error) {
/*
Index: miscfs/union/union_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/union/Attic/union_vnops.c,v
retrieving revision 1.72
diff -u -r1.72 union_vnops.c
--- miscfs/union/union_vnops.c 15 Dec 1999 23:02:14 -0000 1.72
+++ miscfs/union/union_vnops.c 14 Nov 2002 18:02:50 -0000
@@ -98,6 +98,7 @@
static int union_revoke __P((struct vop_revoke_args *ap));
static int union_rmdir __P((struct vop_rmdir_args *ap));
static int union_poll __P((struct vop_poll_args *ap));
+static int union_kqfilter __P((struct vop_kqfilter_args *ap));
static int union_setattr __P((struct vop_setattr_args *ap));
static int union_strategy __P((struct vop_strategy_args *ap));
static int union_getpages __P((struct vop_getpages_args *ap));
@@ -1189,6 +1190,26 @@
}

static int
+union_kqfilter(ap)
+ struct vop_kqfilter_args /* {
+ struct vnode *a_vp;
+ ...
+ } */ *ap;
+{
+ struct vnode *ovp = OTHERVP(ap->a_vp);
+
+ ap->a_vp = ovp;
+ return (VCALL(ovp, VOFFSET(vop_kqfilter), ap));
+}
+
+/*
+ * Revoke access
+ *
+ * Note that if this is a device node, the lower or upper vp is already
+ * on the vnode alias list for the device and revoke will be called on it,
+ * so a duplicate call here would panic the box.
+ */
+static int
union_revoke(ap)
struct vop_revoke_args /* {
struct vnode *a_vp;
@@ -1198,9 +1219,9 @@
{
struct vnode *vp = ap->a_vp;

- if (UPPERVP(vp))
+ if (UPPERVP(vp) && vcount(UPPERVP(vp)) > 1)
VOP_REVOKE(UPPERVP(vp), ap->a_flags);
- if (LOWERVP(vp))
+ if (LOWERVP(vp) && vcount(LOWERVP(vp)) > 1)
VOP_REVOKE(LOWERVP(vp), ap->a_flags);
vgone(vp);
return (0);
@@ -1958,6 +1979,7 @@
{ &vop_open_desc, (vop_t *) union_open },
{ &vop_pathconf_desc, (vop_t *) union_pathconf },
{ &vop_poll_desc, (vop_t *) union_poll },
+ { &vop_kqfilter_desc, (vop_t *) union_kqfilter },
{ &vop_print_desc, (vop_t *) union_print },
{ &vop_read_desc, (vop_t *) union_read },
{ &vop_readdir_desc, (vop_t *) union_readdir },