Vnode API change: add global vnode cache

J. Hannken-Illjes

unread,

Apr 6, 2014, 6:14:44 AM4/6/14

to tech-kern@netbsd.org list

Currently all file systems have to implement their own cache of
vnode / fs node pairs. Most file systems use a copy and pasted
version of ufs_ihash.

So add a global vnode cache with lookup and remove:

/*
* Lookup a vnode / fs node pair by key and return it referenced through vpp.
*/
int
vcache_lookup(struct mount *mp, void *key, size_t key_len, struct vnode **vpp)

/*
* Remove a vnode / fs node pair from the cache.
*/
void
vcache_remove(struct mount *mp, void *key, size_t key_len)

On cache miss vcache_lookup() will call a new vfs operation:

/*
* Read an inode from disk and initialize this vnode / inode pair.
* Caller assures no other thread will try to load this inode.
*/
int
vfs_load_node(struct mount *mp, struct vnode *vp,
const void *key, size_t key_len, void **new_key)

to load and initialize this vnode / fs node pair. The vnode cache
guarantees this call to be exclusive, no other thread will try to
load this vnode / fs node pair. The vnode will not appear on any
list before this operation completes.

Diff implementing this for file systems sharing ufs_ihash and replacing
VFS_VGET()/VOP_UNLOCK() sequences with vcache_lookup() is here:

http://www.netbsd.org/~hannken/vnode-pass6-1.diff

Comments or objections anyone?

--
J. Hannken-Illjes - han...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

David Holland

unread,

Apr 6, 2014, 3:12:09 PM4/6/14

to J. Hannken-Illjes, tech-kern@netbsd.org list

On Sun, Apr 06, 2014 at 12:14:24PM +0200, J. Hannken-Illjes wrote:
> Currently all file systems have to implement their own cache of
> vnode / fs node pairs. Most file systems use a copy and pasted
> version of ufs_ihash.
>
> So add a global vnode cache with lookup and remove:

Heh, would you like the rest of my todo list? :-)

one surface comment:

> /*
> * Read an inode from disk and initialize this vnode / inode pair.
> * Caller assures no other thread will try to load this inode.
> */
> int
> vfs_load_node(struct mount *mp, struct vnode *vp,
> const void *key, size_t key_len, void **new_key)

Can we call this vn_load, or vncache_load, or vncache_enter, or
something of the sort that's more consistent with the rest of the
extant naming?

I will try to look at the diff later...

--
David A. Holland
dhol...@netbsd.org

J. Hannken-Illjes

unread,

Apr 6, 2014, 3:50:53 PM4/6/14

to David Holland, tech-kern@netbsd.org list

May be not so clear without reading the diff, this is a vfs operation:

@@ -221,4 +221,6 @@ struct vfsops {
int (*vfs_sync) (struct mount *, int, struct kauth_cred *);
int (*vfs_vget) (struct mount *, ino_t, struct vnode **);
+ int (*vfs_load_node) (struct mount *, struct vnode *,
+ const void *, size_t, void **);
int (*vfs_fhtovp) (struct mount *, struct fid *,
struct vnode **);

> I will try to look at the diff later...

Please don't wait too long ...

Taylor R Campbell

unread,

Apr 6, 2014, 9:22:44 PM4/6/14

to J. Hannken-Illjes, tech...@netbsd.org

Date: Sun, 6 Apr 2014 12:14:24 +0200
From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>

Currently all file systems have to implement their own cache of
vnode / fs node pairs. Most file systems use a copy and pasted
version of ufs_ihash.

So add a global vnode cache with lookup and remove:

[...]

http://www.netbsd.org/~hannken/vnode-pass6-1.diff

Comments or objections anyone?

Generally looks pretty good, and largely better than the draft I threw
together last month to do more or less the same thing.

There are a number of file systems that have copypasta of ufs_ihash;
adapting these too to vcache should be a piece of cake. Have you
surveyed the more different file systems -- e.g., nfs, coda, tmpfs,
puffs -- to see whether this abstraction will make sense there too?

Some specific comments:

int
vcache_lookup(struct mount *mp, void *key, size_t key_len, struct vnode **vpp)

Call it vcache_intern rather than vcache_lookup, since it inserts if
not already there?

Why `void *key' and not `const void *key'?

void
vcache_remove(struct mount *mp, void *key, size_t key_len)

Same about `void *key'.

int
vfs_load_node(struct mount *mp, struct vnode *vp,
const void *key, size_t key_len, void **new_key)

Same about `void **new_key'.

+static kmutex_t vcache_lock __cacheline_aligned;
+static kcondvar_t vcache_cv __cacheline_aligned;
+static rb_tree_t vcache_rb_tree __cacheline_aligned;

Do these all need to be cacheline-aligned separately? Is vcache_lock
not enough, since the others will be used only while vcache_lock is
held? Or, how about this?

struct {
kmutex_t lock;
kcondvar_t cv;
rb_tree_t tree;
} vcache __cacheline_aligned;

+int
+vcache_lookup(struct mount *mp, void *key, size_t key_len, struct vnode **vpp)
+{
...
+#ifdef DIAGNOSTICS

DIAGNOSTIC, not DIAGNOSTICS.

+ new_key = NULL;
+ *vpp = NULL;
+#endif

Why not just make these unconditional, or remove the unconditional
assertions that rely on them? We generally ought to avoid assertions
that are true only if DIAGNOSTIC -- it makes code harder to reason
about. If I see KASSERT(*vpp != NULL), I am going to assume that *vpp
is nonnull, whether DIAGNOSTIC is set or not.

+ if (node != NULL /* && node->vn_node == NULL */) {

Turn the commented fragment into a kassert.

+ new_node = kmem_alloc(sizeof(*new_node), KM_SLEEP);

Should use pool_cache(9) rather than kmem(9) for fixed allocation.

+ vp = vnalloc(NULL);
+ vp->v_usecount = 1;
+ vp->v_type = VNON;
+ vp->v_size = vp->v_writesize = VSIZENOTSET;

Is the plan to nix getnewvnode and ungetnewvnode? It would be good to
avoid long-term duplication of its body. But I suspect we'll want to
keep those for, e.g., tmpfs, in which case this should use getnewvnode
and ungetnewvnode instead of vnalloc/setup/teardown/vnfree.

+ /* Load the fs node. Exclusive as new_node->vn_vnode is NULL. */
+ error = VFS_LOAD_NODE(mp, vp, key, key_len, &new_key);
+ if (error == 0) {

Switch the sense of this branch so the main-line code is success and
the indented (and shorter) alternative is failure.

/*
+ * Look up and return a vnode/inode pair by inode number.
+ */
+int
+ufs_vget(struct mount *mp, ino_t ino, struct vnode **vpp)
+{

Does anything use VFS_VGET in ufs now? If so, why? If not, this
should just fail. (Eventually we ought to nix VFS_VGET, I think.)

The one call in ufs_rename.c should be easy to fix -- just replace

VOP_UNLOCK(vp);
error = VFS_VGET(mp, dotdot_ino, &dvp);
vrele(vp);

by

error = vcache_intern(mp, &dotdot_ino, sizeof(dotdot_ino), &dvp);
vput(vp);

J. Hannken-Illjes

unread,

Apr 7, 2014, 9:51:25 AM4/7/14

to Taylor R Campbell, tech...@netbsd.org

On 07 Apr 2014, at 03:22, Taylor R Campbell <campbell+net...@mumble.net> wrote:

> Date: Sun, 6 Apr 2014 12:14:24 +0200
> From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>
>
> Currently all file systems have to implement their own cache of
> vnode / fs node pairs. Most file systems use a copy and pasted
> version of ufs_ihash.
>
> So add a global vnode cache with lookup and remove:
> [...]
> http://www.netbsd.org/~hannken/vnode-pass6-1.diff
>
> Comments or objections anyone?
>
> Generally looks pretty good, and largely better than the draft I threw
> together last month to do more or less the same thing.
>
> There are a number of file systems that have copypasta of ufs_ihash;
> adapting these too to vcache should be a piece of cake. Have you
> surveyed the more different file systems -- e.g., nfs, coda, tmpfs,
> puffs -- to see whether this abstraction will make sense there too?
>
> Some specific comments:
>
> int
> vcache_lookup(struct mount *mp, void *key, size_t key_len, struct vnode **vpp)
>
> Call it vcache_intern rather than vcache_lookup, since it inserts if
> not already there?

Ok.

> Why `void *key' and not `const void *key'?
>
> void
> vcache_remove(struct mount *mp, void *key, size_t key_len)
>
> Same about `void *key'.
>
> int
> vfs_load_node(struct mount *mp, struct vnode *vp,
> const void *key, size_t key_len, void **new_key)
>
> Same about `void **new_key'.

Ok - all const now.

> +static kmutex_t vcache_lock __cacheline_aligned;
> +static kcondvar_t vcache_cv __cacheline_aligned;
> +static rb_tree_t vcache_rb_tree __cacheline_aligned;
>
> Do these all need to be cacheline-aligned separately? Is vcache_lock
> not enough, since the others will be used only while vcache_lock is
> held? Or, how about this?
>
> struct {
> kmutex_t lock;
> kcondvar_t cv;
> rb_tree_t tree;
> } vcache __cacheline_aligned;

Ok - using a struct looks good.

> +int
> +vcache_lookup(struct mount *mp, void *key, size_t key_len, struct vnode **vpp)
> +{
> ...
> +#ifdef DIAGNOSTICS
>
> DIAGNOSTIC, not DIAGNOSTICS.
>
> + new_key = NULL;
> + *vpp = NULL;
> +#endif
>
> Why not just make these unconditional, or remove the unconditional
> assertions that rely on them? We generally ought to avoid assertions
> that are true only if DIAGNOSTIC -- it makes code harder to reason
> about. If I see KASSERT(*vpp != NULL), I am going to assume that *vpp
> is nonnull, whether DIAGNOSTIC is set or not.

Removed the #ifdef and always clear new_key and *vpp on top.

> + if (node != NULL /* && node->vn_node == NULL */) {
>
> Turn the commented fragment into a kassert.

Ok.

> + new_node = kmem_alloc(sizeof(*new_node), KM_SLEEP);
>
> Should use pool_cache(9) rather than kmem(9) for fixed allocation.

Ok.

> + vp = vnalloc(NULL);
> + vp->v_usecount = 1;
> + vp->v_type = VNON;
> + vp->v_size = vp->v_writesize = VSIZENOTSET;
>
> Is the plan to nix getnewvnode and ungetnewvnode? It would be good to
> avoid long-term duplication of its body. But I suspect we'll want to
> keep those for, e.g., tmpfs, in which case this should use getnewvnode
> and ungetnewvnode instead of vnalloc/setup/teardown/vnfree.

Not sure if we have to keep getnewvnode() in the long term.

It cannot be used here as we don't want the partially initialised
vnode (before VFS_LOAD_NODE) on the mnt_vnodelist.

If we end up replacing getnewvnode() with a two-stage allocator that
initialises in the first stage and finalises in the second stage we
may use it here.

> + /* Load the fs node. Exclusive as new_node->vn_vnode is NULL. */
> + error = VFS_LOAD_NODE(mp, vp, key, key_len, &new_key);
> + if (error == 0) {
>
> Switch the sense of this branch so the main-line code is success and
> the indented (and shorter) alternative is failure.

Ok.

> /*
> + * Look up and return a vnode/inode pair by inode number.
> + */
> +int
> +ufs_vget(struct mount *mp, ino_t ino, struct vnode **vpp)
> +{
>
> Does anything use VFS_VGET in ufs now? If so, why? If not, this
> should just fail. (Eventually we ought to nix VFS_VGET, I think.)

VFS_VGET() is used by NFS V3 readdirplus so we have to keep it.

> The one call in ufs_rename.c should be easy to fix -- just replace
>
> VOP_UNLOCK(vp);
> error = VFS_VGET(mp, dotdot_ino, &dvp);
> vrele(vp);
>
> by
>
> error = vcache_intern(mp, &dotdot_ino, sizeof(dotdot_ino), &dvp);
> vput(vp);

This is wrong, did you mean

error = vcache_intern(mp, &dotdot_ino, sizeof(dotdot_ino), &dvp);
vput(vp);

if (error)
return error;
error = vn_lock(dvp, LK_EXCLUSIVE);
if (error)
return error;

Anyway, not now.

New diff at http://www.netbsd.org/~hannken/vnode-pass6-2.diff

Mindaugas Rasiukevicius

unread,

Apr 7, 2014, 11:45:01 AM4/7/14

to J. Hannken-Illjes, tech...@netbsd.org

"J. Hannken-Illjes" <han...@eis.cs.tu-bs.de> wrote:
> Currently all file systems have to implement their own cache of
> vnode / fs node pairs. Most file systems use a copy and pasted
> version of ufs_ihash.
>
> So add a global vnode cache with lookup and remove:
>

> <...>

>
> Diff implementing this for file systems sharing ufs_ihash and replacing
> VFS_VGET()/VOP_UNLOCK() sequences with vcache_lookup() is here:
>
> http://www.netbsd.org/~hannken/vnode-pass6-1.diff
>
> Comments or objections anyone?

I like this step a lot!

One concern though: can you benchmark ./build.sh on a machine with 8 or
more CPUs? I fear that vcache.lock might be a bit contended and we ought
to figure out how much. If the contention is not significant, it should
be fine for now. In the longer term, it should be moved to a different,
preferably lockless, data structure (something I am working on). For this,
the global vcache.cv should be replaced with either per-node cv or perhaps
just a kpause, so we would be prepared now rather than later.

--
Mindaugas

J. Hannken-Illjes

unread,

Apr 7, 2014, 11:51:05 AM4/7/14

to Mindaugas Rasiukevicius, tech...@netbsd.org

On 07 Apr 2014, at 17:44, Mindaugas Rasiukevicius <rm...@netbsd.org> wrote:

> "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de> wrote:
>> Currently all file systems have to implement their own cache of
>> vnode / fs node pairs. Most file systems use a copy and pasted
>> version of ufs_ihash.
>>
>> So add a global vnode cache with lookup and remove:
>>
>> <...>
>>
>> Diff implementing this for file systems sharing ufs_ihash and replacing
>> VFS_VGET()/VOP_UNLOCK() sequences with vcache_lookup() is here:
>>
>> http://www.netbsd.org/~hannken/vnode-pass6-1.diff
>>
>> Comments or objections anyone?
>
> I like this step a lot!
>
> One concern though: can you benchmark ./build.sh on a machine with 8 or
> more CPUs? I fear that vcache.lock might be a bit contended and we ought
> to figure out how much.

I don't have such a setup. Contention should be the same as before as
ufs_ihash did exactly the same.

> If the contention is not significant, it should
> be fine for now. In the longer term, it should be moved to a different,
> preferably lockless, data structure (something I am working on). For this,
> the global vcache.cv should be replaced with either per-node cv or perhaps
> just a kpause, so we would be prepared now rather than later.

Let's change things when we need it. Not a big job as it is no longer
copy'n'pasted code.

Mindaugas Rasiukevicius

unread,

Apr 7, 2014, 11:57:10 AM4/7/14

to Taylor R Campbell, J. Hannken-Illjes, tech...@netbsd.org

Taylor R Campbell <campbell+net...@mumble.net> wrote:
>

> Some specific comments:
>
> int
> vcache_lookup(struct mount *mp, void *key, size_t key_len, struct
> vnode **vpp)
>
> Call it vcache_intern rather than vcache_lookup, since it inserts if
> not already there?
>

> <...>

What is "intern"? Not intuitive. What is wrong with just vcache_get(),
or perhaps vcache_conget() for constructing-get? I would also prefer to
wrap the long line by using "vnode_t **vpp".

--
Mindaugas

Taylor R Campbell

unread,

Apr 7, 2014, 12:02:36 PM4/7/14

to J. Hannken-Illjes, tech...@netbsd.org

Date: Mon, 7 Apr 2014 15:51:00 +0200
From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>

On 07 Apr 2014, at 03:22, Taylor R Campbell <campbell+net...@mumble.net> wrote:

> Is the plan to nix getnewvnode and ungetnewvnode? It would be good to
> avoid long-term duplication of its body. But I suspect we'll want to
> keep those for, e.g., tmpfs, in which case this should use getnewvnode
> and ungetnewvnode instead of vnalloc/setup/teardown/vnfree.

Not sure if we have to keep getnewvnode() in the long term.

It cannot be used here as we don't want the partially initialised
vnode (before VFS_LOAD_NODE) on the mnt_vnodelist.

If we end up replacing getnewvnode() with a two-stage allocator that
initialises in the first stage and finalises in the second stage we
may use it here.

In that case, could you set the VI_CHANGING bit in vp, rather than
node->vn_vnode = NULL in the vcache node, to indicate that the vnode
is being initialized in vcache_intern? That way,

(a) it is safe to put the vnode on the mount list -- anyone trying to
use it from there must use vget first --;

(b) there is only one place waits happen, namely vget (vwait), so no
need for vcache.cv or some of the complex loop logic in vcache_intern;

(c) tmpfs can mirror the vcache_intern logic without more duplication
of code from getnewvnode; and

(d) as a tiny added bonus, omitting vcache.cv will make the vcache
struct fit in a single cache line on, e.g., amd64.

Generally, we should have only one mechanism for weak references to
vnodes, and currently that's vget. If we do want to take dholland's
idea of a separate structure for weak references to vnodes, then the
vnode mount list and anything else should use that too. But I think
vget is good enough for now.

Also, it occurs to me that we're going to have not two but multiple
copies of initialization code that is currently in getnewvnode,
because in your API, some is copied in vcache_intern and some must be
copied in VFS_LOAD_NODE.

So I'm inclined to say that vcache_intern should itself look a little
more like getnewvnode, and perhaps VFS_LOAD_NODE should be VOP_LOAD
instead. See attached file for a sketch this, and a tmpfs version.

Except for the VI_CHANGING part, the vcache_intern I attached uses
nothing internal to the vnode abstraction. To address VI_CHANGING, we
could make getnewvnode set it by default and require callers to use
some new routine, say vready, to clear it. Then vcache_intern, or
clones of it like tmpfs will want, can live entirely outside the vnode
API.

> Does anything use VFS_VGET in ufs now? If so, why? If not, this
> should just fail. (Eventually we ought to nix VFS_VGET, I think.)

VFS_VGET() is used by NFS V3 readdirplus so we have to keep it.

Ugh bletch. OK for now, but it would be nice if there were a better
way to do this.

Perhaps we ought at least to change VFS_VGET so it returns the vnode
unlocked, and once we do that, use it consistently to look up vnodes
by inode number rather than calling vcache_intern in multiple
different places.

This is wrong, did you mean

error = vcache_intern(mp, &dotdot_ino, sizeof(dotdot_ino), &dvp);
vput(vp);
if (error)
return error;
error = vn_lock(dvp, LK_EXCLUSIVE);
if (error)
return error;

Oops, yes.

Anyway, not now.

OK.

vcache_intern.c

tmpfs_vnode_get.c

Mindaugas Rasiukevicius

unread,

Apr 7, 2014, 12:04:38 PM4/7/14

to J. Hannken-Illjes, tech...@netbsd.org

"J. Hannken-Illjes" <han...@eis.cs.tu-bs.de> wrote:
> On 07 Apr 2014, at 17:44, Mindaugas Rasiukevicius <rm...@netbsd.org>
> wrote:
>
> > "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de> wrote:
> >> Currently all file systems have to implement their own cache of
> >> vnode / fs node pairs. Most file systems use a copy and pasted
> >> version of ufs_ihash.
> >>
> >> So add a global vnode cache with lookup and remove:
> >>
> >> <...>
> >>
> >> Diff implementing this for file systems sharing ufs_ihash and replacing
> >> VFS_VGET()/VOP_UNLOCK() sequences with vcache_lookup() is here:
> >>
> >> http://www.netbsd.org/~hannken/vnode-pass6-1.diff
> >>
> >> Comments or objections anyone?
> >
> > I like this step a lot!
> >
> > One concern though: can you benchmark ./build.sh on a machine with 8 or
> > more CPUs? I fear that vcache.lock might be a bit contended and we
> > ought to figure out how much.
>
> I don't have such a setup. Contention should be the same as before as
> ufs_ihash did exactly the same.

Fair enough. A small side note: the lock is per all file systems now,
not only UFS.

--
Mindaugas

Taylor R Campbell

unread,

Apr 7, 2014, 12:04:58 PM4/7/14

to Mindaugas Rasiukevicius, J. Hannken-Illjes, tech...@netbsd.org

Date: Mon, 7 Apr 2014 16:44:47 +0100
From: Mindaugas Rasiukevicius <rm...@netbsd.org>

One concern though: can you benchmark ./build.sh on a machine with 8 or
more CPUs? I fear that vcache.lock might be a bit contended and we ought
to figure out how much. If the contention is not significant, it should
be fine for now.

I can do this if you really want, but I doubt it'll make a difference
unless your src and obj trees are spread across multiple different
types of file systems, because currently for ufs, ufs_ihashlock and
ufs_hashlock are just as contended as vcache.lock would be.

Taylor R Campbell

unread,

Apr 7, 2014, 12:07:40 PM4/7/14

to Mindaugas Rasiukevicius, J. Hannken-Illjes, tech...@netbsd.org

Date: Mon, 7 Apr 2014 16:56:58 +0100
From: Mindaugas Rasiukevicius <rm...@netbsd.org>

What is "intern"?

`Intern' means `lookup, or create and insert if not there'.

Mindaugas Rasiukevicius

unread,

Apr 7, 2014, 12:23:44 PM4/7/14

to Taylor R Campbell, J. Hannken-Illjes, tech...@netbsd.org

Taylor R Campbell <campbell+net...@mumble.net> wrote:

The point being is that I do not find it meaningful/intuitive. Many other
systems just use get(). If you want more accurate name, I suggest conget()
or something more meaningful.

--
Mindaugas

Mindaugas Rasiukevicius

unread,

Apr 7, 2014, 12:27:12 PM4/7/14

to Taylor R Campbell, J. Hannken-Illjes, tech...@netbsd.org

Taylor R Campbell <campbell+net...@mumble.net> wrote:

Should be fine, but once tmpfs is switched to use vcache, it might become
a little bit more visible for some workloads..

--
Mindaugas

Taylor R Campbell

unread,

Apr 7, 2014, 12:31:50 PM4/7/14

to Mindaugas Rasiukevicius, J. Hannken-Illjes, tech...@netbsd.org

Date: Mon, 7 Apr 2014 17:27:01 +0100
From: Mindaugas Rasiukevicius <rm...@netbsd.org>

Should be fine, but once tmpfs is switched to use vcache, it might become
a little bit more visible for some workloads..

Why switch tmpfs to use vcache? That would add memory and time
overhead for no reason. struct tmpfs_node already has a map from key
to vnode -- the tn_vnode field. No need to manage an rb tree to
accomplish the same thing.

J. Hannken-Illjes

unread,

Apr 7, 2014, 12:32:28 PM4/7/14

to Taylor R Campbell, tech...@netbsd.org

On 07 Apr 2014, at 18:02, Taylor R Campbell <campbell+net...@mumble.net> wrote:

> Date: Mon, 7 Apr 2014 15:51:00 +0200
> From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>
>
> On 07 Apr 2014, at 03:22, Taylor R Campbell <campbell+net...@mumble.net> wrote:
>
>> Is the plan to nix getnewvnode and ungetnewvnode? It would be good to
>> avoid long-term duplication of its body. But I suspect we'll want to
>> keep those for, e.g., tmpfs, in which case this should use getnewvnode
>> and ungetnewvnode instead of vnalloc/setup/teardown/vnfree.
>
> Not sure if we have to keep getnewvnode() in the long term.
>
> It cannot be used here as we don't want the partially initialised
> vnode (before VFS_LOAD_NODE) on the mnt_vnodelist.
>
> If we end up replacing getnewvnode() with a two-stage allocator that
> initialises in the first stage and finalises in the second stage we
> may use it here.
>
> In that case, could you set the VI_CHANGING bit in vp, rather than
> node->vn_vnode = NULL in the vcache node, to indicate that the vnode
> is being initialized in vcache_intern?

This will not work for layered file systems. They will share the
v_interlock from another vnode (in VFS_LOAD_NODE) so we better don't use
the interlock before loading the fs node.

>> Does anything use VFS_VGET in ufs now? If so, why? If not, this
>> should just fail. (Eventually we ought to nix VFS_VGET, I think.)
>
> VFS_VGET() is used by NFS V3 readdirplus so we have to keep it.
>
> Ugh bletch. OK for now, but it would be nice if there were a better
> way to do this.
>
> Perhaps we ought at least to change VFS_VGET so it returns the vnode
> unlocked, and once we do that, use it consistently to look up vnodes
> by inode number rather than calling vcache_intern in multiple
> different places.

Maybe in the far future. Already looked at it and it is a very big job
if we want to do it in one step.

Taylor R Campbell

unread,

Apr 7, 2014, 12:38:13 PM4/7/14

to J. Hannken-Illjes, tech...@netbsd.org

Date: Mon, 7 Apr 2014 18:32:02 +0200
From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>

On 07 Apr 2014, at 18:02, Taylor R Campbell <campbell+net...@mumble.net> wrote:

> In that case, could you set the VI_CHANGING bit in vp, rather than
> node->vn_vnode = NULL in the vcache node, to indicate that the vnode
> is being initialized in vcache_intern?

This will not work for layered file systems. They will share the
v_interlock from another vnode (in VFS_LOAD_NODE) so we better don't use
the interlock before loading the fs node.

You can pass the shared interlock to vcache_intern, like I illustrated
in the code I attached.

> Perhaps we ought at least to change VFS_VGET so it returns the vnode
> unlocked, and once we do that, use it consistently to look up vnodes
> by inode number rather than calling vcache_intern in multiple
> different places.

Maybe in the far future. Already looked at it and it is a very big job
if we want to do it in one step.

We could do this for just ufs:

1. Add ufs_vget_unlocked to make the sole call to vcache_intern for
ufs.

2. Make ufs_vget call ufs_vget_unlocked and then lock the result.

3. Replace calls to VFS_VGET by ufs_vget_unlocked, rather than by
vcache_intern.

Then, when we fix VFS_VGET, we replace ufs_vget by ufs_vget_unlocked.

J. Hannken-Illjes

unread,

Apr 7, 2014, 12:55:05 PM4/7/14

to Taylor R Campbell, tech...@netbsd.org

On 07 Apr 2014, at 18:38, Taylor R Campbell <campbell+net...@mumble.net> wrote:

> Date: Mon, 7 Apr 2014 18:32:02 +0200
> From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>
>
> On 07 Apr 2014, at 18:02, Taylor R Campbell <campbell+net...@mumble.net> wrote:
>
>> In that case, could you set the VI_CHANGING bit in vp, rather than
>> node->vn_vnode = NULL in the vcache node, to indicate that the vnode
>> is being initialized in vcache_intern?
>
> This will not work for layered file systems. They will share the
> v_interlock from another vnode (in VFS_LOAD_NODE) so we better don't use
> the interlock before loading the fs node.
>
> You can pass the shared interlock to vcache_intern, like I illustrated
> in the code I attached.

Sorry, I missed the new argument to vcache_intern. But there is still
a window between getnewvnode() and setting VI_CHANGING where another
thread may succeed with vget() so for this to work either getnewvnode()
returns with interlock held or it has to set VI_CHANGING and all callers
of getnewvnode have to be changed.

>> Perhaps we ought at least to change VFS_VGET so it returns the vnode
>> unlocked, and once we do that, use it consistently to look up vnodes
>> by inode number rather than calling vcache_intern in multiple
>> different places.
>
> Maybe in the far future. Already looked at it and it is a very big job
> if we want to do it in one step.
>
> We could do this for just ufs:
>
> 1. Add ufs_vget_unlocked to make the sole call to vcache_intern for
> ufs.
>
> 2. Make ufs_vget call ufs_vget_unlocked and then lock the result.
>
> 3. Replace calls to VFS_VGET by ufs_vget_unlocked, rather than by
> vcache_intern.
>
> Then, when we fix VFS_VGET, we replace ufs_vget by ufs_vget_unlocked.

Ok -- tomorrow ...

Mindaugas Rasiukevicius

unread,

Apr 7, 2014, 1:07:22 PM4/7/14

to Taylor R Campbell, J. Hannken-Illjes, tech...@netbsd.org

Taylor R Campbell <campbell+net...@mumble.net> wrote:

Err.. yes. I seem to forget conceptual details quickly after not working
on a codebase for a while. We have in-memory inodes after all, duh! :)

--
Mindaugas

Chuck Silvers

unread,

Apr 7, 2014, 1:28:49 PM4/7/14

to J. Hannken-Illjes, tech-kern@netbsd.org list

On Sun, Apr 06, 2014 at 12:14:24PM +0200, J. Hannken-Illjes wrote:

> Currently all file systems have to implement their own cache of
> vnode / fs node pairs. Most file systems use a copy and pasted
> version of ufs_ihash.
>
> So add a global vnode cache with lookup and remove:

[...]

hi,

the principle of this is good, but I have a couple concerns about the details:

- why use an rb tree instead of a hash table?
other people are saying that the lock contention is the same,
but there are two differences:
- updating an rb tree touches a lot more memory, so the lock is held for
longer.
- different parts of a hash table can easily be protected by
different locks, but that doesn't work for a tree.

- it looks like this increases the effective size of a vnode by a net 40 bytes
(7 pointers added, 2 removed). it would be better to do this without using
more memory than we do now. is there any real reason to have a separate
structure for this rather than just putting the fields directly in
struct vnode? to avoid needing fields for the "key" pointer and key length,
each fs could supply its own comparison function.

-Chuck

Greg Troxel

unread,

Apr 7, 2014, 1:50:35 PM4/7/14

to Mindaugas Rasiukevicius, Taylor R Campbell, J. Hannken-Illjes, tech...@netbsd.org

Agreed that it may be surprising, but intern has a long history of that
kind of meaning in Lisp:

http://www.lispworks.com/documentation/HyperSpec/Body/f_intern.htm

Alan Barrett

unread,

Apr 7, 2014, 2:47:19 PM4/7/14

to tech...@netbsd.org

I would find "conget" confusing, while finding "intern" clear.

Essentially, to "intern" a string or an external representation
of an an object, means to create an internal representation of
the string or object, or to find an already existing internal
representation of an identical object, and (usually) to return a
reference to that internal representation. The wikipedia article
at <https://en.wikipedia.org/wiki/String_interning> is focused on
strings, but other objects can also be interned.

--apb (Alan Barrett)

Mindaugas Rasiukevicius

unread,

Apr 7, 2014, 11:35:11 PM4/7/14

to Alan Barrett, tech...@netbsd.org

How about looking for precedents in the NetBSD source tree rather than
string manipulation in Java? Nothing really uses "intern". Perhaps not
a great naming, but other subsystems usually just use "get".

By the way, since LIPS was mentioned.. I actually suggested "conget"
while thinking of "cons". It seems a bit easier to associate with
"construction". The first thing which comes to mind when seeing an
"intern" is a "student". :)

--
Mindaugas

Thor Lancelot Simon

unread,

Apr 8, 2014, 12:18:56 AM4/8/14

to Taylor R Campbell, Mindaugas Rasiukevicius, J. Hannken-Illjes, tech...@netbsd.org

This will nail people building on tmpfs, no?

Thor

Alan Barrett

unread,

Apr 8, 2014, 2:22:30 AM4/8/14

to tech...@netbsd.org

On Tue, 08 Apr 2014, Mindaugas Rasiukevicius wrote:
> Nothing [in NetBSD] really uses "intern". Perhaps not a great

> naming, but other subsystems usually just use "get".

Yes, that's a good argument for just using "get".

--apb (Alan Barrett)

J. Hannken-Illjes

unread,

Apr 8, 2014, 5:23:32 AM4/8/14

to Chuck Silvers, tech-kern@netbsd.org list

On 07 Apr 2014, at 19:28, Chuck Silvers <ch...@chuq.com> wrote:

> On Sun, Apr 06, 2014 at 12:14:24PM +0200, J. Hannken-Illjes wrote:
>> Currently all file systems have to implement their own cache of
>> vnode / fs node pairs. Most file systems use a copy and pasted
>> version of ufs_ihash.
>>
>> So add a global vnode cache with lookup and remove:
> [...]
>
> hi,
>
> the principle of this is good, but I have a couple concerns about the details:
>
> - why use an rb tree instead of a hash table?
> other people are saying that the lock contention is the same,
> but there are two differences:
> - updating an rb tree touches a lot more memory, so the lock is held for
> longer.
> - different parts of a hash table can easily be protected by
> different locks, but that doesn't work for a tree.

The underlying data structure may change. Once at least ufs, layerfs and nfs
use this cache it will be easier to do some measurements.

> - it looks like this increases the effective size of a vnode by a net 40 bytes
> (7 pointers added, 2 removed). it would be better to do this without using
> more memory than we do now. is there any real reason to have a separate
> structure for this rather than just putting the fields directly in
> struct vnode?

Not all file systems will use the cache (tmpfs for example doesn't need it)
so to me it looks better to use an extra struct when needed.

> to avoid needing fields for the "key" pointer and key length,
> each fs could supply its own comparison function.

Going this route (more or less making the current ufs_ihash general)
makes it impossible to serialise node creation.

J. Hannken-Illjes

unread,

Apr 9, 2014, 5:11:02 AM4/9/14

to Taylor R Campbell, tech...@netbsd.org

Sorry for being late -- back from the openssl disaster now ...

There is no need to do this VI_CHANGING etc. here. Primary goal of
the vnode cache is to always initialise vnode / fs node pairs before
they become visible or appear on any lists.

There is no need to make the vcache API look like getnewvnode(). But
getnewvnode() should change to a two-pass allocator that initialises
in the first pass and finalises, adds to mount list in the second pass.

Taylor R Campbell

unread,

Apr 9, 2014, 9:57:51 AM4/9/14

to J. Hannken-Illjes, tech...@netbsd.org

Date: Wed, 9 Apr 2014 11:10:37 +0200
From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>

There is no need to do this VI_CHANGING etc. here. Primary goal of
the vnode cache is to always initialise vnode / fs node pairs before
they become visible or appear on any lists.

But the vnode/fsnode pair effectively already does appear on a list,
namely in the vcache -- it's just that you've added a new mechanism
for waiting for the vnode's state to change so that *fs_vget now has
to have two cv_wait loops, one in vcache and one in vget.

Either way, once someone has decided to create a vnode for an inode,
and before the inode has been loaded from disk, everyone else wanting
a vnode for that inode must wait for its state to change. I don't see
a compelling reason to have more than one mechanism to wait for vnode
state transitions.

There is no need to make the vcache API look like getnewvnode(). But
getnewvnode() should change to a two-pass allocator that initialises
in the first pass and finalises, adds to mount list in the second pass.

Splitting it into two parts sounds good to me. I don't think it
matters much which pass puts it in the mount list, though. I propose
we do the following:

1. Add vready(vp). After getnewvnode, you must call either
ungetnewvnode or vready. Anyone trying to vget a newly created vnode
blocks until ungetnewvnode or vready. (We can argue about names
separately -- this is morally equivalent to a two-pass allocator.)

I have a patch to do this in all file systems, and to break the
getnewvnode API so existing calls fail to compile. It passes all
tests.

2. Change VFS_ROOT, VFS_VGET, and VFS_FHTOVP to return the vnode
unlocked, and mechanically convert all callers to lock on success and
implementations to unlock on exit so that this patch is easy to audit.

I have a patch to do this in all file systems, and to break the
signatures so existing calls fail to compile. It is mostly debugged
-- only puffs and zfs tests are failing, probably a trivial bug.

3. Add vcache(9) to reduce the copypasta and variance in structure of
the *fs_vget or *fs_fhtovp routines. This can be done incrementally
per-file-system and requires changes only in *fs_vget or *fs_fhtovp.

4. Make *fs_root, *fs_vget, and *fs_fhtovp never lock the vnodes.
This is not a major API change because it really matters only for
fs-internal callers, so it can be done incrementally.

5. Omit needless lock/unlock pairs that are left as a result of (2),
and omit the needless unlock/vget ../relock dance now that (4) is
done. This is a clean-up step that doesn't need to be mixed with the
API changes or new functionality, and can also be done bit by bit.

J. Hannken-Illjes

unread,

Apr 9, 2014, 10:55:52 AM4/9/14

to Taylor R Campbell, tech...@netbsd.org

On 09 Apr 2014, at 15:57, Taylor R Campbell <campbell+net...@mumble.net> wrote:

> Date: Wed, 9 Apr 2014 11:10:37 +0200
> From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>
>
> There is no need to do this VI_CHANGING etc. here. Primary goal of
> the vnode cache is to always initialise vnode / fs node pairs before
> they become visible or appear on any lists.
>
> But the vnode/fsnode pair effectively already does appear on a list,
> namely in the vcache -- it's just that you've added a new mechanism
> for waiting for the vnode's state to change so that *fs_vget now has
> to have two cv_wait loops, one in vcache and one in vget.
>
> Either way, once someone has decided to create a vnode for an inode,
> and before the inode has been loaded from disk, everyone else wanting
> a vnode for that inode must wait for its state to change. I don't see
> a compelling reason to have more than one mechanism to wait for vnode
> state transitions.

And I don't see a reason to use the same mechanism in vcache. A newly
created vnode is invisible until it is fully loaded.

Changing vcache like you proposed would request file systems to set up
vnodes in two places, one the call to vcache_intern() where it has to
set type, ops and interlock and the second the load operation where it
has to set data, size etc.

To me it looks better if one file system operation has to set up the
full vnode state.

Do you object to the vcache API from the latest diff?

Chuck Silvers

unread,

Apr 9, 2014, 1:09:27 PM4/9/14

to J. Hannken-Illjes, tech-kern@netbsd.org list

On Tue, Apr 08, 2014 at 11:23:11AM +0200, J. Hannken-Illjes wrote:
>
> On 07 Apr 2014, at 19:28, Chuck Silvers <ch...@chuq.com> wrote:
>
> > On Sun, Apr 06, 2014 at 12:14:24PM +0200, J. Hannken-Illjes wrote:
> >> Currently all file systems have to implement their own cache of
> >> vnode / fs node pairs. Most file systems use a copy and pasted
> >> version of ufs_ihash.
> >>
> >> So add a global vnode cache with lookup and remove:
> > [...]
> >
> > hi,
> >
> > the principle of this is good, but I have a couple concerns about the details:
> >
> > - why use an rb tree instead of a hash table?
> > other people are saying that the lock contention is the same,
> > but there are two differences:
> > - updating an rb tree touches a lot more memory, so the lock is held for
> > longer.
> > - different parts of a hash table can easily be protected by
> > different locks, but that doesn't work for a tree.
>
> The underlying data structure may change. Once at least ufs, layerfs and nfs
> use this cache it will be easier to do some measurements.

ok, no point in arguing this further without some actual data.
but we should have that data before this is checked in.

> > - it looks like this increases the effective size of a vnode by a net 40 bytes
> > (7 pointers added, 2 removed). it would be better to do this without using
> > more memory than we do now. is there any real reason to have a separate
> > structure for this rather than just putting the fields directly in
> > struct vnode?
>
> Not all file systems will use the cache (tmpfs for example doesn't need it)
> so to me it looks better to use an extra struct when needed.

making every other file system use more memory just so that tmpfs can use less
doesn't seem like a good trade-off to me. the only reason that tmpfs wouldn't
need it right now is because tmpfs currently can't be NFS-exported, is that right?
if we fixed tmpfs to be NFS-exportable then it would need this?

> > to avoid needing fields for the "key" pointer and key length,
> > each fs could supply its own comparison function.
>
> Going this route (more or less making the current ufs_ihash general)
> makes it impossible to serialise node creation.

I don't follow, how does this make serialising node creation impossible?
(I assume that by "serialise node creation" you mean ensure preventing the
creation of multiple vnodes for the same file.)

-Chuck

J. Hannken-Illjes

unread,

Apr 9, 2014, 1:32:36 PM4/9/14

to Chuck Silvers, tech-kern@netbsd.org list

No, tmpfs stores a back pointer to the vnode inside the tmpfs node and
therefore doesn't need a cache.

> if we fixed tmpfs to be NFS-exportable then it would need this?

No.

>
>
>>> to avoid needing fields for the "key" pointer and key length,
>>> each fs could supply its own comparison function.
>>
>> Going this route (more or less making the current ufs_ihash general)
>> makes it impossible to serialise node creation.
>
> I don't follow, how does this make serialising node creation impossible?
> (I assume that by "serialise node creation" you mean ensure preventing the
> creation of multiple vnodes for the same file.)

If the key is part of the file system node we cannot prevent different
threads to load the same inode before the inode is at least partially
initialised.

J. Hannken-Illjes

unread,

Apr 10, 2014, 6:48:55 AM4/10/14

to Chuck Silvers, tech-kern@netbsd.org list

On 09 Apr 2014, at 19:09, Chuck Silvers <ch...@chuq.com> wrote:

> On Tue, Apr 08, 2014 at 11:23:11AM +0200, J. Hannken-Illjes wrote:
>>
>> On 07 Apr 2014, at 19:28, Chuck Silvers <ch...@chuq.com> wrote:
>>
>>> On Sun, Apr 06, 2014 at 12:14:24PM +0200, J. Hannken-Illjes wrote:
>>>> Currently all file systems have to implement their own cache of
>>>> vnode / fs node pairs. Most file systems use a copy and pasted
>>>> version of ufs_ihash.
>>>>
>>>> So add a global vnode cache with lookup and remove:
>>> [...]
>>>
>>> hi,
>>>
>>> the principle of this is good, but I have a couple concerns about the details:
>>>
>>> - why use an rb tree instead of a hash table?
>>> other people are saying that the lock contention is the same,
>>> but there are two differences:
>>> - updating an rb tree touches a lot more memory, so the lock is held for
>>> longer.
>>> - different parts of a hash table can easily be protected by
>>> different locks, but that doesn't work for a tree.
>>
>> The underlying data structure may change. Once at least ufs, layerfs and nfs
>> use this cache it will be easier to do some measurements.
>
> ok, no point in arguing this further without some actual data.
> but we should have that data before this is checked in.

On a 20-core amd64 (KVM virtualised on a 24-core CentOS 6.5) with
kernel GENERIC without DIAGNOSTIC I ran "dbench 30" on
a 2 GByte MD based UFS1 filesystem:

Throughput 45.5631 MB/sec 30 procs
Elapsed time: 721.75 seconds.

-- Adaptive mutex spin

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
100.00 390598 4034.18 vcache <all>
87.65 343624 3535.86 vcache vcache_intern+44
7.47 15761 301.37 vcache vcache_remove+21
3.59 14706 144.67 vcache vcache_intern+1fc
1.29 16503 52.19 vcache vcache_intern+14c
0.00 4 0.09 vcache vcache_intern+b0

-- Adaptive mutex sleep

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
100.00 12 5.43 vcache <all>
63.34 11 3.44 vcache vcache_intern+44
36.66 1 1.99 vcache vcache_intern+1fc

It is spinning about 0.6 % of the time -- doesn't look like contention to me.

Ilia Zykov

unread,

Apr 13, 2014, 5:58:20 PM4/13/14

to J. Hannken-Illjes, tech-kern@netbsd.org list

Hello!
Can you consider possibility integration global vnode cache with
one more generic FS function genfs_lookup().
genfs_lookup can be used by other FS, if FS no need something exotic.
Interface for genfs_lookup() can be next:

/*
* Call back function for initiate new created vnode (vnode / fs node pair).
*/
int
fs_fill_vnode(struct mount *mp, ino_t fileno, struct vnode *vp) {
....
vp->data = ...
vp->v_op = ...
vp->v_tag = ...
....
}

/*
* Called from mount() for initiate new instance genfs_lookup for new mp.
*/
int
genfs_init_lookup(struct mount *mp,
int (fillvnode)(struct mount *, ino_t, struct vnode *), /* &fs_fill_vnode() */
int (*readdir)(void *)) {
}
/*
* Called from unmount() for destroy instance genfs_lookup for mp.
*/
int
genfs_remove_lookup(struct mount *mp) {
}

/*
* Generic lookup for FSs.
*/
int
genfs_lookup(void * v){
...
find fileno with the help of readdir() for vpp
try find vnode for this fileno in cache
vcache_lookup(struct mount *mp, void *fileno, size_t ino_t, struct vnode **vpp)
if found not new vnode goto return
else
get and initiate new vnode(vnode / fs node pair) with call back
fillvnode()
return:
....
maybe vget, vref, VOP_UNLOCK(*vpp) ... what need
return 0;
}

Thanks in advance.
Ilia.

J. Hannken-Illjes

unread,

Apr 14, 2014, 9:58:50 AM4/14/14

to Chuck Silvers, tech-kern@netbsd.org list

On 07 Apr 2014, at 19:28, Chuck Silvers <ch...@chuq.com> wrote:

> On Sun, Apr 06, 2014 at 12:14:24PM +0200, J. Hannken-Illjes wrote:
>> Currently all file systems have to implement their own cache of
>> vnode / fs node pairs. Most file systems use a copy and pasted
>> version of ufs_ihash.
>>
>> So add a global vnode cache with lookup and remove:
> [...]
>
> hi,
>
> the principle of this is good, but I have a couple concerns about the details:
>
> - why use an rb tree instead of a hash table?
> other people are saying that the lock contention is the same,
> but there are two differences:
> - updating an rb tree touches a lot more memory, so the lock is held for
> longer.
> - different parts of a hash table can easily be protected by
> different locks, but that doesn't work for a tree.

Did some tests, "dbench 30" with pre-loaded vnode cache of 110592 nodes.
A hash table gives ~2.7% more throughput so I switched to a hash table.
Also replaced the cv_wait/cv_broadcast with kpause as insert collisions
are very rare and moved the common vnode initialisation to vnalloc.

New diff at http://www.netbsd.org/~hannken/vnode-pass6-3.diff

Plan to commit early wednesday ...

Taylor R Campbell

unread,

Apr 15, 2014, 12:12:18 PM4/15/14

to J. Hannken-Illjes, Chuck Silvers, tech-kern@netbsd.org list

Date: Mon, 14 Apr 2014 15:58:28 +0200
From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>

Also replaced the cv_wait/cv_broadcast with kpause as insert collisions
are very rare and moved the common vnode initialisation to vnalloc.

New diff at http://www.netbsd.org/~hannken/vnode-pass6-3.diff

Plan to commit early wednesday ...

I still don't think this approach is right. It makes a long-term copy
of logic in getnewvnode (because this vcache(9) will not be applicable
everywhere), there's no reason to use kpause or any new condvars when
we already have one inside each vnode which we'll be using anyway in
vget, and it still increases the overhead of every vnode using it.

I don't object to the principle of VFS_LOAD_NODE, or VOP_LOAD, but at
the moment I think it will cause greater divergence between file
systems and maintenance headaches as a result.

As an alternative, I propose the set of patches at
<https://www.NetBSD.org/~riastradh/tmp/20140415/>, to do the
following:

1. Add vready(vp), which callers of getnewvnode are required to call
before anyone else can vget vp. This doesn't really add any new
states to vnodes -- it just changes the wait-for-initialization
mechanism from abusing the vnode lock to waiting on the vnode's
condvar.

2. Make VFS_VGET, VFS_FHTOVP, and VFS_ROOT return unlocked (but
referenced) vnodes.

3. Import a vcache-like abstraction I wrote a couple months ago,
vinotab (`vfs inode table'), which adds no memory overhead to vnodes
versus ufs_ihash.

4. Convert some file systems to use it (more to come later), and kill
various lock dances.

This changes all file systems at once to use vready and to change the
contract of VFS_VGET, VFS_FHTOVP, and VFS_ROOT in steps (1) and (2).
The remaining changes are incremental and entail no changes to the
VFS/vnode API itself -- notably, vinotab sits outside vfs_vnode.c and
only captures a pattern that file systems could use directly.

J. Hannken-Illjes

unread,

Apr 16, 2014, 10:06:12 AM4/16/14

to Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

On 15 Apr 2014, at 18:11, Taylor R Campbell <campbell+net...@mumble.net> wrote:

> Date: Mon, 14 Apr 2014 15:58:28 +0200
> From: "J. Hannken-Illjes" <han...@eis.cs.tu-bs.de>
>
> Also replaced the cv_wait/cv_broadcast with kpause as insert collisions
> are very rare and moved the common vnode initialisation to vnalloc.
>
> New diff at http://www.netbsd.org/~hannken/vnode-pass6-3.diff
>
> Plan to commit early wednesday ...
>
> I still don't think this approach is right. It makes a long-term copy
> of logic in getnewvnode (because this vcache(9) will not be applicable
> everywhere), there's no reason to use kpause or any new condvars when
> we already have one inside each vnode which we'll be using anyway in
> vget, and it still increases the overhead of every vnode using it.
>
> I don't object to the principle of VFS_LOAD_NODE, or VOP_LOAD, but at
> the moment I think it will cause greater divergence between file
> systems and maintenance headaches as a result.

Still don't see the divergence and maintenance headaches.

This discussion is bogged down. If no new arguments come up
we should ask core@ for a decision.

David Holland

unread,

Apr 26, 2014, 3:23:05 PM4/26/14

to Taylor R Campbell, J. Hannken-Illjes, Chuck Silvers, tech-kern@netbsd.org list

On Tue, Apr 15, 2014 at 04:11:58PM +0000, Taylor R Campbell wrote:
> New diff at http://www.netbsd.org/~hannken/vnode-pass6-3.diff
>
> Plan to commit early wednesday ...
>
> I still don't think this approach is right. It makes a long-term copy
> of logic in getnewvnode (because this vcache(9) will not be applicable
> everywhere), there's no reason to use kpause or any new condvars when
> we already have one inside each vnode which we'll be using anyway in
> vget, and it still increases the overhead of every vnode using it.
>
> I don't object to the principle of VFS_LOAD_NODE, or VOP_LOAD, but at
> the moment I think it will cause greater divergence between file
> systems and maintenance headaches as a result.
>
> As an alternative, I propose the set of patches at
> <https://www.NetBSD.org/~riastradh/tmp/20140415/>, to do the
> following:

> [...]

Wading into this after the fact, I see the following points:

- Duplicating the getnewvnode logic is definitely bad; we have enough
cruft without adding new redundant but not quite equivalent code
paths.

- It seems to me that in the long run, all of this baloney should be
hidden away inside the vfs layer; filesystems that use the vnode cache
should only need to call vcache_get, and the only thing that should
ever see a partially initialized vnode is VOP_LOAD and nothing outside
the vnode cache should ever see a vnode with refcount zero. This in
particular means that all the various forms of vget() should be hidden
away, as should getnewvnode. It seems to me like both of these patch
sets are steps in this direction, but we're far enough away from the
destination that it's hard to choose between them.

- A while back (one or two rounds ago) I floated an idea that had
separate vnode key objects for the same reason they appear here: to
make insertion atomic. Riastradh convinced me at the time that this
wasn't necessary, so I dropped it and never got around to posting it
on tech-kern. However, it had a characteristic that I don't see here:
locks in the vnode keys. The idea of that was to introduce a level of
indirection so that nothing like VI_CHANGING was needed: while loading
a vnode, you unlock the table but leave the key locked, and that takes
care, in an orderly way, of making sure nobody else sees it. If we're
going to end up with vnode key objects, I think they should contain
locks in favor of having exposed or semi-exposed state flags.

- There are still filesystems that do not use the vnode cache. Half
the problem with the legacy scheme we have is that it tries to provide
the expire half of the cache to all filesystems, even those that don't
use or need the lookup half of the cache. This is fundamentally wrong.
I think rather than trying to continue with this folly we should do
something different that keeps these filesystems separate.

- Question 1: How much overhead would it really cause if we didn't
bother to cycle tmpfs vnodes "in" and "out" but just materialized them
along with the tmpfs_node and kept them around until someone destroys
the tmpfs_node? vnodes have a fair amount of slop in them, but they
aren't that big and if you're using tmpfs you presumably have enough
memory to keep whole files in RAM anyway. In this case we could just
provide a constructor and destructor for non-cache vnodes, decouple
that from the lifecycle logic (and firewall it off with plenty of
assertions and such) and a lot of the silly goes away.

- Question 2: The other obvious candidates for not using the vnode
cache are virtual constructs like kernfs and procfs. ISTM that
"loading" and "unloading" their vnodes is equally pointless and that
the same solution applies. The question is: are there any other fses
that wouldn't use the vnode cache for some other reason, such that
this reasoning wouldn't apply? I can't think of any, and it seems like
anything where there are genuine load and unload operations for vnodes
vnodes should be using the cache, but that doesn't prove anything.

Also, it should be vcache_get(), not intern; with all "intern"
operations I can think of, you pass in the object to be interned,
not just a key for it. If we had vcache_intern(), the model would be
that you consed up a vnode, then passed it (along with the key) to
vcache_intern(), which would then either insert it into the cache and
return it, or destroy it and return the existing equivalent one. This
is not really what we want.

and FWIW, I favor VOP_LOAD over VFS_LOADVNODE.

--
David A. Holland
dhol...@netbsd.org

J. Hannken-Illjes

unread,

Apr 28, 2014, 4:57:10 AM4/28/14

to David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

Agreed.

> This in
> particular means that all the various forms of vget() should be hidden
> away, as should getnewvnode. It seems to me like both of these patch
> sets are steps in this direction, but we're far enough away from the
> destination that it's hard to choose between them.
>
> - A while back (one or two rounds ago) I floated an idea that had
> separate vnode key objects for the same reason they appear here: to
> make insertion atomic. Riastradh convinced me at the time that this
> wasn't necessary, so I dropped it and never got around to posting it
> on tech-kern. However, it had a characteristic that I don't see here:
> locks in the vnode keys. The idea of that was to introduce a level of
> indirection so that nothing like VI_CHANGING was needed: while loading
> a vnode, you unlock the table but leave the key locked, and that takes
> care, in an orderly way, of making sure nobody else sees it. If we're
> going to end up with vnode key objects, I think they should contain
> locks in favor of having exposed or semi-exposed state flags.

While I agree on atomic insertions being a mandatory property I
don't see the need for a per key lock. As collisions are very rare
wait and retry using hash table lock and key->vnode being NULL
or a per table cv like I did before is sufficient (at least for now).

> - There are still filesystems that do not use the vnode cache. Half
> the problem with the legacy scheme we have is that it tries to provide
> the expire half of the cache to all filesystems, even those that don't
> use or need the lookup half of the cache. This is fundamentally wrong.
> I think rather than trying to continue with this folly we should do
> something different that keeps these filesystems separate.
>
> - Question 1: How much overhead would it really cause if we didn't
> bother to cycle tmpfs vnodes "in" and "out" but just materialized them
> along with the tmpfs_node and kept them around until someone destroys
> the tmpfs_node? vnodes have a fair amount of slop in them, but they
> aren't that big and if you're using tmpfs you presumably have enough
> memory to keep whole files in RAM anyway. In this case we could just
> provide a constructor and destructor for non-cache vnodes, decouple
> that from the lifecycle logic (and firewall it off with plenty of
> assertions and such) and a lot of the silly goes away.

Once we reach a point where this decision has to be made it may be
better to use the vnode cache for tmpfs than to make tmpfs the only
fs being different.

> - Question 2: The other obvious candidates for not using the vnode
> cache are virtual constructs like kernfs and procfs. ISTM that
> "loading" and "unloading" their vnodes is equally pointless and that
> the same solution applies. The question is: are there any other fses
> that wouldn't use the vnode cache for some other reason, such that
> this reasoning wouldn't apply? I can't think of any, and it seems like
> anything where there are genuine load and unload operations for vnodes
> vnodes should be using the cache, but that doesn't prove anything.

Kernfs and procfs already use a hash table and are candidates for
the vnode cache.

> Also, it should be vcache_get(), not intern; with all "intern"
> operations I can think of, you pass in the object to be interned,
> not just a key for it. If we had vcache_intern(), the model would be
> that you consed up a vnode, then passed it (along with the key) to
> vcache_intern(), which would then either insert it into the cache and
> return it, or destroy it and return the existing equivalent one. This
> is not really what we want.

Agreed. And we may need vcache_lookup to lookup but not load a
node too.

> and FWIW, I favor VOP_LOAD over VFS_LOADVNODE.

I prefer a fs operation over a vnode operation to initialise a
vnode. Using a vnode operation we had to either pass the vnode
operations vector to vache_get() which is ugly or we had to set
it from "*mp->mnt_op->vfs_opv_descs[0]->opv_desc_vector_p" before
calling VOP_LOAD(). To me it looks better to have the fs setup the
operations vector (which may not be the default one).

Could we finally agree on this interface:

vcache_get(mp, key, key_len, vpp) to lookup and possibly load a vnode.
vcache_lookup(mp, key, key_len, vpp) to lookup a vnode.
vcache_remove(mp, key, key_len) to remove a vnode from the cache.
VFS_LOAD_NODE(mp, vp, key, key_len, new_key) to initialise a vnode.

Ilia Zykov

unread,

Apr 28, 2014, 7:24:38 AM4/28/14

to J. Hannken-Illjes, David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

Sorry to disturb you.

But I have small comment about subject.
I think it is unnecessary overload involve mount point in lookup for global cache.
You save few bytes for hashhead for every mount point, but loose effectively lookup.
Every lookup has one more compare for mp, and your hashtable will be more bigger.
In any case have hashhead and hashlock for every mount point is more flexible way, IMHO.
"struct mount" is a good place for this.

Ilia.

J. Hannken-Illjes

unread,

Apr 28, 2014, 7:36:43 AM4/28/14

to Ilia Zykov, David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

First, one of the goals here is to have one table for all vnodes.

Second, regarding space this even gets smaller. A hash table allocates
next_power_of_two(desiredvnodes) list heads so every hash table accounts
for ~1.5 pointers per vnode.

Ilia Zykov

unread,

Apr 28, 2014, 11:29:58 AM4/28/14

to J. Hannken-Illjes, David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

>
> First, one of the goals here is to have one table for all vnodes.
>

Allow me to disagree, what is advantage has only one table?
I see only disadvantage.
For instance:
You mount more than one FSes, and one have hardware problem, bad sector ...
Do You have also problems with other or only with one?
I understand why need global functions, but I don't understand why need
global table.

Sorry for my stupidity.
Ilia.

J. Hannken-Illjes

unread,

Apr 28, 2014, 11:51:52 AM4/28/14

to Ilia Zykov, David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

On 28 Apr 2014, at 17:29, Ilia Zykov <net...@izyk.ru> wrote:

>>
>> First, one of the goals here is to have one table for all vnodes.
>>
>
> Allow me to disagree, what is advantage has only one table?
> I see only disadvantage.
> For instance:
> You mount more than one FSes, and one have hardware problem, bad sector ...
> Do You have also problems with other or only with one?

Other file systems are unaffected, if a fs node fails to load the
partially initialised vnode gets destroyed and we're done. Loading
a node may only block other threads trying to load the same node.

Having per-mount tables would not change anything here.

Ilia Zykov

unread,

Apr 28, 2014, 1:15:40 PM4/28/14

to J. Hannken-Illjes, David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

>
> Having per-mount tables would not change anything here.
>

Ok.
It is not fundamental thing.
If needed, this can be changed later.

Thank you.
Ilia.

Ilia Zykov

unread,

Apr 30, 2014, 3:50:55 AM4/30/14

to J. Hannken-Illjes, tech-kern@netbsd.org list, David Holland

Hello.
What do you think about "DragonFly's VOP_NRESOLVE() API"?
I think it resolves many problems.
It's not replace other changes, but very good supplementation.
If I understood it correctly, VFS LOCKS name component, before
subsequent vnode's operations.

Ilia.

J. Hannken-Illjes

unread,

Apr 30, 2014, 11:15:34 AM4/30/14

to tech-kern@netbsd.org list

On 28 Apr 2014, at 10:56, J. Hannken-Illjes <han...@eis.cs.tu-bs.de> wrote:

<snip>

>
> Could we finally agree on this interface:
>
> vcache_get(mp, key, key_len, vpp) to lookup and possibly load a vnode.
> vcache_lookup(mp, key, key_len, vpp) to lookup a vnode.
> vcache_remove(mp, key, key_len) to remove a vnode from the cache.
> VFS_LOAD_NODE(mp, vp, key, key_len, new_key) to initialise a vnode.

Updated diff at http://www.netbsd.org/~hannken/vnode-pass6-4.diff

David Holland

unread,

May 1, 2014, 12:07:10 PM5/1/14

to J. Hannken-Illjes, David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

On Mon, Apr 28, 2014 at 10:56:44AM +0200, J. Hannken-Illjes wrote:
> > Wading into this after the fact, I see the following points:
> >
> > - Duplicating the getnewvnode logic is definitely bad; we have enough
> > cruft without adding new redundant but not quite equivalent code
> > paths.
> >
> > - It seems to me that in the long run, all of this baloney should be
> > hidden away inside the vfs layer; filesystems that use the vnode cache
> > should only need to call vcache_get, and the only thing that should
> > ever see a partially initialized vnode is VOP_LOAD and nothing outside
> > the vnode cache should ever see a vnode with refcount zero.
>
> Agreed.

ok, in that case I think the only issue with the patch I looked at
Sunday (was it that long ago already?) is that it does duplicate the
getnewvnode logic -- if you've since fixed that I don't think I have
further substantive objections.

> > - A while back (one or two rounds ago) I floated an idea that had
> > separate vnode key objects for the same reason they appear here: to
> > make insertion atomic. Riastradh convinced me at the time that this
> > wasn't necessary, so I dropped it and never got around to posting it
> > on tech-kern. However, it had a characteristic that I don't see here:
> > locks in the vnode keys. The idea of that was to introduce a level of
> > indirection so that nothing like VI_CHANGING was needed: while loading
> > a vnode, you unlock the table but leave the key locked, and that takes
> > care, in an orderly way, of making sure nobody else sees it. If we're
> > going to end up with vnode key objects, I think they should contain
> > locks in favor of having exposed or semi-exposed state flags.
>
> While I agree on atomic insertions being a mandatory property I
> don't see the need for a per key lock. As collisions are very rare
> wait and retry using hash table lock and key->vnode being NULL
> or a per table cv like I did before is sufficient (at least for now).

Having a lock and waiting on a cv and state bit are exactly
equivalent, except that if you have a lock,
- neither the state nor the mechanism is exposed.

I would consider that an advantage, but it's not a killer.

I don't think holding the table lock (or even a lock on part of the
table, like one hash chain) while loading a vnode is a good idea,
because a layer vnode is likely to come back to the table while
loading.

Hmm. I think one of the reasons I had a separate lock (and a refcount
on the key structure) in my scheme rather than just setting the vnode
pointer to null is to avoid a race on reclaim or revoke: to reclaim
you want to lock the key, unlock the table, do reclaim, and only then
null the vnode pointer and unlock the key. (Then you relock the table,
look up the key again, and take it out if the vnode pointer is still
null.) If you only null the vnode pointer before running reclaim you
either have to hold the table lock during reclaim (almost certainly
not a good idea -- just yesterday in another context I was dealing
with a fs that needed to load a vnode when reclaiming another) or you
have a race where the vnode can be reloaded while reclaim is still in
progress.

> > - Question 1: How much overhead would it really cause if we didn't
> > bother to cycle tmpfs vnodes "in" and "out" but just materialized them
> > along with the tmpfs_node and kept them around until someone destroys
> > the tmpfs_node? vnodes have a fair amount of slop in them, but they
> > aren't that big and if you're using tmpfs you presumably have enough
> > memory to keep whole files in RAM anyway. In this case we could just
> > provide a constructor and destructor for non-cache vnodes, decouple
> > that from the lifecycle logic (and firewall it off with plenty of
> > assertions and such) and a lot of the silly goes away.
>
> Once we reach a point where this decision has to be made it may be
> better to use the vnode cache for tmpfs than to make tmpfs the only
> fs being different.

I don't think that's necessary. For vnodes belonging to virtual
objects, there's very little to be gained by counting them under
maxvnodes or cycling them "out of memory". The actual underlying
objects can't go away and the vnode itself is just a fairly small
administrative structure.

(Also, with tmpfs files, given that the vnode is the uvm object
holding the file data, presumably there's some hack in place already
so it can't go away and thereby lose said file data? Or is the vnode a
uvm object that maps another uvm object that's the real file data? If
the latter, simplifying this would buy back the overhead of making
tmpfs vnodes permanently "resident".)

> > - Question 2: The other obvious candidates for not using the vnode
> > cache are virtual constructs like kernfs and procfs. ISTM that
> > "loading" and "unloading" their vnodes is equally pointless and that
> > the same solution applies. The question is: are there any other fses
> > that wouldn't use the vnode cache for some other reason, such that
> > this reasoning wouldn't apply? I can't think of any, and it seems like
> > anything where there are genuine load and unload operations for vnodes
> > vnodes should be using the cache, but that doesn't prove anything.
>
> Kernfs and procfs already use a hash table and are candidates for
> the vnode cache.

..is there a reason for this or is it just because that's what vfs
demands/demanded, and/or copypaste from other fses? I can't think of
any reason why it would be meaningful to "unload" a kernfs vnode. The
closest approximation to a reason I can think of is that it might make
the device vnode for /kern/rootdev permanently resident; but that's
pretty much true anyway.

> > Also, it should be vcache_get(), not intern; with all "intern"
> > operations I can think of, you pass in the object to be interned,
> > not just a key for it. If we had vcache_intern(), the model would be
> > that you consed up a vnode, then passed it (along with the key) to
> > vcache_intern(), which would then either insert it into the cache and
> > return it, or destroy it and return the existing equivalent one. This
> > is not really what we want.
>
> Agreed. And we may need vcache_lookup to lookup but not load a
> node too.

When would one do that? I suppose there are probably reasons, but I
can't think of any offhand. It's going to inherently be a race, also,
since if it returns NULL or ENOENT or whatever it does if the vnode
isn't loaded, the vnode could get loaded by the time the caller sees
this. Consequently I suspect that most of the uses that someone might
think they want would actually be wrong. :-/

> > and FWIW, I favor VOP_LOAD over VFS_LOADVNODE.
>
> I prefer a fs operation over a vnode operation to initialise a
> vnode. Using a vnode operation we had to either pass the vnode
> operations vector to vache_get() which is ugly or we had to set
> it from "*mp->mnt_op->vfs_opv_descs[0]->opv_desc_vector_p" before
> calling VOP_LOAD(). To me it looks better to have the fs setup the
> operations vector (which may not be the default one).

Yes, you're right.

> Could we finally agree on this interface:
>
> vcache_get(mp, key, key_len, vpp) to lookup and possibly load a vnode.
> vcache_lookup(mp, key, key_len, vpp) to lookup a vnode.
> vcache_remove(mp, key, key_len) to remove a vnode from the cache.
> VFS_LOAD_NODE(mp, vp, key, key_len, new_key) to initialise a vnode.

I would prefer VFS_LOADVNODE rather than VFS_LOAD_NODE, for the
following not-quite-cosmetic reasons:
- it should be "vnode" rather than "node" as a matter of standard
usage;
- when all the op names have one underscore it's easier to process
tables of them with editor macros. I know the renamelock ones
broke this property, and maybe they should be changed...

What calls vcache_remove? Is this the external interface for revoke?
(Reclaim should be triggered from inside the cache, at least in the
long run.) We need a way for the fs to mark vnodes as not worth
caching any more (e.g. because they've been unlinked), so they get
reclaimed as soon as the refcount hits zero, but I don't think
"remove" gives the right semantics.

If it's the interface for revoke, we should probably call it
vcache_revoke. If it's also the current path for getting reclaim
called, we should probably make a separate entry point for that (even
if it goes to the same logic) and hide it away later.

If it's something else, please tell me what I've forgotten :-)

David Holland

unread,

May 1, 2014, 12:09:49 PM5/1/14

to Ilia Zykov, J. Hannken-Illjes, David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

On Mon, Apr 28, 2014 at 07:29:44PM +0400, Ilia Zykov wrote:
> > First, one of the goals here is to have one table for all vnodes.
>
> Allow me to disagree, what is advantage has only one table?
> I see only disadvantage.

The reason (the only reason, really) is to bound the total size of the
vnode cache across all fses and keep all the vnodes on a single
least-recently-used chain. That way if I do something that grinds in
/tmp for a while it will cycle in vnodes from the root fs and cycle
out the ones from /home, and if I go back to ~ it will cycle in the
ones from /home in favor of the ones from /tmp.

Otherwise it would be much simpler (and probably more scalable) to
have one cache per fs.

J. Hannken-Illjes

unread,

May 1, 2014, 12:50:15 PM5/1/14

to David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

On 01 May 2014, at 18:06, David Holland <dholla...@netbsd.org> wrote:

> On Mon, Apr 28, 2014 at 10:56:44AM +0200, J. Hannken-Illjes wrote:
>>> Wading into this after the fact, I see the following points:
>>>
>>> - Duplicating the getnewvnode logic is definitely bad; we have enough
>>> cruft without adding new redundant but not quite equivalent code
>>> paths.
>>>
>>> - It seems to me that in the long run, all of this baloney should be
>>> hidden away inside the vfs layer; filesystems that use the vnode cache
>>> should only need to call vcache_get, and the only thing that should
>>> ever see a partially initialized vnode is VOP_LOAD and nothing outside
>>> the vnode cache should ever see a vnode with refcount zero.
>>
>> Agreed.
>
> ok, in that case I think the only issue with the patch I looked at
> Sunday (was it that long ago already?) is that it does duplicate the
> getnewvnode logic -- if you've since fixed that I don't think I have
> further substantive objections.

What do you mean with "duplicate the getnewvnode logic" here?
Vfs_busy()/vfs_unbusy() and vfs_insmntque()?

>>> - A while back (one or two rounds ago) I floated an idea that had
>>> separate vnode key objects for the same reason they appear here: to
>>> make insertion atomic. Riastradh convinced me at the time that this
>>> wasn't necessary, so I dropped it and never got around to posting it
>>> on tech-kern. However, it had a characteristic that I don't see here:
>>> locks in the vnode keys. The idea of that was to introduce a level of
>>> indirection so that nothing like VI_CHANGING was needed: while loading
>>> a vnode, you unlock the table but leave the key locked, and that takes
>>> care, in an orderly way, of making sure nobody else sees it. If we're
>>> going to end up with vnode key objects, I think they should contain
>>> locks in favor of having exposed or semi-exposed state flags.
>>
>> While I agree on atomic insertions being a mandatory property I
>> don't see the need for a per key lock. As collisions are very rare
>> wait and retry using hash table lock and key->vnode being NULL
>> or a per table cv like I did before is sufficient (at least for now).
>
> Having a lock and waiting on a cv and state bit are exactly
> equivalent, except that if you have a lock,
> - neither the state nor the mechanism is exposed.

What state is exposed by "wait and retry"?

> I would consider that an advantage, but it's not a killer.
>
> I don't think holding the table lock (or even a lock on part of the
> table, like one hash chain) while loading a vnode is a good idea,
> because a layer vnode is likely to come back to the table while
> loading.

I never presented a patch where the node got loaded with the table lock
held.

> Hmm. I think one of the reasons I had a separate lock (and a refcount
> on the key structure) in my scheme rather than just setting the vnode
> pointer to null is to avoid a race on reclaim or revoke: to reclaim
> you want to lock the key, unlock the table, do reclaim, and only then
> null the vnode pointer and unlock the key. (Then you relock the table,
> look up the key again, and take it out if the vnode pointer is still
> null.) If you only null the vnode pointer before running reclaim you
> either have to hold the table lock during reclaim (almost certainly
> not a good idea -- just yesterday in another context I was dealing
> with a fs that needed to load a vnode when reclaiming another) or you
> have a race where the vnode can be reloaded while reclaim is still in
> progress.

What is the difference between a locked cache node and a cache node
with NULL vnode pointer (where the vnode pointer gets examined or
changed with the table lock held)?

>>> - Question 1: How much overhead would it really cause if we didn't
>>> bother to cycle tmpfs vnodes "in" and "out" but just materialized them
>>> along with the tmpfs_node and kept them around until someone destroys
>>> the tmpfs_node? vnodes have a fair amount of slop in them, but they
>>> aren't that big and if you're using tmpfs you presumably have enough
>>> memory to keep whole files in RAM anyway. In this case we could just
>>> provide a constructor and destructor for non-cache vnodes, decouple
>>> that from the lifecycle logic (and firewall it off with plenty of
>>> assertions and such) and a lot of the silly goes away.
>>
>> Once we reach a point where this decision has to be made it may be
>> better to use the vnode cache for tmpfs than to make tmpfs the only
>> fs being different.
>
> I don't think that's necessary. For vnodes belonging to virtual
> objects, there's very little to be gained by counting them under
> maxvnodes or cycling them "out of memory". The actual underlying
> objects can't go away and the vnode itself is just a fairly small
> administrative structure.

Sounds reasonable -- but we are not at a point to make this decision.

> (Also, with tmpfs files, given that the vnode is the uvm object
> holding the file data, presumably there's some hack in place already
> so it can't go away and thereby lose said file data? Or is the vnode a
> uvm object that maps another uvm object that's the real file data? If
> the latter, simplifying this would buy back the overhead of making
> tmpfs vnodes permanently "resident".)
>
>>> - Question 2: The other obvious candidates for not using the vnode
>>> cache are virtual constructs like kernfs and procfs. ISTM that
>>> "loading" and "unloading" their vnodes is equally pointless and that
>>> the same solution applies. The question is: are there any other fses
>>> that wouldn't use the vnode cache for some other reason, such that
>>> this reasoning wouldn't apply? I can't think of any, and it seems like
>>> anything where there are genuine load and unload operations for vnodes
>>> vnodes should be using the cache, but that doesn't prove anything.
>>
>> Kernfs and procfs already use a hash table and are candidates for
>> the vnode cache.
>

> ...is there a reason for this or is it just because that's what vfs

> demands/demanded, and/or copypaste from other fses? I can't think of
> any reason why it would be meaningful to "unload" a kernfs vnode. The
> closest approximation to a reason I can think of is that it might make
> the device vnode for /kern/rootdev permanently resident; but that's
> pretty much true anyway.
>
>>> Also, it should be vcache_get(), not intern; with all "intern"
>>> operations I can think of, you pass in the object to be interned,
>>> not just a key for it. If we had vcache_intern(), the model would be
>>> that you consed up a vnode, then passed it (along with the key) to
>>> vcache_intern(), which would then either insert it into the cache and
>>> return it, or destroy it and return the existing equivalent one. This
>>> is not really what we want.
>>
>> Agreed. And we may need vcache_lookup to lookup but not load a
>> node too.
>
> When would one do that? I suppose there are probably reasons, but I
> can't think of any offhand. It's going to inherently be a race, also,
> since if it returns NULL or ENOENT or whatever it does if the vnode
> isn't loaded, the vnode could get loaded by the time the caller sees
> this. Consequently I suspect that most of the uses that someone might
> think they want would actually be wrong. :-/

At least union will need something like this. The key of a union
node may change and there must be a way to "test" a key. The file
system will have to serialise to prevent races.

But it may be better to not expose vcache_lookup() before we have
a need for it.

>>> and FWIW, I favor VOP_LOAD over VFS_LOADVNODE.
>>
>> I prefer a fs operation over a vnode operation to initialise a
>> vnode. Using a vnode operation we had to either pass the vnode
>> operations vector to vache_get() which is ugly or we had to set
>> it from "*mp->mnt_op->vfs_opv_descs[0]->opv_desc_vector_p" before
>> calling VOP_LOAD(). To me it looks better to have the fs setup the
>> operations vector (which may not be the default one).
>
> Yes, you're right.
>
>> Could we finally agree on this interface:
>>
>> vcache_get(mp, key, key_len, vpp) to lookup and possibly load a vnode.
>> vcache_lookup(mp, key, key_len, vpp) to lookup a vnode.
>> vcache_remove(mp, key, key_len) to remove a vnode from the cache.
>> VFS_LOAD_NODE(mp, vp, key, key_len, new_key) to initialise a vnode.
>
> I would prefer VFS_LOADVNODE rather than VFS_LOAD_NODE, for the
> following not-quite-cosmetic reasons:
> - it should be "vnode" rather than "node" as a matter of standard
> usage;
> - when all the op names have one underscore it's easier to process
> tables of them with editor macros. I know the renamelock ones
> broke this property, and maybe they should be changed...

Ok.

> What calls vcache_remove? Is this the external interface for revoke?
> (Reclaim should be triggered from inside the cache, at least in the
> long run.) We need a way for the fs to mark vnodes as not worth
> caching any more (e.g. because they've been unlinked), so they get
> reclaimed as soon as the refcount hits zero, but I don't think
> "remove" gives the right semantics.

The cache in its current form does not change the vnode life cycle.
So vcache_remove() is called from the file systems reclaim operation.

This will likely change once the vnode cache becomes responsible for
the life cycle -- but one step after the other please.

> If it's the interface for revoke, we should probably call it
> vcache_revoke. If it's also the current path for getting reclaim
> called, we should probably make a separate entry point for that (even
> if it goes to the same logic) and hide it away later.
>
> If it's something else, please tell me what I've forgotten :-)
>
> --
> David A. Holland
> dhol...@netbsd.org

David Holland

unread,

May 1, 2014, 2:02:57 PM5/1/14

to J. Hannken-Illjes, David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

On Thu, May 01, 2014 at 06:49:41PM +0200, J. Hannken-Illjes wrote:
> > ok, in that case I think the only issue with the patch I looked at
> > Sunday (was it that long ago already?) is that it does duplicate the
> > getnewvnode logic -- if you've since fixed that I don't think I have
> > further substantive objections.
>
> What do you mean with "duplicate the getnewvnode logic" here?
> Vfs_busy()/vfs_unbusy() and vfs_insmntque()?

ultimately, what Taylor was objecting to -- he convinced me last
weekend that it was a valid objection, but I don't have time to look
at the details right now :-/

> > Having a lock and waiting on a cv and state bit are exactly
> > equivalent, except that if you have a lock,
> > - neither the state nor the mechanism is exposed.
>
> What state is exposed by "wait and retry"?

The state you're waiting on. If the state in question is that the
vnode pointer inside the table is null, I suppose then it's decently
hidden away.

> > I would consider that an advantage, but it's not a killer.
> >
> > I don't think holding the table lock (or even a lock on part of the
> > table, like one hash chain) while loading a vnode is a good idea,
> > because a layer vnode is likely to come back to the table while
> > loading.
>
> I never presented a patch where the node got loaded with the table lock
> held.

ok good, I wasn't sure exactly what you meant and wanted to make sure
I'd addressed all the cases.

> > Hmm. I think one of the reasons I had a separate lock (and a refcount
> > on the key structure) in my scheme rather than just setting the vnode
> > pointer to null is to avoid a race on reclaim or revoke: to reclaim
> > you want to lock the key, unlock the table, do reclaim, and only then
> > null the vnode pointer and unlock the key. (Then you relock the table,
> > look up the key again, and take it out if the vnode pointer is still
> > null.) If you only null the vnode pointer before running reclaim you
> > either have to hold the table lock during reclaim (almost certainly
> > not a good idea -- just yesterday in another context I was dealing
> > with a fs that needed to load a vnode when reclaiming another) or you
> > have a race where the vnode can be reloaded while reclaim is still in
> > progress.
>
> What is the difference between a locked cache node and a cache node
> with NULL vnode pointer (where the vnode pointer gets examined or
> changed with the table lock held)?

My scheme was trying to address the fact that you need to hold the
vnode lock to call VOP_RECLAIM and it isn't a good idea to let the
table lock couple to the vnode lock.

[various rambling deleted]

Your scheme does this too, and on further reflection, I think your
scheme is better.

IIUC your scheme is: lock the table, look up the key, sleep while it's
present with a null vnode pointer; then either enter a key with a null
vnode pointer, unlock the table, load, relock the table, relookup, set
the vnode pointer, broadcast, unlock; or null the vnode pointer,
unlock the table, reclaim, relock the table, relookup, remove the key,
broadcast, unlock.

(This is by deduction rather than by reading the code, so maybe I've
got it all wrong. But this is the scheme I think is better than mine.)

> > I don't think that's necessary. For vnodes belonging to virtual
> > objects, there's very little to be gained by counting them under
> > maxvnodes or cycling them "out of memory". The actual underlying
> > objects can't go away and the vnode itself is just a fairly small
> > administrative structure.
>
> Sounds reasonable -- but we are not at a point to make this decision.

I think we are; we aren't at the point to *do* it, but knowing more or
less where we want to go I think we can make the design decision.

> >> Agreed. And we may need vcache_lookup to lookup but not load a
> >> node too.
> >
> > When would one do that? I suppose there are probably reasons, but I
> > can't think of any offhand. It's going to inherently be a race, also,
> > since if it returns NULL or ENOENT or whatever it does if the vnode
> > isn't loaded, the vnode could get loaded by the time the caller sees
> > this. Consequently I suspect that most of the uses that someone might
> > think they want would actually be wrong. :-/
>
> At least union will need something like this. The key of a union
> node may change and there must be a way to "test" a key. The file
> system will have to serialise to prevent races.

Ew. It's been a long time since I looked at onionfs and I've blocked
most of it out. How does this happen? It probably needs to be able to
look up its own vnodes by the keys or identities of either the upper
or lower vnodes under it, but I'd think it'd be better served to
manufacture its own keys and have its own table for the layer
translation. That or we could have it enter both and allow the vnode
cache to point to multiple references of the same vnode, but that
seems like a can of worms.

How does onionfs assign its inode numbers?

On the other hand, as Taylor noted, msdosfs needs to be able to change
the key of a vnode, so maybe we should just expose such a function. It
should be relatively straightforward as it doesn't need to do anything
besides lock and shuffle the table.

> But it may be better to not expose vcache_lookup() before we have
> a need for it.

Yeah.

> > I would prefer VFS_LOADVNODE rather than VFS_LOAD_NODE, [...]
>
> Ok.

Thanks.

> > What calls vcache_remove? Is this the external interface for revoke?
> > (Reclaim should be triggered from inside the cache, at least in the
> > long run.) We need a way for the fs to mark vnodes as not worth
> > caching any more (e.g. because they've been unlinked), so they get
> > reclaimed as soon as the refcount hits zero, but I don't think
> > "remove" gives the right semantics.
>
> The cache in its current form does not change the vnode life cycle.
> So vcache_remove() is called from the file systems reclaim operation.

Right, ok.

> This will likely change once the vnode cache becomes responsible for
> the life cycle -- but one step after the other please.

Yes indeed!

Ilia Zykov

unread,

May 1, 2014, 3:23:50 PM5/1/14

to David Holland, J. Hannken-Illjes, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

On 01.05.2014 22:02, David Holland wrote:
> > > I don't think that's necessary. For vnodes belonging to virtual
> > > objects, there's very little to be gained by counting them under
> > > maxvnodes or cycling them "out of memory". The actual underlying
> > > objects can't go away and the vnode itself is just a fairly small
> > > administrative structure.

Maybe it is not to place.
procfs' objects can go away.

Ilia Zykov

unread,

May 1, 2014, 3:52:24 PM5/1/14

to David Holland, J. Hannken-Illjes, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

>> > > maxvnodes or cycling them "out of memory". The actual underlying
>> > > objects can't go away and the vnode itself is just a fairly small
>> > > administrative structure.
>
> Maybe it is not to place.
> procfs' objects can go away.
>

It's not actual objects, sorry.

J. Hannken-Illjes

unread,

May 2, 2014, 5:40:21 AM5/2/14

to David Holland, Taylor R Campbell, Chuck Silvers, tech-kern@netbsd.org list

On 01 May 2014, at 20:02, David Holland <dholla...@netbsd.org> wrote:

> On Thu, May 01, 2014 at 06:49:41PM +0200, J. Hannken-Illjes wrote:
>>> ok, in that case I think the only issue with the patch I looked at
>>> Sunday (was it that long ago already?) is that it does duplicate the
>>> getnewvnode logic -- if you've since fixed that I don't think I have
>>> further substantive objections.
>>
>> What do you mean with "duplicate the getnewvnode logic" here?
>> Vfs_busy()/vfs_unbusy() and vfs_insmntque()?
>
> ultimately, what Taylor was objecting to -- he convinced me last
> weekend that it was a valid objection, but I don't have time to look
> at the details right now :-/

<snip>

>>> I don't think that's necessary. For vnodes belonging to virtual
>>> objects, there's very little to be gained by counting them under
>>> maxvnodes or cycling them "out of memory". The actual underlying
>>> objects can't go away and the vnode itself is just a fairly small
>>> administrative structure.
>>
>> Sounds reasonable -- but we are not at a point to make this decision.
>
> I think we are; we aren't at the point to *do* it, but knowing more or
> less where we want to go I think we can make the design decision.

For now it should be sufficient to know these file systems may either
become users of the vnode cache or will use their own vnodes permanently
attached to their object.

>>>> Agreed. And we may need vcache_lookup to lookup but not load a
>>>> node too.
>>>
>>> When would one do that? I suppose there are probably reasons, but I
>>> can't think of any offhand. It's going to inherently be a race, also,
>>> since if it returns NULL or ENOENT or whatever it does if the vnode
>>> isn't loaded, the vnode could get loaded by the time the caller sees
>>> this. Consequently I suspect that most of the uses that someone might
>>> think they want would actually be wrong. :-/
>>
>> At least union will need something like this. The key of a union
>> node may change and there must be a way to "test" a key. The file
>> system will have to serialise to prevent races.
>
> Ew. It's been a long time since I looked at onionfs and I've blocked
> most of it out. How does this happen? It probably needs to be able to
> look up its own vnodes by the keys or identities of either the upper
> or lower vnodes under it, but I'd think it'd be better served to
> manufacture its own keys and have its own table for the layer
> translation. That or we could have it enter both and allow the vnode
> cache to point to multiple references of the same vnode, but that
> seems like a can of worms.

Currently union cache gets looked up by (upper vnode, lower vnode),
either but not both may be NULL. When both vnodes are != NULL it
has to lookup (upper, NULL), (NULL, lower) and (upper, lower) and
possibly adjust the key.

> How does onionfs assign its inode numbers?

It takes them from the upper or lower layer.

> On the other hand, as Taylor noted, msdosfs needs to be able to change
> the key of a vnode, so maybe we should just expose such a function. It
> should be relatively straightforward as it doesn't need to do anything
> besides lock and shuffle the table.

If we had this and vcache_lookup union could use it too. Another
approach is for union to manage its own table of
"(upper, lower) -> key" mappings and use this key for the vcache.

<snip>

Matthew Mondor

unread,

May 9, 2014, 7:01:18 PM5/9/14

to tech...@netbsd.org

On Wed, 30 Apr 2014 17:15:16 +0200
"J. Hannken-Illjes" <han...@eis.cs.tu-bs.de> wrote:

> > vcache_get(mp, key, key_len, vpp) to lookup and possibly load a vnode.
> > vcache_lookup(mp, key, key_len, vpp) to lookup a vnode.
> > vcache_remove(mp, key, key_len) to remove a vnode from the cache.
> > VFS_LOAD_NODE(mp, vp, key, key_len, new_key) to initialise a vnode.
>
> Updated diff at http://www.netbsd.org/~hannken/vnode-pass6-4.diff

One small question:

Is it expected in vcache_common() for the interlock to remain held even
if returning an error?

Thanks,
--
Matt

Taylor R Campbell

unread,

May 9, 2014, 9:30:02 PM5/9/14

to Matthew Mondor, tech...@netbsd.org

Date: Fri, 9 May 2014 19:01:04 -0400
From: Matthew Mondor <mm_l...@pulsar-zone.net>

Is it expected in vcache_common() for the interlock to remain held even
if returning an error?

vget unconditionally drops the interlock, so it will never remain
held, error or not.

Matthew Mondor

unread,

May 9, 2014, 9:42:31 PM5/9/14

to tech...@netbsd.org

On Sat, 10 May 2014 01:29:47 +0000

Taylor R Campbell <campbell+net...@mumble.net> wrote:

> Is it expected in vcache_common() for the interlock to remain held even
> if returning an error?
>
> vget unconditionally drops the interlock, so it will never remain
> held, error or not.

Oh, thanks. I can now see that vget() must be called with it held, and
indeed drops it itself.
--
Matt

David Holland

unread,

May 21, 2014, 6:48:13 PM5/21/14

to Taylor R Campbell, J. Hannken-Illjes, Chuck Silvers, tech-kern@netbsd.org list

On Sat, Apr 26, 2014 at 07:22:50PM +0000, David Holland wrote:
> - It seems to me that in the long run, all of this baloney should be
> hidden away inside the vfs layer; filesystems that use the vnode cache
> should only need to call vcache_get, and the only thing that should
> ever see a partially initialized vnode is VOP_LOAD and nothing outside
> the vnode cache should ever see a vnode with refcount zero. This in
> particular means that all the various forms of vget() should be hidden
> away, as should getnewvnode.

As I discovered while prodding lfs last weekend, this is too
optimistic, as there's another family of cases: allocating a new inode
and constructing a vnode for it.

It is *possible* to do this by allocating an inode number and then
using vcache_get to procure the vnode, such that when the vnode cache
calls VFS_LOADVNODE it will load a blank inode. This is what ffs does,
and it works fine there since unallocated inodes have fixed known
locations that are in a known on-disk state.

It doesn't work for lfs, and it won't work for any other filesystem
where inode locations aren't fixed, at least not without some ugly
hackery: you can update the fs state with a location for the new
inode, and write a buffer with the intended inode contents, and then
have VFS_LOADVNODE find and use this buffer, but this is inefficient
and gross. (This isn't itself adequate for lfs because lfs has some
additional self-inflicted issues.)

After sleeping on this a few times, ISTM that the best way to handle
this is as follows:

- add an additional call vcache_new to the API;
- have vcache_new take a type argument;
- have vcache_new assert that no vnode with the same key already
exists (FS-level locking should guarantee this);
- have vcache_new call a new op VFS_NEWVNODE instead of
VFS_LOADVNODE and pass the type in.

(Other things I don't like so much including passing flags or other
information through vcache_get to VFS_LOADVNODE, or giving vcache_new
a preconstructed new vnode.)

Thoughts? I'll try to make time to prepare a patch this weekend.

..note that LFS would still like to be able to pass info through to
VFS_LOADVNODE, because sometimes it already knows the physical disk
location of an inode when it's calling vcache_get and sometimes it
doesn't. But I think this is a different issue and should be addressed
separately.

J. Hannken-Illjes

unread,

Mar 9, 2015, 6:06:33 AM3/9/15

to David Holland, tech-kern@netbsd.org list

On 22 May 2014, at 00:48, David Holland <dholla...@netbsd.org> wrote:

<snip>

> As I discovered while prodding lfs last weekend, this is too
> optimistic, as there's another family of cases: allocating a new inode
> and constructing a vnode for it.
>
> It is *possible* to do this by allocating an inode number and then
> using vcache_get to procure the vnode, such that when the vnode cache
> calls VFS_LOADVNODE it will load a blank inode. This is what ffs does,
> and it works fine there since unallocated inodes have fixed known
> locations that are in a known on-disk state.
>
> It doesn't work for lfs, and it won't work for any other filesystem
> where inode locations aren't fixed, at least not without some ugly
> hackery: you can update the fs state with a location for the new
> inode, and write a buffer with the intended inode contents, and then
> have VFS_LOADVNODE find and use this buffer, but this is inefficient
> and gross. (This isn't itself adequate for lfs because lfs has some
> additional self-inflicted issues.)
>
> After sleeping on this a few times, ISTM that the best way to handle
> this is as follows:
>
> - add an additional call vcache_new to the API;
> - have vcache_new take a type argument;
> - have vcache_new assert that no vnode with the same key already
> exists (FS-level locking should guarantee this);
> - have vcache_new call a new op VFS_NEWVNODE instead of
> VFS_LOADVNODE and pass the type in.

<snip>

Here comes the new operation "vcache_new()" to allocate and initialise
a new vnode/fsnode pair. Passing the type is not sufficient, therefore
pass all information we get from VOP_MKNOD:

int
vcache_new(struct mount *mp, struct vnode *dvp, struct vattr *vap,
kauth_cred_t cred, struct vnode **vpp)

where dvp is the (referenced) directory where we want to create the
new node, vap passes va_type, va_mode and possibly va_rdev and cred
gives the credentials to setup uid/guid.

The node returned from vcache_new() is referenced, fully initialised
and has link count zero.

A diff showing the implementation is here:

http://www.netbsd.org/~hannken/vnode-pass7-1a.diff

and a diff showing its usage for ffs is here:

http://www.netbsd.org/~hannken/vnode-pass7-1b.diff

Comments or objections anyone?