some [big] changes to ZPL (ZFS<->VFS )

Skip to first unread message

Andriy Gapon

Aug 3, 2016, 10:26:50 AM8/3/16

I would like to get more comments about the following review request:
Both about the code changes and about the general direction of the changes.

I've been asked to try to get this change into 11.0 because of
So, there is some urgency to my request.
To be honest I've never wanted to rush in this change, but I probably
have been sitting on it for too long.

ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS. Many things that FreeBSD VFS manages on
behalf of all filesystems are also (re-)implemented in ZPL in a
different way. As a result, ZPL contains code that is redundant on
FreeBSD or duplicates VFS functionality or, in the worst cases, badly
interacts / interferes with VFS.

I am talking mostly about the operations that work on a filesystem
namespace rather than on contents and properties of files.
The former operations typically work on 2 or more vnodes. The vnodes
are usually locked in a particular way or are expected to be locked in a
particular way by the filesystem.

ZPL on the other hand has quite an elaborate locking system of its own.
It includes a lock that protects a znode, a lock that protects the
znode's notion of its parent, locks that protect a name to znode
relationship for the znode's children.

So, we have two competing locking systems for the filesystem nodes and
they are not aware of each other.

The most prominent problem is a deadlock caused by the lock order
reversal of vnode locks that may happen with concurrent zfs_rename() and
lookup(). The deadlock is a result of zfs_rename() not observing the
vnode locking contract expected by the VFS. I should mention here that
VOP_RENAME is the weirdest method with respect to locking requirements.
Every filesystem implementation is expected to perform quite an
elaborate locking "dance" in VOP_RENAME. This is in contrast to all
other operations where all necessary locking is done by the VFS before
calling into the filesystem code.

So, I started by fixing zfs_rename() but that required to make some
changes to the "bottom half" of the ZPL (the "upper half" being the code
that implements the vnode and VFS operations). And that required making
changes in other VOPs anyway. And so the next step was to remove
redundant re-lookup of child nodes in methods like zfs_rmdir and
zfs_remove. Before I could stop I removed all ZPL internal locking that
was supposed to protect parent-child relationships of filesystem nodes.
Those relationships are now protected exclusively by the vnode locks and
the code is changed to take advantage of that fact and to properly
interact with VFS.

As a minor detail, the removal of the internal locking allowed all ZPL
dmu_tx_assign calls to use TXG_WAIT mode.

Another change that was not strictly required and which is probably too
intrusive is killing the support for case insensitive operations. My
thinking was that FreeBSD VFS does not provide support for those anyway.
But I'll probably restore the code, at least in the bottom half of the
ZPL, before committing the change.

On a more general level, I decided that the upper half of ZPL would
necessarily be quite different between different VFS models and so it
does not make much sense to keep huge chunks of ifdef-ed out code (or
compiled in, but never reached code) that's not useful in FreeBSD at
all. I think that keeping that code makes it harder to maintain the
FreeBSD code and to _correctly_ merge upstream changes. E.g., if a
change can be merged without conflicts to an ifdef-ed out blocked, that
does not mean that the merge is correct.

Last, but not least, the work was sponsored by HybridCluster / ClusterHQ
Inc back in 2013.

Andriy Gapon
_______________________________________________ mailing list
To unsubscribe, send any mail to ""

Andriy Gapon

Aug 5, 2016, 2:38:01 AM8/5/16
On 03/08/2016 17:25, Andriy Gapon wrote:
> Another change that was not strictly required and which is probably too
> intrusive is killing the support for case insensitive operations. My
> thinking was that FreeBSD VFS does not provide support for those anyway.
> But I'll probably restore the code, at least in the bottom half of the
> ZPL, before committing the change.

It turned out that most of the removed code was dead anyway and it took
just a few lines of code to restore support for case-insensitive
filesystems. Filesystems with mixed case sensitivity behave exactly the
same as case-sensitive filesystem as it has always been the case on FreeBSD.

Anyway the big change has just been committed:
Please test away.

Another note is that the filesystem name cache is now disabled for case
insensitive filesystems and filesystems with normalization other than
none. That may hurt the lookup performance, but should ensure
correctness of operations.

Alan Somers

Aug 5, 2016, 4:31:51 PM8/5/16
I'm not certain it's related, but on a head build at r303767 I see a
LOR and a reproducible panic that involve the snapdir code.

First, the LOR:
$ zpool destroy foo

lock order reversal:
1st 0xfffff800404c8b78 zfs (zfs) @
2nd 0xfffff800404c85f0 zfs_gfs (zfs_gfs) @
stack backtrace:
#0 0xffffffff80aa90b0 at witness_debugger+0x70
#1 0xffffffff80aa8fa4 at witness_checkorder+0xe54
#2 0xffffffff80a22072 at __lockmgr_args+0x4c2
#3 0xffffffff80af8e7c at vop_stdlock+0x3c
#4 0xffffffff81018880 at VOP_LOCK1_APV+0xe0
#5 0xffffffff80b19f2a at _vn_lock+0x9a
#6 0xffffffff821b9c53 at gfs_file_create+0x73
#7 0xffffffff821b9cfd at gfs_dir_create+0x1d
#8 0xffffffff8228aa07 at zfsctl_mknode_snapdir+0x47
#9 0xffffffff821ba1a5 at gfs_dir_lookup+0x185
#10 0xffffffff821ba68d at gfs_vop_lookup+0x1d
#11 0xffffffff82289a42 at zfsctl_root_lookup+0xf2
#12 0xffffffff8228a8c3 at zfsctl_umount_snapshots+0x83
#13 0xffffffff822a1d2b at zfs_umount+0x7b
#14 0xffffffff80b02a14 at dounmount+0x6f4
#15 0xffffffff80b0228d at sys_unmount+0x35d
#16 0xffffffff80ebbb7b at amd64_syscall+0x2db
#17 0xffffffff80e9b72b at Xfast_syscall+0xfb

Here's the panic:
$ zpool create testpool da0
$ touch /testpool/testfile
$ zfs snapshot testpool@testsnap
$ cd /testpool/.zfs/snapshots

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80b19f1c
stack pointer = 0x28:0xfffffe0b54bf7430
frame pointer = 0x28:0xfffffe0b54bf74a0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 966 (bash)
trap number = 12
panic: page fault
cpuid = 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0b54bf6fc0
vpanic() at vpanic+0x182/frame 0xfffffe0b54bf7040
panic() at panic+0x43/frame 0xfffffe0b54bf70a0
trap_fatal() at trap_fatal+0x351/frame 0xfffffe0b54bf7100
trap_pfault() at trap_pfault+0x1fd/frame 0xfffffe0b54bf7160
trap() at trap+0x284/frame 0xfffffe0b54bf7370
calltrap() at calltrap+0x8/frame 0xfffffe0b54bf7370
--- trap 0xc, rip = 0xffffffff80b19f1c, rsp = 0xfffffe0b54bf7440, rbp
= 0xfffffe0b54bf74a0 ---
_vn_lock() at _vn_lock+0x8c/frame 0xfffffe0b54bf74a0
zfs_lookup() at zfs_lookup+0x50d/frame 0xfffffe0b54bf7540
zfs_freebsd_lookup() at zfs_freebsd_lookup+0x91/frame 0xfffffe0b54bf7680
VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0xda/frame 0xfffffe0b54bf76b0
vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfffffe0b54bf7710
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0xda/frame 0xfffffe0b54bf7740
lookup() at lookup+0x5a2/frame 0xfffffe0b54bf77d0
namei() at namei+0x5b2/frame 0xfffffe0b54bf7890
kern_statat() at kern_statat+0xa8/frame 0xfffffe0b54bf7a40
sys_stat() at sys_stat+0x2d/frame 0xfffffe0b54bf7ae0
amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe0b54bf7bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0b54bf7bf0

I can provide core files, test scripts, whatever you need. Thanks for
tackling this difficult problem.

Reply all
Reply to author
0 new messages