Envoyé de mon iPhone
Le 5 juin 2010 à 00:44, Brian Behlendorf <brianbeh...@gmail.com>
a écrit :
Actually, in the long term I would love to support both a native in-
kernel posix layer and a fuse based posix layer. The way the code is
structured you actually build the same ZFS code once in the kernel as
a set of modules and a second time as a set of shared libraries. The
in-kernel version is used by Lustre, the ZVOL, and will eventually be
used by the native posix layer. Currently the shared libraries are
only used by ztest for regression testing. However, they could be
used to form the basis of a fuse based implementation. The major
missing bit is the glue to tie it to fuse which you guys have already
shown can be done. IMHO the real advantage of this would be a shared
About time to chime in.
Now I have hardly any kernel experience (just forget about that). But it
is pretty obvious that interfacing to the fuse interface is much easier
than programming a ZPL from scratch.
My syllogism would be
* If you want to be in-kernel,
* You wish to share codebase with the fuse implementation
* You should obviously look at the fuse kernel module
Only in that way can you hope to reuse significant bits of zfs-fuse. In
a thought experiment you can easily show that it should be a welldefined
(a) fork the fuse module
(b) adapt the interface so you can have the same functional blocks BUT
don't need to cross-over to user space at the fuse-layer
(c) patch-up where a switch to user-space is still desired
So the real /job/ is in (c) only. I have a suspicion that it would be
feasible to have 1:1 port into kernel space pretty soonish (unless there
are technical reasons why the libzfs socket .e.g. could not continue to
be a socket interface, e.g.).
I have another suspicion that the important bits that now rely on being
in userspace should be relatively few and probably not needed anyway
once you are in kernel (e.g. the mounting operations)
It can certainly be bundled with Linux distributions in much the same
way as other kernel-tainting modules. Ubuntu already bundles in a ton
of binary-only drivers (Broadcom WiFi/ NVidia/ fglrx) and this would
be no different. What makes you think it can't be included?
It's different because it's almost the complete opposite :P. This can
only be legally distributed in source form, to be compiled by the user
(who can't then distribute the resulting binary of course).
I believe what you're saying though is that it has the same legal
status as the shim layer between the binary blob and the kernel - it
just happens that rather than being a thin wrapper that has to be
built by the end user, it's the whole thing. I don't know if
module-assistant (or other distributions' equivalents) could be used
as-is, but I'd expect so. The real question is whether you can
persuade distributions that it is worth going to that effort. They
understandably don't like having to jump through hoops to work around
legal problems unless there is tremendous demand - probably no point
in even trying until the ZPL is usable.
Are you sure about this? BSD et al all distribute this as binary -
what evidence do you have that the CDDL denies this distribution? I
thought it was just a standard "incompatible with GPL" problem.
I don't mean that binary distribution is prohibited in general, just
in this case. I'll clarify: to generate the binary module you need to
link against both ZFS (CDDL) and Linux (GPL), hence the binary has no
terms under which it can be legally distributed. That's what the
incompatibility is: in source form it is possible to honour the terms
of both licenses - trivially, because there is no combined work yet so
the kernel's license doesn't apply. As soon as the module is compiled
and linked with both the GPL and CDDL code there are no longer any
terms under which the resultant binary can be distributed.
The userland tools would be redistributable as they aren't considered
derivative works of the kernel.
Right - of course this was (and still is) the big problem when
Opensolaris/ZFS was first open-sourced. I have massive respect for
what the FSF/GNU project have achieved but I'm starting to like the
the idea of BSD-style licensing more and more!
But if on-site compilation before install is required,
that would be just Ok.
AFAIK that's how nvidia module is distributed under Ubuntu: using dkms
which compiles the module after user installs it.
1. will not scale (can only have 1 snapshot, or need to duplicate
allocation space for each snapshot)
2. will not scale (write performance degrades linearly with each snapshots)
3. must have a snapshot volume of matching size (at least as much as
used blocks in origin), see next
4. no rollback/restore mechanism (if you accidentally think you're smart
and rsync the snapshot back to the original, you will _by definition_
run out of free blocks on the snapshot; this corrupts your snapshot
(unrecoverable, not even by extending it) and your origin will be
half-way an rsync. Don't ask _who_ learned this the hard way).
5. in terms of disk access, it is hard to tune the disk layout so
that writes to origin+snapshot go to different spindles. In write
performance view, you can view lvm snapshot like a limping mirror/funny
mirror: it needs to write both volumes. If you let lvm do it's default
block allocation, chances are that seek times are going through the roof.
The only boon I remember was that lvcreate -s knows how to xfs_freeze
and xfs_unfreeze, which is really a gimmick, but still very nice. I use
that for some of my older cloud backups (where I don't want to use ZFS,
for reasons of variety)
 maybe possible?
IIRC you can't make a snapshot of a snapshot, but the sad thing is that
the underlying device mapper layer is able to do it.
> 2. will not scale (write performance degrades linearly with each snapshots)
That depends on which device you write to. If you write on the original
device, then yes, performance degrades, as writes there are going to be
translated to a read on the same spot first then a write in the
snapshot. If you want good write performance, never write on the
original device. Sadly, LVM doesn't do to that, as when you create a
snapshot, the device in use stays the original one. Basically, LVM is
really bad at showing how the device mapper is powerful.
> 3. must have a snapshot volume of matching size (at least as much as
> used blocks in origin), see next
It's not a must.
> 4. no rollback/restore mechanism (if you accidentally think you're smart
> and rsync the snapshot back to the original, you will _by definition_
> run out of free blocks on the snapshot; this corrupts your snapshot
> (unrecoverable, not even by extending it) and your origin will be
> half-way an rsync. Don't ask _who_ learned this the hard way).
Actually, there is one, now, at the device mapper level, though I don't
know if LVM uses it. google for snapshot-merge. The main difference with
zfs snapshots is that there is a merge phase with a lot of I/O, here.
Quite like what happens with VMware's VMDKs when consolidating snapshots.
Thanks for another useful tip, I just _might_ use that in the future.
As part of a joint effort with Sun/Oracle to augment the Lustre file
system with ZFS support, we've been engaged in porting ZFS natively to
the Linux kernel. So far we have pretty much everything working
except the ZPL - this is because Lustre interfaces directly with the
DMU and the ZPL was not a priority for us. However, we connected with
folks at KQ Infotech who are also interested in a Linux kernel port
and they are working on the ZPL so it is on the way.
Anyway, the fruits of our labor are available here http://github.com/behlendorf/zfs/.
I don't know to what extent it's practical for the zfs-fuse community
and our project to collaborate. But since there is ZFS expertise on
both sides and a lot of common code, I wanted to propose that we at
least consider how we might help each other out.
Brian Behlendorf (no not the apache one, the other one)