I've recently been playing around with ZFS-FUSE a bit and managed to get
a machine to boot off a ZFS filesystem. The trick was, as mentioned
earlier on this mailing list, to have an initramfs with zfs-fuse on it.
I needed to patch zfs-fuse so that it mounted filesystems with the dev
and suid options (as opposed to the FUSE default of nodev,nosuid) -
otherwise device nodes and SUID binaries didn't work. (I also patched
zfs-fuse to daemonise itself when it started; later, I'd like to make
this controllable on the command line. I'd also like to make it log to
syslog rather than stderr, since at the moment stderr is redirected to
/dev/null when running as a daemon.)
I put the /etc/zfs directory with the zfs socket on a tmpfs, and used
mount --move to make it available to the system once it had booted.
Unfortunately, I couldn't figure out how to also have any _other_
filesystem on the machine ZFS - to start with, zfs-fuse needs the
fusermount binary in its path to mount filesystems, and all files on the
initramfs are deleted by run-init when the main system boots [1]. Even
ignoring this snag (which could be fixed by any of several evil hacks),
when zfs-fuse tries to mount new filesystems it'd mount them in the
initramfs's directory hierarchy, and so they wouldn't be available to
the real system. It seems that the 'real' root is not accessible from
within the initramfs after run-init, either.
Next, I tried to get zfs-fuse to chroot itself into /root, once it had
been mounted. This "worked for me" when running inside a chroot on the
real system, but the machine deadlocked when I tried this on an
initramfs. I'm not quite sure why, or how to debug this.
There are a few other gotchas to a ZFS root that I noticed, too:
- when init kills all processes during shutdown, this includes
zfs-fuse. After killing zfs-fuse, the machine locks up since nothing
can access / any more. Oops :-) I suppose I could get around this
by patching my init scripts and/or killall5.
- apt-get assumes that it can do shared, writable mmap()s, which
aren't possible on FUSE filesystems. There is a patch in the Debian
BTS to fix this - http://bugs.debian.org/314334
- dbus crashes on startup (haven't tried debugging this one yet)
Cameron
[1] The boot process for recent versions of Debian/Ubuntu works
something like this: GRUB loads the kernel and initramfs image; the
kernel ungzips and un-cpio's the initramfs it's been given into a ramfs
filesystem mounted as /, and runs /init on it. /init is responsible for
loading drivers (nowadays by starting udev), setting up software RAID or
LVM where applicable, and mounting the root filesystem as /root. Virtual
filesystems like /dev, /sys and /proc are moved into /root. Then /init
runs run-init from the klibc package, which deletes all the files on the
initramfs (to save memory), chdir()s into /root and does a
mount --move /root /. Then it runs /sbin/init from the real root
filesystem and the machine boots normally.
Cameron Patrick wrote:
> (I also patched
> zfs-fuse to daemonise itself when it started; later, I'd like to make
> this controllable on the command line. I'd also like to make it log to
> syslog rather than stderr, since at the moment stderr is redirected to
> /dev/null when running as a daemon.)
This is the next thing on my to-do list (see the STATUS file) after
fixing the remaining zfs send/recv issues.
> Next, I tried to get zfs-fuse to chroot itself into /root, once it had
> been mounted. This "worked for me" when running inside a chroot on the
> real system, but the machine deadlocked when I tried this on an
> initramfs. I'm not quite sure why, or how to debug this.
Sorry, I don't know how to debug that either, I'm not very familiar with
the kernel boot mechanism.
I only figured it was possible since I've seen a LiveCD with HTTP-FUSE
as a root filesystem.
Anyway, good work! I envy your ZFS root Linux box :)
As a side note, it's interesting to know that Linux is the first
operating system that can boot from RAID-1+0, RAID-Z or RAID-Z2 ZFS
pools (Solaris can only boot from single-disk or RAID-1 pools)
;)
> Cameron Patrick wrote:
> > (I also patched
> > zfs-fuse to daemonise itself when it started; later, I'd like to make
> > this controllable on the command line. I'd also like to make it log to
> > syslog rather than stderr, since at the moment stderr is redirected to
> > /dev/null when running as a daemon.)
>
> This is the next thing on my to-do list (see the STATUS file) after
> fixing the remaining zfs send/recv issues.
Okay. I've put an `hg bundle` of the patches I've made so far up at
http://largestprime.net/cameron/zfs/patches_20070407.bundle
I'm not sure if that's the ideal way of sharing changes with hg, but it
didn't look particularly simple to set up my own network-accessible
repository.
I also started to have a look at getting mount options to work, although
with no real progress so far.
> Anyway, good work! I envy your ZFS root Linux box :)
>
> As a side note, it's interesting to know that Linux is the first
> operating system that can boot from RAID-1+0, RAID-Z or RAID-Z2 ZFS
> pools (Solaris can only boot from single-disk or RAID-1 pools)
Hehe, awesome - didn't realise that Solaris's ZFS-root support was
limited like that. In my case I was booting from ZFS-on-LVM-on-RAID5.
Interestingly, ZFS+LVM+RAID5 is a lot faster than ZFS+RAIDz. According
to bonnie, I see 125 MB/s reads on ext3+RAID5, 65 MB/s on ZFS+RAID5
(using Linux's software RAID) and 20 MB/s on ZFS+raidz (using the same
raw drives). Writes are also proportionally slower. The real
performance hit with ZFS-FUSE was random accesses for lots of small
files. The bonnie++ results showed something like 75 random seeks for
ZFS vs 470 for ext3, but it was also subjectively very sluggish when
dealing with hierarchies of small files (like kernel source trees or
Maildir mail spools) - sometimes the processes would seem to "freeze"
for a few seconds, presumably because requests were being queued up
somewhere and processed very slowly.
I notice that the multi-threading approach you're using is different
from the standard FUSE one. Are there likely to be any performance
considerations in that?
Cameron
Thanks, that's probably (for both of us) the most convenient way to
share patches.
I have pulled your changes into the main repositories and updated the
docs. Also note that your upstream merge was a little borked due to a
symlink change I made (Mercurial doesn't support symlinks yet..), but
I've fixed that.
> I notice that the multi-threading approach you're using is different
> from the standard FUSE one. Are there likely to be any performance
> considerations in that?
I don't know if the performance problems are due to the way I have
implemented the multithreaded FUSE event handler since I haven't done
any serious performance profiling yet.
Unfortunately I can't just use the standard FUSE multithreaded event
loop since it doesn't handle multiple channels.
> I've recently been playing around with ZFS-FUSE a bit and managed to get
> a machine to boot off a ZFS filesystem.
When you say "boot off a ZFS filesystem," I presume that means that
/boot/ was not on ZFS, but the root filesystem / was? I just want to
make sure I really understand this. IIUC, the only way to have /boot on
ZFS will be to use Sun's patched GRUB2.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com