Creating ZFS datasets per VM (mini-howto)

753 views
Skip to first unread message

candlerb

unread,
Apr 29, 2018, 12:42:09 PM4/29/18
to ganeti
I thought I'd have a go at using image files on local ZFS with ganeti, à la sanoid.  Specifically I wanted ganeti to create a new zfs dataset for each VM, store the disks as image files within it, and remove the dataset when the VM is removed.

TL;DR: I got it to work using ganeti instance hooks, although I needed a small patch to ganeti to make removal work properly.  I'm documenting what I did here in case it's useful to anyone else.

1. CREATING THE ZFS POOL

I decided to make a ZFS pool on top of LVM, in order to use the LVM free space, and so that I could grow the ZFS area incrementally as I migrated instances from LVM to ZFS.  Boringly I called the zpool "zfs", and made a dataset "zfs/vm" to hold all the VM images on this node.

apt-get install zfsutils-linux
lvcreate -l 100%FREE --name zpool ganeti
zpool create -oashift=12 zfs /dev/ganeti/zpool
zfs set compression=lz4 zfs
zfs create zfs/vm

2. ENABLING LOCAL FILE STORAGE ON GANETI

This was very frustrating.  You have to do *three* separate cluster config changes:

gnt-cluster modify --enabled-disk-templates plain,drbd,file
gnt-cluster modify --ipolicy-disk-templates plain,drbd,file
gnt-cluster modify --file-storage-dir /zfs/vm     # was previously /srv/ganeti/file-storage

But that's not sufficient.

It turns out that you *also* have to edit /etc/ganeti/file-storage-paths and add `/zfs/vm` there too, and distribute to all nodes.  (I needed to add debug statements into the python code to find out why it wasn't working!)

3. HOOKS

I wrote a small hook using 'printenv' to find out exact what environment variables ganeti passes to instance add and remove hooks, then used that information to make the scripts:

===> /etc/ganeti/hooks/instance-add-pre.d/zfs-create <===
#!/bin/sh -e
if [ "$GANETI_INSTANCE_DISK_TEMPLATE" = "file" -a "$GANETI_INSTANCE_PRIMARY" = "$(hostname)" ]; then
  zfs create "zfs/vm/$GANETI_INSTANCE_NAME"
fi

===> /etc/ganeti/hooks/instance-remove-post.d/zfs-destroy <===
#!/bin/sh -e
if [ "$GANETI_INSTANCE_DISK_TEMPLATE" = "file" -a "$GANETI_INSTANCE_PRIMARY" = "$(hostname)" ]; then
  zfs destroy "zfs/vm/$GANETI_INSTANCE_NAME"
fi

(Remember to chmod +x and to distribute to all nodes)

The remove-post script receives GANETI_INSTANCE_DISK0_ID with a full file path which could have been used for removing the dataset, but the add-pre script does not; my only option in the add script was to use the instance name.  Therefore for symmetry and simplicity I used the same logic in the removal script too.

Note: if you specify --file-storage-dir when adding an instance, sadly this does *not* appear in the environment variables for the instance add-pre hook at all, so I couldn't make it work with this option at all.

This nearly worked, except for one problem: on gnt-instance remove, ganeti tries to remove the enclosing directory /zfs/vm/<instance-name>, but that fails with EBUSY because it's a mounted dataset, and then doesn't run the post script.  I had to make a small patch to ganeti itself to allow this:

--- /usr/share/ganeti/2.15/ganeti/backend.py.orig 2016-03-26 07:28:51.000000000 +0000
+++ /usr/share/ganeti/2.15/ganeti/backend.py 2018-04-29 16:40:48.000000000 +0100
@@ -4551,7 +4551,10 @@
     try:
       os.rmdir(file_storage_dir)
     except OSError, err:
-      _Fail("Cannot remove file storage directory '%s': %s",
+      if isinstance(err, OSError) and err.errno == os.errno.EBUSY:
+        pass
+      else:
+        _Fail("Cannot remove file storage directory '%s': %s",
             file_storage_dir, err)


Now you can do the following:

gnt-instance add -n <nodename> -t file -o noop --no-start --no-install --no-name-check --no-ip-check --disk 0:size=1G --disk 1:size=2G foobar
...
gnt-instance remove foobar

This creates the zfs/vm/foobar dataset with the sparse image files inside it, and similarly removes the dataset when the image is destroyed.

4. MIGRATING INSTANCES FROM LVM TO ZFS

This is made awkward because:

a. ganeti does not yet support instance conversion from -t plain to -t file
b. ganeti -t file does not support adoption of existing files

So what I did was:

- stop the instance
- if drbd, change to plain (gnt-instance modify -t plain xxx)
- rename it from xxx to xxx-old
- create a new instance xxx, copying relevant backend/disk/NIC settings from existing instance, e.g.

gnt-instance add -o <ostype> --no-install --no-start \
  -B minmem=NNN,maxmem=NNN,vcpus=NNN \   # copy from existing instance
  -t file -s <disk size>G \
  -n <node> xxx

- copy the instance image disk(s) from LVM to overwrite the sparse file(s) (qemu-img is useful for this); get the old LVM path with UUID from gnt-instance info.

qemu-img convert -f raw -O raw /dev/ganeti/XXXX.disk0_data /zfs/vm/xxx/YYYY.file.disk0
# repeat if image has more than one disk

- start the new instance, check it's all working fine
- now you can release the old storage:

gnt-instance remove xxx-old

and on the node where the storage was used, you can hand it over to ZFS:

lvextend -l +100%FREE /dev/ganeti/zpool
zpool online -e zfs /dev/ganeti/zpool

Tedious, but doable.

5. SNAPSHOTTING AND REPLICATING ZFS DATASETS

This is left as an exercise for the reader.  My approach is to have the VM hosts with SSD, but they periodically replicate to a backup host with HDDs.

I tried the following tools which all do the job:

* sanoid and syncoid I think is the best option. sanoid makes periodic snapshots from a configured policy, and syncoid does replication; it can replicate individual listed datasets or recursively under a parent dataset.  syncoid supports both "push" and "pull" replication, but needs to be able to ssh to the remote host as root.  It supports nagios health checking flags (`--monitor-health`, `--monitor-snapshots`)

Limitation: it can't take snapshots more often than hourly.

* zfsnap + simplesnap are in the default Ubuntu/Debian repos.  zfsnap creates the rotating snapshots, and simplesnap is the replicator.  simplesnap supports only "pull" replication, i.e. the target backuphost has to make an outbound ssh connection to the source host; but on the source host it only needs to be able to run a safe wrapper script.

Simplesnap does not copy any existing snapshots as part of its initial replication.  However, any snapshots created subsequently *are* copied - and if they are removed, they are not deleted from the receiving side, so this may leave manual cleanup to be done.

As the manpage says:

> simplesnap will transfer snapshots made by other tools, but will not
> destroy them on either end.
> ...
> Since simplesnap sends all snapshots, it is possible that locally-created
> snapshots made outside of your rotation scheme will also be sent to your
> backuphost

A single property `org.complete.simplesnap:exclude` can be used to exclude certain datasets from replication; but otherwise all datasets are replicated under `destination/hostname/...`

* zfs-auto-snapshot and znapzend.  I hated znapzend; it has bits in C and Perl, all its configuration is stored in zfs properties.  There's a horribly complex tool to configure it.

The benefit of this one is that it supports pre- and post- snapshot hooks, so for example you can quiesce a VM before snapshotting it, and let it continue afterwards, to ensure the filesystem is in a consistent state inside each snapshot.  (But I don't think it would be too hard to modify sanoid and syncoid to do that).

----------

TODO: I now need to check how ganeti handles instance moves with -t file, and probably need to add a new hook to support creating and removing datasets as part of the move.

To minimise downtime, what I'd actually rather do is:

- sync the zfs filesystem to the target node while the instance is running
- shut down the instance
- do a final resync to pick up just the latest changed blocks
- start the new instance

But I don't think ganeti will let me do that unless I treat the filesystem as -t sharedfile, and then it would be up to me to perform the copying outside of ganeti itself.

Anyway, I hope there's something useful in the above.  It looks like a pretty decent way to run VMs, although I think it would be a little better if ganeti could use qcow2 files rather than sparse raw image files.

Cheers,

Brian.

candlerb

unread,
Apr 29, 2018, 2:10:09 PM4/29/18
to ganeti
Oh dear: instance rename is broken for the same reason (EBUSY when trying to rename a directory which is actually a zfs dataset), although by this point ganeti has already renamed the instance, so you can manually complete the job by doing "zfs rename" after the ganeti rename has failed.

I tried frigging it via pre/post scripts but that fails too:

===> /etc/ganeti/hooks/instance-rename-pre.d/zfs-rename <===
#!/bin/sh -e
if [ "$GANETI_INSTANCE_DISK_TEMPLATE" = "file" -a "$GANETI_INSTANCE_PRIMARY" = "$(hostname)" ]; then
  zfs rename "zfs/vm/$GANETI_INSTANCE_NAME" "zfs/vm/$GANETI_INSTANCE_NAME.rename"
  mkdir "/zfs/vm/$GANETI_INSTANCE_NAME"
fi

===> /etc/ganeti/hooks/instance-rename-post.d/zfs-rename <===
#!/bin/sh -e
if [ "$GANETI_INSTANCE_DISK_TEMPLATE" = "file" -a "$GANETI_INSTANCE_PRIMARY" = "$(hostname)" ]; then
  rmdir "/zfs/vm/$GANETI_INSTANCE_NEW_NAME"
  zfs rename "zfs/vm/$GANETI_INSTANCE_NAME.rename" "zfs/vm/$GANETI_INSTANCE_NEW_NAME"
fi

The result:

# gnt-instance rename --no-name-check --no-ip-check foobar foobar2
As you disabled the check of the DNS entry, please verify that
'foobar2' is a FQDN. Continue?
y/[n]/?: y
Sun Apr 29 19:04:05 2018  - WARNING: Could not prepare block device disk/0 on node nuc2.home.deploy2.net (is_primary=True, pass=2): Error while assembling disk: /zfs/vm/foobar2/9f52eb9d-13b9-4178-aabc-6ff91e28e080.file.disk0: No such file
Sun Apr 29 19:04:05 2018  - WARNING: Could not prepare block device disk/1 on node nuc2.home.deploy2.net (is_primary=True, pass=2): Error while assembling disk: /zfs/vm/foobar2/81806b71-b7d8-4e31-bb97-4a379d24dfe1.file.disk1: No such file
Failure: command execution error:
Disk consistency error

It seems that hooks are really the wrong level to do this - should really plumb into ganeti at the level of disk template and make a completely new disk template type for zfs.  That is a lot of work :-(

candlerb

unread,
Apr 29, 2018, 3:27:50 PM4/29/18
to ganeti
Or there is extstorage. I found from the wiki an example which maps a local or shared file image as if it were a SAN:


AFAICS it runs using losetup, so is messier than having kvm talking directly to an image file; but it might be a base to work from.  (Can't see how to clone the repo, but the files can be pulled one by one)

Phil Regnauld

unread,
Apr 29, 2018, 4:14:05 PM4/29/18
to gan...@googlegroups.com
candlerb (b.candler) writes:
> Or there is extstorage. I found from the wiki
> <https://github.com/ganeti/ganeti/wiki/External-Storage-Providers> an
> example which maps a local or shared file image as if it were a SAN:
>
> https://code.grnet.gr/projects/extstorage/repository/revisions/master/show/shared-filer
>
> AFAICS it runs using losetup, so is messier than having kvm talking
> directly to an image file; but it might be a base to work from. (Can't see
> how to clone the repo, but the files can be pulled one by one)

Hmm, I did some initial work on getting a Clariion FC SAN working with
Ganeti (https://github.com/regnauld/ganeti-ext-storage-clariion) - this
is based off of https://github.com/DSI-Universite-Rennes2/hpeva, and
it used the vendor provided tool to provision a LUN and make it appear
as a block device using multipath - no losetup required. You should be
able to make this work with ZFS + files, with the limitation that the ZFS
pool will be entirely local to each node, and not visible across all nodes
(don't remember if there are some assumptions about this in the code).


Phil Regnauld

unread,
Apr 29, 2018, 4:15:04 PM4/29/18
to gan...@googlegroups.com
candlerb (b.candler) writes:
>
> It turns out that you *also* have to edit /etc/ganeti/file-storage-paths
> and add `/zfs/vm` there too, and distribute to all nodes. (I needed to add
> debug statements into the python code to find out why it wasn't working!)

It's actually listed in the admin docs at http://docs.ganeti.org/ganeti/master/html/admin.html,
but yeah, it's not super obvious.

candlerb

unread,
Apr 30, 2018, 9:32:46 AM4/30/18
to ganeti
Cheers for that.

I worked out why zvol snapshots fail if there isn't plenty of free disk space.  I probably ought to write it up in a blog post, but in summary, ZFS does the "right thing" by setting a reservation equal to the zvol logical size, to guarantee that you'll never end up with an out-of-disk error which prevents writes to the zvol.  You can do sparse provisioning if you want by adding the "-s" flag when creating the zvol, or simply by setting its reservation to zero at any time later.

This means that the ganeti+ZFS ZVOL option becomes interesting again.

I really don't like the idea of spoofing lvm commands, so I took the ganeti-extstorage-zfs provider and stripped all that out to give something extremely simple:

This seems to work just fine, since after all, the zvols are already exposed as block devices; you just have to tell ganeti where to find them.

It's slightly annoying that you can no longer create an instance with "-s 1G"; you have to use the longer version "-t ext --disk 0:size=1G,provider=zfs"

The main inconvenience though is mobility.  "gnt-instance move" refuses to run on any instance with -t ext.  So what you have to do to migrate an instance is:

1. Shut it down
2. gnt-instance failover -n (newnode); fortunately this doesn't check that the volume exists on the new node
3. Copy across the disk dataset, e.g. using syncoid
4. Start the instance on the new node
5. When everything looks OK, delete the disk dataset from the old node

Or you can minimise the downtime like this:

1. While the instance is still running, do a sync to the target using syncoid
2. Shutdown the instance
3. Do a final resync with syncoid to pick up any last changes
4. gnt-instance failover -n (newnode)
5. Start the instance on the new node
6. When everything looks OK, delete the disk dataset from the old node 

If you're not moving instances around very much then you can probably live with this, for the beenfits of ZFS integrity checking, snapshots and replication.

I did look into whether libvirt+ZFS works better, but it's broken.

I note that the ganeti extstorage interface appends ".diskN" to volume names, even though they are already unique UUIDs.  This is likely to mess things up if you want to detach a disk from one instance and attach it to another at a different disk number.  Since moving disks between instances is not yet possible that doesn't really matter right now.

Final note: all this means that using the "file" hooks for ZFS is almost certainly the wrong way to go, so nobody should implement what I posted earlier :-)
Reply all
Reply to author
Forward
0 new messages