gnt command to re-create instance disks of a LVM VG

403 views
Skip to first unread message

John N.

unread,
Aug 1, 2018, 12:19:23 PM8/1/18
to ganeti
Hi,

I can't remember which command parameter to gnt-cluster or gnt-node is supposed to re-create the missing disks (LVM logical volumes) of a redundant pair (DRBD) after one has re-created the volume group (LVM) from scratch on a node?

I am asking because I have a node where the LVM volume group is dying and I want to-recreate the whole LVM volume group from scratch. This means that my LVM logical volumes will be gone and I would like ganeti to re-create them automatically for me but I can't remember which command is doing that (was it maybe "gnt-cluster verify-disks" ???).

On that node there are only secondary instances left, I have already live migrated my primary instances to other nodes. Btw this is Debian 8 with ganeti 2.12 with DRBD, LVM and Xen.

Regards,
John


candlerb

unread,
Aug 8, 2018, 1:19:10 PM8/8/18
to ganeti
> I can't remember which command parameter to gnt-cluster or gnt-node is supposed to re-create the missing disks (LVM logical volumes) of a redundant pair (DRBD) after one has re-created the volume group (LVM) from scratch on a node?

You may be thinking of "gnt-instance replace-disks". But I don't think you can recreate the volume group from scratch.  Rather: you add the new disks into the existing volume group, then use replace-disks to move the instance(s) off the bad disks, then you can remove the bad disks from the volume group.

If you've already zapped the existing VG then maybe you'll have to convert the instances to plain and then back to drbd.

John N.

unread,
Aug 9, 2018, 4:07:59 PM8/9/18
to ganeti

On Wednesday, August 8, 2018 at 7:19:10 PM UTC+2, candlerb wrote:
 
You may be thinking of "gnt-instance replace-disks". But I don't think you can recreate the volume group from scratch.  Rather: you add the new disks into the existing volume group, then use replace-disks to move the instance(s) off the bad disks, then you can remove the bad disks from the volume group.

Thanks for the pointer, that must be it. If I run a "gnt-instance replace-disks -s" for each of my instance which has the secondary on that specific node it should re-create its logical volume and sync the whole disk. The volume group I have to manually create but that's an easy one as it is only one command (vgcreate ...) but I wouldn't like to have to manually create all the logical volumes as each instance has two of them (meta + disk). So if you have 35 secondary instances on one node like me, it means re-creating 70 logical volumes manually.
 
If you've already zapped the existing VG then maybe you'll have to convert the instances to plain and then back to drbd.

I am planning to zap that VG yes because of an error with the underlying disks which I want to replace all but I don't think (or hope) I will need to convert the instances to plain and back to DRBD...

candlerb

unread,
Aug 10, 2018, 8:59:16 AM8/10/18
to ganeti
> I am planning to zap that VG yes because of an error with the underlying disks which I want to replace

I'm not sure you should zap the VG.  I think you should remove the faulty disk from the volume group, add a new good one, and do "replace-disks" for each instance, e.g. using the --auto option:

       The fourth form (when using --auto) will automatically determine which disks of an instance are faulty and replace them within the
       same  node.   The  --auto  option  works  only  when an instance has only faulty disks on either the primary or secondary node; it
       doesn't work when both sides have faulty disks.

The documentation in this area isn't great:

In particular, it's not clear to me how you tell ganeti that the physical volume has failed (if it hasn't been kicked out by the OS) - maybe you could pvremove --force it, or just physically unplug it.

But if you delete the VG and recreate the VG, it's going to be a different VG.  It *might* work (if Ganeti always references the VG by name, not by UUID).  But it might not.

John N.

unread,
Aug 10, 2018, 12:18:54 PM8/10/18
to ganeti


On Friday, August 10, 2018 at 2:59:16 PM UTC+2, candlerb wrote:
I'm not sure you should zap the VG.  I think you should remove the faulty disk from the volume group, add a new good one, and do "replace-disks" for each instance, e.g. using the --auto option:

       The fourth form (when using --auto) will automatically determine which disks of an instance are faulty and replace them within the
       same  node.   The  --auto  option  works  only  when an instance has only faulty disks on either the primary or secondary node; it
       doesn't work when both sides have faulty disks.

Unfortunately my situation is a bit more complicated: I am using the ZFS external storage provider and my ZFS pool as a permanent error (all disks are fine though -> checked with smartctl long test). This permanent error can not be fixed so my last solution would be to delete the ZFS pool to re-create a new one which would then be all good again.
 
The documentation in this area isn't great:

In particular, it's not clear to me how you tell ganeti that the physical volume has failed (if it hasn't been kicked out by the OS) - maybe you could pvremove --force it, or just physically unplug it.

I totally agree, I was also searching the docs for this specific situation but could not find any clear procedure for this situation.

But if you delete the VG and recreate the VG, it's going to be a different VG.  It *might* work (if Ganeti always references the VG by name, not by UUID).  But it might not.

Good point, would be nice if someone from google here could step in and confirm what would happen when deleting the whole VG and re-creating it from scratch.

candlerb

unread,
Aug 14, 2018, 3:14:35 AM8/14/18
to ganeti
> I am using the ZFS external storage provider

Are you talking about this one?

You do realise that this replaces all the lvm binaries with frig shell scripts don't you?  In other words, you're not using LVM at all!    For example, the replacement "lvcreate" command creates a zvol, not an LVM logical volume. See:

Personally I wouldn't touch that with a barge pole.  Ignore everything I said about ganeti LVM physical disk replacement before, as it doesn't apply to this (because you're not using LVM).

You *are* using drbd on top of zvols though.  As to how to recreate the underlying zvols I don't know, but if your zpool is in a sufficiently unbroken state that you can *list* all the zvols and recreate them with identical names and sizes, that might be sufficient.  drbd itself might be sufficiently clever to realise that both the data and metadata volumes have been wiped, and just resync them automatically the next time the disks are activated.

But if I were you, I'd take dumps of my instances first.

John N.

unread,
Aug 15, 2018, 8:47:13 AM8/15/18
to ganeti


Are you talking about this one?

Yes, but actually I am using a fork (https://github.com/brigriffin/ganeti-extstorage-zfs) which includes a few fixes.

 
You do realise that this replaces all the lvm binaries with frig shell scripts don't you?  In other words, you're not using LVM at all!    For example, the replacement "lvcreate" command creates a zvol, not an LVM logical volume. See:

Yes I do and and I don't mind having LVM binaries replaced as I don't need LVM at all. Actually I did not want to explain all the gory details as the principles of volume management stays the same no matter if it is LVM or ZFS but I am totally aware that the lv* commands do not apply to my case and I know the respective ZFS commands.

Personally I wouldn't touch that with a barge pole.  Ignore everything I said about ganeti LVM physical disk replacement before, as it doesn't apply to this (because you're not using LVM).

I am using this external stroage provider since over 2 years in production on a small 3 nodes cluster and I must say I am very happy with it. Performance is amazing and you have the added benefit of snapshots (which you can send over for DR) as well as data integrity among other benefits (lz4 compression, etc). I am still wondering why ganeti did not decide to go for ZFS natively, but that's another topic discussion :-)
 
You *are* using drbd on top of zvols though.  As to how to recreate the underlying zvols I don't know, but if your zpool is in a sufficiently unbroken state that you can *list* all the zvols and recreate them with identical names and sizes, that might be sufficient.  drbd itself might be sufficiently clever to realise that both the data and metadata volumes have been wiped, and just resync them automatically the next time the disks are activated.

Unfortunately I already tried re-creating the problematic zvol and this removed the one zpool permanent error but added a new one which I can really not get rid of. Hence my last resort solution of destroying the whole pool and starting again with a clean pool again. I opened an issue with ZFS on linux but no answer yet. If you want to see more details it would be here: https://github.com/zfsonlinux/zfs/issues/7762

candlerb

unread,
Aug 16, 2018, 4:00:16 AM8/16/18
to ganeti
Sure - whatever works for you.  You did say originally "I have a node where the LVM volume group is dying", and "this is Debian 8 with ganeti 2.12 with DRBD, LVM and Xen", which rather strongly implied you were using LVM :-)

I do use zfs in my home cluster, but just using local zvols on SSD, which I then hourly replicate to a backup machine on HDD using syncoid.

John N.

unread,
Aug 16, 2018, 3:59:20 PM8/16/18
to ganeti


Sure - whatever works for you.  You did say originally "I have a node where the LVM volume group is dying", and "this is Debian 8 with ganeti 2.12 with DRBD, LVM and Xen", which rather strongly implied you were using LVM :-)

You are totally right, basically I abstracted the ZFS part because of the fact that from the point of view of ganeti it is like it was still using/emulating LVM and was lazy to explain the "gory" details :)
 
I do use zfs in my home cluster, but just using local zvols on SSD, which I then hourly replicate to a backup machine on HDD using syncoid.

Ah that's you who made this nice and simple extstorage provide for ZFS, well done! When you say "local zvols" do I understand correctly that you main like the plain type (non-replicated DRBD)?
You might also want to set "primarycache=metadata" as parameter on your zvols as else you end up having double-caching (ZFS on hypervisor + guest OS) without much benefits and higher context switches on your hypervisor.

candlerb

unread,
Aug 17, 2018, 7:16:49 AM8/17/18
to ganeti
On Thursday, 16 August 2018 20:59:20 UTC+1, John N. wrote:

Ah that's you who made this nice and simple extstorage provide for ZFS, well done! When you say "local zvols" do I understand correctly that you main like the plain type (non-replicated DRBD)?

The ext storage provider creates a zvol directly on the local host, and the instance directly talks to the zvol block device.  So it's just a local image.

The "plain" and "drbd" disk templates still exist, and still do what they normally do: create LVM, and DRBD on top of LVM, respectively.

So if you want to create a zfs instance you have to name the ext storage provider directly: e.g.

gnt-instance add -o snf-image+jessie -t ext --disk 0:size=10G,provider=zfs -n nuc1 jessie1.example.com

You do lose the real-time replication and live migration which DRBD gives you.  On my home setup I can live without those.  To move an instance, you stop it, do a final sanoid sync to the central storage server, then sync from storage server back to target.
 
You might also want to set "primarycache=metadata" as parameter on your zvols as else you end up having double-caching (ZFS on hypervisor + guest OS) without much benefits and higher context switches on your hypervisor.

I create all my zvols under a parent dataset ("zfs/vm"), so Iany options like that can be set there globally:

root@nuc1:~# zfs get -r primarycache zfs/vm -t volume
NAME                                                   PROPERTY      VALUE         SOURCE
zfs/vm/12700459-7dc2-454f-87f8-54eb1a8c090b.ext.disk0  primarycache  metadata      inherited from zfs/vm
zfs/vm/28289bb1-8e27-4417-af7c-3d2f5a3a50d1.ext.disk0  primarycache  metadata      inherited from zfs/vm
zfs/vm/e7b0fe8b-996d-44a9-950e-da5c257ab0df.ext.disk0  primarycache  metadata      inherited from zfs/vm

John N.

unread,
Aug 17, 2018, 2:40:18 PM8/17/18
to ganeti

The ext storage provider creates a zvol directly on the local host, and the instance directly talks to the zvol block device.  So it's just a local image.

The "plain" and "drbd" disk templates still exist, and still do what they normally do: create LVM, and DRBD on top of LVM, respectively.

So if you want to create a zfs instance you have to name the ext storage provider directly: e.g.

gnt-instance add -o snf-image+jessie -t ext --disk 0:size=10G,provider=zfs -n nuc1 jessie1.example.com

You do lose the real-time replication and live migration which DRBD gives you.  On my home setup I can live without those.  To move an instance, you stop it, do a final sanoid sync to the central storage server, then sync from storage server back to target.

Interesting, thanks for the example. Now that you mention this I remember why the ZFS extstorage provider needs to keep the LVM layer by emulating it, that's because the ganeti external storage provider does not work with DRBD and hence live migrations. As I need live migrations I went for this "hacky" ZFS external storage provider with the downside that I can't have any real LVM anymore but for me live migrations and high availability was the priority.
 
I create all my zvols under a parent dataset ("zfs/vm"), so Iany options like that can be set there globally:

root@nuc1:~# zfs get -r primarycache zfs/vm -t volume
NAME                                                   PROPERTY      VALUE         SOURCE
zfs/vm/12700459-7dc2-454f-87f8-54eb1a8c090b.ext.disk0  primarycache  metadata      inherited from zfs/vm
zfs/vm/28289bb1-8e27-4417-af7c-3d2f5a3a50d1.ext.disk0  primarycache  metadata      inherited from zfs/vm
zfs/vm/e7b0fe8b-996d-44a9-950e-da5c257ab0df.ext.disk0  primarycache  metadata      inherited from zfs/vm

Makes sense and I should do the same like you as I currently set this parameter for each zvol independently when they get created... 

Nice to see someone else using ZFS with ganeti, I think it's a great match.

candlerb

unread,
Aug 18, 2018, 7:49:39 AM8/18/18
to ganeti
On Friday, 17 August 2018 19:40:18 UTC+1, John N. wrote:

As I need live migrations I went for this "hacky" ZFS external storage provider with the downside that I can't have any real LVM anymore but for me live migrations and high availability was the priority.
 

If you do that, I think you immediately negate the ability to safely use snapshots.  That is: if you magically change the data from underneath drbd, without going through the block write layer, drbd *will* become totally confused (most likely the two ends will be out of sync without drbd realising). Even if you tried to do snapshots at both primary and secondary sides, they won't be simultaneous.

As far as I can see, the only way I can see to safely restore to a snapshot would be:

1. Stop the instance and deactivate its disks
2. Revert to snapshot at primary side
3. Reactivate disks
4. Force a complete resync of drbd

Ganeti itself doesn't really have an option for step 4, apart from converting the instance to 'plain' and back to 'drbd'.  It might be possible to do it using drbdadm manually to invalidate the copy, or by recreating the secondary zvol (or wiping just the drbd metadata from it).  Either way, you're still forcing a full copy of the entire disk.

As for live migration: I did have an idea how to approach this but I've not tested it.

1. Create two physical servers with large spare disk or disk partition.  Export this free space as nbd or iscsi volumes
2. Build an NFS server VM, which uses ZFS internally and attaches to the two nbd/iscsi volumes as mirrors.
3. Run VMs using sharedfile, with disk images accessed over NFS

You can live-migrate the VMs, and you can live-migrate the NFS server, whilst maintaining the full data integrity checking of ZFS.  The system area of the NFS server can be traditional DRBD, so you don't need it to be able to boot from zfs-over-nbd.

John N.

unread,
Aug 18, 2018, 1:07:30 PM8/18/18
to ganeti

As for live migration: I did have an idea how to approach this but I've not tested it.

1. Create two physical servers with large spare disk or disk partition.  Export this free space as nbd or iscsi volumes
2. Build an NFS server VM, which uses ZFS internally and attaches to the two nbd/iscsi volumes as mirrors.
3. Run VMs using sharedfile, with disk images accessed over NFS

You can live-migrate the VMs, and you can live-migrate the NFS server, whilst maintaining the full data integrity checking of ZFS.  The system area of the NFS server can be traditional DRBD, so you don't need it to be able to boot from zfs-over-nbd.

I was thinking would it also be possible to benefit from live migration by having a more simple setup with a ZFS+NFS physical server and running the ganeti instances using sharedfile (NFS share mounted on all ganeti nodes)? The ZFS+NFS server could be built out of two physical servers enabling HA (example project: https://github.com/ewwhite/zfs-ha/wiki)?

Finally back to my initial problem I have now built a Vagrant box with Debian 8, ganeti, Xen, ZFS and the ZFS external storage provider in order to test with two test virtualbox ganeti nodes the process of destroying my ZFS pool and re-creating it. The process so far looks like that:

0) document all zfs volumes on the problematic node (in order to manually re-create them after)
1) offline the problematic node from the ganeti master
2) reboot the problematic node (in order to quickly remove all DRBD devices and make the the ZFS pool not busy anymore)
3) destroy the zpool
4) re-create zpool
5) set default settings on zpool
6) online the problematic node from the ganeti master
7) for each instance do:
7.1) re-create instance data zvol
7.2) re-create instance meta zvol
7.3) set zvol settings on data and meta zvol (required by ZFS external storage provider)
7.4) re-create symbolic links in /dev/ffzgvg pointing to the correct zvol /dev/zd* (required by ZFS external storage provider)
7.5) re-create DRBD meta (example:  drbdmeta /dev/drbd1 v08 /dev/ffzgvg/ec4da952-0421-49d2-9b24-dae3858a1dc9.disk0_meta flex-external create-md)
7.6) activate instance disks from master node (example: gnt-instance activate-disks ginst1)
7.7) observe full resync via /proc/drbd
8) proceed to next instance at 7.1) until all instances are done

Note here that my problematic node is currently only used as secondary for all my instances but still this is quite a hell of a job as I have 33 secondary instances on that node meaning 66 zvols (data+meta)...

As you already mentioned ganeti does not seem to have any commands in order to re-create the missing underlying volumes on a node in case they are gone which is a bit annoying.
Reply all
Reply to author
Forward
0 new messages