Linbit announces drbdd

Karsten Heymann

unread,

Feb 19, 2021, 3:15:31 AM2/19/21

to gan...@googlegroups.com

Hi everyone,

this might be interesting: Linbit (the company that created drbd) just announced a new drbdd service that shares a lot of similarities with what ganeti does to manage drbd:

https://lists.linbit.com/pipermail/drbd-announce/2021-February/000422.html

It's a bit like the linstor controller without linstor (https://twitter.com/philipp_reisner/status/1362518109735841794).

Maybe someday drbdd could be used to replace the ganeti drbd logic, making the ganeti code smaller and easier to manage.

Best regards

Karsten

Brian Candler

unread,

Feb 19, 2021, 7:35:21 AM2/19/21

to ganeti

> It's a bit like the linstor controller without linstor

Interesting development, but as far as I can see, it's rather lower level than that.

Linstor provisions volumes across a cluster: it configures drbd on all the nodes, and creates/destroys LVM or ZFS volumes for the underlying storage. drbdd just watches for events from the drbd resources on the local node, and triggers local plugins when their state changes.

The "promoter" plugin triggers changes to systemd units. Therefore, if you were running VMs using systemd units to start qemu (and that is certainly a low-level way to run VMs!), on failure you could make it auto-promote another node to primary and restart the VM there. The promoter plugin runs on *all* nodes, and there's a race to see the first one that wins. This means it doesn't need a distributed cluster manager (such as ganeti), since the drbd resource itself acts as the locking/coordination mechanism.

I can see drbdd being useful to signal events to ganeti, but not this auto-promotion; ganeti has a central master architecture. It would want to pick up events on the master, make its own decision which node to promote, and perform it as a job.

Also note: I believe it has been a conscious decision to date that ganeti *won't* auto-migrate an instance to another node when a node fails (*). This drbdd mechanism *could* be used to implement or trigger such auto-healing, if it was so desired. It would however be relying on drbd's own quorum and state management, rather than ganeti's.

IMO, the best way to simplify the ganeti drbd logic would be to get it to talk to the Linstor API. Then you get all the benefits of drbd9, up to 32 replicas, and lvm and zfs underneath, and thin-volumes, and live-migration to diskless nodes. Linstor handles plain LVM volumes too. With this, ganeti could remove *all* its LVM and DRBD logic!

Regards,

Brian.

(*) This is surprising to some people, because there is harep, which you enable by setting instance tags "ganeti:watcher:autorepair:<type>", e.g. "ganeti:watcher:autorepair:failover". But as the manpage says:

Harep doesn't do any hardware failure detection on its own, it relies on nodes being marked as offline by the administrator.

So if your server goes down at 3am, the VMs *won't* be automatically restarted on another node. Plus, it doesn't work with shared storage (ext) - only drbd(**)

Previous discussions:

https://groups.google.com/g/ganeti/c/Y9hkk2waRS4/m/MV1WEC3yIJgJ

https://groups.google.com/g/ganeti/c/LC9Un2pFkPk/m/abIWyL2wAAAJ

(**) or plain - but since the node with the plain volume has died, it can only reinstall a fresh instance for you.

Brian Candler

unread,

Feb 19, 2021, 8:30:18 AM2/19/21

to ganeti

On Friday, 19 February 2021 at 08:15:31 UTC karsten...@gmail.com wrote:

It's a bit like the linstor controller without linstor (https://twitter.com/philipp_reisner/status/1362518109735841794).

BTW, the way I read that tweet, it's not saying this is a replacement for the linstor controller, it's a replacement for linstor HA controller.

The linstor controller itself is a single entity, not a cluster. It doesn't have the concept of active and standby nodes and master-failover, like ganeti has. So until now, the recommended way to deploy the linstor controller in a HA fashion is either to muck about with pacemaker, or rely on some other external cluster like proxmox or k8s to keep an instance of the controller running in a VM or container.

What's being proposed here (I think) is that you run the controller as a systemd service, which mounts the storage and runs the controller. In the event that the running node fails:

- drbdd "promoter" signals all the other machines to try and grab the primary role

- one of them succeeds

- systemd restarts the controller on that node

Basically you're leveraging the drbd quorum mechanisms to avoid a separate cluster decision maker. With linstor, even if you make a drbd volume across only two nodes, it automatically adds a third diskless node as a quorum tiebreaker.

I think this should work in the case of a network partition as well. If the running node becomes isolated it should lose quorum, and drbd will go into I/O blocking state on that node, so it should be safe for another node to take over.