Something strange here.
Short version: I migrated some VMs and rebooted the underlying hosts, and then DRBD was not getting configured at all on the master node.
Long version:
* Two nodes, wrn-vm1 (master) and wrn-vm2. Both are Debian wheezy. Both have identical sets of packages (output of "dpkg-query -l" is identical)
* I recently updated ganeti to 2.9.5 (from wheezy-backports; previously was on 2.8.4 from source)
* I also did a full "apt-get dist-upgrade" and found that a newer kernel was installed: 3.2.54-2, whereas 3.2.46-1 was running. So I decided to reboot the boxes in turn to activate the new kernel.
* There were only two important VMs, both running on wrn-vm1. I started another VM on wrn-vm2, migrated it back and forth to wrn-vm1 a few times to check all was OK, and left it on wrn-vm1.
So at this point there were three VMs running on wrn-vm1 (all using drbd disks), and nothing on wrn-vm2.
* I then rebooted wrn-vm2 to get the new kernel there. Fine.
* I then migrated all three VMs from wrn-vm1 to wrn-vm2. Fine.
* I then rebooted wrn-vm1. This is where the problems started.
When I tried to migrate a machine back from wrn-vm2 to wrn-vm1, it told me /proc/drbd was not present on wrn-vm1. Sure enough, lsmod showed that the drbd module hadn't been loaded.
So I did "modprobe drbd".
Now drbd is present, but completely unconfigured.
root@wrn-vm1:~# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: F937DCB2E5D83C6CCE4A6C9
root@wrn-vm1:~#
Ganeti has not configured any of the drbd devices needed for the three VMs which are primary on wrn-vm2 and secondary on wrn-vm1.
So on wrn-vm1 I did "/etc/init.d/ganeti restart". No difference:
root@wrn-vm1:~# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: F937DCB2E5D83C6CCE4A6C9
root@wrn-vm1:~#
root@wrn-vm1:~# gnt-cluster verify-disks
Submitted jobs 150105
Waiting for job 150105 ...
No disks need to be activated.
So, ganeti thinks everything is fine!
If I use "gnt-instance info" it clearly shows that the instance is using the drbd template, with nodeB being wrn-vm1
Looking at the wrn-vm2 side:
root@wrn-vm2:~# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: F937DCB2E5D83C6CCE4A6C9
1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:120 nr:0 dw:5388 dr:135244 al:36 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:248
2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:0 nr:0 dw:0 dr:1072 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:0 nr:0 dw:56 dr:1072 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
root@wrn-vm2:~# drbd-overview
1:??not-found?? WFConnection Primary/Unknown UpToDate/DUnknown C r-----
2:??not-found?? WFConnection Primary/Unknown UpToDate/DUnknown C r-----
3:??not-found?? WFConnection Primary/Unknown UpToDate/DUnknown C r-----
root@wrn-vm2:~#
I can't see any other problems with the cluster:
root@wrn-vm1:~# gnt-cluster getmaster
root@wrn-vm1:~# gnt-cluster verify
(all looks OK, no errors or warnings)
Both nodes have
options drbd minor_count=128 usermode_helper=/bin/true
in /etc/modprobe.d/drbd.conf
I rebooted wrn-vm1 again, all is exactly the same. Again I had to modprobe drbd by hand.
I added drbd to /etc/modules and rebooted wrn-vm1 again. This time drbd was present, but again no drbd devices have been created.
Presumably the drbd devices should be created at some point when the ganeti daemons start up, but I don't know at what point.
I found this in watcher.log:
2014-03-17 15:15:04,974: ganeti-watcher pid=4132 INFO Skipping disk activation for instance with not activated disks 'sftp'
2014-03-17 15:15:04,974: ganeti-watcher pid=4132 INFO Skipping disk activation for instance with not activated disks 'airfoo'
2014-03-17 15:15:04,974: ganeti-watcher pid=4132 INFO Skipping disk activation for instance with not activated disks '
netdot.example.com'
Those are the names of the three VMs. So the disks are not "activated"? Googling takes me to the manpage for gnt-instance activate-disks. And sure enough:
Mon Mar 17 15:17:49 2014 - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
wrn-vm2.example.com:disk/0:/dev/drbd1
And now I can migrate this machine back and forth happily.
But this to me begs a number of questions:
(1) What does "activating" a disk mean?
(2) Why, after rebooting wrn-vm1 (which was the master node but secondary for those instances), were the disks not activated?
(3) Why did "gnt-cluster verify-disks" say that no disks needed to be activated?
I may be able to leave the other VMs in an unactivated state for a short time, if it helps debug the issue, but I'd rather get them mirrored again sooner rather than later.
Regards,
Brian.