Petter,
Thanks for the tip. Lsof did help me find the non-ganeti process that was locking the drbd device. Here's the full sequence of events:
First, this all started when I tried to remove the instance named "inventory-a":
# gnt-instance remove inventory-a.intranet.psfc.coop
This will remove the volumes of the instance
inventory-a.intranet.psfc.coop (including mirrors), thus removing all
the data of the instance. Continue?
y/[n]/?: y
Thu Sep 1 12:21:51 2022 - WARNING: Could not remove disk 0 on node goji-a.intranet.psfc.coop, continuing anyway: drbd6: can't shutdown drbd device: resource6: State change failed: (-12) Device is held open by someone\nadditional info from kernel:\nfailed to demote\n; Can't lvremove: exited with exit code 5 - Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_data is used by another device.\n; Can't lvremove: exited with exit code 5 - Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_meta is used by another device.\n
Failure: command execution error:
Can't remove instance's disks
Next, I ran lsof piped to grep to see see all the locks on device /dev/drbd6:
# lsof | grep -E "COMMAND|drbd6"
COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME
mms 1165543 1165678 mms root 202r BLK 147,6 0t0 411031582 /dev/drbd6
drbd6_sub 1956253 root cwd DIR 253,0 243 512 /
drbd6_sub 1956253 root rtd DIR 253,0 243 512 /
drbd6_sub 1956253 root txt unknown /proc/1956253/exe
drbd6_sub 2246791 root cwd DIR 253,0 243 512 /
drbd6_sub 2246791 root rtd DIR 253,0 243 512 /
drbd6_sub 2246791 root txt unknown /proc/2246791/exe
Hmm. What's this "mms" process, with pid 1165543?
# ps auxwww | grep 1165543
root 1053047 0.0 0.0 12136 1148 pts/0 S+ 10:54 0:00 grep --color=auto 1165543
root 1165543 0.3 0.0 0 0 ? Zsl Mar23 802:24 [mms]
Turns out, it's the "Acronis Machine Management Service", part of the cloud backup solution we use. Attempting to stop it results in this nastiness:
# journalctl -u acronis_mms
-- Logs begin at Wed 2022-03-23 09:15:15 EDT, end at Fri 2022-09-02 11:11:15 EDT. --
Sep 01 21:42:37 goji-a.intranet.psfc.coop systemd[1]: Stopping Acronis machine management service...
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: State 'stop-sigterm' timed out. Killing.
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 1165543 (mms) with signal SIGKILL.
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 861846 (mount) with signal SIGKILL.
Sep 01 21:44:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Processes still around after SIGKILL. Ignoring.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: State 'stop-final-sigterm' timed out. Killing.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 1165543 (mms) with signal SIGKILL.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 861846 (mount) with signal SIGKILL.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Processes still around after final SIGKILL. Entering failed mode.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Failed with result 'timeout'.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: Stopped Acronis machine management service.
As you can see the mms process cannot be killed!
# systemctl | grep -i acronis
● acronis_mms.service loaded failed failed Acronis machine management service Even worse, this mms process is locking ALL the DRBD devices!
# lsof | grep -E "COMMAND|mms.+drbd"
COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME
mms 1165543 1165678 mms root 77r BLK 147,12 0t0 584468762 /dev/drbd12
mms 1165543 1165678 mms root 200r BLK 147,2 0t0 303346027 /dev/drbd2
mms 1165543 1165678 mms root 201r BLK 147,5 0t0 387574301 /dev/drbd5
mms 1165543 1165678 mms root 202r BLK 147,6 0t0 411031582 /dev/drbd6
mms 1165543 1165678 mms root 203r BLK 147,1 0t0 411518809 /dev/drbd1
mms 1165543 1165678 mms root 204r BLK 147,9 0t0 460792479 /dev/drbd9
mms 1165543 1165678 mms root 205r BLK 147,10 0t0 583993543 /dev/drbd10
mms 1165543 1165678 mms root 206r BLK 147,11 0t0 584065625 /dev/drbd11
mms 1165543 1165678 mms root 207r BLK 147,4 0t0 640055854 /dev/drbd4
mms 1165543 1165678 mms root 208r BLK 147,13 0t0 702959941 /dev/drbd13
mms 1165543 1165678 mms root 209r BLK 147,15 0t0 839105112 /dev/drbd15
mms 1165543 1165678 mms root 210r BLK 147,0 0t0 964966725 /dev/drbd0
mms 1165543 1165678 mms root 211r BLK 147,17 0t0 998594489 /dev/drbd17
How to resolve this unpleasant situation, I wondered? Here's what I did:
1. Use "gnt-instance migrate" to get the local instances over to the secondary node. This was successful, even though it threw a non-critical "can't filp DRBD state" error for each instance.
2. Stop and disable all the Acronis services, to prevent the MMS service from running again, and the unexpected device lock from recurring.
3. Reboot the system. Even though it was still the master node, I felt that trying to do a "gnt-master failover" and then remove the node from the cluster, as I usually do before any reboot, could have made things worse.
Fortunately, when the system came back up, all the DRBD devices were healthy within a few minutes. Also, ganeti cluster operations were normal again, and I was able to remove the "inventory-a" instance as originally planned.
Whew!
BTW, I am impressed at the robustness DRBD -- all 13 active drbd devices recovered after the reboot just fine.
For future reference, does anyone know I could have resolved this situation more gracefully, without rebooting a live master node?
Thanks for your patience in getting to the end of this shaggy dog story!
-jm