Can't remove instance due to stuck DRBD device

bamblew...@gmail.com

unread,

Sep 1, 2022, 12:34:16 PM9/1/22

to ganeti

Dear Ganeti experts,

I have with some serious disk corruption, and I am trying to remove it and recreate it from ganeti backup. But ganeti won't let me, due to contention on its DRBD device. Here's the error I get:

# gnt-instance remove inventory-a.intranet.psfc.coop
This will remove the volumes of the instance
inventory-a.intranet.psfc.coop (including mirrors), thus removing all
the data of the instance. Continue?
y/[n]/?: y
Thu Sep 1 12:21:51 2022 - WARNING: Could not remove disk 0 on node goji-a.intranet.psfc.coop, continuing anyway: drbd6: can't shutdown drbd device: resource6: State change failed: (-12) Device is held open by someone\nadditional info from kernel:\nfailed to demote\n; Can't lvremove: exited with exit code 5 - Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_data is used by another device.\n; Can't lvremove: exited with exit code 5 - Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_meta is used by another device.\n
Failure: command execution error:
Can't remove instance's disks

How do I free the lock on the DRBD device, so removal can succeed?

Thanks.

-jm

bamblew...@gmail.com

unread,

Sep 1, 2022, 12:50:32 PM9/1/22

to ganeti

Oh, forgot to mention:

OS: Rocky 8.6

Hypervisor: KVM

Ganeti 3.0.2

-jm

bamblew...@gmail.com

unread,

Sep 1, 2022, 2:53:16 PM9/1/22

to ganeti

Poking around some on the ganeti node (which is the master), I managed to see:

# gnt-instance list inventory-a
- Instance name: inventory-a.intranet.psfc.coop
- disk/0: drbd, size 30.0G
on secondary: /dev/drbd6 (147:6) in sync, status *DEGRADED*
logical_id: ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_data
logical_id: ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_meta

# dmsetup info -c | grep 38d8.disk0
ganeti-65eb82ed--aa8e--453f--920c--bd88443c38d8.disk0_data 253 15 L--w 1 1 0 LVM-CXLC7dGbd7ReuHUUrKmEvM2F6pOmX0O2N9i6XezB0tk0UGn2VoHxhqkXv5YPS4lT
ganeti-65eb82ed--aa8e--453f--920c--bd88443c38d8.disk0_meta 253 16 L--w 1 1 0 LVM-CXLC7dGbd7ReuHUUrKmEvM2F6pOmX0O2w8aksC51MEVCs6gmMtlaT2nHRZhq5K1M

# ls -al /sys/dev/block/253\:15/holders
total 0
drwxr-xr-x. 2 root root 0 May 19 12:26 .
drwxr-xr-x. 9 root root 0 Mar 28 16:40 ..
lrwxrwxrwx. 1 root root 0 Sep 1 14:33 drbd6 -> ../../drbd6

# cat /proc/drbd
version: 8.4.10 (api:1/proto:86-101)
srcversion: 4C7330B3F407CB1B288D73A
.
.
.
6: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:308583881 nr:0 dw:308584036 dr:108128470 al:10099 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

So, /dev/drbd6 still exists. How do I remove it cleanly, along with its underlying LV's? I don't want to leave it there, broken. I want all the pieces removed cleanly, as I plan to use "gnt-backup import" to recreate the instance with the same name as before.

Thanks.

-jm

Petter Urkedal

unread,

Sep 2, 2022, 3:19:27 AM9/2/22

to gan...@googlegroups.com

lsof can give a hint of what is hogging the device. Two possibilities
are that the partition table is exposed with kpartx (under /dev/mapper)
or that multipathd is running.

Petter

> --
> You received this message because you are subscribed to the Google Groups "ganeti" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ganeti+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ganeti/f8b80412-54c3-43c3-a155-66b85a644dc8n%40googlegroups.com.

bamblew...@gmail.com

unread,

Sep 6, 2022, 4:15:54 PM9/6/22

to ganeti

Petter,

Thanks for the tip. Lsof did help me find the non-ganeti process that was locking the drbd device. Here's the full sequence of events:

First, this all started when I tried to remove the instance named "inventory-a":

# gnt-instance remove inventory-a.intranet.psfc.coop
This will remove the volumes of the instance
inventory-a.intranet.psfc.coop (including mirrors), thus removing all
the data of the instance. Continue?
y/[n]/?: y
Thu Sep 1 12:21:51 2022 - WARNING: Could not remove disk 0 on node goji-a.intranet.psfc.coop, continuing anyway: drbd6: can't shutdown drbd device: resource6: State change failed: (-12) Device is held open by someone\nadditional info from kernel:\nfailed to demote\n; Can't lvremove: exited with exit code 5 - Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_data is used by another device.\n; Can't lvremove: exited with exit code 5 - Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_meta is used by another device.\n
Failure: command execution error:
Can't remove instance's disks

Next, I ran lsof piped to grep to see see all the locks on device /dev/drbd6:

# lsof | grep -E "COMMAND|drbd6"
COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME
mms 1165543 1165678 mms root 202r BLK 147,6 0t0 411031582 /dev/drbd6
drbd6_sub 1956253 root cwd DIR 253,0 243 512 /
drbd6_sub 1956253 root rtd DIR 253,0 243 512 /
drbd6_sub 1956253 root txt unknown /proc/1956253/exe
drbd6_sub 2246791 root cwd DIR 253,0 243 512 /
drbd6_sub 2246791 root rtd DIR 253,0 243 512 /
drbd6_sub 2246791 root txt unknown /proc/2246791/exe

Hmm. What's this "mms" process, with pid 1165543?

# ps auxwww | grep 1165543
root 1053047 0.0 0.0 12136 1148 pts/0 S+ 10:54 0:00 grep --color=auto 1165543
root 1165543 0.3 0.0 0 0 ? Zsl Mar23 802:24 [mms]

Turns out, it's the "Acronis Machine Management Service", part of the cloud backup solution we use. Attempting to stop it results in this nastiness:

# journalctl -u acronis_mms
-- Logs begin at Wed 2022-03-23 09:15:15 EDT, end at Fri 2022-09-02 11:11:15 EDT. --
Sep 01 21:42:37 goji-a.intranet.psfc.coop systemd[1]: Stopping Acronis machine management service...
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: State 'stop-sigterm' timed out. Killing.
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 1165543 (mms) with signal SIGKILL.
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 861846 (mount) with signal SIGKILL.
Sep 01 21:44:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Processes still around after SIGKILL. Ignoring.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: State 'stop-final-sigterm' timed out. Killing.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 1165543 (mms) with signal SIGKILL.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 861846 (mount) with signal SIGKILL.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Processes still around after final SIGKILL. Entering failed mode.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Failed with result 'timeout'.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: Stopped Acronis machine management service.

As you can see the mms process cannot be killed!

# systemctl | grep -i acronis
● acronis_mms.service loaded failed failed Acronis machine management service

Even worse, this mms process is locking ALL the DRBD devices!

# lsof | grep -E "COMMAND|mms.+drbd"
COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME
mms 1165543 1165678 mms root 77r BLK 147,12 0t0 584468762 /dev/drbd12
mms 1165543 1165678 mms root 200r BLK 147,2 0t0 303346027 /dev/drbd2
mms 1165543 1165678 mms root 201r BLK 147,5 0t0 387574301 /dev/drbd5
mms 1165543 1165678 mms root 202r BLK 147,6 0t0 411031582 /dev/drbd6
mms 1165543 1165678 mms root 203r BLK 147,1 0t0 411518809 /dev/drbd1
mms 1165543 1165678 mms root 204r BLK 147,9 0t0 460792479 /dev/drbd9
mms 1165543 1165678 mms root 205r BLK 147,10 0t0 583993543 /dev/drbd10
mms 1165543 1165678 mms root 206r BLK 147,11 0t0 584065625 /dev/drbd11
mms 1165543 1165678 mms root 207r BLK 147,4 0t0 640055854 /dev/drbd4
mms 1165543 1165678 mms root 208r BLK 147,13 0t0 702959941 /dev/drbd13
mms 1165543 1165678 mms root 209r BLK 147,15 0t0 839105112 /dev/drbd15
mms 1165543 1165678 mms root 210r BLK 147,0 0t0 964966725 /dev/drbd0
mms 1165543 1165678 mms root 211r BLK 147,17 0t0 998594489 /dev/drbd17

How to resolve this unpleasant situation, I wondered? Here's what I did:

1. Use "gnt-instance migrate" to get the local instances over to the secondary node. This was successful, even though it threw a non-critical "can't filp DRBD state" error for each instance.

2. Stop and disable all the Acronis services, to prevent the MMS service from running again, and the unexpected device lock from recurring.

3. Reboot the system. Even though it was still the master node, I felt that trying to do a "gnt-master failover" and then remove the node from the cluster, as I usually do before any reboot, could have made things worse.

Fortunately, when the system came back up, all the DRBD devices were healthy within a few minutes. Also, ganeti cluster operations were normal again, and I was able to remove the "inventory-a" instance as originally planned.

Whew!

BTW, I am impressed at the robustness DRBD -- all 13 active drbd devices recovered after the reboot just fine.

For future reference, does anyone know I could have resolved this situation more gracefully, without rebooting a live master node?

Thanks for your patience in getting to the end of this shaggy dog story!

-jm

Brian Candler

unread,

Sep 7, 2022, 2:50:45 AM9/7/22

to ganeti

I have no better idea on how to fix it.

I do note however that the process was a zombie (ps showed state Z), which explains why it couldn't be killed - it was already dead, just occupying a slot in the process table waiting to be reaped by its parent. However, I didn't think a zombie process should be able to keep files open.

It would have been interesting to find out what its parent process was (ppid). If it was 1, then it's a bug in systemd (failing to reap) or the kernel (failing to signal). If it was not 1, then killing its parent should have caused pid 1 to reap it.

Reply all

Reply to author

Forward