Can't remove instance due to stuck DRBD device

423 views
Skip to first unread message

bamblew...@gmail.com

unread,
Sep 1, 2022, 12:34:16 PM9/1/22
to ganeti
Dear Ganeti experts,

I have with some serious disk corruption, and I am trying to remove it and recreate it from ganeti backup. But ganeti won't let me, due to contention on its DRBD device. Here's the error I get:

# gnt-instance remove inventory-a.intranet.psfc.coop
This will remove the volumes of the instance
inventory-a.intranet.psfc.coop (including mirrors), thus removing all
the data of the instance. Continue?
y/[n]/?: y
Thu Sep  1 12:21:51 2022  - WARNING: Could not remove disk 0 on node goji-a.intranet.psfc.coop, continuing anyway: drbd6: can't shutdown drbd device: resource6: State change failed: (-12) Device is held open by someone\nadditional info from kernel:\nfailed to demote\n; Can't lvremove: exited with exit code 5 -   Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_data is used by another device.\n; Can't lvremove: exited with exit code 5 -   Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_meta is used by another device.\n
Failure: command execution error:
Can't remove instance's disks


How do I free the lock on the DRBD device, so removal can succeed?

Thanks.

-jm

bamblew...@gmail.com

unread,
Sep 1, 2022, 12:50:32 PM9/1/22
to ganeti
Oh, forgot to mention:

OS: Rocky 8.6
Hypervisor: KVM
Ganeti 3.0.2

-jm

bamblew...@gmail.com

unread,
Sep 1, 2022, 2:53:16 PM9/1/22
to ganeti
Poking around some on the ganeti node (which is the master), I managed to see:

# gnt-instance list inventory-a
- Instance name: inventory-a.intranet.psfc.coop
    - disk/0: drbd, size 30.0G
      on secondary: /dev/drbd6 (147:6) in sync, status *DEGRADED*
          logical_id: ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_data
          logical_id: ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_meta

# dmsetup info -c | grep 38d8.disk0
ganeti-65eb82ed--aa8e--453f--920c--bd88443c38d8.disk0_data 253  15 L--w    1    1      0 LVM-CXLC7dGbd7ReuHUUrKmEvM2F6pOmX0O2N9i6XezB0tk0UGn2VoHxhqkXv5YPS4lT
ganeti-65eb82ed--aa8e--453f--920c--bd88443c38d8.disk0_meta 253  16 L--w    1    1      0 LVM-CXLC7dGbd7ReuHUUrKmEvM2F6pOmX0O2w8aksC51MEVCs6gmMtlaT2nHRZhq5K1M

# ls -al /sys/dev/block/253\:15/holders
total 0
drwxr-xr-x. 2 root root 0 May 19 12:26 .
drwxr-xr-x. 9 root root 0 Mar 28 16:40 ..
lrwxrwxrwx. 1 root root 0 Sep  1 14:33 drbd6 -> ../../drbd6

# cat /proc/drbd
version: 8.4.10 (api:1/proto:86-101)
srcversion: 4C7330B3F407CB1B288D73A
.
.
.
 6: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
    ns:308583881 nr:0 dw:308584036 dr:108128470 al:10099 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

So, /dev/drbd6 still exists. How do I remove it cleanly, along with its underlying LV's? I don't want to leave it there, broken. I want all the pieces removed cleanly, as I plan to use "gnt-backup import" to recreate the instance with the same name as before.

Thanks.

-jm

Petter Urkedal

unread,
Sep 2, 2022, 3:19:27 AM9/2/22
to gan...@googlegroups.com
lsof can give a hint of what is hogging the device. Two possibilities
are that the partition table is exposed with kpartx (under /dev/mapper)
or that multipathd is running.

Petter
> --
> You received this message because you are subscribed to the Google Groups "ganeti" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ganeti+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ganeti/f8b80412-54c3-43c3-a155-66b85a644dc8n%40googlegroups.com.

bamblew...@gmail.com

unread,
Sep 6, 2022, 4:15:54 PM9/6/22
to ganeti
Petter,

Thanks for the tip. Lsof did help me find the non-ganeti process that was locking the drbd device. Here's the full sequence of events:

First, this all started when I tried to remove the instance named "inventory-a":

# gnt-instance remove inventory-a.intranet.psfc.coop
This will remove the volumes of the instance
inventory-a.intranet.psfc.coop (including mirrors), thus removing all
the data of the instance. Continue?
y/[n]/?: y
Thu Sep  1 12:21:51 2022  - WARNING: Could not remove disk 0 on node goji-a.intranet.psfc.coop, continuing anyway: drbd6: can't shutdown drbd device: resource6: State change failed: (-12) Device is held open by someone\nadditional info from kernel:\nfailed to demote\n; Can't lvremove: exited with exit code 5 -   Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_data is used by another device.\n; Can't lvremove: exited with exit code 5 -   Logical volume ganeti/65eb82ed-aa8e-453f-920c-bd88443c38d8.disk0_meta is used by another device.\n
Failure: command execution error:
Can't remove instance's disks

Next, I ran lsof piped to grep to see see all the locks on device /dev/drbd6:

# lsof | grep -E "COMMAND|drbd6"
COMMAND       PID     TID TASKCMD          USER   FD      TYPE             DEVICE     SIZE/OFF       NODE NAME
mms       1165543 1165678 mms              root  202r      BLK              147,6          0t0  411031582 /dev/drbd6
drbd6_sub 1956253                          root  cwd       DIR              253,0          243        512 /
drbd6_sub 1956253                          root  rtd       DIR              253,0          243        512 /
drbd6_sub 1956253                          root  txt   unknown                                            /proc/1956253/exe
drbd6_sub 2246791                          root  cwd       DIR              253,0          243        512 /
drbd6_sub 2246791                          root  rtd       DIR              253,0          243        512 /
drbd6_sub 2246791                          root  txt   unknown                                            /proc/2246791/exe

Hmm. What's this "mms" process, with pid 1165543?

# ps auxwww | grep 1165543
root     1053047  0.0  0.0  12136  1148 pts/0    S+   10:54   0:00 grep --color=auto 1165543
root     1165543  0.3  0.0      0     0 ?        Zsl  Mar23 802:24 [mms] 

Turns out, it's the "Acronis Machine Management Service", part of the cloud backup solution we use. Attempting to stop it results in this nastiness:

# journalctl -u acronis_mms
-- Logs begin at Wed 2022-03-23 09:15:15 EDT, end at Fri 2022-09-02 11:11:15 EDT. --
Sep 01 21:42:37 goji-a.intranet.psfc.coop systemd[1]: Stopping Acronis machine management service...
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: State 'stop-sigterm' timed out. Killing.
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 1165543 (mms) with signal SIGKILL.
Sep 01 21:43:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 861846 (mount) with signal SIGKILL.
Sep 01 21:44:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Processes still around after SIGKILL. Ignoring.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: State 'stop-final-sigterm' timed out. Killing.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 1165543 (mms) with signal SIGKILL.
Sep 01 21:45:37 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Killing process 861846 (mount) with signal SIGKILL.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Processes still around after final SIGKILL. Entering failed mode.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: acronis_mms.service: Failed with result 'timeout'.
Sep 01 21:46:38 goji-a.intranet.psfc.coop systemd[1]: Stopped Acronis machine management service.

As you can see the mms process cannot be killed!

# systemctl | grep -i acronis
● acronis_mms.service                                                                                  loaded failed failed    Acronis machine management service                                           
Even worse, this mms process is locking ALL the DRBD devices!

# lsof | grep -E "COMMAND|mms.+drbd"
COMMAND       PID     TID TASKCMD          USER   FD      TYPE             DEVICE     SIZE/OFF       NODE NAME
mms       1165543 1165678 mms              root   77r      BLK             147,12          0t0  584468762 /dev/drbd12
mms       1165543 1165678 mms              root  200r      BLK              147,2          0t0  303346027 /dev/drbd2
mms       1165543 1165678 mms              root  201r      BLK              147,5          0t0  387574301 /dev/drbd5
mms       1165543 1165678 mms              root  202r      BLK              147,6          0t0  411031582 /dev/drbd6
mms       1165543 1165678 mms              root  203r      BLK              147,1          0t0  411518809 /dev/drbd1
mms       1165543 1165678 mms              root  204r      BLK              147,9          0t0  460792479 /dev/drbd9
mms       1165543 1165678 mms              root  205r      BLK             147,10          0t0  583993543 /dev/drbd10
mms       1165543 1165678 mms              root  206r      BLK             147,11          0t0  584065625 /dev/drbd11
mms       1165543 1165678 mms              root  207r      BLK              147,4          0t0  640055854 /dev/drbd4
mms       1165543 1165678 mms              root  208r      BLK             147,13          0t0  702959941 /dev/drbd13
mms       1165543 1165678 mms              root  209r      BLK             147,15          0t0  839105112 /dev/drbd15
mms       1165543 1165678 mms              root  210r      BLK              147,0          0t0  964966725 /dev/drbd0
mms       1165543 1165678 mms              root  211r      BLK             147,17          0t0  998594489 /dev/drbd17

How to resolve this unpleasant situation, I wondered? Here's what I did:

1. Use "gnt-instance migrate" to get the local instances over to the secondary node. This was successful, even though it threw a non-critical "can't filp DRBD state" error for each instance.

2. Stop and disable all the Acronis services, to prevent the MMS service from running again, and the unexpected device lock from recurring.

3. Reboot the system. Even though it was still the master node, I felt that trying to do a "gnt-master failover" and then remove the node from the cluster, as I usually do before any reboot, could have made things worse.

Fortunately, when the system came back up, all the DRBD devices were healthy within a few minutes. Also, ganeti cluster operations were normal again, and I was able to remove the "inventory-a" instance as originally planned.

Whew!

BTW, I am impressed at the robustness DRBD  -- all 13 active drbd devices recovered after the reboot just fine.

For future reference, does anyone know I could have resolved this situation more gracefully, without rebooting a live master node?

Thanks for your patience in getting to the end of this shaggy dog story!

-jm

Brian Candler

unread,
Sep 7, 2022, 2:50:45 AM9/7/22
to ganeti
I have no better idea on how to fix it.

I do note however that the process was a zombie (ps showed state Z), which explains why it couldn't be killed - it was already dead, just occupying a slot in the process table waiting to be reaped by its parent.  However, I didn't think a zombie process should be able to keep files open.

It would have been interesting to find out what its parent process was (ppid). If it was 1, then it's a bug in systemd (failing to reap) or the kernel (failing to signal).  If it was not 1, then killing its parent should have caused pid 1 to reap it.

Reply all
Reply to author
Forward
0 new messages