VM pause on IO error

Zang Li

unread,

Jun 14, 2021, 2:09:42 PM6/14/21

to kubevirt-dev

Hi,

While I simulate storage access failure, I found that the VM will be
automatically paused when the storage access failed. I saw the
following events being generated on the VMI:

Warning IOerror 3m29s (x9 over 9m34s) virt-handler
VM Paused due to IO error at the volume: vm2-boot-dv

However, the vmi still shows as running and not paused.
# k get vmi vm2 -o wide
NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED
vm2 18m Running 192.168.1.128 nvm-w1 False

Moreover, the VM cannot recover after the storage access is restored.

Shouldn't we show the correct status in the above output? It is hard
to detect this error now unless we check the event log.
Also, if we cannot automatically recover from the storage access
failure, why do we put the VM into a pause state instead of stopping
the vm and claim the failure?
What is the advantage of being in the pause state?

Thanks,
Zang

Luboslav Pivarc

unread,

Jun 14, 2021, 2:46:29 PM6/14/21

to Zang Li, kubevirt-dev

Hi,

I was recently working on https://github.com/kubevirt/kubevirt/pull/5401. It should propagate the pause state as a pause condition. Can you share what version you are using?

I believe that we should resume the VM in the next sync cycles. Can you share your approach in simulation of storage failure?

Thanks,

Lubo

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAO_S94hBTcQtgwJHEupwMf9zoCcUeEbx-x8%2BU%2Bw58Uba8_PvQQ%40mail.gmail.com.

Zang Li

unread,

Jun 14, 2021, 3:05:32 PM6/14/21

to Luboslav Pivarc, kubevirt-dev

Hi Lubo,

Glad to know that you already fixed the status propagation! I tested
on kubevirt version 0.40.0. After you propagated the pause state, will
I be able to manually unpause it after the storage access is
recovered?

Yes, I also think we should automatically recover after the storage
access is recovered. I used NFS as my shared storage. I simulated the
storage failure by adding an iptable rule on the nfs server machine to
drop all traffic from the particular k8s cluster node that hosts the
VM.

Zang

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAFTR0GV-Vn%2Bvk2yxOvGeBuEAA97%3DqY8%3D9iqr5QYMUt1D2PhiNQ%40mail.gmail.com.

Zang Li

unread,

Jun 14, 2021, 3:21:58 PM6/14/21

to Luboslav Pivarc, kubevirt-dev

Another question: once the vm falls into this state, how can I delete it from the local virt-handler? I tried to delete the VM by calling the processVMShutdown or processVMDelete from virt-handler, but none of them worked and I got the below error messages:

{"component":"virt-handler","level":"info","msg":"##### Shutting down vmi vm2","pos":"vm.go:1203","timestamp":"2021-06-13T19:49:55.243461Z"}
{"component":"virt-handler","level":"info","msg":"##### Failed to shut down vmi vm2: server error. command Shutdown failed: \"LibvirtError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running')\"","pos":"vm.go:1207","timestamp":"2021-06-13T19:49:55.247779Z"}
{"component":"virt-handler","level":"info","msg":"##### Try deleting VMI vm2","pos":"vm.go:1208","timestamp":"2021-06-13T19:49:55.247833Z"}
{"component":"virt-handler","kind":"VirtualMachineInstance","level":"info","msg":"Signaled deletion for vm2","name":"vm2","namespace":"default","pos":"vm.go:1954","timestamp":"2021-06-13T19:49:55.248222Z","uid":"d81c5e51-4bdc-48ac-bc8b-6a635a656d01"}
{"component":"virt-handler","level":"info","msg":"##### Failed to delete vmi vm2: server error. command Delete failed: \"LibvirtError(Code=55, Domain=10, Message='Requested operation is not valid: cannot undefine transient domain')\"","pos":"vm.go:1212","timestamp":"2021-06-13T19:49:55.249461Z"}

Thanks,

Zang

Luboslav Pivarc

unread,

Jun 15, 2021, 10:08:13 AM6/15/21

to Zang Li, Vladik Romanovsky, kubevirt-dev

Hi,

Can you verify that the launcher pod can access the filesystem? I will have a look if we can test this in CI and if my assumption was correct.

I wonder how the domain becomes transient. @Vladik Romanovsky any thoughts?

Thanks,

Lubo

Zang Li

unread,

Jun 15, 2021, 10:19:37 AM6/15/21

to Luboslav Pivarc, Vladik Romanovsky, kubevirt-dev

Hi Lubo,

Does the launcher pod need to access the storage for deletion? I hit the issue when the host lost access to the storage and the vm was put into PAUSE state due to IO error. I am looking for a way to force delete the launcher pod from the virt-handler on the same node.

Zang

dvo...@redhat.com

unread,

Jun 15, 2021, 2:12:13 PM6/15/21

to kubevirt-dev

On Tuesday, June 15, 2021 at 10:19:37 AM UTC-4 Zang Li wrote:

Hi Lubo,

Does the launcher pod need to access the storage for deletion? I hit the issue when the host lost access to the storage and the vm was put into PAUSE state due to IO error. I am looking for a way to force delete the launcher pod from the virt-handler on the same node.

Deleting the virt-launcher pod should work, does it not?

kubectl delete pod <virt launcher pod name>

Zang Li

unread,

Jun 15, 2021, 2:16:09 PM6/15/21

to dvo...@redhat.com, kubevirt-dev

Hi David,

Yea, that should work in normal cases, but I am looking for ways to delete the pod from virt-handler without going through the API server in case the connection to the API server is broken.

Thanks,

Zang

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/7fb0353f-3af8-48fb-b938-2a32a1850e58n%40googlegroups.com.

dvo...@redhat.com

unread,

Jun 15, 2021, 2:41:00 PM6/15/21

to kubevirt-dev

On Tuesday, June 15, 2021 at 2:16:09 PM UTC-4 Zang Li wrote:

Hi David,

Yea, that should work in normal cases, but I am looking for ways to delete the pod from virt-handler without going through the API server in case the connection to the API server is broken.

if the api server is broken, how would you know the VMI needs to be shutdown? the shutdown is coordinated via the api server.

we've never needed to implement this at the virt-handler level because we depend on kubelet to do this for us. Like a Pod, the way to shutdown a VMI is to simply delete the VMI. Virt-handler attempts to gracefully shutdown the deleted VMI before SIG_KILL is called by the kubelet, but if virt-handler fails then the kubelet ultimately has the final say.

all that said, if you need virt-handler to have the final say for some reason, you'd need to replicate the kubelet pod shutdown logic of sending SIG_TERM and eventually SIG_KILL to the virt-launcher process.

Zang Li

unread,

Jun 15, 2021, 3:02:00 PM6/15/21

to dvo...@redhat.com, kubevirt-dev

Thanks David. I got what you mean. So even when the virt-handler shuts down the vm, the final pod deletion is still done by the kubelet? From my test, once the virt-handler shuts down the vm the pod is removed, but there might be some other mechanism taking effect here. Let me inspect this further. On the other hand, I was hoping to achieve the same effect as shutdown when the vm is in PAUSE state, i.e. stop the qemu process.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/bcc55a0e-8c65-4acb-a1bf-56a2c88d8ef0n%40googlegroups.com.

Alice Frosi

unread,

Jun 16, 2021, 5:49:42 AM6/16/21

to Zang Li, dvo...@redhat.com, kubevirt-dev

Hi,

I don't think it is possible to achieve an automation deletion of the Paused VM without some custom logic. Libvirt pauses the VM because of the IO error. It could happen that you are able to recover from the error and able to unpause the VM. Libvirt/Kubevirt doesn't know how you want to handle the IO errors.

It seems to me that you want some monitor mechanism that watches for IO error events and eventually deletes the VMI if you are not able to recover. Have I interpreted your issue correctly? Please note also that the storage is not managed by Kubevirt directly, but by the CSI storage driver.

Regards,

Alice

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAO_S94jNaPkUhJjO7uMF4Pi%2B4PK438F9%2BWJLEuD-h1xEHG%3DA9g%40mail.gmail.com.

Zang Li

unread,

Jun 17, 2021, 12:45:46 AM6/17/21

to Alice Frosi, dvo...@redhat.com, kubevirt-dev

Hi Alice,

Yes, I am working on an issue similar to what you described, and I understand that I need to add some logic to the code to achieve this. I wanted to have an existing function to call, but I didn't find one. Maybe I will just write one to force kill the process.

Thanks,

Zang

Luboslav Pivarc

unread,

Jun 17, 2021, 8:39:12 AM6/17/21

to Zang Li, Alice Frosi, dvo...@redhat.com, kubevirt-dev

Hi Zang,

I believe we have a bug here and that might be the root cause why you can't stop the vm or recover after the storage is accessible again.

https://github.com/kubevirt/kubevirt/issues/5859

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAO_S94jUptk_HwzRURBN9XyRj7e-7WKs%3DbUPB6M%3DmvkNG96PNg%40mail.gmail.com.

Zang Li

unread,

Jun 17, 2021, 1:05:20 PM6/17/21

to Luboslav Pivarc, Alice Frosi, dvo...@redhat.com, kubevirt-dev

Hi Lubo,

Ahha, great findings! Thank you very much for working on it!

Best,

Zang

Zang Li

unread,

Jul 20, 2021, 4:41:46 PM7/20/21

to Luboslav Pivarc, Alice Frosi, dvo...@redhat.com, kubevirt-dev

Hi Lubo,

I just tried the same test on kubevirt v0.43.0 - now I don't see the IO error anymore when I added the iptable rule to block VM access to the storage.

Instead, the VM shows as normal. After I connected to the vm through console and tried some write operations (which hung as expected),

I saw the following warning event on the vmi:

Warning SyncFailed 2s (x16 over 5m3s) virt-handler unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Did you change the behavior? In my opinion, this behavior isn't better than the old one as the VM failed silently and it is even harder now to tell the reason from the event.

Thanks,

Zang

Luboslav Pivarc

unread,

Jul 21, 2021, 5:26:05 AM7/21/21

to Zang Li, Alice Frosi, dvo...@redhat.com, kubevirt-dev

Hi,

There should not be a change of behavior. This should be also enforced by this test (https://github.com/kubevirt/kubevirt/pull/5864).

Can you open an issue with logs from handler & launcher?

Thanks,

Lubo

Zang Li

unread,

Jul 26, 2021, 8:58:04 PM7/26/21

to Luboslav Pivarc, Alice Frosi, dvo...@redhat.com, kubevirt-dev

Hi Lubo,

virt-launcher doesn't have any related error log. Virt-handler just have the following error repeatedly:

{"component":"virt-handler","level":"info","msg":"Processing event default/vm2","pos":"vm.go:1554","timestamp":"2021-07-27T00:46:38.090736Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm2","namespace":"default","pos":"vm.go:1556","timestamp":"2021-07-27T00:46:38.090779Z","uid":"22e0184e-c6c2-4b25-9b4f-3e1fd24994d1"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"vm2","namespace":"default","pos":"vm.go:1561","timestamp":"2021-07-27T00:46:38.090805Z","uid":""}
{"component":"virt-handler","kind":"","level":"error","msg":"Synchronizing the VirtualMachineInstance failed.","name":"vm2","namespace":"default","pos":"vm.go:1693","reason":"unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded","timestamp":"2021-07-27T00:46:58.094210Z","uid":"22e0184e-c6c2-4b25-9b4f-3e1fd24994d1"}
{"component":"virt-handler","level":"info","msg":"re-enqueuing VirtualMachineInstance default/vm2","pos":"vm.go:1350","reason":"unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded","timestamp":"2021-07-27T00:46:58.099108Z"}

I filed the issue https://github.com/kubevirt/kubevirt/issues/6126 for this. Could you please try the same operation on your setup?