--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CANwfQB-gwcZvQtzmn%2BLxg0BU4F5L6W%2BNFYr%2B-MXFfkqeuFu7Fw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CANwfQB-nYg1cEA3AxyGUoBty_kjU05XJu84N6UF33MQzEm9Ugw%40mail.gmail.com.
Hi Alexander and Vasiliy,If live migration can insert the extra volume into a new virt launcher pod, why not just make this as the default behavior as long as the live migration is supported? I.e. every time you do a hotplug, migrate the old vm to a new one with the hotplugged disk. This way, we don't need to have the extra pods around.
- we'd need the ability to migrate VMs on the same host, there are some limitations here we hit within the virt stack, but besides that it presents issues with available resources. We'd need to be able to schedule a pod with identical mem request/limits on the same node. that node may or may not have capacity for the a second target pod, which then prevents the hotplug from occurring.
- migrations can be slow (depending on the workload), reducing the responsiveness of the hotplug
- depending on the usecase and how the network is setup, migrating a VM causes a new pod ip, which isn't something everyone can handle.
Hi David and Vasiliy,Thanks for listing the limitations. Since migratable or not is known from our side, we should be able to tell when to do migration and when not. I believe live migration is a very desired feature for every production workload so it should be a more common use case than the non-migratable ones.- we'd need the ability to migrate VMs on the same host, there are some limitations here we hit within the virt stack, but besides that it presents issues with available resources. We'd need to be able to schedule a pod with identical mem request/limits on the same node. that node may or may not have capacity for the a second target pod, which then prevents the hotplug from occurring.Agree that it would be best to migrate the VM to the same host, but is this a must-have?
As long as we can do live migration, the migration should be transparent to the users with a very short period interruption. Most cases, users probably don't care which node the vm resides as long as it is within a particular node pool. Alternatively, we can block the operation if the vm has a node-selector but the node doesn't have enough resources.
A side question on this - is it possible to increase the cpu/mem as well during a live migration? This falls into a more general question and use case - what changes can be done without restarting a vm?
- migrations can be slow (depending on the workload), reducing the responsiveness of the hotplugDo we have some benchmark results on how long does it take for a migration to finish?
- depending on the usecase and how the network is setup, migrating a VM causes a new pod ip, which isn't something everyone can handle.I believe a VM is not migratable if it uses pod IP directly because pod IP can not persist. Please correct me if I am wrong.
Thanks,Zang
On Thu, Apr 8, 2021 at 2:53 PM Zang Li <zan...@google.com> wrote:Hi David and Vasiliy,Thanks for listing the limitations. Since migratable or not is known from our side, we should be able to tell when to do migration and when not. I believe live migration is a very desired feature for every production workload so it should be a more common use case than the non-migratable ones.- we'd need the ability to migrate VMs on the same host, there are some limitations here we hit within the virt stack, but besides that it presents issues with available resources. We'd need to be able to schedule a pod with identical mem request/limits on the same node. that node may or may not have capacity for the a second target pod, which then prevents the hotplug from occurring.Agree that it would be best to migrate the VM to the same host, but is this a must-have?As long as we can do live migration, the migration should be transparent to the users with a very short period interruption. Most cases, users probably don't care which node the vm resides as long as it is within a particular node pool. Alternatively, we can block the operation if the vm has a node-selector but the node doesn't have enough resources.Migrating to the same node is not a "must-have" necessarily for what you are describing to work. However, there are situations where we're limited if "same node" live migration isn't possible though... For example, if local storage PVCs are being hotplugged to a VMI. We support hotplugging local storage PVCs today, but that would get tricky if we need to depend on migration for hotplug.
A side question on this - is it possible to increase the cpu/mem as well during a live migration? This falls into a more general question and use case - what changes can be done without restarting a vm?I haven't put much thought into this, but theoretically we could give the target pod increased cpu/mem requests and limits, then hotplug those additional resources to the running qemu guest after the migration completes.- migrations can be slow (depending on the workload), reducing the responsiveness of the hotplugDo we have some benchmark results on how long does it take for a migration to finish?I've seen non-public benchmarks, and the migration times vary depending on the cluster hardware, the guest workload, and the migration tunings KubeVirt is configured with. It's hard to give accurate numbers here that make any sense.If we're optimizing for getting the guest workload running on the target pod ASAP (to access a hotplugged disk), we do have the ability to perform postCopy migration, which transfers the workload into the target pod nearly immediately while we continue to stream contents from the source pod. This is similar in concept to what GCP is doing behind the scenes with their compute instance live migration. From a workload performance perspective, any memory intensive workload that needs low latency guarantees is definitely going to notice a performance hit while the contents continue to stream from the source to target.All that said, even with PostCopy things can get kind of weird if we use Live Migration for hotplug. Streaming all the contents of a VM across the network is bandwidth intensive. If we're doing this every time someone hotplugs or un-hotplugs a disk, there's a limit to how many of those migrations can occur in parallel across the cluster.The idea with same node migration is that it perhaps helps alleviate some of those bandwidth concerns as well- depending on the usecase and how the network is setup, migrating a VM causes a new pod ip, which isn't something everyone can handle.I believe a VM is not migratable if it uses pod IP directly because pod IP can not persist. Please correct me if I am wrong.yep, when we're using the "bridge" network binding with the pod network, you're correct. We don't allow migration for VMIs with pod network + bridge bindingHowever, when using the "masquerade" binding with the pod network, we setup NAT and give the guest an IP address that will remain consistent throughout a live migration. So after a live migration occurs, the pod IP technically changes, but the IP the vm guest sees internally on eth0 remains consistent.I was thinking about the "masquerade" binding when I wrote that bullet point earlier, however I don't think my argument is very strong. It's true that with the masquerade binding the VM's IP within the cluster will change, which might impact how other endpoints contact the VM, however we have the exact same problem when a VM is stopped and started as well.Thanks,Zang
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAPjOJFt6-Xx_4cHb2p39yhfHAdtKONgTGZ_0QOse2GT36otmOA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAKDnqXaYR7AhpTjBB9Ghzk0cuycYBJoQ8jAjR1XSumMr2hxWrw%40mail.gmail.com.
Hey,Trying to bring up this discussion again :)I was looking into the topic of migrating VMs with hotplug disks recently. Currently there is a draft PR [1] that implements the approach with a single attachment pod holding all the volumes. I tried to enable LM using that PR as a base. In the end I think I managed to shape it to some working state. I created a WIP PR [2] and would like to get some feedback from the community about it. There is still some work to be done but the approach in general is there. I hope that helps to move forward and bring the feature closer to release as currently there is an interest and a strong wish to leverage it in other projects.Thanks,Vasiliy
Hi Alexander,Thank you for reviewing the PR. The [1] works pretty well so it does not slow me down or cause troubles :) Still there are things in TODO.Also apart from what is mentioned in the PR (code cleanup and testing) there is currently an issue that the attachment pod is not deleted after a successful migration. My assumption was that it should be destroyed along with the virt-launcher pod. In general that is the case but after migration virt-launcher stays around in the "Completed" state and only manual deletion removes it from the cluster. So currently I am not sure whether I need to explicitly delete the attachment pod. I think since virt-launcher is not deleted after migration there should be some reasoning behind it. Maybe then it makes sense also to keep the attachemnt pod as long as virt-launcher is there. At least that seems like the easiest and most straghtforward approach. Any thought on that?
Yeah, right. Probably it is better to clean up the 'orphaned' attachment pods explicitly. It's not very difficult to implement anyway...I faced one more issue actually. I've been looking into it for some time already. It is reproducible with the functional test that is available in the PR and happens with block volumes: during the preparation of the target virt-launcher pod I do the write to cgroup device controller file ('devices.allow') to allow access to the volume and print in the log whether it is successful or not. But when the preparation is done and the target virt-launcher tries to resume the VM I get 'Operation not permitted' error and for some reason the rule to allow the block device is no longer in the 'devices.list'. So it looks like that the rule is dropped by someone else meanwhile (container runtime?). Initially I thought it was happening because the initialization of one of the virt-launcher containers was not finished and the container runtime was overwriting the device rules. But looks like it's not the case as I tried to wait till the pod was running. Interesting that the issue is not always reproducible. Usually it happens after several migrations e.g. when I try to migrate the VM back to the original node (i.e. node02 -> node01 -> node02) but sometimes it does not happen. Seems like it's a floating issue. Maybe someone saw something similar before?
Hey there,I think I have updates on the issue with cpumanager. I found this commit [1] that actually fixes the problem upstream on the runc level. In fact cpumanager updates cgroup parameters related only to CPU, memory and blockIO. It does not explicitly reset the device rules. Seems that it was the implementation of the `update` call on the runc side that was dropping all the rules (apparently they were just reapplying the config to match the existing container spec and thus dropping all the 'extra' device rules).I rebuilt runc from the master branch with that fix included and so far I do not observe any issues. I can confirm that the hotplug device rules are no more dropped. But the fix is actually pretty new (a couple of weeks) and it is not yet even released. So it seems the problem has been fixed but the fix is not yet *widely* available. I just wonder how to tackle this on the KubeVirt side. I do not see any reliable workaround for this. Any thoughts?
On Wed, Jun 16, 2021 at 1:14 PM Vasiliy Ulyanov <vasil...@gmail.com> wrote:Hey there,I think I have updates on the issue with cpumanager. I found this commit [1] that actually fixes the problem upstream on the runc level. In fact cpumanager updates cgroup parameters related only to CPU, memory and blockIO. It does not explicitly reset the device rules. Seems that it was the implementation of the `update` call on the runc side that was dropping all the rules (apparently they were just reapplying the config to match the existing container spec and thus dropping all the 'extra' device rules).I rebuilt runc from the master branch with that fix included and so far I do not observe any issues. I can confirm that the hotplug device rules are no more dropped. But the fix is actually pretty new (a couple of weeks) and it is not yet even released. So it seems the problem has been fixed but the fix is not yet *widely* available. I just wonder how to tackle this on the KubeVirt side. I do not see any reliable workaround for this. Any thoughts?I am trying to decide if it is worth it for us to try and do a work around until the actual fix is widely available. I guess the earliest we could see the fix is k8s 1.22? Maybe we can just document and note that LM with a hotplugged block device doesn't work reliably until k8s 1.22? Having to make a work around and then removing it seems like a waste of time. But I do want to know what others think about this.