Recording: https://youtu.be/ne2vPgaqXV8
Attendees: Aseef Imran, fossedihelm, Orel Misan, Beñat Gartzia Arruabarrena Edward Haas Alan Caldelas, Igor Bezukh
Introductions: Welcome everyone to the KubeVirt weekly sig-compute meeting
Do we have any new members this week that would like to introduce themselves?
Quarantined tests: https://github.com/kubevirt/kubevirt/issues/17720
Agenda and Notes:
[aseef] Currently, AllowWorkloadDisruption must be =true in order to use postcopy. We should change that.
Original motivation:
With postcopy there is a chance to lose your workload if network fails during postcopy.
While active postcopy also fetches memory over the network on memory page fault. This can have negative impact on workload.
Rebuttal:
What are the chances you lose your network right in the few hundred milliseconds we have postcopy active for?
And besides in any case, QEMU provides a recovery mode for postcopy as well in case of a pathological network failure.
Though ATM kubevirt does not implement this recovery mode.
Academic literature shows that using postcopy instead of stop-and-copy for the few hundred milliseconds you have to pause your VM actually reduces the impact on the workload. [Petter Svärd, Benoit Hudzia, Steve Walsh, Johan Tordsson, and Erik Elmroth. 2015. Principles and Performance Characteristics of Algorithms for Live VM Migration. SIGOPS Oper. Syst. Rev. 49, 1 (January 2015), 142–155. https://doi.org/10.1145/2723872.2723894]
Agreement: Should be fine if we can add another option (exact API specs TBD) where we can express a middle ground between “allow disruption” and “don’t allow disruption”.
[bgartzia] VEP144: reservedOverhead
It was decided to push it back, change the API and decouple pod sizing and memlock limits.
There is one VEP extension (draft) that covers the concerns #316
Open Floor:
VEP Tracker:
Pull Requests that need attention: https://github.com/kubevirt/kubevirt/pulls?q=is%3Aopen+is%3Apr+label%3Asig%2Fcompute
Bug/issue scrub: https://github.com/kubevirt/kubevirt/issues?q=is%3Aopen%20is%3Aissue%20label%3Asig%2Fcompute%20-label%3Akind%2Fflake
(/triage {accepted | build-watcher | duplicate | needs-information | not-reproducible | unresolved})
Flake issue/PR scrub:
(/triage {accepted | build-watcher | duplicate | needs-information | not-reproducible | unresolved})
Zoom chat: