guestlog test instability within my instance type live update PR

31 views
Skip to first unread message

Lee Yarwood

unread,
Jun 6, 2024, 2:49:10 PMJun 6
to kubevirt-dev
Hello all,

With FF looming I'd appreciate some help understanding why the
guestlog tests are so unstable within the following PR introducing
live update support for instance types:

instancetype: Support Live Updates
https://github.com/kubevirt/kubevirt/pull/11455

An example failure can be seen below:

https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute/1798767730631905280

I've been working with Felix for the last few days to identify a
reproducer to understand the issue but this only appears to reproduce
in full CI runs with the full series present. An attempt to bisect the
series in CI didn't get us anywhere and I feel like I'm back to square
one again.

My working assumption has been that the issue is being caused by the
series moving hot plug defaulting to the VMI mutation webhook as these
tests only use VMIs. I had assumed this was causing memory pressure
within the VMIs as we allocate the calculated max guest value and that
this is somehow causing the instability in virtlogd etc but I've not
been able to confirm that yet.

I'd appreciate any and all feedback here as I'm very very confused by
this behaviour.

Regards,

Lee

Daniel Hiller

unread,
Jun 7, 2024, 4:34:42 AMJun 7
to Lee Yarwood, Simone Tiraboschi, kubevirt-dev
Hey Lee,

On Thu, Jun 6, 2024 at 8:49 PM Lee Yarwood <lyar...@redhat.com> wrote:
Hello all,

With FF looming I'd appreciate some help understanding why the
guestlog tests are so unstable within the following PR introducing
live update support for instance types:

instancetype: Support Live Updates
https://github.com/kubevirt/kubevirt/pull/11455

An example failure can be seen below:

https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute/1798767730631905280

IIRC we saw this bug on CI also as a flake - @Simone Tiraboschi verified that this flake was caused by a libvirt bug - and I remember that it then disappeared a while ago, I suspected through some update - but TBH we never got to the ground of that.



I've been working with Felix for the last few days to identify a
reproducer to understand the issue but this only appears to reproduce
in full CI runs with the full series present. An attempt to bisect the
series in CI didn't get us anywhere and I feel like I'm back to square
one again.

My working assumption has been that the issue is being caused by the
series moving hot plug defaulting to the VMI mutation webhook as these
tests only use VMIs. I had assumed this was causing memory pressure
within the VMIs as we allocate the calculated max guest value and that
this is somehow causing the instability in virtlogd etc but I've not
been able to confirm that yet.

I'd appreciate any and all feedback here as I'm very very confused by
this behaviour.

Regards,

Lee

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAPkJ9Dv88OwicFC7cq0%2Ba8dhOxck4LEi%3DKkZcA2_1AziWRb4mQ%40mail.gmail.com.



--
-- 
Best,
Daniel

Lee Yarwood

unread,
Jun 7, 2024, 5:50:59 AMJun 7
to Daniel Hiller, Simone Tiraboschi, kubevirt-dev
On Fri, 7 Jun 2024 at 09:34, Daniel Hiller <dhi...@redhat.com> wrote:
>
> Hey Lee,
>
> On Thu, Jun 6, 2024 at 8:49 PM Lee Yarwood <lyar...@redhat.com> wrote:
>>
>> Hello all,
>>
>> With FF looming I'd appreciate some help understanding why the
>> guestlog tests are so unstable within the following PR introducing
>> live update support for instance types:
>>
>> instancetype: Support Live Updates
>> https://github.com/kubevirt/kubevirt/pull/11455
>>
>> An example failure can be seen below:
>>
>> https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute/1798767730631905280
>
>
> IIRC we saw this bug on CI also as a flake - @Simone Tiraboschi verified that this flake was caused by a libvirt bug - and I remember that it then disappeared a while ago, I suspected through some update - but TBH we never got to the ground of that.

Thanks Daniel,

Simone, Daniel, in that case would either of you be against
quaratineeing this test for now while tracking the effort to report
this down to the libvirt folks and unquarantine the test again in a
KubeVirt bug?

Cheers,

Lee

Lee Yarwood

unread,
Jun 7, 2024, 6:15:33 AMJun 7
to Daniel Hiller, Simone Tiraboschi, kubevirt-dev
On Fri, 7 Jun 2024 at 10:50, Lee Yarwood <lyar...@redhat.com> wrote:
> On Fri, 7 Jun 2024 at 09:34, Daniel Hiller <dhi...@redhat.com> wrote:
> >
> > Hey Lee,
> >
> > On Thu, Jun 6, 2024 at 8:49 PM Lee Yarwood <lyar...@redhat.com> wrote:
> >>
> >> Hello all,
> >>
> >> With FF looming I'd appreciate some help understanding why the
> >> guestlog tests are so unstable within the following PR introducing
> >> live update support for instance types:
> >>
> >> instancetype: Support Live Updates
> >> https://github.com/kubevirt/kubevirt/pull/11455
> >>
> >> An example failure can be seen below:
> >>
> >> https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/11455/pull-kubevirt-e2e-k8s-1.30-sig-compute/1798767730631905280
> >
> >
> > IIRC we saw this bug on CI also as a flake - @Simone Tiraboschi verified that this flake was caused by a libvirt bug - and I remember that it then disappeared a while ago, I suspected through some update - but TBH we never got to the ground of that.
>
> Thanks Daniel,
>
> Simone, Daniel, in that case would either of you be against
> guaranteeing this test for now while tracking the effort to report
> this down to the libvirt folks and unquarantine the test again in a
> KubeVirt bug?

I've jumped the gun slightly here and have added a commit to my series
moving the test into quarantine with a bug raised to track moving it
back out later:

https://github.com/kubevirt/kubevirt/pull/11455/commits/0c2c76a390e83b96f6187535bf801c3ff6bdd7ad

https://github.com/kubevirt/kubevirt/issues/12074

Apologies for rushing ahead, feel free to say if you would rather not
take this approach and I'll revert.

Cheers,

Lee

Lee Yarwood

unread,
Jun 27, 2024, 10:07:28 AM (2 days ago) Jun 27
to Daniel Hiller, Simone Tiraboschi, kubevirt-dev
Just coming back to this post FF, devconf and sickness on my side.

I've posted https://github.com/kubevirt/kubevirt/pull/12234 to check
that the test remains flakey.

Simone assuming it reproduces would you mind taking a look at the
failure to see if it's the same libvirt issue you saw previously?
Happy to reach out to the libvirt team myself once we can confirm it's
a bug on their side.

Cheers,

Lee

Simone Tiraboschi

unread,
Jun 27, 2024, 4:42:29 PM (2 days ago) Jun 27
to Lee Yarwood, Daniel Hiller, kubevirt-dev
Yes, sure.
Waiting for the results.

Cheers,

Lee

Lee Yarwood

unread,
Jun 28, 2024, 4:46:42 AM (yesterday) Jun 28
to Simone Tiraboschi, Daniel Hiller, kubevirt-dev
Thanks however the flake is now shy or fixed, likely by the various
liveupdate memory hotplug fixes that landed recently?

I'll move the PR to revert out of draft so we can unquarantine for now.

Cheers,

Lee

Reply all
Reply to author
Forward
0 new messages