Testing kubevirt/kubevirt capacity

65 views

Skip to first unread message

Edward Haas

unread,

Apr 5, 2022, 2:55:57 AM4/5/22

to kubevirt-dev

Hello All,

With the growth of the kubevirt project, especially kubevirt/kubevirt, we are experiencing an increase in the number of jobs that the CI runs on a per-push change.

This is pressuring the CI workers, increasing the time a feedback is received and in some cases destabilizing (randomly) running workloads.

There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.

In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).

We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).

I would like to suggest that we focus our gateway (required) jobs to a default setup (which we can define) and all the rest to be checked as periodics and gateways of releases.

Here is an example of current compute jobs we run, all required:
pull-kubevirt-e2e-k8s-1.21-sig-compute
pull-kubevirt-e2e-k8s-1.22-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.21-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.22-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot

* There are some others I see in the making (realtime, single-mode).

I suggest to move the following ones to periodic-only:
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot

* If the default changes, the jobs will move accordingle (e.g. nonroot becoming default).

I think this approach also scales for the future, where we already see many specialized

setup appearing which we are interested in supporting.

But we cannot test them all as gateways.

If we go this path, the main challenge I see here is a feedback flow which makes sure that

failures in periodics are well examined and immediate actions are taken to resolve real issues.

(e.g. to revert a change)

Another option is to ask the maintainers to make sure the specilaized setups e2e tests are

passing by asking to trigger them as the last step before approval.

Thanks,

Edy.

Luboslav Pivarc

unread,

Apr 5, 2022, 3:29:34 PM4/5/22

to Edward Haas, kubevirt-dev

Hi,

On Tue, Apr 5, 2022 at 8:56 AM Edward Haas <edw...@redhat.com> wrote:

Hello All,

With the growth of the kubevirt project, especially kubevirt/kubevirt, we are experiencing an increase in the number of jobs that the CI runs on a per-push change.

This is pressuring the CI workers, increasing the time a feedback is received and in some cases destabilizing (randomly) running workloads.

I didn't experience any delays recently other than those caused by 3rd party. Do you have any pointers showing otherwise?

I must admit there was an increased number of jobs recently and I am the one behind this change. This change, of course, was communicated with people who know our CI limits and of course, was
done by proper process(PRs). The "-nonroot" lanes are necessary only temporarily and my personal expectation is that they could go away after one release.

I noticed some association of "-nonroot" lanes with recent problems:
https://github.com/kubevirt/project-infra/pull/2011
https://github.com/kubevirt/project-infra/pull/2012

I would like to touch upon the sriov lane here. I don't think this has any potential to affect the infrastructure because each sriov node is only capable of serving one job at a time. Therefore the new "-nonroot" lane is only limiting throughput so developers can experience a slower turnaround for this lane. According to my calculations, we should be capable to cover 98 jobs per day/49 PRs per day. This should be enough for active PRs.

I would like to kindly ask to revert the revert of revert if there is no evidence that it does cause a problem.

There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.

In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).
We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).

It is true that we have increased the number of lanes but I don't think the current status is alarming. We would need to evaluate how to manage this if the trend of increasing jobs would continue.

I would like to suggest that we focus our gateway (required) jobs to a default setup (which we can define) and all the rest to be checked as periodics and gateways of releases.

Here is an example of current compute jobs we run, all required:
pull-kubevirt-e2e-k8s-1.21-sig-compute
pull-kubevirt-e2e-k8s-1.22-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.21-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.22-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot

* There are some others I see in the making (realtime, single-mode).

I suggest to move the following ones to periodic-only:
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot

I am against this suggestion as the expected lifetime of these lanes is very low and we are handling this workload just fine (according to data available to me).

* If the default changes, the jobs will move accordingle (e.g. nonroot becoming default).

I think this approach also scales for the future, where we already see many specialized
setup appearing which we are interested in supporting.
But we cannot test them all as gateways.

If we go this path, the main challenge I see here is a feedback flow which makes sure that
failures in periodics are well examined and immediate actions are taken to resolve real issues.
(e.g. to revert a change)
Another option is to ask the maintainers to make sure the specilaized setups e2e tests are
passing by asking to trigger them as the last step before approval.

I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?

Thanks for raising your concerns,

Lubo

Thanks,
Edy.

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CALmkdFSKf5aNShFB67K0GKnp%2BN3yM74wgBL1ahNp_W30Dj%3Dukg%40mail.gmail.com.

Daniel Hiller

unread,

Apr 6, 2022, 1:06:42 AM4/6/22

to Edward Haas, kubevirt-dev

Accidentally hit reply instead of reply all...

Daniel Hiller <dhi...@redhat.com> schrieb am Di., 5. Apr. 2022, 09:36:

Hi Edward,

thanks for starting this discussion.

On Tue, Apr 5, 2022 at 8:56 AM Edward Haas <edw...@redhat.com> wrote:

Hello All,

With the growth of the kubevirt project, especially kubevirt/kubevirt, we are experiencing an increase in the number of jobs that the CI runs on a per-push change.

This is pressuring the CI workers, increasing the time a feedback is received and in some cases destabilizing (randomly) running workloads.

I disagree. The parallelization of tests in fact decreases the overall test lane runtime per PR. This can be seen i.e in the separation of the migration tests into a separate lane and in the parallelization of a lot of tests that Roman et al did recently, which reduced overall test lane runtime of the long running compute lanes.

What I can agree with is that the test lanes only start later when the traffic on CI is high. What is increasing the feedback time the most is the number of retests required because of the flaky tests. Every time we need to do a retest we lose time on another test lane run.

Given the point that things can always be improved (looking at kind not using our image proxy), I must say though that the outages over the last weeks didn't put our CI in the best light. But, besides the workloads cluster outage, which was our fault, I don't see major issues in the setup of CI.

Also, can you give examples to your last point of "destabilizing (randomly) running workloads" please?

There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.

In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).
We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).

I would like to suggest that we focus our gateway (required) jobs to a default setup (which we can define) and all the rest to be checked as periodics and gateways of releases.

I think the current state of the `optional` status per lane pretty much reflects what we need. But that might be just me.

Here is an example of current compute jobs we run, all required:
pull-kubevirt-e2e-k8s-1.21-sig-compute
pull-kubevirt-e2e-k8s-1.22-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.21-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.22-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot

* There are some others I see in the making (realtime, single-mode).

I suggest to move the following ones to periodic-only:
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot

Especially the nonroot lanes should _not_ be optional, as they gate our current efforts in getting rid of root being required. We rather want to get rid of the root lanes in the long term, IIRC this will take a couple more months. Others please keep me honest here.

* If the default changes, the jobs will move accordingle (e.g. nonroot becoming default).

I think this approach also scales for the future, where we already see many specialized
setup appearing which we are interested in supporting.
But we cannot test them all as gateways.

If we go this path, the main challenge I see here is a feedback flow which makes sure that
failures in periodics are well examined and immediate actions are taken to resolve real issues.
(e.g. to revert a change)
Another option is to ask the maintainers to make sure the specilaized setups e2e tests are
passing by asking to trigger them as the last step before approval.

I see these steps as the opposite of what we want to achieve, i.e. let the automation gate what is merged and reduce the actual workload of maintainers, in order to step in only when it is really required.

I think the answer to our problem is not less, but more automation and less manual steps. And as said, less flaky tests.

Thanks,
Edy.

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CALmkdFSKf5aNShFB67K0GKnp%2BN3yM74wgBL1ahNp_W30Dj%3Dukg%40mail.gmail.com.

--
Kind regards,

Daniel Hiller
He / Him / His
Senior Software Engineer, OpenShift Virtualization
Red Hat
dhi...@redhat.com
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, 
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Edward Haas

unread,

Apr 6, 2022, 2:28:02 AM4/6/22

to Luboslav Pivarc, Daniel Hiller, kubevirt-dev

Thank you for the feedback.

As the responses have arrived in parallel, I tried to merge them here and added my response.

I may have repeated myself in the answers as the one to Daniel I replied yesterday.

On Tue, Apr 5, 2022 at 10:29 PM Luboslav Pivarc <lpi...@redhat.com> wrote:

Hi,

On Tue, Apr 5, 2022 at 8:56 AM Edward Haas <edw...@redhat.com> wrote:
Hello All,

With the growth of the kubevirt project, especially kubevirt/kubevirt, we are experiencing an increase in the number of jobs that the CI runs on a per-push change.

This is pressuring the CI workers, increasing the time a feedback is received and in some cases destabilizing (randomly) running workloads.

==== Daniel Hiller ====

I disagree. The parallelization of tests in fact decreases the overall test lane runtime per PR. This can be seen i.e in the separation of the migration tests into a separate lane and in the parallelization of a lot of tests that Roman et al did recently, which reduced overall test lane runtime of the long running compute lanes.

I actually did not mean to say anything negative about the parallelizing approach, I was one of the ones who pushed to split to sigs.

What I was referring to is that once we reach the resource limit (i.e. number of jobs that can run in parallel on the given resources), the parallelizing loses its effectiveness.

We have started adding more and more jobs, running the same tests (doubling them) on different setups.

What I can agree with is that the test lanes only start later when the traffic on CI is high. What is increasing the feedback time the most is the number of retests required because of the flaky tests. Every time we need to do a retest we lose time on another test lane run.

Given the point that things can always be improved (looking at kind not using our image proxy), I must say though that the outages over the last weeks didn't put our CI in the best light. But, besides the workloads cluster outage, which was our fault, I don't see major issues in the setup of CI.

The introduction of `nonroot`, `cgroupsv2`, `ipv6 cluster` setup flavors doubles and triples the amount of workload.

It causes high load and as you said, high feedback time.

As we continue, we will see this more and more.

Also, can you give examples to your last point of "destabilizing (randomly) running workloads" please?

We have seen a relation between the load on the CI resources and the number of failures.

The load is affecting I/O, which in turn causes timing changes that sometime rolled out to expose flakes.

==== Daniel Hiller (end) ====

I didn't experience any delays recently other than those caused by 3rd party. Do you have any pointers showing otherwise?

We can observe that jobs are queued to be run (I think this is showing it).

But it should be obvious that with larger numbers of jobs (and therefore running tests), we have more potential blockers for our work to get merged.

We sometimes waste a large amount of time figuring out what we did wrong or even worse, we ignore the failure completely and just `/retry`.

I must admit there was an increased number of jobs recently and I am the one behind this change. This change, of course, was communicated with people who know our CI limits and of course, was
done by proper process(PRs). The "-nonroot" lanes are necessary only temporarily and my personal expectation is that they could go away after one release.

I am looking to express my concern about this workflow. While it surely helps with individual features and their development, it also affects all contributors in the project.

We need a balance and the recent additions (not only nonroot) IMO harms the overall velocity of development.

I noticed some association of "-nonroot" lanes with recent problems:
https://github.com/kubevirt/project-infra/pull/2011
https://github.com/kubevirt/project-infra/pull/2012

I would like to touch upon the sriov lane here. I don't think this has any potential to affect the infrastructure because each sriov node is only capable of serving one job at a time. Therefore the new "-nonroot" lane is only limiting throughput so developers can experience a slower turnaround for this lane. According to my calculations, we should be capable to cover 98 jobs per day/49 PRs per day. This should be enough for active PRs.

I would like to kindly ask to revert the revert of revert if there is no evidence that it does cause a problem.

These actions have been part of a crisis handling and I prefer not to mix it in this thread.

I suggest you open a separate thread about this or just post a PR.

There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.

In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).
We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).

It is true that we have increased the number of lanes but I don't think the current status is alarming. We would need to evaluate how to manage this if the trend of increasing jobs would continue.

I initiated this thread because I do think it is alarming.

My impression is that jobs can be added just because there is a functional justification for it. I think it is not enough reasoning and that a policy that will also support us in 2 years should be negotiated.

I would like to suggest that we focus our gateway (required) jobs to a default setup (which we can define) and all the rest to be checked as periodics and gateways of releases.

==== Daniel Hiller =====

I think the current state of the `optional` status per lane pretty much reflects what we need. But that might be just me.

I am aiming to not run the "optional" ones per push, but per periodic and tagging/release.

==== Daniel Hiller (end) =====

Here is an example of current compute jobs we run, all required:
pull-kubevirt-e2e-k8s-1.21-sig-compute
pull-kubevirt-e2e-k8s-1.22-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.21-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.22-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot

* There are some others I see in the making (realtime, single-mode).

I suggest to move the following ones to periodic-only:
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot

==== Daniel Hiller ====

Especially the nonroot lanes should _not_ be optional, as they gate our current efforts in getting rid of root being required. We rather want to get rid of the root lanes in the long term, IIRC this will take a couple more months.

The nonroot is an uncompleted feature which works partially and got phased that way.

You are gating on it although the feature is not released and therefore causing all stakeholders to "pay" the cost.

I think that you can monitor that no one introduced something bad using the periodics.

Once you decide we should move to nonroot, all tests should be focused on it.

==== Daniel Hiller (end) =====

I am against this suggestion as the expected lifetime of these lanes is very low and we are handling this workload just fine (according to data available to me).

* If the default changes, the jobs will move accordingle (e.g. nonroot becoming default).

I think this approach also scales for the future, where we already see many specialized
setup appearing which we are interested in supporting.
But we cannot test them all as gateways.

If we go this path, the main challenge I see here is a feedback flow which makes sure that
failures in periodics are well examined and immediate actions are taken to resolve real issues.
(e.g. to revert a change)
Another option is to ask the maintainers to make sure the specilaized setups e2e tests are
passing by asking to trigger them as the last step before approval.

==== Daniel Hiller =====

I see these steps as the opposite of what we want to achieve, i.e. let the automation gate what is merged and reduce the actual workload of maintainers, in order to step in only when it is really required.

It is not realistic to think that we can run such a vast e2e test on a per-push check.

The cost we are paying for doing that is increasing as time progresses and the project grows.

We need to adjust to size and capacity, I do not see how this can continue on and still function well enough for flu

===== Daniel Hiller (end) =====

I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?

All good points, I am not arguing there are no advantages to having gateways like we have today.

What I am arguing about is that it is not reasonable for me as a contributor to wait 3 hours or more for feedback on a change I posted.
And looking at the history, this is going to increase all the time (more developers, more tests, more feature jobs).

Itamar Holder

unread,

Apr 6, 2022, 9:40:41 AM4/6/22

to Edward Haas, Luboslav Pivarc, Daniel Hiller, kubevirt-dev

It's a bit off-topic maybe, but I think we can also avoid running some lanes under certain circumstances.

For example, if a PR only changes a unit test - we can obviously skip the vast majority of the lanes. Same goes for a PR that only changes a functional test (we can avoid running all irrelevant lanes), changes in documentation, etc.

It might sound negligible, but there are many such small PRs, so perhaps it could be valuable.

Just a thought.

BR,

Itamar.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CALmkdFROUjU%3Du82KnQiNEqgw1xeMgpjRx9X0d2FzKJEJ%3D5eoow%40mail.gmail.com.

Dan Kenigsberg

unread,

Apr 7, 2022, 5:08:28 AM4/7/22

to Edward Haas, Luboslav Pivarc, Daniel Hiller, kubevirt-dev

I would like to join this concern. IMHO, introduction of a new "test lane" should be discussed in a wider forum (this mailing list) than a specific PR, just like we had a thorough discussion about per-SIG lanes about a year ago. Adding a lane can improve latency, but it is typically at the expense of overall resource consumption (more prep time, download, moving pieces).

More importantly, there's the question of ownership. The sriov lane was introduced by the network SIG and is watched by them. I am not sure if it is just as clear who is the owner of sriov-nonroot lane.

I noticed some association of "-nonroot" lanes with recent problems:
https://github.com/kubevirt/project-infra/pull/2011
https://github.com/kubevirt/project-infra/pull/2012

I would like to touch upon the sriov lane here. I don't think this has any potential to affect the infrastructure because each sriov node is only capable of serving one job at a time. Therefore the new "-nonroot" lane is only limiting throughput so developers can experience a slower turnaround for this lane. According to my calculations, we should be capable to cover 98 jobs per day/49 PRs per day. This should be enough for active PRs.

I would like to kindly ask to revert the revert of revert if there is no evidence that it does cause a problem.

These actions have been part of a crisis handling and I prefer not to mix it in this thread.
I suggest you open a separate thread about this or just post a PR.

There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.

In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).
We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).

It is true that we have increased the number of lanes but I don't think the current status is alarming. We would need to evaluate how to manage this if the trend of increasing jobs would continue.

I initiated this thread because I do think it is alarming.

As a casual contributor, I share this opinion. What troubles me most is that in a typical PR of mine, most of the communication I get is noise, which hides the interesting bits (/cc for review, request for changes, new code). It makes life harder for all of us, and particularly for new contributors. I randomly picked the most recently-merged PR: just see how many people and robots addressed it in the last 24 hours just to /retest it.

We should not accept the status quo, but aim for zero test failures.

Right, we shouldn't put more burden on human maintainers. But maybe we can automate this? Can we have another style of lanes that are executed only after a PR is /approved and /lgtm, but before tide tries to merge it? Is this a reasonable mid-ground between running them periodically and running them for each code push?

Edward Haas

unread,

May 2, 2022, 2:03:14 AM5/2/22

to Dan Kenigsberg, Luboslav Pivarc, Daniel Hiller, kubevirt-dev

Unfortunately, there is no such an option with prow (AFAIK).

Per documentation, the status contexts can be required for unconditional (as it is today) and conditional (perf change in files).

It is unclear to me if it is possible to require a status context for manually triggered jobs (as that has a potential to allow us to build a flow as a pipeline).

I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?

All good points, I am not arguing there are no advantages to having gateways like we have today.
What I am arguing about is that it is not reasonable for me as a contributor to wait 3 hours or more for feedback on a change I posted.
And looking at the history, this is going to increase all the time (more developers, more tests, more feature jobs).

Thanks for raising your concerns,
Lubo

Thanks,
Edy.

Embedding here Itamar response so I can answer it:

It's a bit off-topic maybe, but I think we can also avoid running some lanes under certain circumstances.

For example, if a PR only changes a unit test - we can obviously skip the vast majority of the lanes. Same goes for a PR that only changes a functional test (we can avoid running all irrelevant lanes), changes in documentation, etc.
It might sound negligible, but there are many such small PRs, so perhaps it could be valuable.
Just a thought.

I agree, we should be able to do this (unless for some reason the branch protection feature).

There are a lot of good ideas raised and I am in favor of exploring all to improve our productivity.

Let's start working on a proposal with all the ideas and suggestions listed.

This can be part of a larger effort to improve our CI and stability.

Daniel Hiller

unread,

May 2, 2022, 3:46:46 AM5/2/22

to Edward Haas, Dan Kenigsberg, Luboslav Pivarc, kubevirt-dev

Think so too. We could put some conditional logic inside the jobs to do this, but this is hacky and not really maintainable.

Per documentation, the status contexts can be required for unconditional (as it is today) and conditional (perf change in files).
It is unclear to me if it is possible to require a status context for manually triggered jobs (as that has a potential to allow us to build a flow as a pipeline).

This sounds like we would reinvent the wheel. To my knowledge there are some frameworks that would give us pipelines. I.e. Tekton has what we would need to have pipeline flows, but the community seemed to be against using it. Another option would be to look into Jenkins X (which is built on Prow IIRC). Maybe we should reevaluate this?

I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?

All good points, I am not arguing there are no advantages to having gateways like we have today.
What I am arguing about is that it is not reasonable for me as a contributor to wait 3 hours or more for feedback on a change I posted.
And looking at the history, this is going to increase all the time (more developers, more tests, more feature jobs).

Thanks for raising your concerns,
Lubo

Thanks,
Edy.

Embedding here Itamar response so I can answer it:

It's a bit off-topic maybe, but I think we can also avoid running some lanes under certain circumstances.

For example, if a PR only changes a unit test - we can obviously skip the vast majority of the lanes. Same goes for a PR that only changes a functional test (we can avoid running all irrelevant lanes), changes in documentation, etc.
It might sound negligible, but there are many such small PRs, so perhaps it could be valuable.
Just a thought.

I agree, we should be able to do this (unless for some reason the branch protection feature).
There are a lot of good ideas raised and I am in favor of exploring all to improve our productivity.

Let's start working on a proposal with all the ideas and suggestions listed.
This can be part of a larger effort to improve our CI and stability.

Best,

Daniel

Or Shoval

unread,

May 4, 2022, 2:56:59 AM5/4/22

to Daniel Hiller, Edward Haas, Dan Kenigsberg, Luboslav Pivarc, kubevirt-dev

https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md#job-environment-variables

If desired, we can use the PULL_NUMBER.

The job will query the github API according to the PULL_NUMBER, check what /sig-x are there, and if it's already lgtm.

According to the sig it will run just the relevant sigs, (return immediately otherwise), unless it is already lgtm and then it runs all.

We might want to have a job that will refresh the data and act as a cache, instead querying the github API for each job

(API has few rate limits and sometimes a little down time).

I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?

All good points, I am not arguing there are no advantages to having gateways like we have today.
What I am arguing about is that it is not reasonable for me as a contributor to wait 3 hours or more for feedback on a change I posted.
And looking at the history, this is going to increase all the time (more developers, more tests, more feature jobs).

Thanks for raising your concerns,
Lubo

Thanks,
Edy.

Embedding here Itamar response so I can answer it:

It's a bit off-topic maybe, but I think we can also avoid running some lanes under certain circumstances.

For example, if a PR only changes a unit test - we can obviously skip the vast majority of the lanes. Same goes for a PR that only changes a functional test (we can avoid running all irrelevant lanes), changes in documentation, etc.
It might sound negligible, but there are many such small PRs, so perhaps it could be valuable.
Just a thought.

I agree, we should be able to do this (unless for some reason the branch protection feature).
There are a lot of good ideas raised and I am in favor of exploring all to improve our productivity.

Let's start working on a proposal with all the ideas and suggestions listed.
This can be part of a larger effort to improve our CI and stability.

--
--
Best,
Daniel

--

You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAK%2BeyL44mgGeageYBD6xLYN0DS8aeUCDY3DVEPzkS4OZ2C7tOQ%40mail.gmail.com.

Daniel Hiller

unread,

May 5, 2022, 2:21:20 AM5/5/22

to Or Shoval, Edward Haas, Dan Kenigsberg, Luboslav Pivarc, kubevirt-dev

Can you elaborate, I can't follow?

We could add the logic of checking whether labels are present on a PR to the bootstrap image. For this we would introduce an environment setting for the labels that need to be present. If they are not, the PR is not run and the PR status will be FAILED, thus it will not get merged if required.

Sketch:

* modify labels-checker [1] so that it can check for labels existing

* add labels-checker to bootstrap

* if CHECK_LABELS_PRESENT env var is available

* run labels-checker before runner.sh is called in bootstrap

* if not all required labels are present, exit with 1, otherwise continue

* create a present that contains the env var

* add this preset to all e2e jobs

We could also use the opportunity to generalize where we are already using labels-checker. We currently have other cases where we already use the labels-checker (i.e [2]) for checking whether labels are not present.

[1] https://github.com/kubevirt/project-infra/tree/main/robots/cmd/labels-checker

[2] https://github.com/kubevirt/project-infra/blob/main/github/ci/prow-deploy/files/jobs/kubevirt/project-infra/project-infra-periodics.yaml#L695

Best,

Daniel

Reply all

Reply to author

Forward

0 new messages