Hello All,With the growth of the kubevirt project, especially kubevirt/kubevirt, we are experiencing an increase in the number of jobs that the CI runs on a per-push change.
This is pressuring the CI workers, increasing the time a feedback is received and in some cases destabilizing (randomly) running workloads.
We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.
I suggest to move the following ones to periodic-only:* There are some others I see in the making (realtime, single-mode).Here is an example of current compute jobs we run, all required:I would like to suggest that we focus our gateway (required) jobs to a default setup (which we can define) and all the rest to be checked as periodics and gateways of releases.
pull-kubevirt-e2e-k8s-1.21-sig-compute
pull-kubevirt-e2e-k8s-1.22-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.21-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.22-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot
passing by asking to trigger them as the last step before approval.Another option is to ask the maintainers to make sure the specilaized setups e2e tests are(e.g. to revert a change)failures in periodics are well examined and immediate actions are taken to resolve real issues.If we go this path, the main challenge I see here is a feedback flow which makes sure thatBut we cannot test them all as gateways.setup appearing which we are interested in supporting.I think this approach also scales for the future, where we already see many specialized* If the default changes, the jobs will move accordingle (e.g. nonroot becoming default).
--Edy.Thanks,
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CALmkdFSKf5aNShFB67K0GKnp%2BN3yM74wgBL1ahNp_W30Dj%3Dukg%40mail.gmail.com.
Hi Edward,thanks for starting this discussion.
On Tue, Apr 5, 2022 at 8:56 AM Edward Haas <edw...@redhat.com> wrote:
This is pressuring the CI workers, increasing the time a feedback is received and in some cases destabilizing (randomly) running workloads.Hello All,With the growth of the kubevirt project, especially kubevirt/kubevirt, we are experiencing an increase in the number of jobs that the CI runs on a per-push change.
I disagree. The parallelization of tests in fact decreases the overall test lane runtime per PR. This can be seen i.e in the separation of the migration tests into a separate lane and in the parallelization of a lot of tests that Roman et al did recently, which reduced overall test lane runtime of the long running compute lanes.What I can agree with is that the test lanes only start later when the traffic on CI is high. What is increasing the feedback time the most is the number of retests required because of the flaky tests. Every time we need to do a retest we lose time on another test lane run.Given the point that things can always be improved (looking at kind not using our image proxy), I must say though that the outages over the last weeks didn't put our CI in the best light. But, besides the workloads cluster outage, which was our fault, I don't see major issues in the setup of CI.
Also, can you give examples to your last point of "destabilizing (randomly) running workloads" please?
I would like to suggest that we focus our gateway (required) jobs to a default setup (which we can define) and all the rest to be checked as periodics and gateways of releases.We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.
I think the current state of the `optional` status per lane pretty much reflects what we need. But that might be just me.
I suggest to move the following ones to periodic-only:Here is an example of current compute jobs we run, all required:* There are some others I see in the making (realtime, single-mode).
pull-kubevirt-e2e-k8s-1.21-sig-compute
pull-kubevirt-e2e-k8s-1.22-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.21-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.22-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot
Especially the nonroot lanes should _not_ be optional, as they gate our current efforts in getting rid of root being required. We rather want to get rid of the root lanes in the long term, IIRC this will take a couple more months. Others please keep me honest here.
passing by asking to trigger them as the last step before approval.Another option is to ask the maintainers to make sure the specilaized setups e2e tests are(e.g. to revert a change)failures in periodics are well examined and immediate actions are taken to resolve real issues.If we go this path, the main challenge I see here is a feedback flow which makes sure thatBut we cannot test them all as gateways.setup appearing which we are interested in supporting.I think this approach also scales for the future, where we already see many specialized* If the default changes, the jobs will move accordingle (e.g. nonroot becoming default).
I see these steps as the opposite of what we want to achieve, i.e. let the automation gate what is merged and reduce the actual workload of maintainers, in order to step in only when it is really required.I think the answer to our problem is not less, but more automation and less manual steps. And as said, less flaky tests.
--Edy.Thanks,
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CALmkdFSKf5aNShFB67K0GKnp%2BN3yM74wgBL1ahNp_W30Dj%3Dukg%40mail.gmail.com.
--Kind regards,
Daniel Hiller
He / Him / His
Senior Software Engineer, OpenShift Virtualization
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
Hi,On Tue, Apr 5, 2022 at 8:56 AM Edward Haas <edw...@redhat.com> wrote:Hello All,With the growth of the kubevirt project, especially kubevirt/kubevirt, we are experiencing an increase in the number of jobs that the CI runs on a per-push change.This is pressuring the CI workers, increasing the time a feedback is received and in some cases destabilizing (randomly) running workloads.
I disagree. The parallelization of tests in fact decreases the overall test lane runtime per PR. This can be seen i.e in the separation of the migration tests into a separate lane and in the parallelization of a lot of tests that Roman et al did recently, which reduced overall test lane runtime of the long running compute lanes.
What I can agree with is that the test lanes only start later when the traffic on CI is high. What is increasing the feedback time the most is the number of retests required because of the flaky tests. Every time we need to do a retest we lose time on another test lane run.Given the point that things can always be improved (looking at kind not using our image proxy), I must say though that the outages over the last weeks didn't put our CI in the best light. But, besides the workloads cluster outage, which was our fault, I don't see major issues in the setup of CI.
Also, can you give examples to your last point of "destabilizing (randomly) running workloads" please?
I didn't experience any delays recently other than those caused by 3rd party. Do you have any pointers showing otherwise?
I must admit there was an increased number of jobs recently and I am the one behind this change. This change, of course, was communicated with people who know our CI limits and of course, was
done by proper process(PRs). The "-nonroot" lanes are necessary only temporarily and my personal expectation is that they could go away after one release.
I noticed some association of "-nonroot" lanes with recent problems:
https://github.com/kubevirt/project-infra/pull/2011
https://github.com/kubevirt/project-infra/pull/2012I would like to touch upon the sriov lane here. I don't think this has any potential to affect the infrastructure because each sriov node is only capable of serving one job at a time. Therefore the new "-nonroot" lane is only limiting throughput so developers can experience a slower turnaround for this lane. According to my calculations, we should be capable to cover 98 jobs per day/49 PRs per day. This should be enough for active PRs.I would like to kindly ask to revert the revert of revert if there is no evidence that it does cause a problem.
We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.It is true that we have increased the number of lanes but I don't think the current status is alarming. We would need to evaluate how to manage this if the trend of increasing jobs would continue.
I would like to suggest that we focus our gateway (required) jobs to a default setup (which we can define) and all the rest to be checked as periodics and gateways of releases.
I think the current state of the `optional` status per lane pretty much reflects what we need. But that might be just me.
I suggest to move the following ones to periodic-only:* There are some others I see in the making (realtime, single-mode).Here is an example of current compute jobs we run, all required:
pull-kubevirt-e2e-k8s-1.21-sig-compute
pull-kubevirt-e2e-k8s-1.22-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.21-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.22-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-nonroot
pull-kubevirt-e2e-k8s-1.23-sig-compute-cgroupsv2
pull-kubevirt-e2e-k8s-1.23-sig-compute-migrations-nonroot
Especially the nonroot lanes should _not_ be optional, as they gate our current efforts in getting rid of root being required. We rather want to get rid of the root lanes in the long term, IIRC this will take a couple more months.
I am against this suggestion as the expected lifetime of these lanes is very low and we are handling this workload just fine (according to data available to me).passing by asking to trigger them as the last step before approval.Another option is to ask the maintainers to make sure the specilaized setups e2e tests are(e.g. to revert a change)failures in periodics are well examined and immediate actions are taken to resolve real issues.If we go this path, the main challenge I see here is a feedback flow which makes sure thatBut we cannot test them all as gateways.setup appearing which we are interested in supporting.I think this approach also scales for the future, where we already see many specialized* If the default changes, the jobs will move accordingle (e.g. nonroot becoming default).
I see these steps as the opposite of what we want to achieve, i.e. let the automation gate what is merged and reduce the actual workload of maintainers, in order to step in only when it is really required.
I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CALmkdFROUjU%3Du82KnQiNEqgw1xeMgpjRx9X0d2FzKJEJ%3D5eoow%40mail.gmail.com.
I noticed some association of "-nonroot" lanes with recent problems:
https://github.com/kubevirt/project-infra/pull/2011
https://github.com/kubevirt/project-infra/pull/2012I would like to touch upon the sriov lane here. I don't think this has any potential to affect the infrastructure because each sriov node is only capable of serving one job at a time. Therefore the new "-nonroot" lane is only limiting throughput so developers can experience a slower turnaround for this lane. According to my calculations, we should be capable to cover 98 jobs per day/49 PRs per day. This should be enough for active PRs.I would like to kindly ask to revert the revert of revert if there is no evidence that it does cause a problem.
These actions have been part of a crisis handling and I prefer not to mix it in this thread.I suggest you open a separate thread about this or just post a PR.We seem to have significantly increased the number of these lanes to test various kubevirt deployments. This usually occurs when we need to test the same scenario with different setups, and this setup is cluster wide (therefore, we cannot check it in the same lane).In this thread, I will like to focus on the increase in the different job types (we sometimes call them lanes).There are probably multiple ways to try and manage this better, from optimizing the tests themselves, increasing the HW pool, up to changing the CI workflow.It is true that we have increased the number of lanes but I don't think the current status is alarming. We would need to evaluate how to manage this if the trend of increasing jobs would continue.I initiated this thread because I do think it is alarming.
I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?All good points, I am not arguing there are no advantages to having gateways like we have today.What I am arguing about is that it is not reasonable for me as a contributor to wait 3 hours or more for feedback on a change I posted.
And looking at the history, this is going to increase all the time (more developers, more tests, more feature jobs).Thanks for raising your concerns,LuboEdy.Thanks,
It's a bit off-topic maybe, but I think we can also avoid running some lanes under certain circumstances.For example, if a PR only changes a unit test - we can obviously skip the vast majority of the lanes. Same goes for a PR that only changes a functional test (we can avoid running all irrelevant lanes), changes in documentation, etc.It might sound negligible, but there are many such small PRs, so perhaps it could be valuable.
Just a thought.
Per documentation, the status contexts can be required for unconditional (as it is today) and conditional (perf change in files).It is unclear to me if it is possible to require a status context for manually triggered jobs (as that has a potential to allow us to build a flow as a pipeline).
I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?All good points, I am not arguing there are no advantages to having gateways like we have today.What I am arguing about is that it is not reasonable for me as a contributor to wait 3 hours or more for feedback on a change I posted.
And looking at the history, this is going to increase all the time (more developers, more tests, more feature jobs).Thanks for raising your concerns,LuboEdy.Thanks,
Embedding here Itamar response so I can answer it:It's a bit off-topic maybe, but I think we can also avoid running some lanes under certain circumstances.For example, if a PR only changes a unit test - we can obviously skip the vast majority of the lanes. Same goes for a PR that only changes a functional test (we can avoid running all irrelevant lanes), changes in documentation, etc.It might sound negligible, but there are many such small PRs, so perhaps it could be valuable.
Just a thought.I agree, we should be able to do this (unless for some reason the branch protection feature).There are a lot of good ideas raised and I am in favor of exploring all to improve our productivity.Let's start working on a proposal with all the ideas and suggestions listed.This can be part of a larger effort to improve our CI and stability.
I think this would be a huge change in our workflow. The cost of discovering bugs is increasing as time goes on (this is no secret). In our case, this would increase the burden on maintainers or potentially delay release(which is btw very often == every month). Who would be responsible to identify which PR breaks the lane and who is responsible for fixing the issue?All good points, I am not arguing there are no advantages to having gateways like we have today.What I am arguing about is that it is not reasonable for me as a contributor to wait 3 hours or more for feedback on a change I posted.
And looking at the history, this is going to increase all the time (more developers, more tests, more feature jobs).Thanks for raising your concerns,LuboEdy.Thanks,
Embedding here Itamar response so I can answer it:It's a bit off-topic maybe, but I think we can also avoid running some lanes under certain circumstances.For example, if a PR only changes a unit test - we can obviously skip the vast majority of the lanes. Same goes for a PR that only changes a functional test (we can avoid running all irrelevant lanes), changes in documentation, etc.It might sound negligible, but there are many such small PRs, so perhaps it could be valuable.
Just a thought.I agree, we should be able to do this (unless for some reason the branch protection feature).There are a lot of good ideas raised and I am in favor of exploring all to improve our productivity.Let's start working on a proposal with all the ideas and suggestions listed.This can be part of a larger effort to improve our CI and stability.----Best,Daniel
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAK%2BeyL44mgGeageYBD6xLYN0DS8aeUCDY3DVEPzkS4OZ2C7tOQ%40mail.gmail.com.