proposal to add a kubeadm PR blocking test job for k/k

3 views
Skip to first unread message

Lubomir I. Ivanov

unread,
Nov 1, 2024, 10:09:16 AMNov 1
to kubernetes-sig-testing, sig-cluste...@kubernetes.io
hello,

a slightly more formal cross post from slack:
https://kubernetes.slack.com/archives/C09QZ4DQB/p1730462110713279

do we have any objections if we add the following kubeadm upgrade job
as PR merge blocking for k/k with run_if_changed: '^cmd\/kubeadm\/.*$'
?

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubeadm/3116/pull-kubeadm-kinder-upgrade-latest/1851600608943935488

(the above means that it will run only if kubeadm bits change)
(we may want to add test/e2e_kubeadm too)

it runs for <20 minutes and verifies if something broke in kubeadm
upgrade by doing a latest -> latest version in place upgrade.

prow spec can be seen in the above run. kinder workflow (contains all
the tasks of the job)
https://github.com/kubernetes/kubeadm/blob/main/kinder/ci/workflows/presubmit-upgrade-latest.yaml

the argument for that is simple - often we break kubeadm because we
don't have 100% unit test coverage and that alerts the release CI team
as we have kubeadm upgrade jobs in the release-informing dashboards.
in such a case we have to fix ASAP, log issue, talk to CI team, send
another PR in k/k etc. testing locally with kinder is fine, but that
is quite slow on slower internet. that can all be avoided with a
simple pre-submit.

comments are appreciated.

thanks
lubomir
--

Davanum Srinivas

unread,
Nov 1, 2024, 10:23:43 AMNov 1
to Lubomir I. Ivanov, kubernetes-sig-testing, sig-cluste...@kubernetes.io
+1 from me. I am supportive of this effort

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-testing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-te...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kubernetes-sig-testing/CAGDbWi-9%3DAh9YO-RSWYZfWmAq2fjod3M8q3j65vKE4ZDxYkOnQ%40mail.gmail.com.


--
Davanum Srinivas :: https://twitter.com/dims

Antonio Ojea

unread,
Nov 1, 2024, 12:17:28 PMNov 1
to Davanum Srinivas, Lubomir I. Ivanov, kubernetes-sig-testing, sig-cluste...@kubernetes.io
Can we start reporting status only and evaluate after code freeze the
results to move it to blocking?

My observation is that people sending PRs does not differentiate
between required and not required failing jobs, so that will server
the goal of this effort, and by not making it blocking we avoid the
risk of blocking legit PRs, specially having code freeze so close
> To view this discussion visit https://groups.google.com/d/msgid/kubernetes-sig-testing/CANw6fcGt%3D6Y%2Bf03LBDNJ1ULTr14%3DaSDm-wyag7%2B81%3DqPxsMVbQ%40mail.gmail.com.

Jordan Liggitt

unread,
Nov 1, 2024, 12:27:44 PMNov 1
to Antonio Ojea, Davanum Srinivas, Lubomir I. Ivanov, kubernetes-sig-testing, sig-cluste...@kubernetes.io
> as we have kubeadm upgrade jobs in the release-informing dashboards.

Should it be release-blocking first before making it PR-blocking?

How is the health of the existing periodic run? Is it flaky or stable?

> Can we start reporting status only and evaluate after code freeze the results to move it to blocking?

+1 to running in a non-blocking way first and getting visibility ahead of making it PR blocking




To unsubscribe from this group and stop receiving emails from it, send an email to sig-cluster-life...@kubernetes.io.

Lubomir I. Ivanov

unread,
Nov 1, 2024, 12:45:10 PMNov 1
to Jordan Liggitt, kubernetes-sig-release, Antonio Ojea, Davanum Srinivas, kubernetes-sig-testing, sig-cluste...@kubernetes.io
On Fri, 1 Nov 2024 at 18:27, Jordan Liggitt <lig...@google.com> wrote:
>
> > as we have kubeadm upgrade jobs in the release-informing dashboards.
>
> Should it be release-blocking first before making it PR-blocking?
>

are existing PR blocking jobs also on the release-blocking board?

in practice the kubeadm upgrade release informing jobs have been
release blocking already.
the line between informing and blocking have been historically blared
by the release team and kubeadm maintainers, since nobody wants to
release a broken (e.g.) "kubeadm upgrade" command.

so that can be step 1 for me to PR if there is consensus.
cc @kubernetes-sig-release

> How is the health of the existing periodic run? Is it flaky or stable?

flakes do happen, but very rarely, and are either a product of infra
status or random unexplainable `docker run` (maybe also infra related)
bugs (it's a DIND setup).
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm
apart from the recent breakage the triggered today's discussion, the
entire kubeadm board is green.

>
> > Can we start reporting status only and evaluate after code freeze the results to move it to blocking?
>
> +1 to running in a non-blocking way first and getting visibility ahead of making it PR blocking
>

i think this is already a standard practice, so i was anticipating it.
fine by me to have the on demand `/test ...` trigger initially and
after some time make it merge blocking.

lubomir
--

Benjamin Elder

unread,
Nov 1, 2024, 12:48:49 PMNov 1
to Lubomir I. Ivanov, Jordan Liggitt, kubernetes-sig-release, Antonio Ojea, Davanum Srinivas, kubernetes-sig-testing, sig-cluste...@kubernetes.io
> are existing PR blocking jobs also on the release-blocking board?

generally yes, if there's anywhere currently missing that, we'd probably consider it legacy tech-debt

PR blocking is a stricter subset of release blocking, we wrote some solid guidelines for release blocking and generally presubmit blocking is just release blocking + release-blocking is catching actual breakage often + we can run it even faster than the release blocking guidelines (which permit up to 2h, which is far too slow for a presubmit)


> flakes do happen, but very rarely, and are either a product of infra
status or random unexplainable `docker run` (maybe also infra related)
bugs (it's a DIND setup).

The frequency is the important part, we need people to take presubmit failures seriously.

---

This sounds like a good thing to cover, but I also think we should get it into release blocking first.

Sometimes we introduce things into release blocking and find that they're not actually breaking often
and the tradeoff is not ideal to introduce to presubmit then, but we still guarantee we don't ship those bugs to release.


Lubomir I. Ivanov

unread,
Nov 1, 2024, 2:39:08 PMNov 1
to Benjamin Elder, Jordan Liggitt, kubernetes-sig-release, Antonio Ojea, Davanum Srinivas, kubernetes-sig-testing, sig-cluste...@kubernetes.io
On Fri, 1 Nov 2024 at 18:48, Benjamin Elder <benth...@google.com> wrote:
>
> > are existing PR blocking jobs also on the release-blocking board?
>
> generally yes, if there's anywhere currently missing that, we'd probably consider it legacy tech-debt
>
> PR blocking is a stricter subset of release blocking, we wrote some solid guidelines for release blocking and generally presubmit blocking is just release blocking + release-blocking is catching actual breakage often + we can run it even faster than the release blocking guidelines (which permit up to 2h, which is far too slow for a presubmit)
>

one caveat is that the job i'm proposing as PR blocking is dry-running
the conformance e2e suite, but that would overlap with the kind e2e
presubmit anyway.
this will be only tested in kinder upgrade periodics.

> > flakes do happen, but very rarely, and are either a product of infra
> status or random unexplainable `docker run` (maybe also infra related)
> bugs (it's a DIND setup).
>
> The frequency is the important part, we need people to take presubmit failures seriously.

there are certainly ways to resolve even some of these odd, rare,
"docker run" flakes with retries.
but they are so rare that we have not invested the effort yet.

>
> This sounds like a good thing to cover, but I also think we should get it into release blocking first.
>
> Sometimes we introduce things into release blocking and find that they're not actually breaking often
> and the tradeoff is not ideal to introduce to presubmit then, but we still guarantee we don't ship those bugs to release.
>

here are the PRs to promote the 3 kubeadm jobs to release-blocking:
https://github.com/kubernetes/kubeadm/pull/3119
https://github.com/kubernetes/test-infra/pull/33746

are there blockers to promote them now?
they have been in release-informing since the concept of
informing/blocking was incepted, pretty much.

Lubomir I. Ivanov

unread,
Nov 1, 2024, 5:55:40 PMNov 1
to Benjamin Elder, Jordan Liggitt, kubernetes-sig-release, Antonio Ojea, Davanum Srinivas, kubernetes-sig-testing, sig-cluste...@kubernetes.io
i will leave the test-infra PRs for a few days for folks to consider
the promotion of the 3 kubeadm jobs to release-blocking.
let's discuss further if needed.

plan B is to demote the 3 kubeadm periodic away from release-informing
since in the EOD we get email notifications and triage the failures
ASAP anyhow (@pacoxu can confirm). also, we can implant an external CI
runner / commenter that checks a given k/k PR and reports failures,
which the maintainers can trigger on demand. github actions grant us
this flexibility and it would be the equivalent of the k/k `/test ...`
command.

lubomir
--

Antonio Ojea

unread,
Nov 2, 2024, 5:40:23 AMNov 2
to Lubomir I. Ivanov, Benjamin Elder, Jordan Liggitt, kubernetes-sig-release, Davanum Srinivas, kubernetes-sig-testing, sig-cluste...@kubernetes.io
I'm +1 on the periodics promotion if the requirements are met (I didn't check myself)

I think we can table the presubmit discussion for later 

Antonio Ojea

unread,
Nov 4, 2024, 7:08:52 AMNov 4
to Lubomir I. Ivanov, Benjamin Elder, Jordan Liggitt, kubernetes-sig-release, Davanum Srinivas, kubernetes-sig-testing, sig-cluste...@kubernetes.io
Submitted https://github.com/kubernetes/community/pull/8136 to try to
summarize my thoughts on the topic.
I also added what I think is a good strategy that allows to scale and
make the development sustainable. Area or feature owners can implement
presubmit "soft-blocking" jobs by using the OWNERS feature and
non-blocking presubmits job that run only when the code of that
feature/area is modified.
Reply all
Reply to author
Forward
0 new messages