--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CALOex7sOwOhUYPk8uUF-q6YWNeivmBQ8HSJBRUr%3DAr4bbqDe0w%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CACLa4fUEksdjizU%3DA6UAS3JJo2bkLtVanqF-nTD9krVL0EfY2Q%40mail.gmail.com.
There are probably a few more places we need to follow up on this, I know https://github.com/kubernetes-sigs/kind/blob/69e0c92e79ef897307cb7ad3d7cc3b7c2886fd4f/hack/ci/e2e-k8s.sh#L144-L147 is one.
_brace for flakes_
Thanks for doing this Aaron :-)
On Tue, Dec 10, 2019 at 12:05 PM Caleb Miles <cmi...@pivotal.io> wrote:
This is absolutely amazing progress. Great work, and I'm so happy to see these investments in test health!
On Tue, Dec 10, 2019 at 10:22 AM Aaron Crickenberger <spi...@gmail.com> wrote:
tl;dr At the end of this week, some jobs may flake more often, see resources at [1] for how to identify and deal with flakes.--Early on in the project's history, we modified ginkgo to automatically retry a test case N number of times if it failed; thus was born the --ginkgo.flakeAttempts=2 flag passed around to many of our e2e jobs. This has allowed flakes to hide in some jobs, while causing merge-blocking or release-blocking pain in others.Over the last few months we've slowly removed the flag's usage from all release-blocking jobs. Now that 1.17 has been released, it's time for us to rip the bandaid off and remove it from all remaining jobs that use it. If a test fails, the job will fail.This will allow us to use the relatively quiet period of the next month to gather more data on which flakes are really impacting the project. For more discussion and details, see the issue [2]The list of jobs impacted is listed in the issue [3]The PR to remove this flag [4], will be merged by Friday December 13th unless there are strong objectionsFor those interested in helping out with flakes, or questions about what to do with flakes, please see [1]. It's time for us as a community to figure out how to more effectively deal with flakes, this is a first step to help us ascertain the scope of what we're dealing with.- aaron
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CALOex7sOwOhUYPk8uUF-q6YWNeivmBQ8HSJBRUr%3DAr4bbqDe0w%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
FYI, this merged 5pm PT on Friday https://github.com/kubernetes/test-infra/pull/15516
- aaron
On Tuesday, December 10, 2019 at 4:24:16 PM UTC-5, Benjamin Elder wrote:
There are probably a few more places we need to follow up on this, I know https://github.com/kubernetes-sigs/kind/blob/69e0c92e79ef897307cb7ad3d7cc3b7c2886fd4f/hack/ci/e2e-k8s.sh#L144-L147 is one.
_brace for flakes_
Thanks for doing this Aaron :-)
On Tue, Dec 10, 2019 at 12:05 PM Caleb Miles <cmi...@pivotal.io> wrote:
This is absolutely amazing progress. Great work, and I'm so happy to see these investments in test health!
On Tue, Dec 10, 2019 at 10:22 AM Aaron Crickenberger <spi...@gmail.com> wrote:
tl;dr At the end of this week, some jobs may flake more often, see resources at [1] for how to identify and deal with flakes.--Early on in the project's history, we modified ginkgo to automatically retry a test case N number of times if it failed; thus was born the --ginkgo.flakeAttempts=2 flag passed around to many of our e2e jobs. This has allowed flakes to hide in some jobs, while causing merge-blocking or release-blocking pain in others.Over the last few months we've slowly removed the flag's usage from all release-blocking jobs. Now that 1.17 has been released, it's time for us to rip the bandaid off and remove it from all remaining jobs that use it. If a test fails, the job will fail.This will allow us to use the relatively quiet period of the next month to gather more data on which flakes are really impacting the project. For more discussion and details, see the issue [2]The list of jobs impacted is listed in the issue [3]The PR to remove this flag [4], will be merged by Friday December 13th unless there are strong objectionsFor those interested in helping out with flakes, or questions about what to do with flakes, please see [1]. It's time for us as a community to figure out how to more effectively deal with flakes, this is a first step to help us ascertain the scope of what we're dealing with.- aaron
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CALOex7sOwOhUYPk8uUF-q6YWNeivmBQ8HSJBRUr%3DAr4bbqDe0w%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CACLa4fUEksdjizU%3DA6UAS3JJo2bkLtVanqF-nTD9krVL0EfY2Q%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/d7167b67-a294-43b9-b5db-6fc5538b4aa9%40googlegroups.com.
While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),I'm not entirely sure we're ready for doing that.Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of theblocking suite (and file P0 issues to fix them and move back and made them release blocking).
On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),I'm not entirely sure we're ready for doing that.Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of theblocking suite (and file P0 issues to fix them and move back and made them release blocking).We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).
We turned off flake retries Friday December 13th 2019, and I wanted to update you on what’s happened since then
Looking at the merge-blocking job health dashboard 3 weeks before dec 13 to 3 weeks since dec 13
pull-kubernetes-e2e-gce’s daily failure rate went from an average of 23% to a max of 80%, we’ve since brought it down to an average of 61%
pull-kubernetes-e2e-gce’s daily flake rate went from an average of 7% to a max of 40%, we’ve since brought it down to an average of 27%
The equivalent CI job’s metrics have improved to <20% failure rate, <15% flake rate, and we’ve moved it back to release-blocking
16 new kind/flake issues have been created in kubernetes/kubernetes, bringing us to 35 open flake issues
We’ve uncovered some real bugs thanks to this effort (thanks @liggitt)
https://github.com/kubernetes/kubernetes/issues/86312 - we uncovered a race condition in runc
upstream issue https://github.com/opencontainers/runc/issues/2183
upstream fix https://github.com/opencontainers/runc/pull/2185 merged
waiting for release candidate to be cut, and must then propagate downstream
https://github.com/kubernetes/kubernetes/issues/86417 - namespace deletion performance issues
https://github.com/kubernetes/kubernetes/pull/86320 - (fixed) incorrect pod status on recreated pods
As well as fixed a number of flakes:
https://github.com/kubernetes/kubernetes/issues/86179 (thanks @soltysh)
https://github.com/kubernetes/kubernetes/issues/86317 (thanks @gnufied)
And quarantined others while we root-cause:
https://github.com/kubernetes/kubernetes/issues/86181 (thanks @msau42)
https://github.com/kubernetes/kubernetes/issues/86068 (thanks @Huang-Wei)
Finally, we’ve tweaked tooling a bit to help with flake hunting:
go.k8s.io/triage now supports exclude regexes
See top 10 (was 3) flakes over the last week for all PR jobs at http://storage.googleapis.com/k8s-metrics/flakes-latest.json (Updated daily at ~6pm PST)
http://velodrome.k8s.io/ defaults to the “job-health (merge-blocking)” dashboard
This was just me writing things down as a human a few weeks out from a change we deployed. It’s not my intent to replace the release team’s CI Signal report, nor continue with this format regularly. If you have suggestions on what you’d like to see here, or where this info should live instead, let us know!
On Mon, Dec 16, 2019 at 8:59 PM Jordan Liggitt <lig...@google.com> wrote:
On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),I'm not entirely sure we're ready for doing that.Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of theblocking suite (and file P0 issues to fix them and move back and made them release blocking).We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).Awesome - thanks a lot Jordan!And we really need to do that ~now - otherwise that's going to be extremely painful and unproductive(e.g. I don't know if it's just me, but I'm not making at least final pass of the review until tests are passing).thankswojtek
--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/1d208c6a-6882-4dc2-90b4-5dee80569a91%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CAOqU-DTmREhjw5CPa4yEjVzMTSzBMM1ad_d3mcfxmpnYfDtH0w%40mail.gmail.com.
On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),I'm not entirely sure we're ready for doing that.Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of theblocking suite (and file P0 issues to fix them and move back and made them release blocking).We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).Awesome - thanks a lot Jordan!And we really need to do that ~now - otherwise that's going to be extremely painful and unproductive(e.g. I don't know if it's just me, but I'm not making at least final pass of the review until tests are passing).thankswojtek
--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/1d208c6a-6882-4dc2-90b4-5dee80569a91%40googlegroups.com.