[NOTICE] No more ginkgo.flakeAttempts=2 for e2e tests as of 2019-12-13

Aaron Crickenberger

unread,

Dec 10, 2019, 1:22:13 PM12/10/19

to Kubernetes developer/contributor discussion, kubernetes-sig-leads

tl;dr At the end of this week, some jobs may flake more often, see resources at [1] for how to identify and deal with flakes.

Early on in the project's history, we modified ginkgo to automatically retry a test case N number of times if it failed; thus was born the --ginkgo.flakeAttempts=2 flag passed around to many of our e2e jobs. This has allowed flakes to hide in some jobs, while causing merge-blocking or release-blocking pain in others.

Over the last few months we've slowly removed the flag's usage from all release-blocking jobs. Now that 1.17 has been released, it's time for us to rip the bandaid off and remove it from all remaining jobs that use it. If a test fails, the job will fail.

This will allow us to use the relatively quiet period of the next month to gather more data on which flakes are really impacting the project. For more discussion and details, see the issue [2]

The list of jobs impacted is listed in the issue [3]

The PR to remove this flag [4], will be merged by Friday December 13th unless there are strong objections

For those interested in helping out with flakes, or questions about what to do with flakes, please see [1]. It's time for us as a community to figure out how to more effectively deal with flakes, this is a first step to help us ascertain the scope of what we're dealing with.

- aaron

[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md

[2] https://github.com/kubernetes/kubernetes/issues/68091

[3] https://github.com/kubernetes/kubernetes/issues/68091#issuecomment-562709620

[4] https://github.com/kubernetes/test-infra/pull/15516

Caleb Miles

unread,

Dec 10, 2019, 3:05:57 PM12/10/19

to Aaron Crickenberger, Kubernetes developer/contributor discussion, kubernetes-sig-leads

This is absolutely amazing progress. Great work, and I'm so happy to see these investments in test health!

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CALOex7sOwOhUYPk8uUF-q6YWNeivmBQ8HSJBRUr%3DAr4bbqDe0w%40mail.gmail.com.

Benjamin Elder

unread,

Dec 10, 2019, 4:24:16 PM12/10/19

to Caleb Miles, Aaron Crickenberger, Kubernetes developer/contributor discussion, kubernetes-sig-leads

There are probably a few more places we need to follow up on this, I know https://github.com/kubernetes-sigs/kind/blob/69e0c92e79ef897307cb7ad3d7cc3b7c2886fd4f/hack/ci/e2e-k8s.sh#L144-L147 is one.

_brace for flakes_

Thanks for doing this Aaron :-)

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CACLa4fUEksdjizU%3DA6UAS3JJo2bkLtVanqF-nTD9krVL0EfY2Q%40mail.gmail.com.

Aaron Crickenberger

unread,

Dec 16, 2019, 10:32:33 AM12/16/19

to Kubernetes developer/contributor discussion

FYI, this merged 5pm PT on Friday https://github.com/kubernetes/test-infra/pull/15516

- aaron

On Tuesday, December 10, 2019 at 4:24:16 PM UTC-5, Benjamin Elder wrote:

There are probably a few more places we need to follow up on this, I know https://github.com/kubernetes-sigs/kind/blob/69e0c92e79ef897307cb7ad3d7cc3b7c2886fd4f/hack/ci/e2e-k8s.sh#L144-L147 is one.

_brace for flakes_

Thanks for doing this Aaron :-)

On Tue, Dec 10, 2019 at 12:05 PM Caleb Miles <cmi...@pivotal.io> wrote:

This is absolutely amazing progress. Great work, and I'm so happy to see these investments in test health!

On Tue, Dec 10, 2019 at 10:22 AM Aaron Crickenberger <spi...@gmail.com> wrote:

tl;dr At the end of this week, some jobs may flake more often, see resources at [1] for how to identify and deal with flakes.

Early on in the project's history, we modified ginkgo to automatically retry a test case N number of times if it failed; thus was born the --ginkgo.flakeAttempts=2 flag passed around to many of our e2e jobs. This has allowed flakes to hide in some jobs, while causing merge-blocking or release-blocking pain in others.

Over the last few months we've slowly removed the flag's usage from all release-blocking jobs. Now that 1.17 has been released, it's time for us to rip the bandaid off and remove it from all remaining jobs that use it. If a test fails, the job will fail.

This will allow us to use the relatively quiet period of the next month to gather more data on which flakes are really impacting the project. For more discussion and details, see the issue [2]

The list of jobs impacted is listed in the issue [3]

The PR to remove this flag [4], will be merged by Friday December 13th unless there are strong objections

For those interested in helping out with flakes, or questions about what to do with flakes, please see [1]. It's time for us as a community to figure out how to more effectively deal with flakes, this is a first step to help us ascertain the scope of what we're dealing with.

- aaron

[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md
[2] https://github.com/kubernetes/kubernetes/issues/68091
[3] https://github.com/kubernetes/kubernetes/issues/68091#issuecomment-562709620
[4] https://github.com/kubernetes/test-infra/pull/15516

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CALOex7sOwOhUYPk8uUF-q6YWNeivmBQ8HSJBRUr%3DAr4bbqDe0w%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.

Wojciech Tyczynski

unread,

Dec 16, 2019, 2:54:29 PM12/16/19

to Aaron Crickenberger, Kubernetes developer/contributor discussion

While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),

I'm not entirely sure we're ready for doing that.

Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):

https://github.com/kubernetes/kubernetes/pull/85891

With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the

blocking suite (and file P0 issues to fix them and move back and made them release blocking).

And we really need to do that ~now - otherwise that's going to be extremely painful and unproductive

(e.g. I don't know if it's just me, but I'm not making at least final pass of the review until tests are passing).

thanks

wojtek

On Mon, Dec 16, 2019 at 4:32 PM Aaron Crickenberger <spi...@gmail.com> wrote:

FYI, this merged 5pm PT on Friday https://github.com/kubernetes/test-infra/pull/15516

- aaron

On Tuesday, December 10, 2019 at 4:24:16 PM UTC-5, Benjamin Elder wrote:

There are probably a few more places we need to follow up on this, I know https://github.com/kubernetes-sigs/kind/blob/69e0c92e79ef897307cb7ad3d7cc3b7c2886fd4f/hack/ci/e2e-k8s.sh#L144-L147 is one.

_brace for flakes_

Thanks for doing this Aaron :-)

On Tue, Dec 10, 2019 at 12:05 PM Caleb Miles <cmi...@pivotal.io> wrote:

This is absolutely amazing progress. Great work, and I'm so happy to see these investments in test health!

On Tue, Dec 10, 2019 at 10:22 AM Aaron Crickenberger <spi...@gmail.com> wrote:

tl;dr At the end of this week, some jobs may flake more often, see resources at [1] for how to identify and deal with flakes.

Early on in the project's history, we modified ginkgo to automatically retry a test case N number of times if it failed; thus was born the --ginkgo.flakeAttempts=2 flag passed around to many of our e2e jobs. This has allowed flakes to hide in some jobs, while causing merge-blocking or release-blocking pain in others.

Over the last few months we've slowly removed the flag's usage from all release-blocking jobs. Now that 1.17 has been released, it's time for us to rip the bandaid off and remove it from all remaining jobs that use it. If a test fails, the job will fail.

This will allow us to use the relatively quiet period of the next month to gather more data on which flakes are really impacting the project. For more discussion and details, see the issue [2]

The list of jobs impacted is listed in the issue [3]

The PR to remove this flag [4], will be merged by Friday December 13th unless there are strong objections

For those interested in helping out with flakes, or questions about what to do with flakes, please see [1]. It's time for us as a community to figure out how to more effectively deal with flakes, this is a first step to help us ascertain the scope of what we're dealing with.

- aaron

[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md
[2] https://github.com/kubernetes/kubernetes/issues/68091
[3] https://github.com/kubernetes/kubernetes/issues/68091#issuecomment-562709620
[4] https://github.com/kubernetes/test-infra/pull/15516

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CALOex7sOwOhUYPk8uUF-q6YWNeivmBQ8HSJBRUr%3DAr4bbqDe0w%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CACLa4fUEksdjizU%3DA6UAS3JJo2bkLtVanqF-nTD9krVL0EfY2Q%40mail.gmail.com.

--

You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/d7167b67-a294-43b9-b5db-6fc5538b4aa9%40googlegroups.com.

Jordan Liggitt

unread,

Dec 16, 2019, 2:59:20 PM12/16/19

to Wojciech Tyczynski, Aaron Crickenberger, Kubernetes developer/contributor discussion

On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:

While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
https://github.com/kubernetes/kubernetes/pull/85891
With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).

We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).

Wojciech Tyczynski

unread,

Dec 16, 2019, 3:00:27 PM12/16/19

to Jordan Liggitt, Aaron Crickenberger, Kubernetes developer/contributor discussion

On Mon, Dec 16, 2019 at 8:59 PM Jordan Liggitt <lig...@google.com> wrote:

On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:
While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
https://github.com/kubernetes/kubernetes/pull/85891
With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).

We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).

Awesome - thanks a lot Jordan!

Aaron Crickenberger

unread,

Jan 7, 2020, 5:57:09 PM1/7/20

to Kubernetes developer/contributor discussion

We turned off flake retries Friday December 13th 2019, and I wanted to update you on what’s happened since then

Looking at the merge-blocking job health dashboard 3 weeks before dec 13 to 3 weeks since dec 13

pull-kubernetes-e2e-gce’s daily failure rate went from an average of 23% to a max of 80%, we’ve since brought it down to an average of 61%
pull-kubernetes-e2e-gce’s daily flake rate went from an average of 7% to a max of 40%, we’ve since brought it down to an average of 27%

The equivalent CI job’s metrics have improved to <20% failure rate, <15% flake rate, and we’ve moved it back to release-blocking

16 new kind/flake issues have been created in kubernetes/kubernetes, bringing us to 35 open flake issues

We’ve uncovered some real bugs thanks to this effort (thanks @liggitt)

https://github.com/kubernetes/kubernetes/issues/86312 - we uncovered a race condition in runc

upstream issue https://github.com/opencontainers/runc/issues/2183
upstream fix https://github.com/opencontainers/runc/pull/2185 merged
waiting for release candidate to be cut, and must then propagate downstream

https://github.com/kubernetes/kubernetes/issues/86417 - namespace deletion performance issues
https://github.com/kubernetes/kubernetes/pull/86320 - (fixed) incorrect pod status on recreated pods

As well as fixed a number of flakes:

https://github.com/kubernetes/kubernetes/issues/86179 (thanks @soltysh)
https://github.com/kubernetes/kubernetes/issues/86317 (thanks @gnufied)

And quarantined others while we root-cause:

https://github.com/kubernetes/kubernetes/issues/86181 (thanks @msau42)
https://github.com/kubernetes/kubernetes/issues/86068 (thanks @Huang-Wei)

Finally, we’ve tweaked tooling a bit to help with flake hunting:

New issue template for flaking tests
go.k8s.io/triage now supports exclude regexes

e.g. Test failures across all jobs, including PR jobs, only jobs matching regex ‘gce’, excluding tests matching regex ‘sig-storage|^(Up|Test)$’

See top 10 (was 3) flakes over the last week for all PR jobs at http://storage.googleapis.com/k8s-metrics/flakes-latest.json (Updated daily at ~6pm PST)
http://velodrome.k8s.io/ defaults to the “job-health (merge-blocking)” dashboard

This was just me writing things down as a human a few weeks out from a change we deployed. It’s not my intent to replace the release team’s CI Signal report, nor continue with this format regularly. If you have suggestions on what you’d like to see here, or where this info should live instead, let us know!

- aaron

On Monday, December 16, 2019 at 12:00:27 PM UTC-8, Wojciech Tyczynski wrote:

On Mon, Dec 16, 2019 at 8:59 PM Jordan Liggitt <lig...@google.com> wrote:

Stephen Augustus

unread,

Jan 7, 2020, 6:16:14 PM1/7/20

to Aaron Crickenberger, Kubernetes developer/contributor discussion

Amazing work, everyone!!

On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:
While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
https://github.com/kubernetes/kubernetes/pull/85891
With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).

We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).

Awesome - thanks a lot Jordan!

And we really need to do that ~now - otherwise that's going to be extremely painful and unproductive
(e.g. I don't know if it's just me, but I'm not making at least final pass of the review until tests are passing).

thanks
wojtek

--

You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/1d208c6a-6882-4dc2-90b4-5dee80569a91%40googlegroups.com.

Jago Macleod

unread,

Jan 7, 2020, 6:19:52 PM1/7/20

to Stephen Augustus, Aaron Crickenberger, Kubernetes developer/contributor discussion

So awesome - thanks for all the work all, and thanks for sharing.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CAOqU-DTmREhjw5CPa4yEjVzMTSzBMM1ad_d3mcfxmpnYfDtH0w%40mail.gmail.com.

John Belamaric

unread,

Jan 7, 2020, 6:22:12 PM1/7/20

to Aaron Crickenberger, Kubernetes developer/contributor discussion

Awesome, this is really badly needed work and great to see it moving forward quickly (over the holidays too!).

On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:
While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
https://github.com/kubernetes/kubernetes/pull/85891
With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).

We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).

Awesome - thanks a lot Jordan!

And we really need to do that ~now - otherwise that's going to be extremely painful and unproductive
(e.g. I don't know if it's just me, but I'm not making at least final pass of the review until tests are passing).

thanks
wojtek

--

You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/1d208c6a-6882-4dc2-90b4-5dee80569a91%40googlegroups.com.

Reply all

Reply to author

Forward