[NOTICE] No more ginkgo.flakeAttempts=2 for e2e tests as of 2019-12-13

120 views
Skip to first unread message

Aaron Crickenberger

unread,
Dec 10, 2019, 1:22:13 PM12/10/19
to Kubernetes developer/contributor discussion, kubernetes-sig-leads
tl;dr At the end of this week, some jobs may flake more often, see resources at [1] for how to identify and deal with flakes.

Early on in the project's history, we modified ginkgo to automatically retry a test case N number of times if it failed; thus was born the --ginkgo.flakeAttempts=2 flag passed around to many of our e2e jobs.  This has allowed flakes to hide in some jobs, while causing merge-blocking or release-blocking pain in others. 

Over the last few months we've slowly removed the flag's usage from all release-blocking jobs.  Now that 1.17 has been released, it's time for us to rip the bandaid off and remove it from all remaining jobs that use it.  If a test fails, the job will fail.

This will allow us to use the relatively quiet period of the next month to gather more data on which flakes are really impacting the project.  For more discussion and details, see the issue [2]

The list of jobs impacted is listed in the issue [3]

The PR to remove this flag [4], will be merged by Friday December 13th unless there are strong objections

For those interested in helping out with flakes, or questions about what to do with flakes, please see [1].  It's time for us as a community to figure out how to more effectively deal with flakes, this is a first step to help us ascertain the scope of what we're dealing with.

- aaron

Caleb Miles

unread,
Dec 10, 2019, 3:05:57 PM12/10/19
to Aaron Crickenberger, Kubernetes developer/contributor discussion, kubernetes-sig-leads
This is absolutely amazing progress. Great work, and I'm so happy to see these investments in test health!

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CALOex7sOwOhUYPk8uUF-q6YWNeivmBQ8HSJBRUr%3DAr4bbqDe0w%40mail.gmail.com.

Benjamin Elder

unread,
Dec 10, 2019, 4:24:16 PM12/10/19
to Caleb Miles, Aaron Crickenberger, Kubernetes developer/contributor discussion, kubernetes-sig-leads
There are probably a few more places we need to follow up on this, I know https://github.com/kubernetes-sigs/kind/blob/69e0c92e79ef897307cb7ad3d7cc3b7c2886fd4f/hack/ci/e2e-k8s.sh#L144-L147 is one.

_brace for flakes_

Thanks for doing this Aaron :-)

Aaron Crickenberger

unread,
Dec 16, 2019, 10:32:33 AM12/16/19
to Kubernetes developer/contributor discussion
FYI, this merged 5pm PT on Friday https://github.com/kubernetes/test-infra/pull/15516

- aaron


On Tuesday, December 10, 2019 at 4:24:16 PM UTC-5, Benjamin Elder wrote:
There are probably a few more places we need to follow up on this, I know https://github.com/kubernetes-sigs/kind/blob/69e0c92e79ef897307cb7ad3d7cc3b7c2886fd4f/hack/ci/e2e-k8s.sh#L144-L147 is one.

_brace for flakes_

Thanks for doing this Aaron :-)

On Tue, Dec 10, 2019 at 12:05 PM Caleb Miles <cmi...@pivotal.io> wrote:
This is absolutely amazing progress. Great work, and I'm so happy to see these investments in test health!

On Tue, Dec 10, 2019 at 10:22 AM Aaron Crickenberger <spi...@gmail.com> wrote:
tl;dr At the end of this week, some jobs may flake more often, see resources at [1] for how to identify and deal with flakes.

Early on in the project's history, we modified ginkgo to automatically retry a test case N number of times if it failed; thus was born the --ginkgo.flakeAttempts=2 flag passed around to many of our e2e jobs.  This has allowed flakes to hide in some jobs, while causing merge-blocking or release-blocking pain in others. 

Over the last few months we've slowly removed the flag's usage from all release-blocking jobs.  Now that 1.17 has been released, it's time for us to rip the bandaid off and remove it from all remaining jobs that use it.  If a test fails, the job will fail.

This will allow us to use the relatively quiet period of the next month to gather more data on which flakes are really impacting the project.  For more discussion and details, see the issue [2]

The list of jobs impacted is listed in the issue [3]

The PR to remove this flag [4], will be merged by Friday December 13th unless there are strong objections

For those interested in helping out with flakes, or questions about what to do with flakes, please see [1].  It's time for us as a community to figure out how to more effectively deal with flakes, this is a first step to help us ascertain the scope of what we're dealing with.

- aaron

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.

Wojciech Tyczynski

unread,
Dec 16, 2019, 2:54:29 PM12/16/19
to Aaron Crickenberger, Kubernetes developer/contributor discussion
 While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

 Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
 With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).
And we really need to do that ~now - otherwise that's going to be extremely painful and unproductive
(e.g. I don't know if it's just me, but I'm not making at least final pass of the review until tests are passing).

 thanks
wojtek 

On Mon, Dec 16, 2019 at 4:32 PM Aaron Crickenberger <spi...@gmail.com> wrote:
FYI, this merged 5pm PT on Friday https://github.com/kubernetes/test-infra/pull/15516

- aaron

On Tuesday, December 10, 2019 at 4:24:16 PM UTC-5, Benjamin Elder wrote:
There are probably a few more places we need to follow up on this, I know https://github.com/kubernetes-sigs/kind/blob/69e0c92e79ef897307cb7ad3d7cc3b7c2886fd4f/hack/ci/e2e-k8s.sh#L144-L147 is one.

_brace for flakes_

Thanks for doing this Aaron :-)

On Tue, Dec 10, 2019 at 12:05 PM Caleb Miles <cmi...@pivotal.io> wrote:
This is absolutely amazing progress. Great work, and I'm so happy to see these investments in test health!

On Tue, Dec 10, 2019 at 10:22 AM Aaron Crickenberger <spi...@gmail.com> wrote:
tl;dr At the end of this week, some jobs may flake more often, see resources at [1] for how to identify and deal with flakes.

Early on in the project's history, we modified ginkgo to automatically retry a test case N number of times if it failed; thus was born the --ginkgo.flakeAttempts=2 flag passed around to many of our e2e jobs.  This has allowed flakes to hide in some jobs, while causing merge-blocking or release-blocking pain in others. 

Over the last few months we've slowly removed the flag's usage from all release-blocking jobs.  Now that 1.17 has been released, it's time for us to rip the bandaid off and remove it from all remaining jobs that use it.  If a test fails, the job will fail.

This will allow us to use the relatively quiet period of the next month to gather more data on which flakes are really impacting the project.  For more discussion and details, see the issue [2]

The list of jobs impacted is listed in the issue [3]

The PR to remove this flag [4], will be merged by Friday December 13th unless there are strong objections

For those interested in helping out with flakes, or questions about what to do with flakes, please see [1].  It's time for us as a community to figure out how to more effectively deal with flakes, this is a first step to help us ascertain the scope of what we're dealing with.

- aaron

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/d7167b67-a294-43b9-b5db-6fc5538b4aa9%40googlegroups.com.

Jordan Liggitt

unread,
Dec 16, 2019, 2:59:20 PM12/16/19
to Wojciech Tyczynski, Aaron Crickenberger, Kubernetes developer/contributor discussion
On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:
 While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

 Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
 With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).

We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).

Wojciech Tyczynski

unread,
Dec 16, 2019, 3:00:27 PM12/16/19
to Jordan Liggitt, Aaron Crickenberger, Kubernetes developer/contributor discussion
On Mon, Dec 16, 2019 at 8:59 PM Jordan Liggitt <lig...@google.com> wrote:
On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:
 While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

 Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
 With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).

We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).

Awesome - thanks a lot Jordan! 

Aaron Crickenberger

unread,
Jan 7, 2020, 5:57:09 PM1/7/20
to Kubernetes developer/contributor discussion

We turned off flake retries Friday December 13th 2019, and I wanted to update you on what’s happened since then


We’ve uncovered some real bugs thanks to this effort (thanks @liggitt)


As well as fixed a number of flakes:


And quarantined others while we root-cause:


Finally, we’ve tweaked tooling a bit to help with flake hunting:


This was just me writing things down as a human a few weeks out from a change we deployed.  It’s not my intent to replace the release team’s CI Signal report, nor continue with this format regularly. If you have suggestions on what you’d like to see here, or where this info should live instead, let us know!


- aaron

On Monday, December 16, 2019 at 12:00:27 PM UTC-8, Wojciech Tyczynski wrote:
On Mon, Dec 16, 2019 at 8:59 PM Jordan Liggitt <lig...@google.com> wrote:

Stephen Augustus

unread,
Jan 7, 2020, 6:16:14 PM1/7/20
to Aaron Crickenberger, Kubernetes developer/contributor discussion
Amazing work, everyone!!

On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:
 While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

 Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
 With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).

We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).

Awesome - thanks a lot Jordan! 
 
And we really need to do that ~now - otherwise that's going to be extremely painful and unproductive
(e.g. I don't know if it's just me, but I'm not making at least final pass of the review until tests are passing).

 thanks
wojtek 

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.

Jago Macleod

unread,
Jan 7, 2020, 6:19:52 PM1/7/20
to Stephen Augustus, Aaron Crickenberger, Kubernetes developer/contributor discussion
So awesome - thanks for all the work all, and thanks for sharing.

John Belamaric

unread,
Jan 7, 2020, 6:22:12 PM1/7/20
to Aaron Crickenberger, Kubernetes developer/contributor discussion
Awesome, this is really badly needed work and great to see it moving forward quickly (over the holidays too!).

On Mon, Dec 16, 2019 at 2:54 PM 'Wojciech Tyczynski' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:
 While I'm 100% supportive for doing that (I was also flagging this as a problem long time ago),
I'm not entirely sure we're ready for doing that.

 Unless I wasn't extremely unlucky, my PR required 17 retests to get merged (and I'm 100% sure it wasn't because of my PR):
 With this number, we either need to prioritize fixing the flakes we're facing or move the most flaky tests out of the
blocking suite (and file P0 issues to fix them and move back and made them release blocking).

We are doing exactly that right now (marking the top flaking tests as flaky to make them non-blocking, filing critical release-blocking issues in the milestone and assigning to sigs, etc).

Awesome - thanks a lot Jordan! 
 
And we really need to do that ~now - otherwise that's going to be extremely painful and unproductive
(e.g. I don't know if it's just me, but I'm not making at least final pass of the review until tests are passing).

 thanks
wojtek 

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages