Re: [kubernetes/kubernetes] Service endpoints latency should not be very high flakes (#44791)

Jordan Liggitt

unread,

Apr 22, 2017, 12:53:13 PM4/22/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

namespace cleanup is @kubernetes/sig-api-machinery-test-failures

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Jordan Liggitt

unread,

Apr 22, 2017, 12:53:18 PM4/22/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

cc @derekwaynecarr

Tim Hockin

unread,

Apr 23, 2017, 4:49:05 PM4/23/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

Is there a "reasonable" amount of time to wait for an NS? Default is 5 minutes.

Jordan Liggitt

unread,

Apr 23, 2017, 4:52:11 PM4/23/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

For everything but pods (which can hang waiting for graceful deletion), cleanup should be very quick (limited only by load and API QPS throttling by the namespace controller). I'd expect on the order of single digit seconds. Do we know what non-pod resources still exist in these namespaces the test is complaining about?

Bowei Du

unread,

Apr 27, 2017, 8:22:36 PM4/27/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

This does not seem to be related to networking, it is testing when the service endpoints are populated in the API server.

Tim Hockin

unread,

Apr 27, 2017, 8:33:10 PM4/27/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

We sort of own the endpoints controller.

On Apr 27, 2017 5:22 PM, "Bowei Du" <notifi...@github.com> wrote:

> This does not seem to be related to networking, it is testing when the
> service endpoints are populated in the API server.
>
> —

> You are receiving this because you were assigned.

> Reply to this email directly, view it on GitHub

> <https://github.com/kubernetes/kubernetes/issues/44791#issuecomment-297874320>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AFVgVMRNK0QfgFsJiMt1plMFce3zYmNRks5r0TFBgaJpZM4NE_8W>

Erick Fejta

unread,

Apr 27, 2017, 9:27:47 PM4/27/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

Expectations:

This work should interrupt and take priority over other activity while it is assigned to you.
Expectation for sig leads is to identify someone to fix the issue, /assign them and /unassign yourself.
Expectation for assignees are to investigate the issue and either a) resolve it or b) /assign it to a more appropriate person and b) /unassign yourself.

Bowei Du

unread,

Apr 28, 2017, 2:11:34 AM4/28/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

There seem to be two major sets of failures:

11 failures -- namespace deletion. The namespace deletion errors appear to be part of a general failure to delete namespaces in the test run. All instances where the test failed, other tests in the suite also failed with "namespace is empty but is not yet removed".
8 failures -- all part of a set of broken test runs (many failures in non-networking tests) that occurred at around 4/23/2017, 2:43 PM. It hasn't reoccurred since, I will keep track of this.

Can someone from @kubernetes/sig-api-machinery-test-failures take a look?

cmluciano

unread,

Apr 29, 2017, 3:26:09 PM4/29/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

Jordan Liggitt

unread,

Apr 29, 2017, 3:42:45 PM4/29/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

#45124 only affects the PV e2e tests, not this one

Erick Fejta

unread,

May 4, 2017, 9:40:17 PM5/4/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

It looks like @bowei from sig-networking is attempting to hand off this issue to @kubernetes/sig-api-machinery-test-failures...

Bowei, when you do this will you please /unassign any networking people:
/unassign @thockin @dcbw @caseydavenport @thockin

And /assign leads for the api-machinery group?
/assign @lavalamp @deads2k

Thanks!

David Eads

unread,

May 5, 2017, 7:33:35 AM5/5/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

Namespace deletion makes me thing about #45304

Bowei Du

unread,

May 5, 2017, 5:05:17 PM5/5/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

@fejta -- looks like your comment did the trick

Jago Macleod

unread,

May 5, 2017, 8:43:39 PM5/5/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

@deads2k can you elaborate? Are you suggesting that #45304, when merged, may resolve? That you are looking into it? None of the above?

David Eads

unread,

May 8, 2017, 8:55:32 AM5/8/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

@deads2k can you elaborate? Are you suggesting that #45304, when merged, may resolve? That you are looking into it? None of the above?

If the problem is slow namespace cleanup, then some of that slowness can be addressed by #45304 which ensures that the cleanup controller won't be throttled during its discovery which is extremely chatty.

Some of the slowness may be graceful deletion timing, but we don't even get to graceful deletion timing without addressing the discovery rate limiting on the namespace controller.

Jago Macleod

unread,

May 8, 2017, 1:45:39 PM5/8/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

Thanks @deads2k. Propose leaving issue as-is until post- #45304 merge. At that point: if still flaky,

Related question in the comments above about 'reasonable time'. It appears that test timeouts are convention rather than some SLO commitment; unless that interpretation is incorrect, I would propose increasing the timeout to resolve test flakiness and opening a new issue to investigate graceful deletion timing. (Note: I understand this is sub-optimal. Trying to address flakiness as top priority; please let me know if this goes against the goals or is a taboo suggestion!)

David Eads

unread,

May 8, 2017, 1:48:41 PM5/8/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

Thanks @deads2k. Propose leaving issue as-is until post- #45304 merge.

That's fine with me.

I would propose increasing the timeout to resolve test flakiness and opening a new issue to investigate graceful deletion timing. (Note: I understand this is sub-optimal. Trying to address flakiness as top priority; please let me know if this goes against the goals or is a taboo suggestion!)

I don't have strong feeling on this issue, though I would be annoyed if queue were closed instead of increasing a timeout.

Jago Macleod

unread,

May 9, 2017, 8:57:06 PM5/9/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

@fejta - it appears that the situation has improved based on early readout of the link you provided, but perhaps still too early to tell. Many thanks @feiskyer for quick review on the related issue.

Erick Fejta

unread,

May 9, 2017, 9:01:12 PM5/9/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

Sweet! Also check out https://github.com/kubernetes/test-infra/tree/master/metrics.

@cjwagner wrote some infra to provide daily updates to the data used to power the thursday flaky email. Notice from http://storage.googleapis.com/k8s-metrics/flakes/flakes-latest.json that this job is no longer flagged, whch seems sufficient to me to close this issue.

Jago Macleod

unread,

May 10, 2017, 3:12:22 AM5/10/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

/assign @fejta
/unassign @lavalamp @deads2k @bowei

Erick Fejta

unread,

May 12, 2017, 3:26:53 AM5/12/17

to kubernetes/kubernetes, k8s-mirror-api-machinery-test-failures, Team mention

Closed #44791.

Reply all

Reply to author

Forward