namespace cleanup is @kubernetes/sig-api-machinery-test-failures
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.![]()
Is there a "reasonable" amount of time to wait for an NS? Default is 5 minutes.
For everything but pods (which can hang waiting for graceful deletion), cleanup should be very quick (limited only by load and API QPS throttling by the namespace controller). I'd expect on the order of single digit seconds. Do we know what non-pod resources still exist in these namespaces the test is complaining about?
This does not seem to be related to networking, it is testing when the service endpoints are populated in the API server.
Expectations:
/assign them and /unassign yourself./assign it to a more appropriate person and b) /unassign yourself.There seem to be two major sets of failures:
Can someone from @kubernetes/sig-api-machinery-test-failures take a look?
May be related https://github.com/kubernetes/kubernetes/pull/45124/files
#45124 only affects the PV e2e tests, not this one
It looks like @bowei from sig-networking is attempting to hand off this issue to @kubernetes/sig-api-machinery-test-failures...
Bowei, when you do this will you please /unassign any networking people:
/unassign @thockin @dcbw @caseydavenport @thockin
And /assign leads for the api-machinery group?
/assign @lavalamp @deads2k
Thanks!
Namespace deletion makes me thing about #45304
@fejta -- looks like your comment did the trick
@deads2k can you elaborate? Are you suggesting that #45304, when merged, may resolve? That you are looking into it? None of the above?
If the problem is slow namespace cleanup, then some of that slowness can be addressed by #45304 which ensures that the cleanup controller won't be throttled during its discovery which is extremely chatty.
Some of the slowness may be graceful deletion timing, but we don't even get to graceful deletion timing without addressing the discovery rate limiting on the namespace controller.
Thanks @deads2k. Propose leaving issue as-is until post- #45304 merge. At that point: if still flaky,
Related question in the comments above about 'reasonable time'. It appears that test timeouts are convention rather than some SLO commitment; unless that interpretation is incorrect, I would propose increasing the timeout to resolve test flakiness and opening a new issue to investigate graceful deletion timing. (Note: I understand this is sub-optimal. Trying to address flakiness as top priority; please let me know if this goes against the goals or is a taboo suggestion!)
That's fine with me.
I would propose increasing the timeout to resolve test flakiness and opening a new issue to investigate graceful deletion timing. (Note: I understand this is sub-optimal. Trying to address flakiness as top priority; please let me know if this goes against the goals or is a taboo suggestion!)
I don't have strong feeling on this issue, though I would be annoyed if queue were closed instead of increasing a timeout.
Sweet! Also check out https://github.com/kubernetes/test-infra/tree/master/metrics.
@cjwagner wrote some infra to provide daily updates to the data used to power the thursday flaky email. Notice from http://storage.googleapis.com/k8s-metrics/flakes/flakes-latest.json that this job is no longer flagged, whch seems sufficient to me to close this issue.
Closed #44791.