[sig-api-machinery] Aggregator Should be able to support the 1.7 Sample API Server using the current Aggregator
More nuanced logs and details can be found at https://storage.googleapis.com/k8s-gubernator/triage/index.html?sig=api-machinery&job=ci-kubernetes-e2e-gci-gce&test=Aggregator%20Should%20be%20able%20to%20support%20the%201.7%20Sample%20AP
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:76
gave up waiting for apiservice wardle to come up successfully
Expected error:
<*errors.errorString | 0xc420131150>: {
s: "timed out waiting for the condition",
}
timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:337
/priority failing-test
/priority important-soon
/kind flake
/sig-api-machinery
@kubernetes/sig-api-machinery-bugs
/assign @liggitt
@liggitt can you please help triage this issue?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.![]()
flakes show steady 2-3/day over the past week: https://storage.googleapis.com/k8s-gubernator/triage/index.html?sig=api-machinery&job=ci-kubernetes-e2e-gci-gce&test=Aggregator%20Should%20be%20able%20to%20support%20the%201.7%20Sample%20AP
@lavalamp which changes are you referring to? the line that's failing isn't doing any discovery, restmapping, etc, at all... just a straight get to the API:
audit log from the test run shows the requests are returning 404s, not 503s
hmm, there's a second smaller set of failures that are discovery related:
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:76
getting server preferred namespaces resources for dynamic client
Expected error:
<*discovery.ErrGroupDiscoveryFailed | 0xc421caabe0>: {
Groups: {
{
Group: mygroup.example.com,
Version: v1beta1,
}: {
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {SelfLink: "", ResourceVersion: "", Continue: ""},
Status: "Failure",
Message: "the server could not find the requested resource",
Reason: "NotFound",
Details: {
Name: "",
Group: "",
Kind: "",
UID: "",
Causes: [
{
Type: "UnexpectedServerResponse",
Message: "404 page not found",
Field: "",
},
],
RetryAfterSeconds: 0,
},
Code: 404,
},
},
},
}
unable to retrieve the complete list of server APIs: mygroup.example.com/v1beta1: the server could not find the requested resource
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:386
looks like those started flaking after 74b7cec which went in on 5/4
looks like some discovery paths had internal retries that could have masked those 404 errors before. since the error that test is encountering is for an unrelated CRD API that is being created/deleted as part of another test, it makes more sense to switch that check to just ensure the extension API group under test is discovered, rather than assert there are no errors looking for other API groups that we know are being dynamically added/removed
opened #63624 to deflake the aggregator e2e check
/assign @cheftako
Kubelet log is reporting that the sample-apiserver exited with a return code of 2 and it also shows with a high restart count for the sample-apiserver. When I extract the sample-apiserver log I'm seeing multiple "panic: Get https://10.0.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.0.0.1:443: getsockopt: connection refused"
Both failures today are in ipvs clusters which seems to have extensive networking problems.
/assign @rramkumar1
@cheftako We are looking into finding the root cause into the problems with the ipvs clusters. I am pretty confident they were triggered by a bad PR, just need to identify which one.
Initial hypothesis is that PR #63585 caused it.
We think #63840 should fix the IPVS issues.
Closed #63622.
I see #63840 merged and this test is now passing in both pull-kubernetes-e2e-gce and
ci-kubernetes-e2e-gci-gce. Hence closing the issue. Thanks all
Reopened #63622.
@liggitt can you please take a look and triage? Thanks
/reopen
The test is failing for the same error again in the following jobs
It is one of the top flake from last week failing 47 jobs.
@rramkumar1 @liggitt this flake is now blocking this test from being promoted to Conformance suite in 1.11. Is it something we can investigate and resolve soon (today or tomorrow) to increase changes of getting this into 1.11? If not this test has to wait until 1.12 to be promoted.
all I see in the test run logs is 404 errors coming back from that check. @cheftako can you see anything useful in the container logs for the aggregated server?
Update:
I have been trying to reproduce this failure on my own test cluster (go run hack/e2e.go ...) and it has passed 128 times in a row so far.
Looking through recent results this crops up as a flake in presubmit and CI across many configurations... I don't think how we run the test jobs should be affecting this in the slightest. In fact anything marked conformance really shouldn't be affected by job config (!) Not sure what's wrong here, but it looks like a pretty typical flakey test (some race / buggy setup/teardown?)
nb that go run hack/e2e.go ... may not trivially map to the same way the tests are run though, and e2e.go itself is a wrapper over kubetest
@jennybuckley does the API server logs throw any light on whats happening on CI test cluster.
The teardown for the aggregator test is wrong, it tries to delete the deployment "sample-apiserver" when the deployment is actually named "sample-apiserver-deployment", but I don't think that's the issue. It's been like that since the test was added. I'll look for conflicts with other tests.
The namespace controller polls each of the api groups periodically, and it gets this error from the /apis/wardle.k8s.io/v1alpha1 endpoint a couple of times. Maybe related?
E0531 15:23:19.095075 1 namespace_controller.go:148] unable to retrieve the
complete list of server APIs: wardle.k8s.io/v1alpha1: an error on the server
("Internal Server Error: \"/apis/wardle.k8s.io/v1alpha1?timeout=32s\":
subjectaccessreviews.authorization.k8s.io is forbidden: User
\"system:serviceaccount:e2e-tests-aggregator-97cps:default\" cannot create
subjectaccessreviews.authorization.k8s.io at the cluster scope")
has prevented the request from succeeding
The namespace controller polls each of the api groups periodically, and it gets this internal error back from the /apis/wardle.k8s.io/v1alpha1 endpoint a couple of times. Maybe related?
unclear... I opened #64587 to capture the state of the APIService, extension server pods, and those pod logs in failures cases
All I'm seeing is
webhook.go:185] Failed to make webhook authorizer request: subjectaccessreviews.authorization.k8s.io is forbidden: User "system:serviceaccount:e2e-tests-aggregator-cfck8:default" cannot create subjectaccessreviews.authorization.k8s.io at the cluster scope
All I'm seeing is
webhook.go:185] Failed to make webhook authorizer request:
subjectaccessreviews.authorization.k8s.io is forbidden:
User "system:serviceaccount:e2e-tests-aggregator-cfck8:default" cannot
create subjectaccessreviews.authorization.k8s.io at the cluster scope
I don't think that's related... I see 10-30 of those same rejections in apiserver logs of successful test runs (for example, search https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/64641/pull-kubernetes-e2e-gce/39140/artifacts/e2e-39140-674b9-master/kube-apiserver.log for 403 [[sample-apiserver/v0.0.0). I don't see anything in the test setup that grants the sample apiserver permission to run those checks, so it doesn't seem like that would be intermittent or flaky.
Reopened #63622.
@fedebongio @cheftako this looks like 1.7 API Server image is causing more flakes other than just this test - #64450 (comment)
The test has become lot less flaky after increasing the timeout.
Since the official policy for GKE is 2 previous versions, and 1.11 is almost baked, this is the plan discussed with @fedebongio @cheftako offline:
@liggitt let us know what you think of this approach
this looks like 1.7 API Server image is causing more flakes other than just this test - #64450 (comment)
I think you misunderstood my comment... that only related to kubectl being used against a 1.7-level kube-apiserver serving discovery docs for the v1 API, not a generic 1.7-level extension API server
unrelatedly, moving the test to 1.9-level code once 1.11 is released seems reasonable to me
Thank you Jordan, that's the current plan! We will actually create a parallel 1.9-level test, to see if between 1.7 and 1.9 the internals of API Server fixed what's generating the flakiness and see next steps from there.
The test seems to be green in the past two weeks. Wonder if it's fixed / mitigated?
pull-kubernetes-e2e-gce
ci-kubernetes-e2e-gci-gce
Pinging to see what are the next steps. Who is taking point to create a parallel test?
last I chated with @fedebongio someone on his team will build 1.9 API server image and push it to GCR. Once that image is available then @mgdevstack can help writing the parallel test. I am not sure if the image step is done.
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-api-machinery-bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-mac...@googlegroups.com.
To post to this group, send email to kubernetes-sig-a...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery-bugs/kubernetes/kubernetes/issues/63622/407196016%40github.com.
For more options, visit https://groups.google.com/d/optout.
I'd like to know where we are on this, because the 1.7 thing has come up in the context of multi-arch image builds
Hi Aaron and Aish, sorry for the delay. The process to compile, create and upload the sample API Server was not properly documented / maintained, and that plus more burning things is what made it take so long. We are trying to finally have it back (and documented by end of this week). Will keep you posted.
Image created with all the refactor happened on that area, trying upgrading the test at present from 1.7 to 1.9
Update: Upgraded the test, and now the test is not passing... working on this.
• Failure [38.067 seconds]
[sig-api-machinery] Aggregator
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/framework.go:22
Should be able to support the 1.9 Sample API Server using the current Aggregator [It]
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:76
creating a new flunders resource
Expected error:
<*errors.StatusError | 0xc420c90cf0>: {
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {SelfLink: "", ResourceVersion: "", Continue: ""},
Status: "Failure",
Message: "flunders.wardle.k8s.io "rest-flunder-573651905" is forbidden: not yet ready to handle request",
Reason: "Forbidden",
Details: {
Name: "rest-flunder-573651905",
Group: "wardle.k8s.io",
Kind: "flunders",
UID: "",
Causes: nil,
RetryAfterSeconds: 0,
},
Code: 403,
},
}
flunders.wardle.k8s.io "rest-flunder-573651905" is forbidden: not yet ready to handle request
not to have occurred
not sure if it's related, but I don't think the test currently gives sufficient permissions to the aggregated server to perform delegated authn/authz checks. we might want to look at merging #64993
@liggitt tried your fix, same result. The image of the sample-apiserver 1.9 based that I've created is uploaded to gcr.io. If someone want's to give it a try, simply by upgrading 1.0 to 1.1 here: https://github.com/kubernetes/kubernetes/blob/master/test/utils/image/manifest.go#L51
/assign @mgdevstack
Can you try to figure out why the test is failing w/ the new image?
/cc @yliaog
Now waiting on #69239
Is there an ETA on #69239? This test is flaking quite a bit offlate - https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke
#68300 upgrades to 1.10. however it requires more rbac permissions than before which is not fully understood. We will either resolve the permission issue or leave it for later investigation. The upgrade to 1.10 should be merged in a couple of days, if not today.
Yes, it is relevant. It does not fix the issue by itself though.
As for #69239 should be ready to be merged.
/reopen
I would like some manual confirmation this has been addressed
Reopened #63622.
@spiffxp: Reopening this issue.
In response to this:
/reopen
I would like some manual confirmation this has been addressed
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
after some in-depth investigations, it turns out the 1.10 sample-apiserver would need the following rbac permissions (in addition to the system:auth-delegator role). I'm updating the PR to add these.
=====================================================
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sample-apiserver
rules:
This test is failing again.
Can someone explain why we're testing against a 1.7 API server? 1.7 has been unsupported for a year, and we don't promise backwards compatibility back that far AFAIK. Shouldn't we be testing against a 1.9 or 1.10 API server?
Hi Josh, TL;DR: we are upgrading to 1.10 and that's what people is talking about in the previous comments.
@AishSundar: Reopening this issue.
In response to this:
/reopen
I would like to see the test pass with the #68300 fix before closing this issue out. thanks
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
Reopened #63622.
Hi @AishSundar, please notice the test was renamed to: [sig-api-machinery] Aggregator Should be able to support the 1.10 Sample API Server using the current Aggregator, and has been passing since merged?
https://k8s-testgrid.appspot.com/sig-api-machinery-gce-gke#gce
Ah cool thanks @fedebongio that explains it :) I was looking at this dashboard in release-master-blocking and looks like it hasn't picked by the latest test since its 10/13 run.
Closing this issue now based on the api-machinery dashboard now. Will reopen if the release job doesnt turn green the next run. Thanks much.
Closed #63622.
@AishSundar @fedebongio Is it worth removing the copy of this test that lives in sig-release-master-blocking#gce-master-scale-correctness? Or is it still a release-blocking test?
@mariantalla I think the test will automatically pick the new version of the test in its next run (which should be sometime today, 10/20?). As to whether the test should be in the scale-correctness job , it is a ques for @fedebongio and @wojtek-t