Hi Team
We are using hnc-manager:v1.0.0 for our clusters including production. In the last two weeks we can see the hnc-controller-manager pod got restarted and all roles and role bindings are recreated. This cause issue to our services and we are seeing lot of errors related to roles and role bindings:”
Failure","message":"endpoints is forbidden: User \"system:serviceaccount:prod-esl-fetcher:app-read-access\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"prod-esl-fetcher\": RBAC: role.rbac.authorization.k8s.io \"app-read-access\" not found\nAzure does not have opinion for this user.","reason":"Forbidden","details":{"kind":"endpoints"},"code":403},"requestReceivedTimestamp":"2023-05-09T00:26:18.736318Z","stageTimestamp":"2023-05-09T00:26:18.739488Z","annotations":{"authentication.k8s.io/stale-token":"subject: system:serviceaccount:prod-esl-fetcher:app-read-access, seconds after warning threshold: 414224","authorization.k8s.io/decision":"forbid","authorization.k8s.io/reason":"RBAC: role.rbac.authorization.k8s.io \"app-read-access\" not found\nAzure does not have opinion for this user."}}
While checking the hnc-controller-manager pod logs we can see lot of pod restarts and please see the notable errors:
NAME READY STATUS RESTARTS AGE
hnc-controller-manager-77568694b9-b9rpf 1/1 Running 685 (7d23h ago) 44d
Errors:
2023-05-02 11:25:18 INFO Container manager failed liveness probe, will be restarted
2023-05-02 11:25:18 WARN Readiness probe failed: Get "http://172.19.129.89:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-05-02 11:25:18 WARN Liveness probe failed: Get "http://172.19.129.89:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-05-02 11:25:19 ERROR panic: runtime error: invalid memory address or nil pointer dereference
2023-05-02 11:25:17 ERROR {"level":"error","ts":1683051917.4406927,"msg":"http: TLS handshake error from 172.19.129.48:39036: EOF"}
We can see these types of errors continuously. I mean not only at the time of issue but continuously we are seeing this error.
Attached the log for your reference. Can you please have a check and guide us to know the root cause of this issue. As its production it's very critical for us.