HNC Controller Manager. Restarted and can see role and role bindings got recreated

Geo P.C.

unread,

May 18, 2023, 4:38:19 PM5/18/23

to kubernetes-wg-multitenancy, Adrian Ludwin, Yiqi Gao

Hi Team

We are using hnc-manager:v1.0.0 for our clusters including production. In the last two weeks we can see the hnc-controller-manager pod got restarted and all roles and role bindings are recreated. This cause issue to our services and we are seeing lot of errors related to roles and role bindings:”

Failure","message":"endpoints is forbidden: User \"system:serviceaccount:prod-esl-fetcher:app-read-access\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"prod-esl-fetcher\": RBAC: role.rbac.authorization.k8s.io \"app-read-access\" not found\nAzure does not have opinion for this user.","reason":"Forbidden","details":{"kind":"endpoints"},"code":403},"requestReceivedTimestamp":"2023-05-09T00:26:18.736318Z","stageTimestamp":"2023-05-09T00:26:18.739488Z","annotations":{"authentication.k8s.io/stale-token":"subject: system:serviceaccount:prod-esl-fetcher:app-read-access, seconds after warning threshold: 414224","authorization.k8s.io/decision":"forbid","authorization.k8s.io/reason":"RBAC: role.rbac.authorization.k8s.io \"app-read-access\" not found\nAzure does not have opinion for this user."}}

While checking the hnc-controller-manager pod logs we can see lot of pod restarts and please see the notable errors:

NAME READY STATUS RESTARTS AGE

hnc-controller-manager-77568694b9-b9rpf 1/1 Running 685 (7d23h ago) 44d

Errors:

2023-05-02 11:25:18 INFO Container manager failed liveness probe, will be restarted

2023-05-02 11:25:18 WARN Readiness probe failed: Get "http://172.19.129.89:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

2023-05-02 11:25:18 WARN Liveness probe failed: Get "http://172.19.129.89:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

2023-05-02 11:25:19 ERROR panic: runtime error: invalid memory address or nil pointer dereference

2023-05-02 11:25:17 ERROR {"level":"error","ts":1683051917.4406927,"msg":"http: TLS handshake error from 172.19.129.48:39036: EOF"}

We can see these types of errors continuously. I mean not only at the time of issue but continuously we are seeing this error.

Attached the log for your reference. Can you please have a check and guide us to know the root cause of this issue. As its production it's very critical for us.

Thanks & Regards
Geo P.C.

geo-logresults-2023-05-18-12_09_51.csv

Geo P.C.

unread,

May 19, 2023, 4:26:23 PM5/19/23

to kubernetes-wg-multitenancy, Adrian Ludwin, Yiqi Gao, tasha...@gmail.com

Any help on this team?

Thanks & Regards
Geo P.C.

Adrian Ludwin

unread,

May 26, 2023, 5:23:27 PM5/26/23

to Geo P.C., kubernetes-wg-multitenancy, tasha...@gmail.com, Ryan Bezdicek

Hey, sorry, bit of maintainer burnout here but trying to catch up now.

The nil pointer exception seems like a red herring - it only happens after HNC has decided to shut down. But that log doesn't tell me why it's decided to shut down. What's at 172.19.129.48:39036 - is that HNC itself, or something else? Do you have any hints as to what might have changed in the last two weeks?

This is a bit of a shot in the dark, but have you tried increasing its RAM/CPU allocation to see if it's getting killed by K8s itself?

Thanks, A

Adrian Ludwin

unread,

May 26, 2023, 5:45:17 PM5/26/23

to Geo P.C., kubernetes-wg-multitenancy, tasha...@gmail.com, Ryan Bezdicek

Oh wait - that's probably the answer. See https://github.com/kubernetes-sigs/hierarchical-namespaces/issues/242#issuecomment-1320823760.

Geo P.C.

unread,

May 26, 2023, 7:37:19 PM5/26/23

to Adrian Ludwin, kubernetes-wg-multitenancy, tasha...@gmail.com, Ryan Bezdicek

Thanks for the update. We already did this once we norofied the issue. We have increased the limit and also changed minreplica to 2. After that also in the cluster hnc-control manager got restarted.


      resources:
        limits:
          cpu: 500m
          memory: 512Mi
        requests:
          cpu: 100m
          memory: 150Mi


geopc@SFO ~ % kubectl get pods -n hnc-system

NAME                                      READY   STATUS    RESTARTS        AGE

hnc-controller-manager-84c666fdfb-d4vns   1/1     Running   1 (3d19h ago)   8d
hnc-controller-manager-84c666fdfb-l9djp   1/1     Running   0               8d


One thing we noted is after the hnc upgrade we are facing this issue. Few months before we upgraded hnc version to v1.0.0 from v0.9.0

image: gcr.io/k8s-staging-multitenancy/hnc-manager:v1.0.0 from image: gcr.io/k8s-staging-multitenancy/hnc-manager:v0.9.0
Upgrdaded version to controller-gen.kubebuilder.io/version: v0.8.0 from controller-gen.kubebuilder.io/version: v0.7.0
Added readinessProbe and livenessProbe in new version.

In one cluster we are still using v0.9.0 there we are not seeing any issue and also the resources used on the cluster is comparably high on this. 

Thanks & Regards
Geo P.C.

Adrian Ludwin

unread,

May 29, 2023, 9:50:07 AM5/29/23

to Geo P.C., kubernetes-wg-multitenancy, Ryan Bezdicek

(Tasha to BCC)

Sorry, I'm not sure I understand. Are you still seeing the issue? If so, try raising both the request and the limit much higher - in the other case, they raised the request/limit to 1CPU (not 0.1-0.5CPU as you've done) and the RAM much higher as well. Try raising them to a very high value and see if the problems go away?

Also, please leave minreplica at 1 - each HNC pod tries to control the entire cluster, so having two of them will only make them conflict with each other, and will not add any reliability. If you want a high-availability solution, try out the HA deployment instead, which puts the webhooks in a separate deployment but still only has a single pod for the controller. However, this doesn't replace raising the requests/limits to higher values as well - you'll need to do that as well.

Thanks, A

Reply all

Reply to author

Forward