golang operator leader election failures

1,854 views
Skip to first unread message

Tim Kaczynski

unread,
Oct 5, 2021, 9:39:00 AM10/5/21
to Operator Framework
Hi everyone,

We have noticed on our primary cloud where we deploy OCP, that the controller-manager containers for recent versions of operator-sdk fail and restart frequently.  The failures and restarts correspond with a regular period of resource shortage which is predictable, which we are working to resolve.  The failure is in leader election - the leader election client can't contact the Kube API server during the resource shortage, and the failure seems to cause the container to exit.  From the controller-manager pod log:

E1004 09:00:15.139088 1 leaderelection.go:325] error retrieving resource lock xxxxxx/fc7f2af9.xxx.com: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/xxxxx/leases/fc7f2af9.xxx.com": context deadline exceeded
I1004 09:00:15.139260 1 leaderelection.go:278] failed to renew lease xxxxx/fc7f2af9.xxx.com: timed out waiting for the condition
2021-10-04T09:00:15.139Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"xxxxx","name":"fc7f2af9.xxx.com","uid":"7b762290-1817-4de8-abb5-492648d5ab37","apiVersion":"v1","resourceVersion":"425945143"}, "reason": "LeaderElection", "message": "ir-lifecycle-operator-controller-manager-6cb887545-dlr9p_a16af305-0fc6-4788-aeac-39818642accd stopped leading"}
2021-10-04T09:00:15.139Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"Lease","apiVersion":"coordination.k8s.io/v1"}, "reason": "LeaderElection", "message": "ir-lifecycle-operator-controller-manager-6cb887545-dlr9p_a16af305-0fc6-4788-aeac-39818642accd stopped leading"}
2021-10-04T09:00:15.139Z ERROR setup problem running manager {"error": "leader election lost"}

We'd wondering if anyone else has noticed this behavior.  We'd like to change the way leader election works so that if the Kube API call fails, it will retry a certain number of times before returning an error and causing the container to fail.  There are some knobs and switches to set:

but it seems like it mostly controls the interval between checks, and how long to wait for a response, but not what to do when a response returns an error.

Does anyone have a suggestion about how we can make this more robust?  The failed containers by themselves are not a big deal since they restart, but it makes it difficult for us to test for other conditions or otherwise do any kind of long-run testing since any other potential problems are masked by the leader election restarts.  We'd like to not turn off leader election as that doesn't seem like the correct solution (just introduces a different set of problems).

Thanks!
-Tim

Frederic Giloux

unread,
Oct 8, 2021, 3:49:51 AM10/8/21
to Tim Kaczynski, Operator Framework
Hi Tim

It seems that it is the expected behaviour. When the leader election fails the controller is not able to renew the lease and per design the controller is restarted to ensure that a single controller is active at a time.
The knobs you referenced, especially  leaseDuration and renewDeadline (RenewDeadline is the duration that the acting master will retry), are configurable in controller-runtime [1].

Another approach you may consider is to leverage API Priority & Fairness [2] to increase the chances of success of the calls made to the API by your controller if it is not at  the origin of the API overload.


I hope this helps.

Regards,

Frédéric

--
You received this message because you are subscribed to the Google Groups "Operator Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to operator-framew...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/operator-framework/a562c175-8507-41d8-b38d-ef95cf3ead08n%40googlegroups.com.


--
Frédéric Giloux
OpenShift Engineering
Red Hat Germany

fgi...@redhat.com
   
M: +49-174-172-4661

r
edhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
________________________________________________________________________
Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Michael Hrivnak

unread,
Oct 8, 2021, 11:43:59 AM10/8/21
to Frederic Giloux, Tim Kaczynski, Operator Framework
If the lease renewal process is giving up its lease just because it got an error making a request, I think that would be a bug. It should be re-trying until the renewal deadline is reached.


If a non-responsive API service is a scenario you need to accommodate, there are two options:

1. You can adjust the configuration already cited to increase the lease duration and renewal deadline. Much longer durations are fine as long as you can accept the latency of recovery, discussed below.

2. You can use the other leader election implementation, "leader for life". With that implementation, once a pod becomes the leader, it is presumed to be the leader for the rest of its existence. There is no further action required to renew or maintain the lease. It was created in part for scenarios where actively renewing leases is undesirable.


Either way, of course the consideration is time-to-recovery. If your controller stops for any reason, such as its Node failed (network failure, power failure, kernel panic, etc), the election process determines how another instance of the controller becomes the new leader. With your current approach, a candidate waits until the existing lease exceeds the lease duration, and then a new election is forced. In the leader-for-life approach, a new election happens when the failed Pod gets deleted. In case of node failure, leader-for-life has logic to accelerate deletion of the pod and lease when it reaches a non-recoverable state.



--

Michael Hrivnak

Senior Principal Software EngineerRHCE 

Red Hat

Reply all
Reply to author
Forward
0 new messages