Hi there,
I am using Curator Leader election recipe. The number of leader candidates in my system is low (from 2 to 30) depending on system size. One of the test I am running is a rolling restart of the ZK ensemble.
Current approach in our implementation is to relinquish leader when connection is lost or even suspended (not clear for me what to do when connection is suspended).
In some cases, all the leader candidates can be connected to the same ZK server of the ensemble, an the sequence that I observe performing the ZK ensemble rolling start is the following:
- When this server that has the leader and all the candidates connected is shutdown, the leader relinquish leadership (when connection is SUSPENDED I relinquish leadership)
- The leader candidates are RECONNECTED after some time.
- BUT leader election is not as fast as usually from this point. It takes around 6 secs from this point ( Usually leader election is around 1 sec )
I have several questions related with the issue:
- What is the proper error handling when leader connection is SUSPENDED. Now I am relinquish the leader, but in case I decide not to exit from the takeLeaderhship method, what is the proper way to handle the situation when connection becomes RECONNECTED? Is there a way in the ConnectionStateListener to know if another leader has been elected?
- What is the possible cause that the leader election becomes slower in the case I explained before? I suspect that is related with an issueleader election ephemeral nodes expiration (Now I am testing with sessionTimeout and connectionTimeout of 4,5 secs
Thanks in adavance,
Evaristo