Hi,
We've been using the LeaderLatch recipe for a couple of months, and we've noticed some abnormal behavior that results in extra nodes being placed in Zookeeper. I think I've narrowed down the cause, but I wanted to check if this is a real bug or not.
It looks like this is different to the bug that was fixed a few weeks ago relating to double calls to reset() while the connection is down. I've tried with that fix and I still see this bug.
Basically how it happens is this:
1) I have two processes contesting a LeaderLatch, ProcessA and ProcessB. ProcessA is leader.
2) ProcessA loses leadership somehow (it releases, its connection goes down, etc.)
3) This causes ProcessB's watch to get called, check the state is still STARTED, and if so the LeaderLatch will re-evaluate if it is leader.
4) While the watch handler is running, close() is called on the LeaderLatch on ProcessB. This sets the LeaderLatch state to CLOSED, removes the znode from ZK and closes off the LeaderLatch.
5) The watch handler has already checked that the state is STARTED, so it does a getChildren() on the latch path, and finds the latch's znode is missing. It goes ahead and calls reset(), which places a new znode in Zookeeper.
Result: The LeaderLatch is closed, but there is still a node in Zookeeper that isn't associated with any LeaderLatch and won't go away until the session goes down. Subsequent LeaderLatches at this path can never get leadership while that session is up.
Thanks,
Jono