Thanks Tim, I agree that it's an expected behavior. I didn't mean to imply that the hazelcast map backup configuration was a vertx issue, we have been looking to improve the various issues in our specific type of environment. We have been learning and adjusting, over the last few weeks, our hazelcast configuration, our deployment process and our deployment tools. We have reached out to hazelcast support, there response took a few days (just one of our various hazelcast issues we have seen) regarding the exception above:
"It appears to be caused by Vert.x calling back into Hazelcast on the Hazelcast event thread. When a node leaves the cluster, Vert.x is notified and on the same thread queries the Hazelcast IMap it uses to keep track of its cluster. This likely prevents the event from propagating properly, or some deadlock situation occurs where the reties are exhausted.
I am not sure if this is helpful to you or not, but we haven't seen the exception for some time now, due to some adjustments we have made. I added the ability for our verticles to auto-restart if they notice that the registered handler count drops for a specified period of time. This effectively self heals our cluster. We are anxious to try ZooKeeeper cluster manager once it's available, we think it will be better suited for our environment. We use Rancher to deploy and manager our cluster. Issues we are currently looking into in our environment:
1) During a verticle upgrade, we deploy a new docker container, allow it to run for 20 secs and then take down the old verticle docker container. Once in a while we will still see a handler or two disappear from the map. In this scenario, we signal the container to shutdown gracefully, we have ensured that the map is backed up on more than one other node. We think that there may be a race condition for allowing hazelcast enough time shutdown properly.
2) We have seen issues with hazelcast members dropping off when a particular host is temporarily strained by heavy use. Configuration changes can help in this scenario, but it's difficult to balance the problem where a node has failed and we want to have the member removed quickly (trying to avoid as many issues with the event bus as possible), and that the system will recover and we just need to give it more time.
I am just sharing these in case others have seen similar issues and want to share what they did. Also, it has been interesting to consider how the vertx eventbus could be even more robust in some of these scenarios.