Serious runtime issues with clustered Vert.x Eventbus & Docker/Swarm

219 views
Skip to first unread message

David Hoffer

unread,
Jun 20, 2018, 7:07:36 PM6/20/18
to vert.x
Hey folks we are running into a serious issue with Vert.x in our Docker deployment environment and need some help finding a solution.

Now to start I want to mention I am not the person that has been doing this deployment testing just the messenger, I learned of this issue today.  I have been focused more solely on vert.x verticle code implementation issues.  So if there is something in here that does not quite make sense I can get further clarifications.

However our application consists of several Verticles each deployed as a Docker image, we use Zookeeper as the cluster manager to manage the Docker nodes managed by Swarm.

The Deployment Issue:

The Vert.X event bus is a non-blocking peer-to-peer network communication capability, the so called "nervous system" of Vert.X.  The event bus can be clustered to allow verticles in different JVMs and/or hosts to communicate. To cluster the event bus, a cluster manager (e.g. Zookeeper) is informed at process startup of the host/port that is to be used by the Vert.X instance (JVM) for event bus communication.  When the Vert.X instance is no longer viable, the instance needs to be closed (Vertx.close()) to deregister the event bus host/port.


A question on the Docker forum, unanswered since Jan 29, 2018:

Is docker service rm performing a graceful shutdown on nodes?

https://forums.docker.com/t/is-docker-service-rm-performing-a-graceful-shutdown-on-nodes/45380


From testing, it appears that when Docker Swarm removes a service, the running container is not stopped gracefully. Docker is likely using a kill -9, which can't be caught. A Java shutdown hook is not called when the Docker Swarm service is removed. There appears to be no way for Vert.X instances running in a Swarm to deregister when shutdown. When a Swarm service is removed and recreated, or updated, the other Vert.X nodes are still try to communicate with the old, no longer existing, Vert.X instance.


It appears our only option to use a clustered Vert.X event bus in Docker Swarm is to completely redeploy all application Vert.X instances when an update of any verticle/service is needed.  This would clearly be a major step backwards, the antithesis of micro-services and less robust then prior non-Vert.x versions of our application..


The major justification for using Vert.X centered on using the event bus to communicate notifications from the backend (gateway) to the Javascript UI. This capability is to replace the unreliable Websockets used in our application generation 1. The default Vert.X event bus implementation seems to be designed for performance and not reliability or robustness.  The Vert.X event bus provides best-effort delivery. The event bus does not provide for durable messages, i.e. messages can be lost. While peer-to-peer communication offers higher performance potential than a traditional broker-based messaging mechanism, the design of the Vert.X event bus seems to be an impediment to total system robustness, reliability and availability.

[My commentary here...this is a separate issue...unrelated to the deployment issue....however I would like to get feedback on this as well as I'm quite sure folks are using Vert.x quite successfully, right?  How are these issues mentioned handled with Vert.x?]


The Options

We have several options going forward:


Option 1. Use a broker-based solution for all inter-process messaging other than gateway-to-UI notifications.  This could be organized so that the notification gateway verticle and the UI verticle are in the same Vert.X instance. This organization would remove the need to cluster the event bus. We could develop a simple wrapper/factory that would mimic the activation of Vert.X event bus and the send/publish/subscribe/consume methods. This would minimize future changes if there was a need to revert to a complete Vert.X event bus in the future. This option adds the need for an ActiveMQ/Artemis container (https://hub.docker.com/r/apache/activemq-artemis) and removes the need for Zookeeper containers.

[My commentary here...this solution abandons using the Vert.x eventbus for all but a very small portion of the application.]


Option 2. Use a Vert.X event bus to ActiveMQ/Artemis integration (https://activemq.apache.org/artemis/docs/1.0.0/vertx-integration.html) that might reduce or remove the need for a clustered Vert.X event bus. Like option 1, this option adds the requirement for an ActiveMQ/Artemis containers and removes the need for Zookeeper containers. Unfortunately there appears to be limited documentation and examples to support this approach.


Option 3. Use another Docker orchestration solution (i.e. Kubernetes) that provides for a graceful container shutdown. It appears that significant effort is required to deploy Kubernetes. Even if we are able to deploy and demonstrate this on Dockernet, it might prove to be an issues when deploying ODIN in other locked down environments.

[My commentary here...I think we would like this as a long term goal but likely do not have time to implement for the next application version. So using Kubernetes is probably not a viable option near term.]


Option 4. Use a Docker plugin capability to develop a graceful service shutdown. Possible issues with this option would relate to deploying the plugin in a locked down environment that might not allow Docker plugins.


My commentary here...Please help us with this.  What are we missing?  Is there an easy fix for this?  I really don't want to implement a solution that negates using Vert.x.

-Dave

Thomas SEGISMONT

unread,
Jun 21, 2018, 4:13:49 AM6/21/18
to ve...@googlegroups.com
I don't know much about Docker Swarm, but I doubt it stops containers abruptly by default. Or that the behavior be tuned. Anyway, graceful shutdown are usually due to the same one issue: the JVM is not running as PID 1 in the container and thus is not getting the shutdown signal (hence later killed abruptly).

You need to make sure that either the JVM has PID 1 in the container or the process with PID 1 (usually a running startup script) relays the signal. One way to do this to add the "exec" command in your CMD.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.
To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/905c30c6-7286-4008-9db8-56fe8d46f595%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas SEGISMONT

unread,
Jun 21, 2018, 4:15:07 AM6/21/18
to ve...@googlegroups.com
I meant "Or that the behavior CANNOT be tuned" and "graceful shutdown ISSUES are usually due to"
Reply all
Reply to author
Forward
0 new messages