Vert.X with Cluster and Docker - works ok, but when one "node" fails...

1,143 views
Skip to first unread message

lukjel

unread,
Sep 16, 2016, 1:46:16 PM9/16/16
to vert.x
Hi!

I'm playing with Docker and Vert.X with cluster. It looks very good. You can check this repo for my vert-docker-test project: https://github.com/lukjel/vertx-cluster-docker-test

However... I discovered something very hmm... terrible. And I don't know how to deal with it...

When I add "nodes" or "run image" the cluster works ok.
When I kill one node - the hazelcast cluster is not refreshing. It stills expect that killed node is working.

After some time (much more than minute) it reorganize itself but it takes too much time... And a lot of messages can be sent to "outer space".
When I started this locally (some changes in project are necessary - httpServer port) there wasn't that kind of problem.
Main question: what I'm doing wrong and how to avoid this problem (the reorganization shouldn't lead to loose any message from events...)?

My second question is about docker swarm. It does not support "multicast" so the hazelcast cluster is not working in this mode. 
Has anyone found a solution?

I'm looking for good environment to start many "vertx-microservices" on two hosts with possibility for auto-scale, HA, easy update, etc. The Docker Swarm looks ok for me but not with vertex and clustered eventBus.

Regards
Lukasz

Christian Vogel

unread,
Sep 17, 2016, 9:04:54 AM9/17/16
to vert.x
To your second question, Vert.x also supports Zookeeper as a cluster manager, you could try using that.

lukjel

unread,
Sep 17, 2016, 9:12:01 AM9/17/16
to vert.x
Could you compare it with hazelcast?

lukjel

unread,
Sep 17, 2016, 9:19:18 AM9/17/16
to vert.x
How can I use zookeeper with vertex? There is no info about it in documentation.

Reg.
Lukjel


W dniu sobota, 17 września 2016 15:04:54 UTC+2 użytkownik Christian Vogel napisał:

Christian Vogel

unread,
Sep 17, 2016, 9:29:21 AM9/17/16
to ve...@googlegroups.com

-- 
Christian Vogel
Sent with Airmail
--
You received this message because you are subscribed to a topic in the Google Groups "vert.x" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vertx/3JTLkaGwXh0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vertx+un...@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.
To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/c4b84b35-2a40-4bd7-8b69-c99039002508%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lukjel

unread,
Sep 17, 2016, 9:39:27 AM9/17/16
to vert.x
Thx & sorry - should google by my self.

There is one question: when I just changed vertx-hazelcast to vertx-zookeeper (and added slf4j libs to dependency...) the jar or rather cluster is not starting... repeatedly logging smth like this:
2016-09-17 15:31:28.191 | WARN | .0.1:2181) | rg.apache.zookeeper.ClientCnxn:1102) | Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_92]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_92]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[vertx-docker-test-fat.jar:?]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [vertx-docker-test-fat.jar:?]
2016-09-17 15:31:29.301 | INFO | .0.1:2181) | rg.apache.zookeeper.ClientCnxn::975) | Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2016-09-17 15:31:29.302 | WARN | .0.1:2181) | rg.apache.zookeeper.ClientCnxn:1102) | Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_92]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_92]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[vertx-docker-test-fat.jar:?]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [vertx-docker-test-fat.jar:?]
2016-09-17 15:31:30.077 | RROR | r-thread-0 | apache.curator.ConnectionState::200) | Connection timed out for connection string (127.0.0.1) and timeout (3000) / elapsed (16339)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[vertx-docker-test-fat.jar:?]
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) ~[vertx-docker-test-fat.jar:?]




Do I need start zookeeper server? 

Reg.
-l

Christian Vogel

unread,
Sep 17, 2016, 11:31:36 AM9/17/16
to ve...@googlegroups.com
I’m not a zookeeper expert but I believe a zookeeper cluster needs at least one master registry, since it is not multicast.

-- 
Christian Vogel
Sent with Airmail

Julien Viet

unread,
Sep 17, 2016, 11:34:59 AM9/17/16
to ve...@googlegroups.com
the Zookeeper cluster manager 3.3.3 has been released in central a couple of days ago.

it’s quite a new component, however it’s fully documented (in the project for now).

it has not yet been added to Vert.x stack nor the website.

it should be added in Vert.x stack for 3.4.0 and on the website soon.


On Sep 17, 2016, at 3:04 PM, Christian Vogel <christi...@kimengi.com> wrote:

To your second question, Vert.x also supports Zookeeper as a cluster manager, you could try using that.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

lukjel

unread,
Sep 17, 2016, 11:58:29 AM9/17/16
to vert.x
Thanks. I'll test it with zookeeper and also ignite.
 
Coming back to main question: do you have any idea why killing one docker image makes the cluster "not stable"? 

Reg.
-l.

lukjel

unread,
Sep 17, 2016, 1:09:38 PM9/17/16
to vert.x
So...

Apache Ignite - I've just changed pom.xml and it works great... there is no problem with removing a node - it works properly... however it is more than 3 times slower than hazelcast. I mean: my test (https://github.com/lukjel/vertx-cluster-docker-test) is 3 times slower with apache ignite then with hazelcast hazel.
Apache Ignite has about 7-8 messages per ms.
Hazelcast 22-25 messages per ms.
(I simplified this test - now it sends about 127551 messages...).

But... Apache Ignite works properly in comparison to Hazelcast... 

Zookeeper - I'm not able to run it. - Is there anyone who can help me with it? Any suggestion: how to adopt my project to zookeeper are welcome...

and also for JGroups - probably there must be some changes in configuration.

Still the most important question is: why with hazelcast - when you remove one cluster node it takes so much time to "reconfigure" a topology. It looks more like cluster.xml config file problem but I don't know - what should I change... some kind of "timeout" parameter??? Which one?

Reg.
-l.

Jochen Mader

unread,
Sep 17, 2016, 1:24:25 PM9/17/16
to ve...@googlegroups.com
I think it would be great to have a reproducer for that performance degredation you are seeing with ignite.
The luster-manager shouldn't have such a big impact on message-transfer as it only syncs a map of cluster-members and registered addresses between the nodes.
The actual sending is done directly between nodes.
Sounds like a contention problem in the Ignite-module.

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Jochen Mader | Lead IT Consultant

codecentric AG | Elsenheimerstr. 55a | 80687 München | Deutschland
tel: +49 89 215486633 | fax: +49 89 215486699 | mobil: +49 152 51862390
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de | www.more4fi.de

Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

lukjel

unread,
Sep 17, 2016, 2:00:29 PM9/17/16
to vert.x
You can use my simple project as a reproducer: https://github.com/lukjel/vertx-cluster-docker-test (I know it's not as simple as it should be...)
You can test on "clean" environment. I described how to setup new machine (google cloud) with docker and deploy cluster with 4 "nodes".

Reg.
-l.

Thomas SEGISMONT

unread,
Sep 19, 2016, 7:14:26 AM9/19/16
to ve...@googlegroups.com
Can you verify that the container is shut down gracefully? The cluster manager should call HazelcastInstance#shutdown when the node is leaving, which should send the signal to other members (unless an exsiting HazelcastInstance was provided externally).

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.

lukjel

unread,
Sep 19, 2016, 8:16:27 AM9/19/16
to vert.x
Hi.

I've checked this:

docker kill XX - is not graceful
docker stop XX - is graceful.

In both cases the problem is the same: after stopping one node the cluster is not reorganized.

Here is sample from logs AFTER stopping one node:
Sep 19, 2016 12:13:09 PM com.hazelcast.nio.tcp.TcpIpConnection
INFO: [172.17.0.2]:5701 [dev] [3.6.3] Connection [Address[172.17.0.5]:5701] lost. Reason: java.io.EOFException[Remote socket closed!]
Sep 19, 2016 12:13:09 PM com.hazelcast.nio.tcp.nonblocking.NonBlockingSocketReader
WARNING: [172.17.0.2]:5701 [dev] [3.6.3] hz._hzInstance_1_dev.IO.thread-in-0 Closing socket to endpoint Address[172.17.0.5]:5701, Cause:java.io.EOFException: Remote socket closed!
Sep 19, 2016 12:13:11 PM com.hazelcast.nio.tcp.InitConnectionTask
INFO: [172.17.0.2]:5701 [dev] [3.6.3] Connecting to /172.17.0.5:5701, timeout: 0, bind-any: true
Sep 19, 2016 12:13:56 PM com.hazelcast.cluster.impl.ClusterHeartbeatManager
WARNING: [172.17.0.2]:5701 [dev] [3.6.3] This node does not have a connection to Member [172.17.0.5]:5701

I expect that in both cases (graceful or hard kill) cluster should works...

Regards
Lukasz.




W dniu poniedziałek, 19 września 2016 13:14:26 UTC+2 użytkownik Thomas Segismont napisał:
Can you verify that the container is shut down gracefully? The cluster manager should call HazelcastInstance#shutdown when the node is leaving, which should send the signal to other members (unless an exsiting HazelcastInstance was provided externally).
2016-09-17 20:00 GMT+02:00 lukjel <l.zeli...@gmail.com>:
You can use my simple project as a reproducer: https://github.com/lukjel/vertx-cluster-docker-test (I know it's not as simple as it should be...)
You can test on "clean" environment. I described how to setup new machine (google cloud) with docker and deploy cluster with 4 "nodes".

Reg.
-l.

W dniu sobota, 17 września 2016 19:24:25 UTC+2 użytkownik Jochen Mader napisał:
I think it would be great to have a reproducer for that performance degredation you are seeing with ignite.
The luster-manager shouldn't have such a big impact on message-transfer as it only syncs a map of cluster-members and registered addresses between the nodes.
The actual sending is done directly between nodes.
Sounds like a contention problem in the Ignite-module.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

Jochen Mader

unread,
Sep 19, 2016, 9:28:53 AM9/19/16
to ve...@googlegroups.com
I don't know if I read the logs correctly but this doesn't look like a graceful shutdown.
Looks as if the network got cut off before vert.x had the chance to correctly deregister.

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

lukjel

unread,
Sep 19, 2016, 6:54:17 PM9/19/16
to vert.x
Hi.

IMHO it's irrelevant.
The cluster should survive and stay consistent even in case of "unexpected" failure of node.
Ignite can deal with it. Hazelcast doesn't... However: ignite is much slower.
My question: is the problem in config file or it is something else?
Regards
Lukasz.

Thomas SEGISMONT

unread,
Sep 20, 2016, 10:58:29 AM9/20/16
to ve...@googlegroups.com
I had a look at your project and tried it on my box (no docker machine with googlecloud account, just local docker daemon). When you run "docker stop XXX" the application is *not* shut down gracefully. The reason is the way you build the docker image. When the container starts, it will execute the following:

/bin/sh -c 'java -Dvertx.logger-delegate-factory-class-name=io.vertx.core.logging.Log4j2LogDelegateFactory -jar $VERTICLE_FILE -cluster'

And this ends with a shell process with PID 1 and a child JVM process (see docker exec XXX ps -ef). But when you run docker stop, docker sends the signal to process with PID1, which, in this case, doesn't forward the signal the child JVM process. There are two ways to fix this: wrap your JVM exec with a script which traps the signal and forwards it, or simply run the JVM directly. For the latter, here's what should be in your Dockerfile (no ENTRYPOINT):

CMD ["/usr/bin/java", "
-Dvertx.logger-delegate-factory-class-name=io.vertx.core.logging.Log4j2LogDelegateFactory", "-jar", "vertx-docker-test-fat.jar", "-cluster"]

As for the running on Google cloud platform, I suggest you follow the Hazelcast documentation:
http://docs.hazelcast.org/docs/3.6/manual/html-single/index.html#discovering-members-with-jclouds



To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

lukjel

unread,
Sep 20, 2016, 11:36:41 AM9/20/16
to vert.x
Thanks Thomas.

1. I created Dockerfile in the way how it is described in official documentation. IMHO there should be some update about vert.x and docker and fat-jar.
2. There is still a problem with killing one node in not so graceful way. Any idea?

Reg.
Lukasz.

PS. I updated Dockerfile in my test project but I'll test it later.

Thomas SEGISMONT

unread,
Sep 20, 2016, 11:55:47 AM9/20/16
to ve...@googlegroups.com
2016-09-20 17:36 GMT+02:00 lukjel <l.zeli...@gmail.com>:
Thanks Thomas.

1. I created Dockerfile in the way how it is described in official documentation. IMHO there should be some update about vert.x and docker and fat-jar.


Or vertx-examples repo? Or both I guess?

 
2. There is still a problem with killing one node in not so graceful way. Any idea?


With my local testing I couldn't reproduce issues like yours. Well the node couldn't signal its leaving so it took some time for other nodes to detect the failure, but this is pretty normal and the cluster was rebalanced after a few seconds.

 
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

lukjel

unread,
Sep 20, 2016, 12:27:41 PM9/20/16
to vert.x
Hi

W dniu wtorek, 20 września 2016 17:55:47 UTC+2 użytkownik Thomas Segismont napisał:


2016-09-20 17:36 GMT+02:00 lukjel <l.zeli...@gmail.com>:
Thanks Thomas.

1. I created Dockerfile in the way how it is described in official documentation. IMHO there should be some update about vert.x and docker and fat-jar.


Or vertx-examples repo? Or both I guess?

Both :D 

 
2. There is still a problem with killing one node in not so graceful way. Any idea?


With my local testing I couldn't reproduce issues like yours. Well the node couldn't signal its leaving so it took some time for other nodes to detect the failure, but this is pretty normal and the cluster was rebalanced after a few seconds.

Here's the rub!

My cluster rebalanced after very long period. I didn't measure it but it's more like 1 minute. This is much too long. 
I tested the scenario with Apache Ignite and there wasn't such problem.

reg.
Lukasz.


 

lukjel

unread,
Sep 20, 2016, 1:21:48 PM9/20/16
to vert.x
So...
I can confirm - after Dockerfile change - docker stop works properly and cluster rebalanced. However when I run "docker kill" then it takes much more than one minute to rebalance. It's not good.


And finnaly I found a solution! There is hazelcast parameter  hazelcast.max.no.heartbeat.seconds
And it is exactly what it sounds to be. Default value is 300 so it takes about 5 minutes to rebalance.
When I set this parameter to 5 then it works as I expect. 

To set this parameter you have to add this to your java command:
-Dhazelcast.max.no.heartbeat.seconds=5


Officialy - problem solved. (documentation and examples of docker and fat-jar has to be corrected - how to report this? should I add an issue?)

Thanks to all involved!

Lukas

PS. 
Here are some other VERY interesting parameters for hazelcast: 

Jochen Mader

unread,
Sep 20, 2016, 2:38:36 PM9/20/16
to ve...@googlegroups.com
A dark memory returns.
We did the same thing in a different project and we also spent a hell of a lot of time figuring that one out.
We should add that to the cluster-documentation as it is a very common siutation and I have to agree that 300 seconds is a very long time.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

lukjel

unread,
Sep 20, 2016, 7:06:59 PM9/20/16
to vert.x
+1

Hey - Vertx Team  - should you add this to documentation (hazelcast parameters)? It could be very helpful :D

Reg.
Lukas
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

Clement Escoffier

unread,
Sep 20, 2016, 11:08:33 PM9/20/16
to ve...@googlegroups.com

Jochen Mader

unread,
Sep 21, 2016, 6:27:08 AM9/21/16
to ve...@googlegroups.com
I just opened the pull-request.

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

lukjel

unread,
Sep 21, 2016, 6:40:26 AM9/21/16
to vert.x
Ups... I did it too... :D

-l.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.

Thomas SEGISMONT

unread,
Sep 21, 2016, 9:18:29 AM9/21/16
to ve...@googlegroups.com
2016-09-20 19:21 GMT+02:00 lukjel <l.zeli...@gmail.com>:
So...
I can confirm - after Dockerfile change - docker stop works properly and cluster rebalanced. However when I run "docker kill" then it takes much more than one minute to rebalance. It's not good.


Not graceful kill is not graceful kill :)

 


And finnaly I found a solution! There is hazelcast parameter  hazelcast.max.no.heartbeat.seconds
And it is exactly what it sounds to be. Default value is 300 so it takes about 5 minutes to rebalance.
When I set this parameter to 5 then it works as I expect. 

To set this parameter you have to add this to your java command:
-Dhazelcast.max.no.heartbeat.seconds=5


Have you checked that it works fine in the long run? You might to tune the hearbeat interval as well (to avoid being timedout whereas the node is still alive).

 

Officialy - problem solved. (documentation and examples of docker and fat-jar has to be corrected - how to report this? should I add an issue?)


I've added the issues to my notes and will take care of creating the issues and fixing.
 

Thanks to all involved!

Anytime
 

Lukas

PS. 
Here are some other VERY interesting parameters for hazelcast: 



W dniu wtorek, 20 września 2016 18:27:41 UTC+2 użytkownik lukjel napisał:
Hi
W dniu wtorek, 20 września 2016 17:55:47 UTC+2 użytkownik Thomas Segismont napisał:


2016-09-20 17:36 GMT+02:00 lukjel <l.zeli...@gmail.com>:
Thanks Thomas.

1. I created Dockerfile in the way how it is described in official documentation. IMHO there should be some update about vert.x and docker and fat-jar.


Or vertx-examples repo? Or both I guess?

Both :D 

 
2. There is still a problem with killing one node in not so graceful way. Any idea?


With my local testing I couldn't reproduce issues like yours. Well the node couldn't signal its leaving so it took some time for other nodes to detect the failure, but this is pretty normal and the cluster was rebalanced after a few seconds.

Here's the rub!

My cluster rebalanced after very long period. I didn't measure it but it's more like 1 minute. This is much too long. 
I tested the scenario with Apache Ignite and there wasn't such problem.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages