What could be causing these event bus timeouts?

613 views
Skip to first unread message

Hoobajoob

unread,
May 13, 2018, 8:34:10 PM5/13/18
to vert.x

Hello,

I'm deploying 3 different vertx services into a cloud foundry environment that has support for container-to-container networking. I configure each vertx service to use the same value for vertx.options.clusterPort (5150), and I configure cloud foundry network policies for that port (as well as the hazelcast discovery port 5701) in each direction for all apps (so, tcp connections are allowed in all directions for both ports for all vertx services).

From the logs, I see that the cluster successfully forms, but when one of the services attempts to send an event bus message to an address that I know has an active registered handler:

1) the sending app does not throw any exceptions complaining that there are no registered handlers for the event bus address
2) the app that has the registered handler for that address shows no sign in the logs of ever having received any message
3) the sending app eventually times out waiting for a response

This happens every single time these services deploy - I have never seen it work, so I'm hoping there's something obvious in vertx/hazelcast/cloud-foundry configuration that I'm missing.

How can this behavior occur? Is there any way to get detailed logging from the event bus so I can learn what's happening to these messages when they timeout? Can I find out if a connection is getting dropped or rejected?

Here is the cloud foundry network policy:
source            destination       protocol   ports
service-01        service-03        tcp        5701
service-02        service-03        tcp        5701
service-03        service-03        tcp        5701
service-01        service-03        tcp        5150
service-02        service-03        tcp        5150
service-03        service-03        tcp        5150
service-01        service-02        tcp        5701
service-03        service-02        tcp        5701
service-02        service-02        tcp        5701
service-01        service-02        tcp        5150
service-03        service-02        tcp        5150
service-02        service-02        tcp        5150
service-02        service-01        tcp        5701
service-03        service-01        tcp        5701
service-01        service-01        tcp        5701
service-02        service-01        tcp        5150
service-03        service-01        tcp        5150
service-01        service-01        tcp        5150

Here is the cluster formation log from the startup sequence of the sending application (service-01 in the policy above. It is the last application to startup, and this occurs after service-02 has registered its handler for the event bus address in question):

2018-05-13T23:26:53.448+0000 [cached3] INFO  c.h.nio.tcp.InitConnectionTask - [10.255.68.119]:5701 [nimbus] [3.6.3] Connecting to /10.255.55.183:5701, timeout: 0, bind-any: true
2018-05-13T23:26:53.450+0000 [cached2] INFO  c.h.nio.tcp.InitConnectionTask - [10.255.68.119]:5701 [nimbus] [3.6.3] Connecting to /10.255.55.182:5701, timeout: 0, bind-any: true
2018-05-13T23:26:53.498+0000 [cached2] INFO  c.h.nio.tcp.TcpIpConnectionManager - [10.255.68.119]:5701 [nimbus] [3.6.3] Established socket connection between /10.255.68.119:59445 and /10.255.55.182:5701
2018-05-13T23:26:53.498+0000 [cached3] INFO  c.h.nio.tcp.TcpIpConnectionManager - [10.255.68.119]:5701 [nimbus] [3.6.3] Established socket connection between /10.255.68.119:46105 and /10.255.55.183:5701
2018-05-13T23:26:54.466+0000 [hz._hzInstance_1_nimbus.generic-operation.thread-1] WARN  c.h.p.InternalPartitionService - [10.255.68.119]:5701 [nimbus] [3.6.3] Ignoring received partition table, startup is not completed yet. Sender: Address[10.255.55.182]:5701
2018-05-13T23:26:54.478+0000 [hz._hzInstance_1_nimbus.generic-operation.thread-3] INFO  com.hazelcast.cluster.ClusterService - [10.255.68.119]:5701 [nimbus] [3.6.3] 

Members [3] {
Member [10.255.55.182]:5701 // this is the address for service-03
Member [10.255.55.183]:5701 // this is the address for service-02
Member [10.255.68.119]:5701 this
}

2018-05-13T23:26:56.507+0000 [vert.x-worker-thread-0] INFO  com.hazelcast.core.LifecycleService - [10.255.68.119]:5701 [nimbus] [3.6.3] Address[10.255.68.119]:5701 is STARTED
2018-05-13T23:26:56.861+0000 [vert.x-eventloop-thread-0] WARN  i.netty.util.internal.MacAddressUtil - Failed to find a usable hardware address from the network interfaces; using random bytes: 03:0a:af:95:7b:86:30:47



Here is service-01's log entry illustrating the timeout:

2018-05-13T23:26:57.565+0000 [vert.x-eventloop-thread-7] TRACE c.o.p.p.MyClass - Sending request to the provider at (my.eventbus.address)

...

2018-05-13T23:27:27.614+0000 [vert.x-eventloop-thread-7] DEBUG c.o.p.p.MyClass - Error attempting provider request
io.vertx.core.eventbus.ReplyException: Timed out after waiting 30000(ms) for a reply. address: fc60e114-4a40-48c0-befe-013c0bfe0b7b
at io.vertx.core.eventbus.impl.HandlerRegistration.sendAsyncResultFailure(HandlerRegistration.java:118)
at io.vertx.core.eventbus.impl.HandlerRegistration.lambda$new$0(HandlerRegistration.java:65)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:803)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:774)
at io.vertx.core.impl.ContextImpl.lambda$wrapTask$2(ContextImpl.java:337)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:445)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.lang.Thread.run(Thread.java:748)



Message has been deleted

Hoobajoob

unread,
May 14, 2018, 12:18:08 PM5/14/18
to vert.x
There is certainly something working just fine for hazelcast, but not so for the event bus. For example, these services are using the vert.x service discovery feature with the default backend. I believe the default backend is just an async map, which is implemented by hazelcast. Discovery works just fine - records registered by one service are found by a different service.

But, publishes to a common event bus address are not received by any other member of the cluster. I'm assuming this must come down to clusterHost/Port configuration, but I can't seem to find out what I'm getting wrong.

One thing that seems funky is that each service is writing out the log message found here, indicating it wasn't able to find clusterHost from the command line or system properties:

...yet I have supplied system properties on the command line at startup that include both clusterHost and clusterPort, like so:
-Dvertx.options.clusterPort=5150
-Dvertx.options.clusterHost=10.255.174.192

Any suggestions for how to troubleshoot this or alternatives to try are welcome...

Hoobajoob

unread,
May 14, 2018, 5:54:44 PM5/14/18
to vert.x
I think I may have uncovered the problem - I'm running with version vertx version 3.4.1, and in that version, the BareCommand class doesn't set the clusterHost/Port variables correctly when provided via system properties. They have to be provided as command line options using cluster-host and cluster-port.

So, even though the clusterHost address is correct, I believe I'm getting bonked by random port selection for the clusterPort, which won't match the fixed port config I'm set up with.

Jochen Mader

unread,
May 15, 2018, 2:19:39 AM5/15/18
to ve...@googlegroups.com
Hmm, I just checked the BareCommand and I can't see what's broken in there.
The following line exists in all versions:
configureFromSystemProperties(options, VERTX_OPTIONS_PROP_PREFIX);
This checks the system properties starting with "vertx.options.".
What I can't find is a hint that implies using -Dvertx.options.clusterPort should be used for configuring the cluster manager.
I only see the docs about using the commandline args.
Can you point me to where this is in the docs?

2018-05-14 23:54 GMT+02:00 Hoobajoob <chefho...@gmail.com>:
I think I may have uncovered the problem - I'm running with version vertx version 3.4.1, and in that version, the BareCommand class doesn't set the clusterHost/Port variables correctly when provided via system properties. They have to be provided as command line options using cluster-host and cluster-port.

So, even though the clusterHost address is correct, I believe I'm getting bonked by random port selection for the clusterPort, which won't match the fixed port config I'm set up with.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.
To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/f564e3aa-8ae5-418f-8195-a9d7f69e0acb%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Jochen Mader | Principal IT Consultant

codecentric AG | Elsenheimerstr. 55a | 80687 München | Deutschland
tel: +49 89 215486633 | fax: +49 89 215486699 | mobil: +49 152 51862390
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de

Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
Vorstand: Michael Hochgürtel . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Hoobajoob

unread,
May 15, 2018, 9:26:33 AM5/15/18
to vert.x
Yes, it makes that same call, but in vertx 3.4.1, it doesn't actually use the clusterHost/Port values from the options class that result from that call. Instead, BareCommand treats its own member variables as the authoritative values (which are only set if you pass in the command line args) and overrides whatever values were set from system properties:

Hoobajoob

unread,
May 15, 2018, 9:42:25 AM5/15/18
to vert.x
Yes, the documentation does not mention this anywhere, and the naming of the system properties to use has changed over time. It used to be that you had to use vertx.cluster.* to set the event bus cluster parameters, and it wasn't publicly documented back then either:

There have been github issues submitted requesting that the properties get added to the VertxOptions class and documented as to how they work, and that *has* changed, but for whatever reason the system property support was never part of the public documentation. I assume this is an oversight, since system property configuration has always been supported and carried forward into subsequent releases as the changes were made.

Perhaps there's a hint in the exchange here with Tim Fox, where back in 2015 he seemed to be unaware that there was any system property support for this configuration at all:

Hoobajoob

unread,
May 15, 2018, 9:49:15 AM5/15/18
to vert.x
The main point seems to be that system properties can only be used to configure these properties if you're using the command line or Launcher API for startup, but none of the public command line documentation makes any mention of system properties:

Hoobajoob

unread,
May 15, 2018, 10:05:38 AM5/15/18
to vert.x
Also, here's the issue reported for the run command ignoring system property settings:

Hoobajoob

unread,
May 15, 2018, 10:14:22 AM5/15/18
to vert.x
I added an issue for system property documentation here:

Reply all
Reply to author
Forward
0 new messages