Hi everyone,
This is the issue I've been researching for three months already.
In short, the story.
There is application that uses Hazelcast ITopic for pub/sub messaging.
The application is quite sensitive on the latency of the messages: 2 second delay already disrupts the operations.
The publisher
ITopic<MyMessage> topic = hazelcastInstance.getTopic(...);
topic.publish(message);
The subscriber:
public void onMessage(Message<MyMessage> broadcastMessage) {
///. here, ExecutorService.execute() is called to deliver the message and return control back to Hazelcast
}
There are 2 nodes in setup, but only 1 is active (i.e. all publishers and subscribers are local to the node), second node is always hot-standby.
MyMessage is an envelope that carries its creation timestamp.
In onMessage, the 'Hazelcast internal broadcast latency' is usually below 100 ms, but sometimes it gets as high as 10-15 seconds.
There can be up to 1500 subscribers to one ITopic, usually 200-300, but the issue seem to happen with e.g. 100 subscribers as well.
The JVM thread graphs do not show any unusual spikes in blocked threads.
Excerpts from Hazecast configuration that could be relevant:
<properties>
<property name="hazelcast.icmp.enabled">true</property>
<property name="hazelcast.executor.event.thread.count">256</property>
</properties>
...
Here, number of event threads has been increased since it was thought to be the bottleneck.
It is not clear whether this affects anything anymore.
...
<executor-service>
<core-pool-size>16</core-pool-size>
<max-pool-size>64</max-pool-size>
<keep-alive-seconds>60</keep-alive-seconds>
</executor-service>
In the Hazelcast code, I've tracked the issue down to ParallelExecutor.execute(runnable, hash) method that supposedly makes sure that all messages are delivered in order of publish() method calls.
Other relevant observations:
- On our standard test, in terms of measured broadcast latency Hazelcast 3.1 performs similarly or even worse, compared to 2.6
- There seem to be specific ITopics affected, seemingly those that experience high traffic at the moment. At the same moment, other ITopics do exhibit almost zero latency, so one may conclude that the system as a whole is not affected.
- 99,0% percentile stays below 5 seconds, 99.9% is up to 20 seconds.
Any advice is highly recommended.
Thanks in advance,
Timur