ITopic: high latency between publish() and onMessage() calls

Timur Evdokimov

unread,

Mar 15, 2014, 7:51:51 PM3/15/14

to haze...@googlegroups.com

Hi everyone,

This is the issue I've been researching for three months already.

In short, the story.

There is application that uses Hazelcast ITopic for pub/sub messaging.
The application is quite sensitive on the latency of the messages: 2 second delay already disrupts the operations.

The publisher
        ITopic<MyMessage> topic = hazelcastInstance.getTopic(...);
        topic.publish(message);

The subscriber:
       public void onMessage(Message<MyMessage> broadcastMessage) {
///.     here, ExecutorService.execute() is called to deliver the message and return control back to Hazelcast
      }

There are 2 nodes in setup, but only 1 is active (i.e. all publishers and subscribers are local to the node), second node is always hot-standby.

MyMessage is an envelope that carries its creation timestamp.

In onMessage, the 'Hazelcast internal broadcast latency' is usually below 100 ms, but sometimes it gets as high as 10-15 seconds.

There can be up to 1500 subscribers to one ITopic, usually 200-300, but the issue seem to happen with e.g. 100 subscribers as well.

The JVM thread graphs do not show any unusual spikes in blocked threads.

Excerpts from Hazecast configuration that could be relevant:

    <properties>
        <property name="hazelcast.icmp.enabled">true</property>
      <property name="hazelcast.executor.event.thread.count">256</property>
    </properties>
...
Here, number of event threads has been increased since it was thought to be the bottleneck.
It is not clear whether this affects anything anymore.
...

    <executor-service>
        <core-pool-size>16</core-pool-size>
        <max-pool-size>64</max-pool-size>
        <keep-alive-seconds>60</keep-alive-seconds>
    </executor-service>

In the Hazelcast code, I've tracked the issue down to ParallelExecutor.execute(runnable, hash) method that supposedly makes sure that all messages are delivered in order of publish() method calls.

Other relevant observations:
- On our standard test, in terms of measured broadcast latency Hazelcast 3.1 performs similarly or even worse, compared to 2.6
- There seem to be specific ITopics affected, seemingly those that experience high traffic at the moment. At the same moment, other ITopics do exhibit almost zero latency, so one may conclude that the system as a whole is not affected.
- 99,0% percentile stays below 5 seconds, 99.9% is up to 20 seconds.

Any advice is highly recommended.

Thanks in advance,
Timur

Peter Veentjer

unread,

Mar 20, 2014, 2:42:39 AM3/20/14

to haze...@googlegroups.com

Can you extract the relevant parts so we can have a look at it?

Currently it is a blackbox for me, so it could have all kinds of causes.

And unless you have many cores, creating 256 is not going to make things better, only worse.

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/835e0195-9b83-47d6-9b8a-45d8c364f9a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Veentjer

unread,

Mar 20, 2014, 2:50:59 AM3/20/14

to haze...@googlegroups.com

Can you set this property while running

-Dhazelcast.health.monitoring.level=NOISY

This will log every 30 seconds a lot of internal state.

PS: All topics share a queue; topics don't have their own event queue. This will be changed in the near future, but one topic under stress, could influence another topic.

Timur Evdokimov

unread,

Mar 20, 2014, 12:17:01 PM3/20/14

to haze...@googlegroups.com

> All topics share a queue;

It seems like when one topic is affected, for other topics (there are hundreds at least) publishing/receiving goes just fine, other HZ activity is also quite OK.

Clearly this is not the main event thread that gets stuck, but whatever happens past that execute(Runnable command, int hash) call.

We will try lowering the number of event threads.

The server has 24 cores, what should we set here?

> Can you extract the relevant parts so we can have a look at it?

Well, this is all that is relevant

Publisher publishes
ITopic<MyMessage> topic = hazelcastInstance.getTopic(...

);
topic.publish(message);

And the subscriber receives

public void onMessage(Message<MyMessage> broadcastMessage) {

}

Most of the time, and on lower loads, everything's just fine.

But sometimes, it takes 10+ seconds.

Timur Evdokimov

unread,

Mar 20, 2014, 4:43:03 PM3/20/14

to haze...@googlegroups.com

Peter, I just wonder that in order to achieve low latency broadcasts, I should raise thread priority for the event threads, and/or lower it for those threads busy with further broadcast processing.

If both kinds of threads have the same priority and there is a message broadcasted to 500 listeners, then 500 tasks are submitted to the executor service in onMessage() calls, and they need to compete with event threads.

So no wonder event threads go starving on CPU attention, despite the 24-way server.

Reply all

Reply to author

Forward