Low throughput in consumer & unregister scenario with two nodes

84 views
Skip to first unread message

Mihai Stanescu

unread,
May 4, 2017, 8:23:28 AM5/4/17
to ve...@googlegroups.com

Hi all,

I have two nodes in a cluster (Hazelcast cluster manager) with the latest vertx version.

I have a performance test doing Eventbus.consumer and then after some time unregister.

The speed is about 3000 req / sec and the resources on my machine are not used. CPU is very low.

The result is a bit surprising as i was expecting a higher latecy but also a bigger throughput. 

Can someone explain why number is so low before digging in more details of the test?

Regards,
Mihai

Mihai Stanescu

unread,
May 4, 2017, 8:30:05 AM5/4/17
to ve...@googlegroups.com
What i tried so far was:
 - increase parallelism of consumer & unregister calls => no change in rate
 - increase number of threads in internal blocking pool => no change in rate

Is there anything else i can try?

Tim Fox

unread,
May 4, 2017, 3:33:21 PM5/4/17
to vert.x
Very hard to say without seeing any code

yahim stnsc

unread,
May 9, 2017, 3:51:21 AM5/9/17
to vert.x
Maybe is related to the issue 


Also in HazelcastAsyncMultiMap i noticed that the methods use one taskQueue, practically all accesses are serialized.

Combine this with a network round trip and i guess it won't be so fast.  

IMO it could use a pool of taskQueues (maybe equal to hazelcast partition count or even bigger) 

I guess accesses should be serialized for one key in the map not for all.

I am not sure though if this code passes through these methods but this is just something i noticed through code inspection.

I will have to run with the debugger and see where the code goes

Tim Fox

unread,
May 9, 2017, 4:00:29 AM5/9/17
to vert.x
If you're just doing a simple ping-pong across the network, your performance is going to be limited by the network RTT which will probably in the order of milliseconds, not limited by the framework you're using.

Mihai Stanescu

unread,
May 9, 2017, 4:32:07 AM5/9/17
to ve...@googlegroups.com
Of course framework cannot improve latency but it should not impediment throughput. 

For sure you agree that 

10000 ping-pong launched at the same time and executed one after another will finish in 10000 x RTT , O(N)
10000 ping-pong launched at the same time and executed in parallel will finish in RTT, O(1)

This is exactly what happens here. 

When Eventbus.consumer("address", handler) is called it ends up in the following function


Which ends up here


@Override
public void add(K k, V v, Handler<AsyncResult<Void>> completionHandler) {
vertx.getOrCreateContext().executeBlocking(fut -> {
map.put(k, HazelcastClusterNodeInfo.convertClusterNodeInfo(v));
fut.complete();
}, taskQueue, completionHandler);
}

So, if there are 10000 simultaneous request to add they will all be executed one after another because of the taskQueue parameter which is just one instance, one queue however these requests are also for different key (k) . I see no reason to depend on each other. 

Regards,
Mihai


--
You received this message because you are subscribed to a topic in the Google Groups "vert.x" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vertx/iKOpMJ1u3do/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vertx+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.
To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/4376bddc-28bd-471b-9250-8f98a47c5336%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Mihai Stanescu

unread,
May 9, 2017, 4:43:07 AM5/9/17
to ve...@googlegroups.com

Tim Fox

unread,
May 9, 2017, 4:49:29 AM5/9/17
to vert.x
Like I said, it's very hard for us to be useful here without seeing any code :)

I assumed you were doing simple synchronous ping pong as that's all I could guess from the information I had seen so far.

Do you have a reproducer?

Julien Viet

unread,
May 9, 2017, 4:55:36 AM5/9/17
to ve...@googlegroups.com
agreed,

one caveat of throughput tests is that the client must be implemented carefully.

i.e it should not limit itself in the emitting rate but it should not also overwhelm the server with too many events.

in most benchmarks I’ve seen claiming vertx has low throughput (wether it’s HTTP or EventBus), the main reason was the actual code of the client that was either not sending enough events, or on the contrary sending too many events (which is unrealistic) creating itself the bottleneck in the system.



You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

Tim Fox

unread,
May 9, 2017, 4:58:40 AM5/9/17
to vert.x
I'm not familiar with the reason behind the TaskQueue addition, perhaps Thomas could comment?


On Tuesday, 9 May 2017 09:32:07 UTC+1, mhstnsc wrote:

Tim Fox

unread,
May 9, 2017, 5:05:22 AM5/9/17
to vert.x
Indeed, I am a bit confused about what this benchmark is doing. It seems to imply that event bus consumers are being created rapidly to test the event bus - but for an event bus ping-pong benchmark it shouldn't be necessary to create consumers rapidly.
agreed,

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

Julien Viet

unread,
May 9, 2017, 5:11:32 AM5/9/17
to ve...@googlegroups.com
I think we should implement a proper benchmark for the event bus to run :-)

I’ve created one for HTTP/2 (and also HTTP/1) last year (https://github.com/vietj/http2-bench), it is not that much difficult but there are some parts that definitely takes time to write properly.



To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

Tim Fox

unread,
May 9, 2017, 5:18:54 AM5/9/17
to vert.x
There used to be a very simple event bus throughput benchmark that had some rudimentary flow control somewhere in the source tree. iirc we used to get at least 100K msgs /sec on a couple of basic machines
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

Mihai Stanescu

unread,
May 9, 2017, 5:40:43 AM5/9/17
to ve...@googlegroups.com

I put the instructions how to run.

Make sure to start two instances and that they form a cluster

Here's the dump  rom the test before and after a node joined

register-rate: 41458 req/s, unregister-rate: 41443 req/s
register-rate: 40944 req/s, unregister-rate: 40934 req/s

Members [2] {
Member [172.17.42.1]:5702 this
Member [172.17.42.1]:5703
}
 
register-rate: 10533 req/s, unregister-rate: 10542 req/s
register-rate: 3320 req/s, unregister-rate: 3321 req/s





Tim Fox

unread,
May 9, 2017, 5:53:05 AM5/9/17
to vert.x
What I don't quite understand is why you're testing the performance of register/unregister - this doesn't tell you anything about event bus performance as registers/unregisters go through the cluster manager, not the event bus itself.

What's the use case for wanting to register/unregister a lot of handlers very quickly? Vert.x is not optimised for this case as it's quite unusual.

Mihai Stanescu

unread,
May 9, 2017, 6:08:35 AM5/9/17
to ve...@googlegroups.com
There are several reasons while i test this
1. i need to establish a base-line to see how good vertx works for our cases (i think the documentation could benefit from a performance guideline, it could really help designing an application which has more complex communication patterns than just web stuff)

2. Our application does need a certain amount of register/unregister because we work with TCP sockets and peer-to-peer communication between these sockets across machines. The rate of connections / disconnections is 100 /sec  however the more the better of course :). Just wanted to know what is the max rate we support for connections/disconnections thus how many clients we can support.

3. I saw this when i was looking at the code why the performance of the EventBus.send was low (see the multimap cache issues under load)

I think in this case it can be easily improved just by making a pool of those taskQueues. However if you do not find this proposal appealing its fine, probably we will do it in the product. 




Tim Fox

unread,
May 9, 2017, 6:16:07 AM5/9/17
to vert.x


On Tuesday, 9 May 2017 11:08:35 UTC+1, mhstnsc wrote:
There are several reasons while i test this
1. i need to establish a base-line to see how good vertx works for our cases (i think the documentation could benefit from a performance guideline, it could really help designing an application which has more complex communication patterns than just web stuff)

2. Our application does need a certain amount of register/unregister because we work with TCP sockets and peer-to-peer communication between these sockets across machines. The rate of connections / disconnections is 100 /sec  however the more the better of course :). Just wanted to know what is the max rate we support for connections/disconnections thus how many clients we can support.

So it looks like performance is well within your requirements?
 

3. I saw this when i was looking at the code why the performance of the EventBus.send was low (see the multimap cache issues under load)

I'm not sure I understand the link between registering/unregistering and eventbus.send... can you elaborate? EventBus shouldn't touch the multimap put/get code
 

I think in this case it can be easily improved just by making a pool of those taskQueues. However if you do not find this proposal appealing its fine, probably we will do it in the product. 

I'm not saying it's not valid, I just don't understand why it's important. Event bus send shouldn't use multimap anyway, and you already showed that consumer register/unregister was good enough for your requirements.

Mihai Stanescu

unread,
May 9, 2017, 6:48:13 AM5/9/17
to ve...@googlegroups.com
So it looks like performance is well within your requirements?

For now yes
 
I'm not sure I understand the link between registering/unregistering and eventbus.send... can you elaborate? EventBus shouldn't touch the multimap put/get code

Not sure what you mean but ClusteredEventbus is doing a HazelcastAsyncMultiMapImpl.get() to find the consumers for the address. 


If the address is not in the local cache of HazelcastAsyncMultiMapImpl then a round-trip to Hazelcast is needed of course which is also executed through the same taskQueue.

This happens on the first send to an uninitialized entry in the subscribers map.




Tim Fox

unread,
May 9, 2017, 7:06:31 AM5/9/17
to vert.x


On Tuesday, 9 May 2017 11:48:13 UTC+1, mhstnsc wrote:
So it looks like performance is well within your requirements?

For now yes
 
I'm not sure I understand the link between registering/unregistering and eventbus.send... can you elaborate? EventBus shouldn't touch the multimap put/get code

Not sure what you mean but ClusteredEventbus is doing a HazelcastAsyncMultiMapImpl.get() to find the consumers for the address. 


If the address is not in the local cache of HazelcastAsyncMultiMapImpl then a round-trip to Hazelcast is needed of course which is also executed through the same taskQueue.

This happens on the first send to an uninitialized entry in the subscribers map.


Right, but this should be cached for subsequent sends. After the first send there shouldn't be any multimap access. Seems like you're testing the performance of the "setup" code (i.e. registering consumers, populating cache) rather than the general event bus throughput performance. 

Most use cases would send many messages once addresses have been setup, so the setup cost should become negligible (relatively). I think your use case of registering a consumer, sending one message, then unregistering is quite unusual.

Mihai Stanescu

unread,
May 9, 2017, 8:12:52 AM5/9/17
to ve...@googlegroups.com
Seems like you're testing the performance of the "setup" code (i.e. registering consumers, populating cache) rather than the general event bus throughput performance. 

Actually,  there's one twist to the story though which a colleague pointed to me.

Sending an eventbus message to a non-existing address is actually stressing the "setup" code.

I made a reproducer here


The performance quite small compared to sending to an existing address (which like you said is probably even more than 100k / sec)

Here's the test output:


send-rate: 80666 req/s  

Members [2] {
Member [172.17.42.1]:5702 this
Member [172.17.42.1]:5703
}
 
[172.17.42.1]:5702 [dev] [3.6.3] Re-partitioning cluster data... Migration queue size: 135 
[172.17.42.1]:5702 [dev] [3.6.3] All migration tasks have been completed, queues are empty. 
send-rate: 30271 req/s
send-rate: 21822 req/s

This number should be divided in half because i guess half of the random inexisting addresses are actually local. The rate will drop as i would add more instances. 

And this is a very common pattern for in our application and in general i believe. 

If the application logic dictates that the consumer might/might not be there then the easiest thing to do is just try to send a message to it. 


In our application, considering two TCP connections on different machines. We want to send a message between them through EventBus. The problem is that at the moment of the send there's no way to know if the destination TCP connection exists or on which machine it is, we just send the message. It is a very nice way to find out if another TCP connection exists as vertx will reply with NO_HANDLERS but this is where i realize this is too slow for our needs. 




 

Mihai Stanescu

unread,
May 9, 2017, 8:15:03 AM5/9/17
to ve...@googlegroups.com
this is too slow for our needs. 

Ok...i rephrase. We will see about that. Just take it as an observation.

Tim Fox

unread,
May 9, 2017, 9:37:45 AM5/9/17
to vert.x
As an aside:

One thing that would be interesting to experiment with, would be to see if the the hand rolled cache we use in the Hazelcast cluster manager is still necessary. When I wrote that, the HZ one was slow, but that was a long time ago, maybe they have improved things since then...

Mihai Stanescu

unread,
May 9, 2017, 11:11:29 AM5/9/17
to ve...@googlegroups.com
One thing that would be interesting to experiment with, would be to see if the the hand rolled cache we use in the Hazelcast cluster manager is still necessary. When I wrote that, the HZ one was slow, but that was a long time ago, maybe they have improved things since then...

I don't believe multi-map has near-cache but maybe a regular map would fit the bill

https://github.com/hazelcast/hazelcast/issues/3960

Mihai Stanescu

unread,
May 9, 2017, 11:45:18 AM5/9/17
to ve...@googlegroups.com
I have changed a bit HazelcastAsyncMultiMap to use a pool of 271 (partition count) of task queues and the performance increased dramatically even for the case 

Here are some results:

One taskQueue                                                                      With taskQueue pool
 
send-rate-inexistent-address: 80666 req/s                               send-rate: 190812 req/s
register-rate: 41458 req/s                                                         register-rate: 128329 req/s, 
unregister-rate: 41443 req/s                                                     unregister-rate: 128326 req/s
 
One taskQueue                                                                     With taskQueue pool

send-rate-inexistent-address: 30271 req/s                               send-rate: 72143 req/s
register-rate: aprox 6000 req/s,                                                register-rate: 23628 req/s,
unregister-rate: aprox 6000 req/s                                             unregister-rate: 23627 req/s


I'll take it :)


On Tue, May 9, 2017 at 2:12 PM, Mihai Stanescu <mihai.s...@gmail.com> wrote:

Mihai Stanescu

unread,
May 9, 2017, 11:46:11 AM5/9/17
to ve...@googlegroups.com
Sorrry...a bit of correction

-=[ SINGLE NODE ]=-

One taskQueue                                                                      With taskQueue pool
 
send-rate-inexistent-address: 80666 req/s                               send-rate: 190812 req/s
register-rate: 41458 req/s                                                         register-rate: 128329 req/s, 
unregister-rate: 41443 req/s                                                     unregister-rate: 128326 req/s

-=[ TWO NODE ]=-
 
One taskQueue                                                                     With taskQueue pool

send-rate-inexistent-address: 30271 req/s                               send-rate: 72143 req/s
register-rate: aprox 6000 req/s,                                                register-rate: 23628 req/s,
unregister-rate: aprox 6000 req/s                                             unregister-rate: 23627 req/s

Minh Lee

unread,
Aug 16, 2017, 5:56:15 AM8/16/17
to vert.x

@mhstnsc can you share some experience with taskQueue pool performance in a large vertx cluster? Does it work well? And if you don't mind, please share with me how to implement it, thank you!
Vào 22:46:11 UTC+7 Thứ Ba, ngày 09 tháng 5 năm 2017, mhstnsc đã viết:
agreed,

To unsubscribe from this group and all its topics, send an email to vertx+un...@googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "vert.x" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vertx/iKOpMJ1u3do/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vertx+un...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "vert.x" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vertx/iKOpMJ1u3do/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vertx+un...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "vert.x" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vertx/iKOpMJ1u3do/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vertx+un...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "vert.x" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vertx/iKOpMJ1u3do/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vertx+un...@googlegroups.com.
Message has been deleted

Mihai Stanescu

unread,
Aug 19, 2017, 2:42:42 PM8/19/17
to ve...@googlegroups.com
@Minh

1. I did not test beyond two nodes but i assume it will improve as this way you can take advantage of hazelcast distribution

2. The idea of the implementation is to create a map of 271 (hazelcast partition size) taskQueues which are used based on hazelcast key computation. I created this implementation as a proof of concept. no problem in sharing given i can still find it. We are not currently running it in production but that is because i did not have time to properly put it together.  Th tricky bit is those "removeXXX" bulk operations as you will have to somehow wait for all taskQueues to be empty and then execute this action (this i have not implemented yet)

3. The vertx developers have changed a bit the implementation of this class recently however the concept of one taskQueue is still there. 


To unsubscribe from this group and all its topics, send an email to vertx+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages