Eventbus handlers stop sending/receiving messages at random times.....

215 views
Skip to first unread message

Elad Yosifon

unread,
Mar 11, 2015, 7:57:40 PM3/11/15
to ve...@googlegroups.com
Scenario:
I'm running a clustered Java application (multiple RT servers that increments a counter on another server, and counter server sends aggregated count to RT servers).

RT servers: 3 EC2 c3.8xlarge verticles.. 32 instances/CPUs per machine.
Counter server: 1 EC2 c3.4xlarge verticle.. 16 instances/CPUs
all servers are deployed from the same AMI (CentOS 7 , paravirtual, JDK 8, vertx 2.0.0-final via nginx proxy_pass),

when I deploy the application, everything goes well.. 
Hazelcast detects the servers, messages are passed from RT servers to counter server and vice-versa.
after 1 minute,2 minutes, 10 minutes, 20 hours (random time as far as I know)..  something breaks and messages are not sent/received on the counter server.
restarting the counter verticle solves the problem until the next unexpected fail.

things I do know:
1. each RT server is on at least 1Gbit/s network interface.
2. cluster communication is internal to AWS and uses private IP addresses.
3. a message is sent every 100 milliseconds, and not a huge one (1k at max), from each node to counter and back.
4. I tried both point to point and publish/subscribe flows.. with replies and without.. at the moment I'm adopting a UDP style of synchronization(all servers publish data..  counter server listens to address A, RT servers listen to address B)
5. I'm running the same flow/application in other contexts and everything works like a charm (same AMI same code.... bigger clusters) , the only difference I can think of is that these specific servers are bombarded with traffic (approx 300Mbit/s and at least 30k HTTP requests per seconds on each server, most of it is not passed to Vert.x but handled externally by nginx) - this is the main reason I suspect something fails on the clustering mechanism and not the application logic.
6. I managed to increase the time it takes to fail by separating the RT traffic to 6 servers.. 3 nginx servers and 3 Vert.x servers + 1 counter server..  this leads me to think that the issue is related to network bottleneck of some sort..

I can't seems to make Vert.x or Hazelcast give me hints/reasons for the fail...  everything seems OK and no Exceptions/logs at all.... !!

* I know that the vert.x version is outdated...  but it works on other contexts..

please advise!

Elad Yosifon

unread,
Mar 22, 2015, 7:49:06 PM3/22/15
to ve...@googlegroups.com
any thoughts?

updates:
1. the underlying issue is probably not Hazelcast.. I had to replace the cluster management with a Redis server as a main server..  issue still happens.
2. I found that this scenario is also different. the size of data passed through the eventbus is far bigger than other contexts..

what is the eventbus throughput barrier? 
is there a known barrier?
does it depend on the network stack?
can I use a non-network eventbus?  (e.g. file system/memory/unix sockets)
is there another eventbus implementation?

Jordan Halterman

unread,
Mar 22, 2015, 8:50:55 PM3/22/15
to ve...@googlegroups.com
Im not sure about your issue, but I am curious...

First, there are no other event bus implementations yet. That has been discussed a bit (by me and others) but not much action has been taken in that direction yet AFAIK.

What is the size of the messages you're passing on the event bus?

Now for my curiosity:
Why did you have to replace the cluster management with Redis? How do you do cluster management on Redis? Do nodes discover each other through Redis keys? How do you handle a node leaving the cluster? How do you handle a Vert.x node crashing? How do you handle the Redis node crashing? (Store state in multiple Redis servers?)

Sorry to take over your thread, I'm just really interested to see how you used Redis for cluster management.
--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elad Yosifon

unread,
Mar 22, 2015, 8:58:51 PM3/22/15
to ve...@googlegroups.com
Sorry to disappoint you.. I think I misused the term "cluster management"..
I simply store all data to a single redis instance via TCP.. read/write via eventbus.. Just a ad-hoc patch.. The product had to go live and I was sure the issue is with Hazelcast..



--
You received this message because you are subscribed to a topic in the Google Groups "vert.x" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vertx/D4nZvm8MkL8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vertx+un...@googlegroups.com.

Elad Yosifon

unread,
Mar 22, 2015, 9:17:02 PM3/22/15
to ve...@googlegroups.com
I'm sending 20k json objects per second .. Each has at least 30 fields.. (Lets say each field is 20 chars long)..

** on all contexts

On this particular scenario I am sending another json object with a few long strings(urls.. Up to 1024 chars each). So its probably an additional 20 mega bytes per second.. per server

Lets round it up to 50 mega bytes per second via the event bus which is 400Mbit per second..

Which is probably too much for one network interface.. (Without taking actual traffic(can go up to 500Mbit per second) into consideration.. )



On Mon, Mar 23, 2015 at 2:50 AM, Jordan Halterman <jordan.h...@gmail.com> wrote:

--
You received this message because you are subscribed to a topic in the Google Groups "vert.x" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/vertx/D4nZvm8MkL8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to vertx+un...@googlegroups.com.

Jordan Halterman

unread,
Mar 22, 2015, 9:55:03 PM3/22/15
to ve...@googlegroups.com
Disappointing indeed :-)

Sent from my iPhone

Tim Fox

unread,
Mar 23, 2015, 3:56:12 AM3/23/15
to ve...@googlegroups.com
I suggest you do some debugging, logging and profiling to figure out what's really going on your system.

Elad Yosifon

unread,
Mar 23, 2015, 4:11:44 AM3/23/15
to ve...@googlegroups.com
well... I think I covered the part of debugging, logging a while ago..  
I'm chasing my own tail regarding this issue for at least a month...



I can't run a profiler on that environment since its in production and performance is a key factor for $$$.... 



which tools can I use to monitor Vert.x internals or the eventbus mechanism?
can I enable logs for eventbus issues?

did you ever benchmarked the eventbus throughput? should I try to do that? 
can you give me a suggestion for the best tool/way to do this benchmark?

Tim Fox

unread,
Mar 23, 2015, 4:22:07 AM3/23/15
to ve...@googlegroups.com
Dusting off my crystal ball... ;)

Probably the first thing I would do is to determine why the counter is stopping - is it running out of memory, deadlocked, something else?

Then... if it's deadlocked use the stack traces to figure out why. If it's running our of memory, do a heap dump and see what's in RAM.

Maybe you have event bus messages building up? Should be pretty easy to determine. Try adding some logging - have a periodic counter which, every few seconds logs out number of messages sent, and on your counter do the same. Is one bigger than the other? If so, then they're building up (they don't just disappear), and could lead to OOM eventually.

When you say your counter "verticle" (singular). Do you have just a single instance in your cluster? Vert.x verticles are single threaded as you know, so it will only use one core even if you put it on a server with 1000 cores... ;) Pushing 100s of 1000s of msgs per sec through a single core seems like asking for trouble. Or perhaps you have many counter instances? If so I assume the counter is atomic, so how are making it atomic - maybe a lock? etc

Elad Yosifon

unread,
Mar 23, 2015, 5:05:39 AM3/23/15
to ve...@googlegroups.com
I tried to describe the architecture in the first post...
I'll try to simplify it..

I deploy N eventloop verticles that handles incoming requests ..  (N for number of CPUs)
on each request I'm passing stats regarding the specific request through the eventbus (to be async..) and release the response.
eventbus messages are handled on a designated handler that persist the data to a CSV file...



there is another worker verticle that read/write data to the "cluster" (was Hazelcast to a remote verticle NOW a remote Redis instance) in the background..
"counters" are handled internally by redis.. also single-threaded  and atomic...

I don't think it's the issue..  I really think it is eventbus throughput..  and message transience..
I guess that the CSV data passed through the eventbus is over-using the network stack so messages to Redis are dropped..

BTW I'm having no issues regarding the CSV data..  they are all saved.  
once in a while (2 weeks ~)..  one eventbus is dead and all data is passed to other nodes in the cluster..  (another thing to investigate... but ATM its not an issue for me..)


I really appreciate the help here... thx
Reply all
Reply to author
Forward
0 new messages