Scenario:
I'm running a clustered Java application (multiple RT servers that increments a counter on another server, and counter server sends aggregated count to RT servers).
RT servers: 3 EC2 c3.8xlarge verticles.. 32 instances/CPUs per machine.
Counter server: 1 EC2 c3.4xlarge verticle.. 16 instances/CPUs
all servers are deployed from the same AMI (CentOS 7 , paravirtual, JDK 8, vertx 2.0.0-final via nginx proxy_pass),
when I deploy the application, everything goes well..
Hazelcast detects the servers, messages are passed from RT servers to counter server and vice-versa.
after 1 minute,2 minutes, 10 minutes, 20 hours (random time as far as I know).. something breaks and messages are not sent/received on the counter server.
restarting the counter verticle solves the problem until the next unexpected fail.
things I do know:
1. each RT server is on at least 1Gbit/s network interface.
2. cluster communication is internal to AWS and uses private IP addresses.
3. a message is sent every 100 milliseconds, and not a huge one (1k at max), from each node to counter and back.
4. I tried both point to point and publish/subscribe flows.. with replies and without.. at the moment I'm adopting a UDP style of synchronization(all servers publish data.. counter server listens to address A, RT servers listen to address B)
5. I'm running the same flow/application in other contexts and everything works like a charm (same AMI same code.... bigger clusters) , the only difference I can think of is that these specific servers are bombarded with traffic (approx 300Mbit/s and at least 30k HTTP requests per seconds on each server, most of it is not passed to Vert.x but handled externally by nginx) - this is the main reason I suspect something fails on the clustering mechanism and not the application logic.
6. I managed to increase the time it takes to fail by separating the RT traffic to 6 servers.. 3 nginx servers and 3 Vert.x servers + 1 counter server.. this leads me to think that the issue is related to network bottleneck of some sort..
I can't seems to make Vert.x or Hazelcast give me hints/reasons for the fail... everything seems OK and no Exceptions/logs at all.... !!
* I know that the vert.x version is outdated... but it works on other contexts..
please advise!