Hello,
Another round of updates:
- I ran this benchmark on a cluster at DigitalOcean (3 x 2GB, CoreOS, private networking). No problems
- I've reviewed my AWS VPC config and it looks fine. The issue may be triggered by packet loss. I've confirmed that each time the benchmark freezes the Iris cluster there are TX dropped packets on at least one of the 3 nodes. The packet loss isn't high (usually 20-30 packets), but it's non-zero each time there's a failure.
So I think that's significant. I'm not sure why this is causing the symptoms we're seeing. I would guess the root issue lies somewhere lower level than Iris (Go / OS / hypervisor). I'm going to do another round of tests with Vagrant and netem to see if I dropped packets causes issues there. I tried that before and everything was fine.
Again, if anyone has theories or suggestions of ways to narrow this I'd appreciate them. My preference is to use EC2 but these tests are not inspiring confidence.
Here's the uname output on one of the EC2 nodes (m3.medium)
Linux ip-172-31-33-20.us-west-2.compute.internal 3.16.2+ #2 SMP Thu Oct 16 01:11:04 UTC 2014 x86_64 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz GenuineIntel GNU/Linux
-- James