Iris cluster hangs - still working up repro

James Cooper

unread,

Oct 29, 2014, 11:04:20 AM10/29/14

to projec...@googlegroups.com

Hi folks,

I've been doing more tests with the CoreOS/Iris/EC2 combination and I'm able to hang up the cluster fairly consistently. I thought I'd post a quick message now in case others have time to try to reproduce this work, or have a theory of the cause.

Setup:

- EC2 / VPC

- 3 c3.large instances

- CoreOS / fleet / etcd / Iris (built from master w/coreos support included)

Benchmark: https://github.com/coopernurse/iris-bench

(summary: it's a simple test client/server with 'echo' and 'add' messages)

When I use a concurrency of 10 per test runner with 6 test runners (the test runner itself is an Iris message, so the test runs are distributed across the cluster) everything works fine. I can run for several minutes with no message loss and no timeouts.

If I use a concurrency of 30 per test runner with 6 test runners, the cluster freezes. The behavior appears to be that messages destined for services on localhost are delivered, but messages between hosts timeout.

Iris logs on the hosts seem normal - no visible errors.

Any suggestions on next steps to debug this further? Would tcpdump output help?

-- James

Péter Szilágyi

unread,

Oct 30, 2014, 4:19:12 AM10/30/14

to James Cooper, projec...@googlegroups.com

Hey James,

I just looked over your code quite quickly, so it may very well happened that I miss things, so correct me if I'm wrong.

If I understand correctly, you are spawning the test runners by sending a benchmark request message to some "bench" services, which will run the benchmarks themselves, and return in the reply the gathered statistics. Since the last release of the client APIs, there is a limit on the inbound request concurrency per registered service (see resource capping). Although it *should* be 8 in your case - 4 x 2 virtual cores - (meaning that even if all 6 test runners are sent to the same machine by chance it should run fine), it may be worth a thought to increase it manually (or at least be aware of this cap):

benchS, err := iris.Register(55555, "bench", NewFxHandler(benchSvr), &ServiceLimits{RequestThreads: 32})

Other that this I see no apparent flaw in your code. While reworking the relay protocol I did quite a lot of modifications in there and I didn't have time to run too extensive tests/benchmarks so it might happen that I've introduced a bug somewhere. I'll try and run your benchmark on a local cluster and see if I can lock it up. I would really help if I could debug in peace without being charged for it :))

Cheers,

Peter

PS: tcpdump is unlikely to help as data traffic is encrypted on the network with ephemeral keys, which are themselves encrypted with ephemeral master keys negotiated during connection.

--
You received this message because you are subscribed to the Google Groups "Iris cloud messaging" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-iris...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Cooper

unread,

Oct 30, 2014, 1:01:01 PM10/30/14

to projec...@googlegroups.com

Hi Peter,

Thanks for the detailed response. I'll try again tonight or tomorrow with the ServiceLimits change. I'm a little unsure why that would help -- if I'm reading the docs correctly the default memory buffer size is 64MB, so the extra messages should have buffered. With the concurrency levels I used I doubt there was more than a total of 1MB of messages in flight across the whole cluster.

But I'll try it again.

It looks like a SIGQUIT will dump all the stacks of all the go routines in a process (similar to the JVM), but unlike the JVM it terminates the process. I might try running that on the Iris relay process to see what state it's in. If you're interested in that I can post it in a gist.

thanks again

-- James

On Thu, Oct 30, 2014 at 1:19 AM, Péter Szilágyi <pet...@gmail.com> wrote:

Hey James,

I just looked over your code quite quickly, so it may very well happened that I miss things, so correct me if I'm wrong.

If I understand correctly, you are spawning the test runners by sending a benchmark request message to some "bench" services, which will run the benchmarks themselves, and return in the reply the gathered statistics. Since the last release of the client APIs, there is a limit on the inbound request concurrency per registered service (see resource capping). Although it *should* be 8 in your case - 4 x 2 virtual cores - (meaning that even if all 6 test runners are sent to the same machine by chance it should run fine), it may be worth a thought to increase it manually (or at least be aware of this cap):

benchS, err := iris.Register(55555, "bench", NewFxHandler(benchSvr), &ServiceLimits{RequestThreads: 32})

Other that this I see no apparent flaw in your code. While reworking the relay protocol I did quite a lot of modifications in there and I didn't have time to run too extensive tests/benchmarks so it might happen that I've introduced a bug somewhere. I'll try and run your benchmark on a local cluster and see if I can lock it up. I would really help if I could debug in peace without being charged for it :))

Cheers,
Peter

PS: tcpdump is unlikely to help as data traffic is encrypted on the network with ephemeral keys, which are themselves encrypted with ephemeral master keys negotiated during connection.

--

James Cooper
Principal Consultant - Bitmechanic LLC
http://www.bitmechanic.com/

Péter Szilágyi

unread,

Oct 30, 2014, 1:10:30 PM10/30/14

to James Cooper, projec...@googlegroups.com

Hello,

That could definitely help! The university cluster is currently full so I cannot try out your code on multiple machines, but I'll try and figure something out to track this down asap. I'm also mid reinstalling my OSes, so that doesn't help either.

I'm almost certain that the caps are *not* the problem, just thought it'd point out a possible sensitive code segment if enough test runners are deployed.

Cheers,

Peter

--

James Cooper

unread,

Oct 30, 2014, 1:13:57 PM10/30/14

to projec...@googlegroups.com

Hi Peter,

Sounds great. I'm also trying to repro this on the Google Compute Cloud since I realize that's your normal cloud test environment. But I'm new to that platform so I'm learning about the network/firewall rules there. Once I get that sorted I'll let you know -- that may make it easier for you to try it out.

But I agree, having a local repro (with Vagrant perhaps) would be ideal.

-- James

--
You received this message because you are subscribed to a topic in the Google Groups "Iris cloud messaging" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/project-iris/dw7xhO7b_8E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to project-iris...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

James Cooper

unread,

Oct 31, 2014, 10:59:54 AM10/31/14

to projec...@googlegroups.com

Hi folks,

I'm working on this again today. Here's a gist of a stack dump from one of the hung Iris processes:

https://gist.github.com/coopernurse/4dbd1a621f913b433476

I took this in the middle of a (failing) benchmark run.

I don't know the Iris internals well enough to make much sense of it. I'd like to try to narrow the failure mode so we could get a trace with fewer goroutines, but at the moment I don't have a great understanding of the issue. After reproducing this 5+ times I can say that it seems to only happen at higher concurrency levels.

Would it help to get traces from the other Iris relays in the cluster at the same time?

-- James

James Cooper

unread,

Oct 31, 2014, 2:53:36 PM10/31/14

to projec...@googlegroups.com

Hi folks,

Here's a stack dump of a hung Iris process - it doesn't mean much to me but perhaps Peter or others can scrutinize it.

https://gist.github.com/coopernurse/4dbd1a621f913b433476

But.. I attempted to reproduce this locally using a 3 node Vagrant cluster and so far I haven't been able to, even under much higher levels of concurrency.

So this could turn out to be some issue with CoreOS or docker's network stack, or it may still be an Iris bug that we only see when there's some network latency.. not sure.

I'm going to re-run the test out on EC2 later today using some plain Ubuntu 14.04 VMs without docker and see if the situation changes. I'll post a follow up.

cheers

-- James

James Cooper

unread,

Oct 31, 2014, 4:47:33 PM10/31/14

to projec...@googlegroups.com

Hello again,

Two more updates:

- I can repro this on EC2 with the stock Ubuntu 14.04 AMI - no docker, no etcd, no CoreOS. The same workload/concurrency triggers it.

- latency between nodes in the cluster is very low - less than 1ms

- I still cannot repro this in Vagrant. I tried using netem to introduce 2-10ms of latency, and .1% packet loss on one host. Everything still works fine.

So this is an odd one.

I'll see if I can get things running on Google Compute Cloud. I don't fully understand how to setup networks there -- the netmask on the default network is always 255.255.255.255 so Iris never converges.

-- James

James Cooper

unread,

Nov 1, 2014, 1:24:00 PM11/1/14

to projec...@googlegroups.com

Hello,

Another round of updates:

- I ran this benchmark on a cluster at DigitalOcean (3 x 2GB, CoreOS, private networking). No problems

- I've reviewed my AWS VPC config and it looks fine. The issue may be triggered by packet loss. I've confirmed that each time the benchmark freezes the Iris cluster there are TX dropped packets on at least one of the 3 nodes. The packet loss isn't high (usually 20-30 packets), but it's non-zero each time there's a failure.

So I think that's significant. I'm not sure why this is causing the symptoms we're seeing. I would guess the root issue lies somewhere lower level than Iris (Go / OS / hypervisor). I'm going to do another round of tests with Vagrant and netem to see if I dropped packets causes issues there. I tried that before and everything was fine.

Again, if anyone has theories or suggestions of ways to narrow this I'd appreciate them. My preference is to use EC2 but these tests are not inspiring confidence.

Here's the uname output on one of the EC2 nodes (m3.medium)

Linux ip-172-31-33-20.us-west-2.compute.internal 3.16.2+ #2 SMP Thu Oct 16 01:11:04 UTC 2014 x86_64 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz GenuineIntel GNU/Linux

-- James

Péter Szilágyi

unread,

Nov 4, 2014, 4:55:18 AM11/4/14

to James Cooper, projec...@googlegroups.com

Hey James,

For some reason Google decided that your previous two emails were spam and blocked them from the mailing list. It sent out only now a notification that there is something to moderate. Lame.

I've looked through the first stack dump you sent, but couldn't really find anything that would stand out. All processes were waiting for some outside event to happen. I'll try and skim though the one you've just uploaded. Maybe I can find something more interesting there.