EC2 VPC - carrier: couldn't handle delivered balance event

James Cooper

unread,

Sep 3, 2013, 11:45:07 PM9/3/13

to projec...@googlegroups.com

Hi,

I'm trying out Iris on a EC2 VPC. I'm using the Barrister RPC example from my post last week. The machines are 64bit CentOS 6.4 on micro instances.

On the client machine I'm seeing:

[root@ip-10-0-0-186 example]# /root/iris -net test -rsa /root/iris_rsa

2013/09/04 03:41:32 main: booting carrier...

2013/09/04 03:41:42 main: carrier converged with 0 remote connections.

2013/09/04 03:41:42 main: booting relay service...

2013/09/04 03:41:42 main: iris successfully booted, listening on port 55555.

[root@ip-10-0-0-186 example]# go run iris-calc-client.go

2013/09/04 03:38:38 carrier: couldn't handle delivered balance event: &{{0xc200141870 [225 59 98 99 38 225 202 172 98 86 233 246 3 29 207 207] [177 94 82 216 19 87 12 135 50 78 36 182 176 119 98 69]} [78 78 122 231 2 169 206 139 130 151 224 217 168 0 243 73 26 41 130 31 2 45 149 33 253 221 86 222 223 189 20 173 156 40 238 7 119 80 36 249 222 128 116 17 248 254 64 59 121 146 220 234 90 95 187 54 151 156 135 86 9 80 7 65 148 19 151 135 134 26 122 226 103 175 236 143 151 81 178 80 100 119 57 23 92 2 163 81 43 234 109 152 119 10 200 151 42 107 61 30 13 48 115 108 253 195]}.

I don't get this error if I run the client and server on the same machine.

Is there any way to get more verbose logging from the relay? It's possible there's a network issue, but I've verified the following:

- Security group allows a very wide port range for both TCP and UDP

- SELinux is disabled

- iptables is disabled

- I was able to telnet to the tcp port from machine A to machine B

Any suggestions on how to debug this? It could be something with the network, but I'm not sure how to verify that.

thanks!

-- James

Péter Szilágyi

unread,

Sep 4, 2013, 5:13:47 AM9/4/13

to James Cooper, projec...@googlegroups.com

Hi James!

Well, the error message states (non obvious I agree if you're not intimate with the internals) that a load balancing was requested, but the network failed to find an instance of the requested app group. Looking at the boot message it also logged, that "carrier converged with 0 remote connections".

My thought is that a firewall is either blocking it, or that the IPs are very far apart and they haven't found each other yet. I'm currently in the middle of something, but I'll try it out a bit later on a local cluster to see if it works locally and then I'll try to sort out the EC2 VPC too.

Cheers,

Peter

PS: Is the code you're trying in the barrister-go github repo (just to make sure I'll look at the same code)?

--
You received this message because you are subscribed to the Google Groups "Iris Decentralized Messaging" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-iris...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

James Cooper

unread,

Sep 4, 2013, 9:31:11 AM9/4/13

to projec...@googlegroups.com

Hi Peter,

Thanks for the quick reply. Yes, the client/server code is here: https://github.com/coopernurse/barrister-go/tree/master/example

Well I just booted two new instances into the VPC this morning and tried again and it worked. The output from the iris start sequence shows that it found its peer immediately:

[root@ip-10-0-0-241 ~]# 2013/09/04 13:24:37 main: carrier converged with 1 remote connections.

2013/09/04 13:24:37 main: booting relay service...

2013/09/04 13:24:37 main: iris successfully booted, listening on port 55555.

My only thought is that I had made changes to the EC2 Security Group while those old nodes were running. The TCP change appeared to take effect immediately (I could telnet to a port that was previously blocked). But perhaps whatever magic Amazon does with their networking has to be applied when the VM boots. Not sure.

Anyhow, I'm happy to see it working. We're going to run some larger tests on Friday with a larger pool of nodes to see how quickly the peers converge and to ensure we don't lose messages. I'll start another thread with the results.

thanks again for your hard work

-- James

On Wed, Sep 4, 2013 at 2:13 AM, Péter Szilágyi <pet...@gmail.com> wrote:

Hi James!

Well, the error message states (non obvious I agree if you're not intimate with the internals) that a load balancing was requested, but the network failed to find an instance of the requested app group. Looking at the boot message it also logged, that "carrier converged with 0 remote connections".

My thought is that a firewall is either blocking it, or that the IPs are very far apart and they haven't found each other yet. I'm currently in the middle of something, but I'll try it out a bit later on a local cluster to see if it works locally and then I'll try to sort out the EC2 VPC too.

Cheers,
Peter

PS: Is the code you're trying in the barrister-go github repo (just to make sure I'll look at the same code)?

--

James Cooper
Principal Consultant - Bitmechanic LLC
http://www.bitmechanic.com/

Péter Szilágyi

unread,

Sep 5, 2013, 2:20:21 AM9/5/13

to James Cooper, projec...@googlegroups.com

Glad it worked out :)

On the message losing front, Iris is a best effort protocol (specifically because at a messaging level I cannot "do the right thing" if a message gets lost). Of course this doesn't mean that messages should randomly disappear, only that during overlay convergence they might end up at a dead end.

This is specifically the weak spot of the implementation as is, that such corner cases are not handled and thus a crash for example might drop more messages that absolutely inevitable. I wanted to have the system feature complete (at least in a minimalist way) before trying to implement specialized handlers for specific corner cases.

I'm curious about your findings :)

Cheers,

Peter

PS: For the last few days I was also working on a little visual demo application of the Iris system, hopefully I can take the lid off it soon :)

--

Reply all

Reply to author

Forward