EC2 VPC test results - preliminary

James Cooper

unread,

Sep 7, 2013, 6:36:44 PM9/7/13

to projec...@googlegroups.com

Hi,

I wrote a python script that uses boto to create a VPC cluster. One node runs the "server" process, the other nodes run the "client" process. The number of client nodes is configurable. The goal of the test was to determine:

(a) How long it takes the client nodes to find each other

(b) Messaging reliability

Much more work is needed to draw conclusions about (b). But I have some very preliminary answers for (a).

/24 EC2 VPC subnets appear to work pretty well even with small numbers of nodes. With 5 clients on m1.small instances clients reliably found the server after 30 seconds.

/21 subnets probably need more nodes than I wanted to pay for. With up to 10 clients waiting for 2 minutes they still didn't find each other. I also got errors in syslog regarding the ARP cache, so you'll need to tune some sysctl settings if you plan to have a subnet this large.

Also: micro instances didn't work well. Lot's of weird random errors. m1.small worked much better.

_____

Again, I want to emphasize this is all preliminary. Here's the python script I used.

https://gist.github.com/coopernurse/6469383

here's a screen shot of the example logs in papertrail's web ui:

http://i.imgur.com/A5h8WhG.png

It downloads a .tgz file from my S3 bucket that contains the RSA key, Go distribution, and test client/server binaries. AMI is Centos 6.4 x86_64 with cloud-init support.

It uses papertrailapp.com as a central syslog dump. I just wanted a simple way to watch the machines run, and they offer a free account - no credit card required.

Each run of the script generates a unique random job id, so you can easily filter the papertrail logs by that ID.

If others want to run this script, please feel free. There's a few things to know:

1) the most recent version of boto in pypi doesn't fully support the VPC API. There's one critical patch you need to apply to your boto installation if you want this script to work:

https://github.com/boto/boto/commit/9db6101b46a90f23ffad6e337e53cc14393e96c9

Without this patch, the "AssociatePublicIpAddress" parameter to "run_instances" will not work, and your VMs will not be publicly addressable.

I personally just grabbed the "connection.py" and "networkinterface.py" changes and applied them to my local install.

2) you need a www.papertrailapp.com account

3) you need to set 4 environment variables:

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

AWS_SSH_KEY_NAME (name of ssh key pair registered w/your AWS account to use when booting the machines)

PAPERTRAIL_PORT (assigned to you when you create your account)

_____

Please let me know if you try this script out and learn more about how Iris operates within EC2.

cheers

-- James

Péter Szilágyi

unread,

Sep 8, 2013, 7:57:48 AM9/8/13

to James Cooper, projec...@googlegroups.com

Hi James,

Thanks a lot for this thorough benchmark setup! I don't have the time right now to go though the whole thing and test it, but a few thoughts about the observations.

On Sun, Sep 8, 2013 at 1:36 AM, James Cooper <jamesp...@gmail.com> wrote:

Hi,

I wrote a python script that uses boto to create a VPC cluster. One node runs the "server" process, the other nodes run the "client" process. The number of client nodes is configurable. The goal of the test was to determine:

(a) How long it takes the client nodes to find each other
(b) Messaging reliability

Much more work is needed to draw conclusions about (b). But I have some very preliminary answers for (a).

/24 EC2 VPC subnets appear to work pretty well even with small numbers of nodes. With 5 clients on m1.small instances clients reliably found the server after 30 seconds.
/21 subnets probably need more nodes than I wanted to pay for. With up to 10 clients waiting for 2 minutes they still didn't find each other. I also got errors in syslog regarding the ARP cache, so you'll need to tune some sysctl settings if you plan to have a subnet this large.

I'm not familiar with the IP address allocations done within an Amazon VPC, but maybe a dump of them could provide more basis to explaining these longer convergence times. In the current Iris configurations, each node probes 4 random IPs and 4 additional "up and down scanned" ones per sec. A 2 minute scan should result in 480 scanned (240 up, 240 down) and 480 probed addresses. Given that a /21 bit subnet contains 2048 possible IPs, even if you distribute 10 nodes as evenly as possible, there will be only about 204 empty slots between them, which would be cross-scanned in 1.7 mins. Of course this is the theory, so if in practice you see a larger convergence, then it's time to explore the possibilities :).

Also: micro instances didn't work well. Lot's of weird random errors. m1.small worked much better.

Again I haven't done Amazon testing, so it's just a hunch, but it may have to do with the speed cap on the micro instances. Convergence is very CPU heavy due to the encrypted session setups, and the current implementation has an unhandled corner case if these cascading handshakes (shake, exchange state, find new peers, shake, etc) don't complete within an allotted time (the connections are detected as dead and get dropped). I'm guessing this is the reason why a small instance works (having a lot more CPU juice).

Cheers,

Peter

_____

Again, I want to emphasize this is all preliminary. Here's the python script I used.

https://gist.github.com/coopernurse/6469383

here's a screen shot of the example logs in papertrail's web ui:

http://i.imgur.com/A5h8WhG.png

It downloads a .tgz file from my S3 bucket that contains the RSA key, Go distribution, and test client/server binaries. AMI is Centos 6.4 x86_64 with cloud-init support.

It uses papertrailapp.com as a central syslog dump. I just wanted a simple way to watch the machines run, and they offer a free account - no credit card required.

Each run of the script generates a unique random job id, so you can easily filter the papertrail logs by that ID.

If others want to run this script, please feel free. There's a few things to know:

1) the most recent version of boto in pypi doesn't fully support the VPC API. There's one critical patch you need to apply to your boto installation if you want this script to work:

https://github.com/boto/boto/commit/9db6101b46a90f23ffad6e337e53cc14393e96c9

Without this patch, the "AssociatePublicIpAddress" parameter to "run_instances" will not work, and your VMs will not be publicly addressable.

I personally just grabbed the "connection.py" and "networkinterface.py" changes and applied them to my local install.

2) you need a www.papertrailapp.com account

3) you need to set 4 environment variables:

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY
AWS_SSH_KEY_NAME (name of ssh key pair registered w/your AWS account to use when booting the machines)
PAPERTRAIL_PORT (assigned to you when you create your account)

_____

Please let me know if you try this script out and learn more about how Iris operates within EC2.

cheers

-- James

--
You received this message because you are subscribed to the Google Groups "Iris Decentralized Messaging" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-iris...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Péter Szilágyi

unread,

Sep 8, 2013, 7:59:28 AM9/8/13

to James Cooper, projec...@googlegroups.com

Minor correction, 204 empty slots -> 227 (div by 9, doh)

Reply all

Reply to author

Forward