Problems running Gatling with Amazon ELB

1,058 views
Skip to first unread message

Marius

unread,
Oct 2, 2014, 8:28:05 AM10/2/14
to gat...@googlegroups.com
Hi all,

for quite a few days I am trying to get my Gatling test to run using Amazon's Elastic Load Balancer. My goal is to run 1000 HTTP requests/second distributed among two EC2 instances, which shouldn't be too hard.
With HAProxy, my tests are running fine, but with the ELB I saw some interruptions when more than 650 requests/second are sent:
After while, a high number (>4000) of connections is stuck in state SYN_SENT, until all of them are cleared with a ConnectException. Afterwards the connections are successful for a few seconds, until this issue occurs again.

First I tried changing the number of users (currently I am ramping up from 200 - 1000 users per second, each making one request only), and changing the keep-alive setting, both without success.
The known problems with the ELB have been ruled out (DNS refresh issues, ELB scaling / pre-warming) together with Amazon's support team.

Now I switched to the Apache Benchmark tool instead of Gatling, and using this tool with a similar scenario (600 threads, 1 Mio requests), the test passes fine through the ELB.
Isn't that strange?

Gatling + HA Proxy = OK
Gatling + ELB = ConnectException
AB + ELB = OK

I have been running TCP dump and took a closer look at the TCP connections, for now I can only say that Apache's tool seems to act more parallel, it is using >600 threads after all, because at the beginning all connections are established with SYN's, whereas Gatling is - at least is seams to me - serializing the opening of connections.
Now I have this theory that something Gatling might be stuck because of a missing SYN_ACK which is why no further connections are handled. Is this reasonable?

Perhaps anyone else has an idea what I can try out to debug this issue. Any help is appreciated.

Thanks!

Marius

Stéphane Landelle

unread,
Oct 2, 2014, 8:40:20 AM10/2/14
to gat...@googlegroups.com
Hi,

DNS issue.
Cheers,

Stéphane

--
You received this message because you are subscribed to the Google Groups "Gatling User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gatling+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marius Kreis

unread,
Oct 2, 2014, 8:50:56 AM10/2/14
to gat...@googlegroups.com, Stéphane Landelle
Hi,

it's unfortunately not that easy. As I wrote, DNS is not an issue. I disabled caching, even confirmed the lookups by tracing on port 53.
--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Alex Bagehot

unread,
Oct 2, 2014, 9:03:25 AM10/2/14
to gat...@googlegroups.com
ok so 2 things to start:
- use wrk over ab - or at least test with both.
- it's not a like for like test : you are mixing up "open" and "closed" workloads, or at least I don't have enough information to determine that the 2 workloads are the same.

open == gatling usersPerSecond(n) / tsung / iago, based on arrival rate input
closed == based on concurrent threads/connections/users/widgets/etc, wrk/ab, gatling atOnceUsers(n)+looping scenario, and all other tools ... in brief.

so for a like for like test try to start with:
wrk with 600 connections - note the requests per second and the connection stats (see the nestat one liner in a previous post for counting currency in the tcp stack) interactiely as the test proceeds.
gatling with atOnceUsers(600) with a script that loops forever

that would test the closed case.

you should record the rps(request per second), response time(avg or percentile), and counts of tcp connections in each state (syn_sent,estab,etc)
why? because gatling could be creating far more(or less) concurrent connections than ab, causing a real/perceived issue/difference.

can you enable graphite and set up nc(netcat) with awk as per previous post for realtime gatling monitoring. I'll add in the connection stats and will forward.


I can't say there isn't a problem with Gatling , but at the same time, can't say there is either.



--

Stéphane Landelle

unread,
Oct 2, 2014, 9:05:21 AM10/2/14
to gat...@googlegroups.com
I was about to explain this, but you shot faster, thanks Alex!

I'm going to explain this in the documentation, this kind of question happens way too often.

Stéphane Landelle

unread,
Oct 2, 2014, 10:03:59 AM10/2/14
to gat...@googlegroups.com

Marius Kreis

unread,
Oct 6, 2014, 10:31:25 AM10/6/14
to gat...@googlegroups.com
Hi Alex,

thank you very much for the detailed explanation.
I have been able to reproduce ab's (with parameter -k) and wrt's
behavior, but only when reusing connections, when recreating
connections, only ab does succeed.

But first of all, for reference in case someone else is searching for
this topic, this is what I did:

I have modified my scenario to have the users repeat endlessly with
.forever() { }

Set allowPoolingConnections = true (this is default) and use
.inject(atOnceUsers(600)) to create all users.

Then while running the test, I have been continuously printing
connection stats running command:

while true ; do sleep 1; date +"%T"; sudo netstat -na|grep tcp|awk
'{print $NF}'|sort|uniq -c|sort -nr; echo ; done

This shows that there is a fixed number of 600 open connections.
Btw. How can I set a timeout when using atOnceUsers -- or do I have to
set the total number of requests instead?


Problems when not reusing connections:

As mentioned above, when I set allowPoolingConnections to false,
connections will be closed and opened for every request.
With ab (by default not using keep-alive) this is not a problem, I can
make 2 million requests (having something between 270-430 established
connections at a time).
With Gatling netstat shows around 600 established connections and I run
into the same problems as before: After about 65400 requests (not a
coincidence that it is close to 65535?)) all 600 users are stuck, and
netstat shows exactly 600 connections in state "SYN_SENT". Could this be
some kind of OS bottleneck to which only Gatling is susceptible?
So far I increased the number of open files and I did not get any other
error message which would point to e.g. the number of ports.


For now it seems as if I had to reuse connections for my load testing,
unless someone has a suggestion about what else to try.

Thanks
Marius
> <mailto:gatling+u...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Gatling User Group" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/gatling/qCc_pvYu8oc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> gatling+u...@googlegroups.com
> <mailto:gatling+u...@googlegroups.com>.

Stéphane Landelle

unread,
Oct 6, 2014, 10:52:56 AM10/6/14
to gat...@googlegroups.com
It looks like after ~65400 requests, your target server closes all the connections (probably because the ELB changed IP), Gatling tries to open new connections but it somehow fails (server never acks the SYN packets, hence SYN_SENT).

No idea what causes this. Are you sure you disabled DNS caching? Do you get the same behavior when enabling keep-alive on ab?

Then, what I don't get is why you wouldn't run out of ephemeral ports with ab but you would with Gatling when not using keep-alive.

To unsubscribe from this group and stop receiving emails from it, send an email to gatling+u...@googlegroups.com.

Stéphane Landelle

unread,
Oct 6, 2014, 10:56:34 AM10/6/14
to gat...@googlegroups.com
Also, are you sure that the server that hosts Gatling doesn't end up being marked as an attacker and gets blocked?

Marius Kreis

unread,
Oct 6, 2014, 11:03:58 AM10/6/14
to gat...@googlegroups.com
I've been working with Amazon's support on that topic and they confirmed
that the IPs are correct and ELB has scaled correctly.

Also I traced the DNS lookups to make sure the IPs are not stale and the
lookup takes place (yes, caching is disabled).

Could it be the ephemeral ports? Is there any way to check that?

I've also been thinking whether I might be blocked by the ELB... but
then I have assuming Amazon's support would have noticed that. Also I am
running the tests from an EC2 instance - might be a difference where the
requests come from.



On 06.10.2014 16:52, Stéphane Landelle wrote:
> It looks like after ~65400 requests, your target server closes all the
> connections (probably because the ELB changed IP), Gatling tries to open
> new connections but it somehow fails (server never acks the SYN packets,
> hence SYN_SENT).
>
> No idea what causes this. Are you sure you disabled DNS caching? Do you
> get the same behavior when enabling keep-alive on ab?
>
> Then, what I don't get is why you wouldn't run out of ephemeral ports
> with ab but you would with Gatling when not using keep-alive.
>
> 2014-10-06 16:31 GMT+02:00 Marius Kreis <mar...@advancedtelematic.com
> <mailto:mar...@advancedtelematic.com>>:
> > <mailto:mar...@advancedtelematic.com
> <mailto:gatling%2Bunsu...@googlegroups.com>
> > <mailto:gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>>.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to a topic in the
> > Google Groups "Gatling User Group" group.
> > To unsubscribe from this topic, visit
> > https://groups.google.com/d/topic/gatling/qCc_pvYu8oc/unsubscribe.
> > To unsubscribe from this group and all its topics, send an email to
> > gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>
> > <mailto:gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Gatling User Group" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>.
signature.asc

Stéphane Landelle

unread,
Oct 6, 2014, 11:59:53 AM10/6/14
to gat...@googlegroups.com
Trying to find some info on the internet.
Seems like the exact same issue: http://forum.mikrotik.com/viewtopic.php?f=2&t=89625

Nadine Whitfield

unread,
Oct 9, 2014, 3:06:40 AM10/9/14
to gat...@googlegroups.com
@Alex Would you by chance have a link to this "previous post" where the real-time monitoring of Gatling is covered? thanks

Stéphane Landelle

unread,
Oct 9, 2014, 3:14:56 AM10/9/14
to gat...@googlegroups.com
@Nadine: did you search our documentation for "real-time monitoring"?

2014-10-09 9:06 GMT+02:00 Nadine Whitfield <nadine.w...@gmail.com>:
@Alex Would you by chance have a link to this "previous post" where the real-time monitoring of Gatling is covered?  thanks

Alex Bagehot

unread,
Oct 9, 2014, 6:09:20 PM10/9/14
to gat...@googlegroups.com
is this still unresolved in terms of determining for sure where the issue is? Ie. it looks like the ELB but amazon won't accept that unless further proof provided?
there's more options for diagnostics if so.

Stéphane Landelle

unread,
Oct 9, 2014, 6:14:24 PM10/9/14
to gat...@googlegroups.com
FYI, I have some on AsyncHttpClient whose issue is that ELB replies 503 when a region fails and never closes the connections, so they stay in the pool. ELB is quite a hairy beast...

To unsubscribe from this group and stop receiving emails from it, send an email to gatling+u...@googlegroups.com.

Nadine Whitfield

unread,
Oct 9, 2014, 10:51:46 PM10/9/14
to gat...@googlegroups.com
@stephane I just did and found 6 hits. I guess it's time to go splunking.

No problem. That's what I spend most of my day doing at work  :)

--
You received this message because you are subscribed to a topic in the Google Groups "Gatling User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gatling/qCc_pvYu8oc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gatling+u...@googlegroups.com.

Alex Bagehot

unread,
Oct 10, 2014, 2:32:41 AM10/10/14
to gat...@googlegroups.com

There's now an update to the doc
https://github.com/gatling/gatling/pull/2300/files

Marius Kreis

unread,
Oct 10, 2014, 3:37:34 AM10/10/14
to gat...@googlegroups.com
Hi Alex,

yes, the issue is still not resolved. The high number of connections
being opened and closed is certainly the cause, but I couldn't find the
bottleneck. For now I have switched to a limited number of connections
instead.
Which diagnostics do you have in mind?
> > <mailto:mar...@advancedtelematic.com
> > <mailto:gatling%2Bunsu...@googlegroups.com
> <mailto:gatling%252Buns...@googlegroups.com>>
> > > <mailto:gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>
> > <mailto:gatling%2Bunsu...@googlegroups.com
> <mailto:gatling%252Buns...@googlegroups.com>>>.
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> > >
> > > --
> > > You received this message because you are subscribed to a topic in the
> > > Google Groups "Gatling User Group" group.
> > > To unsubscribe from this topic, visit
> > > https://groups.google.com/d/topic/gatling/qCc_pvYu8oc/unsubscribe.
> > > To unsubscribe from this group and all its topics, send an email to
> > > gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>
> > <mailto:gatling%2Bunsu...@googlegroups.com
> <mailto:gatling%252Buns...@googlegroups.com>>
> > > <mailto:gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>
> > <mailto:gatling%2Bunsu...@googlegroups.com
> <mailto:gatling%252Buns...@googlegroups.com>>>.
> > > For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Gatling User Group" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>
> > <mailto:gatling%2Bunsu...@googlegroups.com
> <mailto:gatling%252Buns...@googlegroups.com>>.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to a topic in the
> > Google Groups "Gatling User Group" group.
> > To unsubscribe from this topic, visit
> > https://groups.google.com/d/topic/gatling/qCc_pvYu8oc/unsubscribe.
> > To unsubscribe from this group and all its topics, send an email to
> > gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>
> > <mailto:gatling+u...@googlegroups.com
> <mailto:gatling%2Bunsu...@googlegroups.com>>.
> > For more options, visit https://groups.google.com/d/optout.
>
>
> --
signature.asc

Nadine Whitfield

unread,
Oct 11, 2014, 3:08:59 AM10/11/14
to gat...@googlegroups.com
Thank you! This looks like just what I was trying to find.

Nadine Whitfield

unread,
Nov 1, 2014, 3:36:06 PM11/1/14
to gat...@googlegroups.com
This is great information! Thank you very much for sharing the link.



On Thursday, October 9, 2014 11:32:41 PM UTC-7, Alex Bagehot wrote:

Mr Lewis

unread,
Nov 1, 2014, 3:44:12 PM11/1/14
to gat...@googlegroups.com
Hi Nadine, 

I run my Gatling tests on EC2 and for real-time console metrics I've replaced the awk script with a python script. 


Aidy






Nadine Whitfield

unread,
Nov 1, 2014, 4:49:29 PM11/1/14
to gat...@googlegroups.com
Nice. My Python skills are a little rusty, but it would be a welcome diversion to see what you've done. I glanced through a few of your Simulation classes and saw you have setup many of the scenarios with one request per user. 

Have you been able to hit your target of 750 RPS with this kind of setup?

I'm in a similar kind of situation and am currently experimenting with adding more requests per user to see how well it runs.

Nadine Whitfield

unread,
Nov 12, 2014, 3:50:12 PM11/12/14
to gat...@googlegroups.com
@Marius:
Were you ever able to get your script to run at 1000 RPS?

If so, what was your final resolution for this problem?

Marius Kreis

unread,
Nov 13, 2014, 3:02:19 AM11/13/14
to gat...@googlegroups.com
Hi Nadine,

the only way I found was running each user in a loop. Having one user
per request did not work with ELB.
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Gatling User Group" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/gatling/qCc_pvYu8oc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> gatling+u...@googlegroups.com
> <mailto:gatling+u...@googlegroups.com>.

Stéphane Landelle

unread,
Nov 13, 2014, 4:52:15 AM11/13/14
to gat...@googlegroups.com
Hi Marius,

I'd love to hear about how you deal in your application with what looks to me like very serious shortcomings in ELB.
I mean that having the ELB frequently change IP:
  • breaks DNS caching, so clients have to disable it, causing a overhead (all the more that most DNS requests implementations, such as Java standard one, are blocking)
  • breaks connection pooling, so clients have to disable keep-alive
Cheers,

Stéphane

You received this message because you are subscribed to the Google Groups "Gatling User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gatling+u...@googlegroups.com.

Nadine Whitfield

unread,
Dec 6, 2014, 7:00:25 PM12/6/14
to gat...@googlegroups.com
Hi Marius:
I have an update. 

This week I made a change in my scripts that dramatically increased the traffic I can send to my service that sits behind an ELB. 
I don't have any special settings in my gatling.conf file, nothing has been configured in the Service Under Test, and my scenarios are designed to send 1 request per user. 

The only thing I changed was to add a line to my httpProtocol definition --- .shareConnection.

After doing this, I was able to routinely send 1000+ requests/sec to the service without any http errors whatsoever. 
I've found that around 1200 I might see some http 500 or 503/504 errors, but there have not been any errors that indicate problems with network card saturation on the script client machine side  (which for me is Jenkins)

The work load contains a few requests that use a .repeat or .pause construct, but the majority do not. 
It's basically lots of single-request users.
Reply all
Reply to author
Forward
0 new messages