Spray performance issue: new TCP connections not being accepted fast enough

1,975 views
Skip to first unread message

Richard Bradley

unread,
Jul 17, 2014, 9:54:38 AM7/17/14
to spray...@googlegroups.com
Hi,

We have a Spray app which we are load testing, and we seem to have hit a performance bottleneck that we have been unable to diagnose for some time.

The app has Comet style polling as well as short-lived requests.
In our tests, at the point where the app becomes overloaded we have around 12K open connections, and around 150 requests per second.

As the load increases past this point, we are finding that the application is taking a long time to respond to requests, but strangely it does not appear to be overloaded: CPU is not maxed out on the sever (50%), and all the Actor queues are empty or near-empty (we have added diagnostics to measure queue lengths).

We have found that altering the test client to re-use TCP connections for requests gives us much greater capacity (unfortunately we cannot easily change real clients to do the same ;-).
We have also found that many of our slow responses take just over 3, 9 or 21 seconds, which appears to suggest TCP retransmission of the connect attempt: http://www.streppone.it/cosimo/blog/2011/07/how-to-detect-tcp-retransmit-timeouts-in-your-network/

We would like to understand what the bottleneck is around creating TCP connections.

We have tried increasing the setting of "akka.io.tcp.nr-of-selectors" from 1 to 16, but it hasn't made any noticeable difference.
We believe that we have ruled out file descriptor limits by checking "ulimit".
We believe that we have ruled out TCP TIME_WAIT port congestion by spreading the application load over 3 separate TCP ports (we bound our http service actor to IO(Http) three times). (We had a previous capacity bottleneck which was caused by TIME_WAIT and which we fixed in this way.)

We do have a great many (20 / second) warnings in the log files that look like this:
  2014-07-17 06:20:06,252 05:20:06.251UTC WARN  spray.can.server.HttpServerConnection akka://NMSRestAPI/user/IO-HTTP/listener-0/1218176 - event pipeline: dropped Ack(0)
This does not quite match the reports in this previous Spray thread https://groups.google.com/forum/#!topic/spray-user/YN2ocRzwhY0 because there is no accompanying "Connection is already closed" log.

We have examined the Java stack traces of the application while it is apparently overloaded (i.e. while the response times are getting large):
There are 16 “pinned” io threads, all idle in “epollWait” inside “sun.nio.ch.SelectorImpl.select()”
There are many “default-dispatcher” Actor threads, all idle in “sun.misc.Unsafe.park” inside “scala.concurrent.forkjoin.ForkJoinPool.scan()”

My questions are:
1. How can I diagnose what bottleneck is causing our app to not accept new TCP connections fast enough? (and how can I relieve that bottleneck?)
2. What is the meaning of the "dropped Ack(0)" warnings? Do they need fixing?

Please let me know if I can provide any more information to help diagnose the issue.


Thanks very much,


Rich


Johannes Rudolph

unread,
Jul 17, 2014, 10:12:24 AM7/17/14
to spray...@googlegroups.com
Hi Richard,

On Thu, Jul 17, 2014 at 3:54 PM, Richard Bradley
<richard.brad...@gmail.com> wrote:
> My questions are:
> 1. How can I diagnose what bottleneck is causing our app to not accept new
> TCP connections fast enough? (and how can I relieve that bottleneck?)

I would probably start with verifying what you are seeing directly on
the wire by running wireshark / tcpdump on the server and checking
where exactly the delays are introduced into the TCP conversation.

It can also make sense to look into the system logs to check if there
are any log messages. E.g. here's one issue I've observed when working
with many connections:

http://lserinol.blogspot.de/2009/02/iptables-and-connection-limit.html


> 2. What is the meaning of the "dropped Ack(0)" warnings? Do they need
> fixing?

It seems you have HTTP pipelining enabled. Is that by purpose? What
happens if you reset the spray.can.server.pipelining-limit to 1?


--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Richard Bradley

unread,
Jul 17, 2014, 10:22:01 AM7/17/14
to spray...@googlegroups.com, johannes...@googlemail.com

Thanks, we'll look into those suggestions.

 

> > 2. What is the meaning of the "dropped Ack(0)" warnings? Do they

> > need fixing?

> It seems you have HTTP pipelining enabled. Is that by purpose?

 

Yes.

Does pipelining cause those warnings? Are they worth worrying about?

 

> What happens if you reset the spray.can.server.pipelining-limit to 1?

 

I think we had that set to 1 until recently and it hasn't affected the load testing (I think our test client might not support pipelining).

We'll do a test run with that off again and see if it makes any difference.

 

Best,

 

 

Rich

Richard Bradley

unread,
Jul 17, 2014, 10:31:28 AM7/17/14
to spray...@googlegroups.com, johannes...@googlemail.com
On Thursday, July 17, 2014 3:22:01 PM UTC+1, Richard Bradley wrote:

Thanks, we'll look into those suggestions.


> I would probably start with verifying what you are seeing directly on 
> the wire by running wireshark / tcpdump on the server and checking 
> where exactly the delays are introduced into the TCP conversation. 

We have already done this: the client sends a SYN to the Spray application, but the app takes a long time to respond with an ACK.

This is while CPU is not maxed out on the machine, and we have idle Selector threads. 
What can we do to diagnose or address this delay?

> It can also make sense to look into the system logs to check if there 
> are any log messages.

Nothing that looks relevant, no.

Thanks,


Rich

Richard Bradley

unread,
Jul 22, 2014, 8:59:38 AM7/22/14
to spray...@googlegroups.com, johannes...@googlemail.com
Did anyone have any further thoughts on this thread?

Since my email below, we have tried:

1. Running multiple instances of our app in parallel and spreading the connection load across those (this is similar to our previous experiment of opening multiple ports in the same Spray application). We found a very slight improvement, but no big effect.

2. Increasing the "backlog" setting on the socket (+ the "net.core.somaxconn" sysctl). No noticeable effect.

The overload point seems to be about 500 TCP connections per second, after which point our Spray app cannot open connections fast enough to keep up with load, even though CPU is not maxed out.

I notice that the Spray benchmark ( http://spray.io/blog/2013-05-24-benchmarking-spray/ ) uses long lived-connections to achieve 200k requests/sec, so doesn't measure Spray's connection/sec capacity. Are there any benchmarks which measure the latter?

We are going to try taking the app from the Spray benchmark above and seeing how many connections per second it can achieve, so we'll have a reproducible test case.

Thanks,


Rich

Johannes Rudolph

unread,
Jul 22, 2014, 9:03:43 AM7/22/14
to Richard Bradley, spray...@googlegroups.com
Hi Rich,

On Tue, Jul 22, 2014 at 2:59 PM, Richard Bradley
<richard.brad...@gmail.com> wrote:
> We are going to try taking the app from the Spray benchmark above and seeing
> how many connections per second it can achieve, so we'll have a reproducible
> test case.

Yes, a more easily reproducible test-case would be very good. The
spray codebase contains similar benchmark code inside its codebase.
Maybe you can start from there:

https://github.com/spray/spray/tree/master/examples/spray-can/server-benchmark

Richard Bradley

unread,
Jul 22, 2014, 12:51:32 PM7/22/14
to spray...@googlegroups.com, richard.brad...@gmail.com, johannes...@googlemail.com
On Tuesday, July 22, 2014 2:03:43 PM UTC+1, Johannes Rudolph wrote:
Yes, a more easily reproducible test-case would be very good. The
spray codebase contains similar benchmark code inside its codebase.
Maybe you can start from there:

https://github.com/spray/spray/tree/master/examples/spray-can/server-benchmark


  

We have reproduced the issue using the Spray “benchmark” app linked above.

It shows the exact same symptoms:

  • When load exceeds around 1200 connections/second, the app starts to take a long time to reply to new TCP connections
  • This is even though the CPU on the machine is only at 15-20%
  • So there is some bottleneck preventing Spray from accepting TCP connections fast enough
    • What is it?
    • How can we improve it?

We used Spray v 1.3.1 and Akka v 2.3.3


I can’t upload the test harness here, but it just makes lots of HTTP requests, using a new TCP connection for each request.

We have already eliminated the following possible bottlenecks:

  • TIME_WAIT issues including TCP port exhaustion
  • Limits on the maximum number of open file handles
  • Increases to the socket “backlog” size

I have included some relevant graphs below, as well as the diff of changes we had to make to the benchmark app (nothing significant).

If anyone has any ideas about what might be causing this bottleneck, or any suggestions on relevant stats we could gather, they would be gratefully received.

 

Thanks,

 

 

Rich

 

 

In these graphs, the test starts at 10:00, and takes about 20 mins to ramp up the load, reaching 1200 connections/sec at around 10:20.

 

CPU load (low throughout the test):


Response times (starts low, increases when the TCP connection issue manifests after around 10:20):

 

Number of TCP “listen socket overflow” events, as reported by “netstat -s” (“times the listen queue of a socket overflowed”).

I think this proves that the system isn’t accepting TCP connections fast enough: 

 

Number of open TCP connections:


We have graphs showing that the slow requests are taking 3 or 9 seconds to complete, which is consistent with TCP backoff times when a connection attempt fails.

The diff of changes we made to the benchmark app:

 

diff --git a/spray/build.sbt b/spray/build.sbt

index 2a9b86f..b2ed753 100644

--- a/spray/build.sbt

+++ b/spray/build.sbt

@@ -15,9 +15,9 @@ resolvers ++= Seq(

 

 libraryDependencies ++= Seq(

   "io.spray" %% "spray-json" % "1.2.4",

-  "io.spray" % "spray-can" % "1.1-20130619",

-  "com.typesafe.akka" %%  "akka-actor" % "2.1.2",

-  "com.typesafe.akka" %%  "akka-slf4j" % "2.1.2",

+  "io.spray" % "spray-can" % "1.3.1",

+  "com.typesafe.akka" %%  "akka-actor" % "2.3.3",

+  "com.typesafe.akka" %%  "akka-slf4j" % "2.3.3",

   "ch.qos.logback"% "logback-classic" % "1.0.12" % "runtime"

)

 

diff --git a/spray/src/main/resources/application.conf b/spray/src/main/resources/application.conf

index ca36824..9e376ab 100644

--- a/spray/src/main/resources/application.conf

+++ b/spray/src/main/resources/application.conf

@@ -24,5 +24,5 @@ spray.can.server {

 

 app {

   interface = "0.0.0.0"

-  port = 8080

+  port = 8099

}

diff --git a/spray/src/main/scala/spray/examples/BenchmarkService.scala b/spray/src/main/scala/spray/examples/BenchmarkService.scala

index ca3e66b..49c4280 100644

--- a/spray/src/main/scala/spray/examples/BenchmarkService.scala

+++ b/spray/src/main/scala/spray/examples/BenchmarkService.scala

@@ -18,6 +18,8 @@ class BenchmarkService extends Actor {

   val unknownResource = HttpResponse(NotFound, entity = "Unknown resource!")

 

   def fastPath: Http.FastPath = {

+    case HttpRequest(_, Uri(_, _, Slash(Segment("myapp", _)), _, _), _, _, _) =>

+      HttpResponse(entity = "You hit myapp ")

     case HttpRequest(GET, Uri(_, _, Slash(Segment(x, Path.Empty)), _, _), _, _, _) =>

       x match {

         case "json" =>

Johannes Rudolph

unread,
Jul 23, 2014, 4:00:37 AM7/23/14
to Richard Bradley, spray...@googlegroups.com
Thanks Richard, this is valuable information. I'll see if I can reproduce it locally.

Johannes

Johannes Rudolph

unread,
Jul 23, 2014, 6:16:03 AM7/23/14
to Richard Bradley, spray...@googlegroups.com
On Thu, Jul 17, 2014 at 4:31 PM, Richard Bradley
<richard.brad...@gmail.com> wrote:
>> I would probably start with verifying what you are seeing directly on
>> the wire by running wireshark / tcpdump on the server and checking
>> where exactly the delays are introduced into the TCP conversation.
>
> We have already done this: the client sends a SYN to the Spray application,
> but the app takes a long time to respond with an ACK.

ACK is sent by the kernel somewhat independently from the app. That
means even if the app isn't accepting connection at all the kernel
will already accept connections.

I tried reproducing your issue but everything works as far as I can
see. I set up a repository with the code used:

https://github.com/jrudolph/spray-benchmark

You can use `./build-and-run.sh` to build and run the server. Then in
another terminal `./benchmark.sh` to run an `ab` benchmark. It will
also show some of the important kernel parameters. This is what it
shows on my machine:

net.ipv4.tcp_max_syn_backlog = 50000
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_tw_recycle = 0
net.core.somaxconn = 50000
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
memory(kbytes) unlimited
locked memory(kbytes) 64
process 127171
nofiles 4096
vmemory(kbytes) unlimited
locks unlimited

I increased somaxcon and tcp_max_syn_backlog. The README contains some
diagnostic commands I use to watch what happens. In your diff you
didn't increase the listen backlog in the spray app. I can get about
17k RPS (= 17000 connections per second) with this setup which ran
for 15 minutes without problems (which may not be long enough?).

Can you try reproducing it with this app?

I wonder why your graphs show 20k open connections if you only open
1200 per second. Are you testing on localhost or from another machine?

Michael Hamrah

unread,
Jul 23, 2014, 8:39:44 AM7/23/14
to spray...@googlegroups.com
I know little about how akka.io interacts with the os's tcp infrastructure, but I'm curious if you compared a Netty based app (say, the play framework) or some other runtime (go) to specifically measure tcp's syn/ack timing if that's where you think the bottleneck is.

Mike

Richard Bradley

unread,
Jul 23, 2014, 8:42:35 AM7/23/14
to spray...@googlegroups.com, richard.brad...@gmail.com, johannes...@googlemail.com
> Can you try reproducing it with this app? 

Thanks very much; initial results look like this isn't overflowing the socket, i.e. much higher connections / second with this test harness than with ours.
This suggests to me that the issue might be in our test harness. We'll look into that and report back here.


> I wonder why your graphs show 20k open connections if you only open 
> 1200 per second. Are you testing on localhost or from another machine? 

Good spot.
We've looked into it, and this graph was mislabelled / misinterpreted. This is from "netstat", and shows the number of new connections in the last 15 seconds, which matches close enough to 1200 connections / second.

Richard Bradley

unread,
Jul 28, 2014, 7:01:36 AM7/28/14
to spray...@googlegroups.com, richard.brad...@gmail.com, johannes...@googlemail.com
We have found that a combination of two changes has fixed most of our issues with TCP connection creation:

1. "sysctl -w net.ipv4.tcp_max_syn_backlog=50000" (with a reboot)
2. Increase the "backlog" setting on the Http.Bind message from the default of 100 to 2048

We are still getting a few "times the listen queue of a socket overflowed" reports from netstat, but far fewer.

Thanks,


Rich

Mathias Doenitz

unread,
Jul 29, 2014, 6:13:01 AM7/29/14
to spray...@googlegroups.com
Great!
Thanks for sharing your solution!

Cheers,
Mathias

---
mat...@spray.io
http://spray.io
> --
> You received this message because you are subscribed to the Google Groups "spray.io User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to spray-user+...@googlegroups.com.
> Visit this group at http://groups.google.com/group/spray-user.
> To view this discussion on the web visit https://groups.google.com/d/msgid/spray-user/c339bcf6-8370-4f86-b9be-24472ab54f79%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Johannes Rudolph

unread,
Jul 29, 2014, 8:56:03 AM7/29/14
to spray...@googlegroups.com
Cool, thanks Richard for reporting back.

Johannes
> To view this discussion on the web visit https://groups.google.com/d/msgid/spray-user/6E393410-3688-4D22-8945-31AC7FCF3EB1%40spray.io.
> For more options, visit https://groups.google.com/d/optout.



Edwin Fuquen

unread,
Jul 30, 2014, 4:46:56 PM7/30/14
to spray...@googlegroups.com, johannes...@googlemail.com
This is great info. It might be good to include a "Performance Tuning" section in the spray-can docs that could make suggestions on config parameters that should be changed in spray to get high connection throughput?  Even some guidance on kernel parameter tuning would be nice, though not directly related to spray.

Richard Bradley

unread,
Jul 31, 2014, 8:33:12 AM7/31/14
to spray...@googlegroups.com, johannes...@googlemail.com
On Wednesday, July 30, 2014 9:46:56 PM UTC+1, Edwin Fuquen wrote:
This is great info. It might be good to include a "Performance Tuning" section in the spray-can docs that could make suggestions on config parameters that should be changed in spray to get high connection throughput?  Even some guidance on kernel parameter tuning would be nice, though not directly related to spray.

Sounds like a good idea to me. 
I don't think I'll be able to write such a section any time soon, but I would be happy to help review & contribute.
I have raised an issue to track this work: https://github.com/spray/spray/issues/917
Reply all
Reply to author
Forward
0 new messages