[akka-http-2.4.2] connection timeouts under load

837 views
Skip to first unread message

Giovanni Alberto Caporaletti

unread,
Mar 5, 2016, 7:03:57 AM3/5/16
to Akka User List
Hi,
I'll try to explain what I'm experiencing in my akka-http app.
(I found this issue but it's not been updated for almost a year and I'm not sure it's relevant: https://github.com/akka/akka/issues/17395)

I noticed that under load a lot of connections (~1-2%) were dropped or timed out. I started investigating, tuning os and akka params and trimming down my sample app until I got this:


//N.B.: this is a test
implicit val system = ActorSystem()
implicit val mat: ActorMaterializer = ActorMaterializer()
implicit val ec = system.dispatcher

val binding: Future[ServerBinding] = Http().bind("0.0.0.0", 1104).map { conn ⇒
val promise = Promise[Unit]()
// I don't even wait for the end of the flow
val handler = Flow[HttpRequest].map { _ ⇒ promise.success(()); HttpResponse() }

// to be sure it's not a mapAsync(1) problem I use map and block here, same result
val t0 = System.currentTimeMillis()
println(s"${Thread.currentThread().getName} start")

conn handleWith handler

Await.result(promise.future, 10.seconds)
println(s"${Thread.currentThread().getName} end ${System.currentTimeMillis() - t0}ms");
}.to(Sink.ignore).run()

Await.result(binding, 10.seconds)


When I run a small test using ab with something like "-c 1000" concurrent connections or more (even if I'm handling one at a time here), some of the requests immediately start getting unusual delays:

default-akka.actor.default-dispatcher-3 start
default-akka.actor.default-dispatcher-3 end 2015ms -> gets bigger

This keeps getting worse. After a while I can kill ab, wait some minutes and make a single request and it either gets refused or times out. The server is basically dead


I get the exact same result with this, if you're wondering why I did all that blocking and printing stuff above:

val handler = Flow[HttpRequest].map(_ ⇒ HttpResponse()).alsoToMat(Sink.ignore)(Keep.right)
val binding: Future[ServerBinding] = Http().bind("0.0.0.0", 1104).mapAsync(1) { conn ⇒
conn handleWith handler
}.to(Sink.ignore).run()
and the same happens if I use bindAndHandle with a simple route. 


In my standard setup (bindAndHandle, any number of concurrent connections (1k to 10k tried) and keepalive for the requests) I see a number of connections between 1 and 3% failing.
This is what I get calling a simple route with  bindAndHandle, MaxConnections(10000) and connection keepalive enabled on the client: lots of timeouts after just 10k calls already:

Concurrency Level:      4000
Time taken for tests:   60.605 seconds
Complete requests:      10000
Failed requests:        261
   (Connect: 0, Receive: 87, Length: 87, Exceptions: 87)
Keep-Alive requests:    9913
...

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    7  31.3      0     191
Processing:     0  241 2780.8      5   60396
Waiting:        0   92 1270.8      5   60396
Total:          0  248 2783.5      5   60459

Percentage of the requests served within a certain time (ms)
...
  90%     13
  95%    255
  98%   2061
  99%   3911
 100%  60459 (longest request) 

It looks like it does the same on my local machine (mac) but I'm not 100% sure. I'm doing the tests on an ubuntu 8-core 24GB ram vm
I really don't know what to do, I'm trying every possible combination of system parameters and akka config but I keep getting the same result.  
Basically everything I tried (changing /etc/security/limits.conf, changing sysctl params, changing akka concurrent connections, backlog, dispatchers etc) led to the same result, that is: connections doing nothing and timing out. As if the execution were queued somehow


Is there something I'm missing? Some tuning parameter/config/something else? 
It looks like the piece of code that times out is conn handleWith handler even if 'handler' does nothing and and it keeps doing it even after the load stops. I.e. the connection is established correctly, but the processing is stuck.


this is my ulimit -a:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 96360
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 32768
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

vm.swappiness = 0


Cheers


Kyrylo Stokoz

unread,
Mar 5, 2016, 8:25:25 AM3/5/16
to Akka User List
Hi,

I have similar observations after a while server keep accepting requests but they all timeout and nothing gets returned in response.
I`m using akka http 2.4.2 and streams to create a simple server which handle requests and return files from S3.

In my case i don`t need to do even high load, doing request one after another is enough to hang the server. I played with max connections parameter and increasing it makes app process more requests but eventually it stuck anyway.

From my observation issue is in http connections pool that are not properly releasing connections and when new request comes in runnable graph is created but sinks and sources are not properly connected to start the flow.

During my tests i see (not sure this is related though):

ERROR [datastore-rest-api-akka.actor.default-dispatcher-5] - Error in stage [One2OneBidi]: Inner stream finished before inputs completed. Outputs might have been truncated.
akka.http.impl.util.One2OneBidiFlow$OutputTruncationException$: Inner stream finished before inputs completed. Outputs might have been truncated.
ERROR [datastore-rest-api-akka.actor.default-dispatcher-5] - Error in stage [One2OneBidi]: Inner stream finished before inputs completed. Outputs might have been truncated.
akka.http.impl.util.One2OneBidiFlow$OutputTruncationException$: Inner stream finished before inputs completed. Outputs might have been truncated.
ERROR [datastore-rest-api-akka.actor.default-dispatcher-5] - Error in stage [One2OneBidi]: Inner stream finished before inputs completed. Outputs might have been truncated.
akka.http.impl.util.One2OneBidiFlow$OutputTruncationException$: Inner stream finished before inputs completed. Outputs might have been truncated.
ERROR [datastore-rest-api-akka.actor.default-dispatcher-5] - Error in stage [One2OneBidi]: Inner stream finished before inputs completed. Outputs might have been truncated.
akka.http.impl.util.One2OneBidiFlow$OutputTruncationException$: Inner stream finished before inputs completed. Outputs might have been truncated.
 INFO [datastore-rest-api-akka.actor.default-dispatcher-5] - Message [akka.io.Tcp$ResumeReading$] from Actor[akka://datastore-rest-api/user/StreamSupervisor-0/$$a#1262265379] to Actor[akka://datastore-rest-api/system/IO-TCP/selectors/$a/3#-1262857800] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
 INFO [datastore-rest-api-akka.actor.default-dispatcher-5] - Message [akka.io.Tcp$ResumeReading$] from Actor[akka://datastore-rest-api/user/StreamSupervisor-0/$$e#585879533] to Actor[akka://datastore-rest-api/system/IO-TCP/selectors/$a/7#1750981790] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

Giovanni Alberto Caporaletti

unread,
Mar 5, 2016, 10:38:11 AM3/5/16
to Akka User List
Let me add that using finch on the same project  with default settings (same exact code apart from the very first layer, http req/resp parsing/routing) I get 0 timeouts/drops and more or less the same performance.
The performance is not an issue (it's getting better and better with every release, it's basically already on par with spray/finagle) but dropping connections / timing out is.

Luckily this won't go into production for a while so I have some time to solve this issue, I don't want to change http frontend, I really hate twitter APIs, my business layer is completely written using futures and streams

I don't know exactly what to look into, I'm open to suggestions

thank you!

David Knapp

unread,
Mar 8, 2016, 7:28:57 PM3/8/16
to Akka User List
For what it's worth I've been occasionally seeing something similar, but it's extremely intermittent and hard to reproduce since my application is quite large and complex and relies almost entirely on an upstream REST endpoint.

Giovanni Alberto Caporaletti

unread,
Mar 9, 2016, 5:16:07 AM3/9/16
to Akka User List
I have very little time but I'll try to create a simple reproducer app and open an issue (if that's the case) asap

lisp pm

unread,
Mar 10, 2016, 4:21:26 AM3/10/16
to Akka User List
My use case is similar to yours. After downgrading from 2.4.2 to 2.0.3, the error has gone and everything seems working fine.

Giovanni Alberto Caporaletti

unread,
Mar 10, 2016, 4:27:04 AM3/10/16
to Akka User List
I opened an issue with a simple reproducer: https://github.com/akka/akka/issues/19996

Kyrylo Stokoz

unread,
Mar 10, 2016, 9:44:44 AM3/10/16
to Akka User List
I think i have related issue but on client side (i.e making request using akka http) and issue with reproducer is: https://github.com/akka/akka/issues/19953

Brian

unread,
Mar 10, 2016, 12:48:56 PM3/10/16
to Akka User List
I'm seeing (what looks to be) the same issue under load test, where connection timeouts cause the application to become non-responsive to http requests.  I also eventually get an OOM due to GC overhead limit exceeded, although that may be a symptom of a cascading failure.  Load test runs without connection timeouts are clean, but as soon as I start to see timeouts, it's crash and burn.  Similar to Giovanni's case with Finch above, when the same business layer is used behind liftweb, there are no such issues.

Thread dumps show a lot of busy threads in deep stacks, but not much obvious in the akka-http layer.  It is interesting that during lock-up I need to use the -F option with jstack to even get a thread dump.

I do have logback configured, so hopefully can add a bit more information around the issue.  Some of the interesting error messages that happen as things are going south include:

// Several of these indicates issue is imminent
ERROR a
.a.RepointableActorRef - Error in stage [recover]: No elements passed in the last 1 minute.
akka
.http.impl.engine.HttpConnectionTimeoutException: No elements passed in the last 1 minute.

// Not sure if this is Metallica related, only one set of these seen
ERROR a.a.ActorSystemImpl - exception on LARS’ timer thread

INFO  a.a.ActorSystemImpl - starting new LARS thread


// These tend to be the last errors to happen before JVM crash, and there are many:

ERROR a.a.RepointableActorRef - Error in stage [akka.http.impl.engine.server.HttpServerBluePrint$RequestTimeoutSupport@xxxxxxx]: requirement failed: Cannot push port (responseOut) twice



Hope this helps, i'll keep an eye on #19996.

Giovanni Alberto Caporaletti

unread,
Mar 10, 2016, 5:59:08 PM3/10/16
to Akka User List
I get this in my tests if i enable logging:

[DEBUG] [03/10/2016 23:43:36.674] [default-akka.actor.default-dispatcher-10] [akka://default/system/IO-TCP/selectors/$a/0] Unbinding endpoint /127.0.0.1:1234
[DEBUG] [03/10/2016 23:43:36.677] [default-akka.actor.default-dispatcher-10] [akka://default/system/IO-TCP/selectors/$a/0] Unbound endpoint /127.0.0.1:1234, stopping listener
Reply all
Reply to author
Forward
0 new messages