Spray->Akka-Http Migration - seeing high 99th percentile latencies post-migration

177 views
Skip to first unread message

Gary Malouf

unread,
Oct 12, 2017, 4:31:14 PM10/12/17
to Akka User List
We have a web service that we just finished migrating from spray 1.3 to Akka-Http 10.0.9.  While in most cases it is performing well, we are seeing terrible 99th percentile latencies 300-450ms range) starting from a very low request rate (10/second) on an ec2 m3.large.  

Our service does not do anything complicated - it does a few Map lookups and returns a response to a request.  In spray, even 99th percentile latencies were on the order of 1-3 ms, so we are definitely concerned.  Connections as with many pixel-type servers are short-lived -> we actually pass the Connection: Close header intentionally in our responses.  

Is there any obvious tuning that should be done on the server configuration that others have found?

Gary Malouf

unread,
Oct 12, 2017, 4:36:06 PM10/12/17
to Akka User List
To be clear, 95th percentile and down are as low as before so wondering if this is a new connection closing penalty being paid or if the actor system needs to be tuned differently now...

Konrad Malawski

unread,
Oct 12, 2017, 8:44:55 PM10/12/17
to Akka User List, Gary Malouf
When asking about performance and benchmarks always include specific numbers, code, and benchmark methodology otherwise it’s just guessing and inventing numbers and reasons.

Thanks

-- 
Konrad Malawski
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Gary Malouf

unread,
Oct 17, 2017, 10:48:15 AM10/17/17
to Akka User List
Hi Konrad,

Understand your point - not really possible to share code on a closed-source project.  I'm more asking if akka-http does not handle short-lived connections very well yet as opposed to how spray handled them.  I will be profiling in the mean-time trying to get to the bottom of the issue.

Gary

Konrad “ktoso” Malawski

unread,
Oct 17, 2017, 11:26:13 AM10/17/17
to akka...@googlegroups.com, Gary Malouf
Short lived connections are slightly more costly in Akka-HTTP than in Spray, due to the streaming infrastructure.

-- 
Cheers,
Konrad 'ktoso' Malawski

Gary Malouf

unread,
Oct 17, 2017, 12:02:35 PM10/17/17
to Akka User List
Thanks Konrad - given the huge cost changes we are seeing - is there any tuning you would recommend in terms of dispatchers, etc for smoothing this or should I consider a different server even given the streaming infrastructure.

Konrad “ktoso” Malawski

unread,
Oct 17, 2017, 12:07:51 PM10/17/17
to akka...@googlegroups.com, Gary Malouf
Step 1 – don’t panic ;-)
Step 2 – as I already asked for, please share actual details of the benchmarks. It is not good to discuss benchmarks without any insight into what / how exactly you’re measuring.

-- 
Cheers,
Konrad 'ktoso' Malawski

Gary Malouf

unread,
Oct 23, 2017, 4:11:48 PM10/23/17
to Akka User List
Hi Konrad,

Our real issue is that we can not reproduce the results.  The web server we are having latency issues with is under peak load of 10-15 requests/second - obviously not much to deal with. 

When we use load tests (https://github.com/apigee/apib), it's easy for us to throw a few thousand requests/second at it and get latencies in the ~ 3 ms range.  We use kamon to track internal metrics - what we see is that our 95th and 99th percentiles only look bad under the production traffic but not under load tests.  

I've since used kamon to print out the actual requests trying to find any pattern in them to hint at what's wrong in my own code, but they seem to be completely random.  What we do know is that downgrading to spray gets us 99.9th percentile latencies under 2ms, so something related to the upgrade is allowing this.

Thanks,

Gary

Viktor Klang

unread,
Oct 23, 2017, 4:21:08 PM10/23/17
to Akka User List
What definition of latency are you using? (i.e. how is it derived)

To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscribe@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Cheers,

Gary Malouf

unread,
Oct 23, 2017, 4:22:49 PM10/23/17
to akka...@googlegroups.com
We are using percentiles computed via Kamon 0.6.8.  In a very low request rate environment like this, it takes roughly 1 super slow request/second to throw off the percentiles (which is what I think is happening).  



You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/-_C9jCPDwts/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+unsubscribe@googlegroups.com.

Viktor Klang

unread,
Oct 23, 2017, 4:30:33 PM10/23/17
to Akka User List
No, I mean, is it from first-byte-received to last-byte-sent or what?

Gary Malouf

unread,
Oct 23, 2017, 4:35:31 PM10/23/17
to akka...@googlegroups.com
It is from when I start the Kamon trace (just inside of my path("myawesomepath") declaration until (theoretically) a 'complete' call is made.  

path("myawesomepath") {
  traceName("CoolStory") {
///do some stuff
 complete("This is great")
} }

For what it's worth, this route is a 'POST' call.

Viktor Klang

unread,
Oct 23, 2017, 4:42:53 PM10/23/17
to Akka User List
And you consume the entityBytes I presume?

Gary Malouf

unread,
Oct 23, 2017, 4:50:07 PM10/23/17
to akka...@googlegroups.com
Yes, it gets parsed using entity(as[]) with spray-json support.  Under a load test of say 1000 requests/second these latencies are not visible in the percentiles - they are easy to see because this web server is getting 10-20 requests/second currently.  Trying to brainstorm if a dispatcher needed to be tuned or something of that sort but have yet to see evidence supporting that.

path("foos") {
traceName("FooSelection") {

entity(as[ExternalPageRequest]) { pr =>
val spr = toSelectionPageRequest(pr)
shouldTracePageId(spr.pageId).fold(
Tracer.currentContext.withNewSegment(s"Page-${pr.pageId}", "PageTrace", "kamon") {
processPageRequestAndComplete(pr, spr)
},
processPageRequestAndComplete(pr, spr)
)
}
}

}

Roland Kuhn

unread,
Oct 24, 2017, 2:23:07 AM10/24/17
to akka...@googlegroups.com
You could try to decrease your thread pool size to 1 to exclude wakeup latencies when things (like CPU cores) have gone to sleep.

Regards, Roland 

Sent from my iPhone

Gary Malouf

unread,
Oct 30, 2017, 4:27:13 PM10/30/17
to Akka User List
Hi Roland - thank you for the tip.  We shrunk the thread pool size down to 1, but were disheartened to still see the latency spikes.  Using Kamon's tracing library (which we validated with various tests to ensure it's own numbers are most likely correct), we could not find anything in our code within the route that was causing the latency (it all appeared to be classified to be that route but no code segments within it).  

As mentioned earlier, running loads of 100-1000 requests/second completely hides the issue (save for the max latency) as everything through 99th percentiles is under a few milliseconds.



--
Cheers,

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/-_C9jCPDwts/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Cheers,

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/-_C9jCPDwts/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Cheers,

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/-_C9jCPDwts/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Gary Malouf

unread,
Nov 1, 2017, 3:56:50 PM11/1/17
to Akka User List
So the only way I was able to successfully identify the suspicious code was to route a percentage of my production traffic to a stubbed route that I incrementally added back pieces of our implementation into.  What I found was that we started getting spikes when the entity(as[CaseClassFromJson]) stubbed was added back in.  To figure out if it was the json parsing or 'POST' entity consumption itself, I replaced that class with a string - turns out we experience the latency spikes with that as well (on low traffic as noted earlier in this thread).  

I by no means have a deep understanding of streams, but it makes me wonder if the way I have our code consuming the entity is not correct.

johannes...@lightbend.com

unread,
Nov 16, 2017, 6:57:42 AM11/16/17
to Akka User List
Hi Gary,

did you find out what's going on by now? If I understand correctly, you get latency spikes as soon as you use the `entity[as[String]]` directive? Could you narrow down if there's anything special to those requests? I guess you monitor your GC times?

Johannes

Gary Malouf

unread,
Nov 16, 2017, 7:28:23 AM11/16/17
to akka...@googlegroups.com
Hi Johannes,

Yes; we are seeing 2-3 requests/second (only in production) with the latency spikes.  We found no correlation between the gc times and these request latencies, nor between the size/type of requests.

We had to pause the migration effort for 2 weeks because of the time being taken, but just jumped back on it the other day.  

Our current strategy is to implement this with the low level api to see if we get the same results.

Gary

To unsubscribe from this group and all its topics, send an email to akka-user+unsubscribe@googlegroups.com.

johannes...@lightbend.com

unread,
Nov 16, 2017, 8:00:35 AM11/16/17
to Akka User List
I wonder if you could start a timer when you enter the trace block and then e.g. after 200ms trigger one or multiple stack dumps (using JMX or just by printing out the result of `Thread.getAllStackTraces`). It's not super likely that something will turn up but it seems like a simple enough thing to try.

Johannes


On Thursday, November 16, 2017 at 1:28:23 PM UTC+1, Gary Malouf wrote:
Hi Johannes,

Yes; we are seeing 2-3 requests/second (only in production) with the latency spikes.  We found no correlation between the gc times and these request latencies, nor between the size/type of requests.

We had to pause the migration effort for 2 weeks because of the time being taken, but just jumped back on it the other day.  

Our current strategy is to implement this with the low level api to see if we get the same results.

Gary
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

Avshalom Manevich

unread,
Oct 7, 2018, 11:18:48 AM10/7/18
to Akka User List
Hi Gary,

Did you end up finding a solution to this?

We're hitting a similar issue with Akka HTTP (10.0.11) and a low-load server.

Average latency is great but 99th percentile is horrible (~200ms).

Appreciate your input.

Regards,
Avshalom 
בתאריך יום חמישי, 16 בנובמבר 2017 בשעה 15:00:35 UTC+2, מאת johannes...@lightbend.com:

Gary Malouf

unread,
Oct 8, 2018, 4:41:59 PM10/8/18
to akka...@googlegroups.com
We ultimately decided to rollout despite this glitch.  Not happy about it, and hoping whatever is causing this gets resolved in a future release.  My hunch is that it's a fixed price being paid that if 1000's of more requests/second were sent to the app would make this unnoticeable.



--
*****************************************************************************************************
** New discussion forum: https://discuss.akka.io/ replacing akka-user google-group soon.
** This group will soon be put into read-only mode, and replaced by discuss.akka.io
** More details: https://akka.io/blog/news/2018/03/13/discuss.akka.io-announced
*****************************************************************************************************

Johannes Rudolph

unread,
Oct 9, 2018, 5:18:17 AM10/9/18
to akka...@googlegroups.com
That the entity directive is part of the picture could be a hint that indeed streaming requests might be the cause of this. In spray, there was no request streaming enabled by default and the engine just collected the complete stream into a buffer and dispatched it to the app only after everything was received. This has changed in akka-http where streaming is on by default if the complete request wasn't received in one go from the network. In this case the streaming case is actually more likely to happen on low-traffic servers with a real network where network packages are not aggregated in lower levels but are really processed immediately when they are received.

The question is still if the 200ms are really added latency in akka-http or just an artifact of how request processing time is measured. There's definitely *some* overhead of processing a request in streaming fashion but it's not 200ms. I haven't checked seriously but it seems that Kamon might be measuring something else than you are thinking in akka-http: it seems to start measuring the time from when the request is dispatched to your app but at this point the request body might not have been received fully. That means that whenever the HTTP client is slow with sending a request for whatever reason, it will show in your request processing times.

Johannes

Avshalom Manevich

unread,
Oct 9, 2018, 10:33:43 AM10/9/18
to Akka User List
Thanks Gary, Johannes.

Our application is pretty simple from the calling side perspective and involves no entity directive, just a String and an Integer segment.

We use an actor per request pattern to complete the request, adapted from https://markatta.com/codemonkey/posts/actor-per-request-with-akka-http (the second option).

Our workload is around 300 reqs per second.

We tried dumping the threads during the latency spike, as you suggested. Also tried tuning the thread pool sizes.

I'll update if we come up with any findings.

Avshalom
Reply all
Reply to author
Forward
0 new messages