Different latency inside and outside

292 views
Skip to first unread message

Alexander Petrovsky

unread,
Mar 18, 2017, 3:52:21 PM3/18/17
to golang-nuts
Hello!

Colleagues, I need your help!

And so, I have the application, that accept through http (fasthttp) dynamic json, unmarshal it to the map[string]interface{} using ffjson, after that some fields reads into struct, then using this struct I make some calculations, and then struct fields writes into map[string]interface{}, this map writes to kafka (asynchronous), and finally the result reply to client through http. Also, I have 2 caches, one contains 100 millions and second 20 millions items, this caches build using freecache to avoid slooooow GC pauses. Incoming rate is 4k rps per server (5 servers at all), total cpu utilisation about 15% per server.

The problem — my latency measurements show me that inside application latency significantly less then outside.
1. How I measure latency?
    - I've add timings into http function handlers, and after that make graphs.
2. How I understood that latency inside application significantly less then outside?
    - I'm installed in front of my application the nginx server and log $request_time, $upstream_response_time, after that make graphs too.

It graphs show me that inside application latency is about 500 microseconds in 99 percentile, and about 10-15 milliseconds outside (nginx). The nginx and my app works on the same server. My graphs show me that GC occur every 30-40 seconds, and works less then 3 millisecond.



Could someone help me find the problem and profile my application?

David Collier-Brown

unread,
Mar 18, 2017, 8:28:13 PM3/18/17
to golang-nuts
Are you seeing the average response time / latency of the cache from outside? 

If so, you should see lots of really quick responeses, and a few ones as slow as inside that average to what you're seeing.

--dave

Konstantin Shaposhnikov

unread,
Mar 19, 2017, 7:19:36 AM3/19/17
to golang-nuts
Hi,

External measurements probably show more acurate picture.

First of all internal latency numbers only include time spent doing actual work but don't include HTTP parsing (by net/http) and network overhead.

Secondly latency measured internally always looks better because it doesn't include application stalls that happened outside of the measured code. Imagine that it takes 10ms for net/http to parse the request (e.g. due to STW pause). and 1ms to run the handler. The real request latency is 11ms in this case by if measured internally it is only 1ms. This is known as coordinated omission. 

I recommend to watch this video for lots of useful information about latency measurement: https://www.youtube.com/watch?v=lJ8ydIuPFeU

Konstantin

Alexander Petrovsky

unread,
Mar 19, 2017, 10:47:53 AM3/19/17
to golang-nuts
Hello, Dave!

воскресенье, 19 марта 2017 г., 3:28:13 UTC+3 пользователь David Collier-Brown написал:
Are you seeing the average response time / latency of the cache from outside? 

I don't calculate average, I'm using percentiles! Looks like the "cache" don't affect at all, otherwise I'll seen that on my graphs, since I'm calling my cache inside http handler between timings.
 
If so, you should see lots of really quick responeses, and a few ones as slow as inside that average to what you're seeing.

No, as I said, I'm using only percentiles, not average.

Jesper Louis Andersen

unread,
Mar 19, 2017, 11:09:17 AM3/19/17
to Alexander Petrovsky, golang-nuts
My approach is usually this:

When a problem like this occurs, I very quickly switch from random guessing at what the problem can be into a mode where I try to verify the mental model I have of the system. Your mental model is likely wrong, and thus it is leading you astray in what the problem might be. So I start devising metrics that can support the mental model I have. Often, when your model is corrected, you start understanding the pathology of the system. I tend to start from the bottom and work up through the layers, trying to verify in each layer that I'm seeing behavior that isn't out of the ordinary from the mental model I have.

* At 4000 req/s, we are implicitly assuming that each request look the same. Otherwise that is a weak metric as an indicator of system behavior. Are they the same and take the same work? If we log the slowest request every 5 seconds, what does it look like compared to one of the typical ones.
* The 99th percentile ignores the 40 slowest queries. What does the 99.9, 9.99, ... and max percentiles look like?
* What lies between the external measurement and the internal measurement? Can we inject a metric for each of those?
* The operating system and environment is only doing work for us, and not for someone else because it is virtualized, or some other operation is running.
* There is enough bandwidth.
* Caches have hit/miss rates that looks about right.
* The cache also caches negative responses. That is, if an element is not present in the backing store, a lookup in the cache will not fail on repeated requests and go the said backing store.
* 15% CPU load means we are spending ample amounts of time waiting. What are we waiting on? Start measuring foreign support systems further down the chain. Don't trust your external partners. Especially if they are a network connection away. What are the latencies for the waiting down the line?
* Are we measuring the right thing in the internal measurements? If the window between external/internal is narrow, then chances are we are doing the wrong thing on the internal side.


Google's SRE handbook mentions the 4 "golden" metrics. If nothing else, measuring those on a system can often tell you if it is behaving or not.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexander Petrovsky

unread,
Mar 19, 2017, 11:30:13 AM3/19/17
to golang-nuts
Hello, Konstantin!

воскресенье, 19 марта 2017 г., 14:19:36 UTC+3 пользователь Konstantin Shaposhnikov написал:
Hi,

External measurements probably show more acurate picture.

Of course!
 

First of all internal latency numbers only include time spent doing actual work but don't include HTTP parsing (by net/http) and network overhead.
 
Yep, I'm absolutely agree with you, but I'm don't use net/http, I'm using fasthttp (jfyi). I don't believe that HTTP parsing may occur more then microseconds, network overhead on local machine insignificantly small!
 
Secondly latency measured internally always looks better because it doesn't include application stalls that happened outside of the measured code.

Agree!
 
Imagine that it takes 10ms for net/http to parse the request (e.g. due to STW pause). and 1ms to run the handler. The real request latency is 11ms in this case by if measured internally it is only 1ms. This is known as coordinated omission. 

As I said earlier, I don't believe that http parsing and other "run-time" stuff may takes 10ms, it's unacceptable for that! By example, let's suppose, this situation takes place, why I don't see similar spikes in both graphs (nginx latency, myapp latency), but with different time order? Here part of my nginx log sampling:

# cat access.log-20170318 | grep "17/Mar/2017:03:42:17" | awk '{ print $15,$16 }' | sort | uniq -c
   2056 0.000 0.000
    200 0.001 0.000
   1313 0.001 0.001
      3 0.002 0.001
      9 0.002 0.002
      5 0.003 0.003
      3 0.004 0.004
      4 0.005 0.005
      5 0.006 0.006
      4 0.007 0.007
      2 0.008 0.007
      5 0.008 0.008
      1 0.009 0.009

As you can see, your hypothesis is not true, more then 99 percent of requests is really fast and occur less the 1 millisecond! And I try to find our what happens in this 1 percent!
 
I recommend to watch this video for lots of useful information about latency measurement: https://www.youtube.com/watch?v=lJ8ydIuPFeU

I'm start watching this video, thanks. One thing that I want to share, I'm agree that measure the latency only inside my handler function is not right, and the main question - how can I measure the latency in other parts of my application? This is main question in this topic!

Alexander Petrovsky

unread,
Mar 19, 2017, 11:58:38 AM3/19/17
to golang-nuts, askj...@gmail.com
Hello, Jesper! Nice to see not only in erlang community!


воскресенье, 19 марта 2017 г., 18:09:17 UTC+3 пользователь Jesper Louis Andersen написал:
My approach is usually this:

When a problem like this occurs, I very quickly switch from random guessing at what the problem can be into a mode where I try to verify the mental model I have of the system. Your mental model is likely wrong, and thus it is leading you astray in what the problem might be. So I start devising metrics that can support the mental model I have. Often, when your model is corrected, you start understanding the pathology of the system. I tend to start from the bottom and work up through the layers, trying to verify in each layer that I'm seeing behavior that isn't out of the ordinary from the mental model I have.

I'm absolutely agree with you on that. The first put forward a hypotysis, and then try to confirm or disapprove hypothesis! The problem is, I have no more hypothesis!
 

* At 4000 req/s, we are implicitly assuming that each request look the same. Otherwise that is a weak metric as an indicator of system behavior. Are they the same and take the same work? If we log the slowest request every 5 seconds, what does it look like compared to one of the typical ones.

The all requests is the same, and have the same behavior! I'm log all requests and they all similar.
 
* The 99th percentile ignores the 40 slowest queries. What does the 99.9, 9.99, ... and max percentiles look like?

I'v have no answer to this question. And I don't know how it can help me?
 
* What lies between the external measurement and the internal measurement? Can we inject a metric for each of those?

Yep, it's the also the main question! I'm log and graph nginx $request_time, and log and graph internal function time. What is between, I can't log, it's:
 - local network (TCP);
 - work in kernel/user space;
 - golang GC and other run-time;
 - golang fasthttp machinery before call my http handler.
 
* The operating system and environment is only doing work for us, and not for someone else because it is virtualized, or some other operation is running.

Only for us! There is no other application that can impact on my application performance!
 
* There is enough bandwidth.

Looks like bandwidth is enough, this show me my graphs. And as I know, local network inside the one server can't affect application performance so much.
 
* Caches have hit/miss rates that looks about right.

In my application this is not true caches, it real it's dictionary loaded from database, and user in calculation.
 
* The cache also caches negative responses. That is, if an element is not present in the backing store, a lookup in the cache will not fail on repeated requests and go the said backing store.

- my answer earlier - )
 
* 15% CPU load means we are spending ample amounts of time waiting. What are we waiting on?

Maybe, or maybe the 32 core can process the 4k rps. How can I find out, what my app is waiting on?
 
Start measuring foreign support systems further down the chain. Don't trust your external partners. Especially if they are a network connection away. What are the latencies for the waiting down the line?

Yep, I measure latency on my side, using nginx, and log $request_time and graph it after that.
 
* Are we measuring the right thing in the internal measurements? If the window between external/internal is narrow, then chances are we are doing the wrong thing on the internal side.

Could you explain this?

Tamás Gulácsi

unread,
Mar 19, 2017, 1:00:25 PM3/19/17
to golang-nuts
As fasthttp does not even follow the specs, you cannot assume that all requests are parsed the same, till you prove it.

Alexander Petrovsky

unread,
Mar 19, 2017, 1:44:46 PM3/19/17
to golang-nuts
As I know it's doesn't matter, before I've used net/http, and the situation doesn't change, except the count of allocations, they are reduced

Could you point please where fasthttp doesn't follow the specs?

воскресенье, 19 марта 2017 г., 20:00:25 UTC+3 пользователь Tamás Gulácsi написал:

Jesper Louis Andersen

unread,
Mar 19, 2017, 1:46:16 PM3/19/17
to Alexander Petrovsky, golang-nuts
On Sun, Mar 19, 2017 at 4:58 PM Alexander Petrovsky <askj...@gmail.com> wrote:
 
* The 99th percentile ignores the 40 slowest queries. What does the 99.9, 9.99, ... and max percentiles look like?

I'v have no answer to this question. And I don't know how it can help me?

Usually, the maximum latency is a better indicator of trouble than a 99th percentile in my experience. If you improve the worst case, then surely the other cases are likely to follow. However, there are situations where this will hurt the median 50th percentile latency. Usually this trade-off is okay, but there are a few situations where it might not be.

Yep, it's the also the main question! I'm log and graph nginx $request_time, and log and graph internal function time. What is between, I can't log, it's:
 - local network (TCP);
 - work in kernel/user space;
 - golang GC and other run-time;
 - golang fasthttp machinery before call my http handler.

The kernel and GC can be dynamically inspected. I'd seriously consider profiling as well in a laboratory environment. Your hypothesis is that none of these have a discrepancy, but they may have.
 
* Caches have hit/miss rates that looks about right.

In my application this is not true caches, it real it's dictionary loaded from database, and user in calculation.

Perhaps the code in https://godoc.org/golang.org/x/text is of use for this? It tends to be faster than maps because it utilizes compact string representations and tries. Of course, it requires you show that the problem is with the caching sublayer first.
 
* 15% CPU load means we are spending ample amounts of time waiting. What are we waiting on?

Maybe, or maybe the 32 core can process the 4k rps. How can I find out, what my app is waiting on?

blockprofile is my guess at what I would grab first. Perhaps the tracing functionality as well. You can also adds metrics on each blocking point in order to get an idea of where the system is going off. Functionality like dtrace would be nice, but I'm not sure Go has it, unfortunately.
 
 
* Are we measuring the right thing in the internal measurements? If the window between external/internal is narrow, then chances are we are doing the wrong thing on the internal side.

Could you explain this?

There may be a bug in the measurement code, so you should probably go over it again. One common fault of mine is to place the measurement around the wrong functions, so I think they are detecting more than they are. A single regular expression that is only hit in corner-cases can be enough to mess with a performance profile. Another common mistake is to not have a appropriate decay parameter on your latency measurements, so older requests eventually gets removed from the latency graph[0]
 
In general, as the amount of work a system processes goes up, it gets more sensitive to fluctuations in latency. So even at a fairly low CPU load, you may still have some spiky behavior hidden by a smoothing of the CPU load measure, and this can contribute to added congestion.

[0] A decaying Vitter's algorithm R implementation, or Tene's HdrHistogram is preferable. HdrHistogram is interesting in that it uses a floating-point representation for its counters: one array for exponents, one array for mantissa. It allows very fast accounting (nanoseconds) and provides precise measurements around 0 at the expense of precision at, say, 1 hour. It is usually okay because if you waited 1 hour, you don't care if it was really 1 hour and 3 seconds. But at 1us, you really care about being precise.

Konstantin Shaposhnikov

unread,
Mar 19, 2017, 2:15:25 PM3/19/17
to golang-nuts

As you can see, your hypothesis is not true, more then 99 percent of requests is really fast and occur less the 1 millisecond! And I try to find our what happens in this 1 percent!
 

I was probably not clear enough with my explanation. In 99% of cases net/http (or fasthttp) parsing will be very fast (a few micros) and won't add much to the internally measured latency. However in 1% of cases there could be a GC stop the world pause or go runtime decides to use the request goroutine to assist GC or some sub-optimal scheduling decision or I/O and the request will take longer but this will never be reflected in the measured time.

https://golang.org/cmd/trace/ can be used to find out what is happening inside a running Go application. If you capture a trace during interval with request(s) taking more time that usual then you will be able to find out what exactly takes so long (syscalls, scheduler, GC, etc).

Also note that there are still a few latency related bugs in Go runtime. E.g. https://github.com/golang/go/issues/14812https://github.com/golang/go/issues/18155https://github.com/golang/go/issues/18534

Tamás Gulácsi

unread,
Mar 19, 2017, 2:48:45 PM3/19/17
to golang-nuts
https://github.com/valyala/fasthttp/blob/master/README.md FAQ says "net/http handles more HTTP corner cases".
For me, that means not following the specs.

a.pet...@propellerads.net

unread,
Mar 20, 2017, 12:27:46 AM3/20/17
to golang-nuts, askj...@gmail.com


воскресенье, 19 марта 2017 г., 20:46:16 UTC+3 пользователь Jesper Louis Andersen написал:
On Sun, Mar 19, 2017 at 4:58 PM Alexander Petrovsky <askj...@gmail.com> wrote:
 
* The 99th percentile ignores the 40 slowest queries. What does the 99.9, 9.99, ... and max percentiles look like?

I'v have no answer to this question. And I don't know how it can help me?

Usually, the maximum latency is a better indicator of trouble than a 99th percentile in my experience. If you improve the worst case, then surely the other cases are likely to follow. However, there are situations where this will hurt the median 50th percentile latency. Usually this trade-off is okay, but there are a few situations where it might not be.

Got it!
 

Yep, it's the also the main question! I'm log and graph nginx $request_time, and log and graph internal function time. What is between, I can't log, it's:
 - local network (TCP);
 - work in kernel/user space;
 - golang GC and other run-time;
 - golang fasthttp machinery before call my http handler.

The kernel and GC can be dynamically inspected. I'd seriously consider profiling as well in a laboratory environment. Your hypothesis is that none of these have a discrepancy, but they may have.

If I'm understood your correctly, I think the problem is something there...
 
 
* Caches have hit/miss rates that looks about right.

In my application this is not true caches, it real it's dictionary loaded from database, and user in calculation.

Perhaps the code in https://godoc.org/golang.org/x/text is of use for this? It tends to be faster than maps because it utilizes compact string representations and tries. Of course, it requires you show that the problem is with the caching sublayer first.

The dictionary is not words dictionary, the dictionary in database terms — some keys (ids), and values or multiple values.
 
* 15% CPU load means we are spending ample amounts of time waiting. What are we waiting on?

Maybe, or maybe the 32 core can process the 4k rps. How can I find out, what my app is waiting on?

blockprofile is my guess at what I would grab first. Perhaps the tracing functionality as well. You can also adds metrics on each blocking point in order to get an idea of where the system is going off. Functionality like dtrace would be nice, but I'm not sure Go has it, unfortunately.

Thanks a lot, I will!
 
 
 
* Are we measuring the right thing in the internal measurements? If the window between external/internal is narrow, then chances are we are doing the wrong thing on the internal side.

Could you explain this?

There may be a bug in the measurement code, so you should probably go over it again. One common fault of mine is to place the measurement around the wrong functions, so I think they are detecting more than they are. A single regular expression that is only hit in corner-cases can be enough to mess with a performance profile. Another common mistake is to not have a appropriate decay parameter on your latency measurements, so older requests eventually gets removed from the latency graph[0]
 
In general, as the amount of work a system processes goes up, it gets more sensitive to fluctuations in latency. So even at a fairly low CPU load, you may still have some spiky behavior hidden by a smoothing of the CPU load measure, and this can contribute to added congestion.

[0] A decaying Vitter's algorithm R implementation, or Tene's HdrHistogram is preferable. HdrHistogram is interesting in that it uses a floating-point representation for its counters: one array for exponents, one array for mantissa. It allows very fast accounting (nanoseconds) and provides precise measurements around 0 at the expense of precision at, say, 1 hour. It is usually okay because if you waited 1 hour, you don't care if it was really 1 hour and 3 seconds. But at 1us, you really care about being precise.

Could you explain please, what does you mean when you say - "Another common mistake is to not have a appropriate decay parameter on your latency measurements, so older requests eventually gets removed from the latency graph[0]"? Why the older requests should remove from latency graph? As I know the hdrHistogram is good fit for high precision measurements and graphs with different order and magnitude. How it can help me? 
 

a.pet...@propellerads.net

unread,
Mar 20, 2017, 12:27:46 AM3/20/17
to golang-nuts


воскресенье, 19 марта 2017 г., 21:15:25 UTC+3 пользователь Konstantin Shaposhnikov написал:

As you can see, your hypothesis is not true, more then 99 percent of requests is really fast and occur less the 1 millisecond! And I try to find our what happens in this 1 percent!
 

I was probably not clear enough with my explanation. In 99% of cases net/http (or fasthttp) parsing will be very fast (a few micros) and won't add much to the internally measured latency. However in 1% of cases there could be a GC stop the world pause or go runtime decides to use the request goroutine to assist GC or some sub-optimal scheduling decision or I/O and the request will take longer but this will never be reflected in the measured time.

Ack!
 

https://golang.org/cmd/trace/ can be used to find out what is happening inside a running Go application. If you capture a trace during interval with request(s) taking more time that usual then you will be able to find out what exactly takes so long (syscalls, scheduler, GC, etc).

Thanks, I'll try to use it for my purpose!
 

Also note that there are still a few latency related bugs in Go runtime. E.g. https://github.com/golang/go/issues/14812https://github.com/golang/go/issues/18155https://github.com/golang/go/issues/18534

Thanks, I'll will investigate them too!
Reply all
Reply to author
Forward
0 new messages