Does concurrency have some effect over HTTP client?

214 views
Skip to first unread message

JuanPablo AJ

unread,
Oct 24, 2020, 1:31:21 PM10/24/20
to golang-nuts
Hi,

I have some doubts related to the HTTP client.

I have an HTTP service that in the handler calls synchronously to another HTTP service (serviceA). (I will call it case 1).

func withoutGoroutine(w http.ResponseWriter, r *http.Request) {

  httpGetA()

  _, err := w.Write([]byte(``))
  if err != nil {
    log.Printf("%v", err)
  }
}

The same service has another handler, it does the same that case 1 with an additional asynchronous HTTP call to a different service (serviceB) using a different HTTP client (I will call it case 2).

func withGoroutine(w http.ResponseWriter, r *http.Request) {

  httpGetA()

  _, err := w.Write([]byte(``))
  if err != nil {
    log.Printf("%v", err)
  }

  go httpGetB()
}

I run some load tests over these handlers with these results

Case 1

    ab -n 500000 -c 50 0.0.0.0:8080/withoutgoroutine
    ...
    Percentage of the requests served within a certain time (ms)
      50%      3
      66%      4
      75%      4
      80%      4
      90%      5
      95%      7
      98%     10
      99%     11
     100%     38 (longest request)

Case 2

    ab -n 500000 -c 50 0.0.0.0:8080/withgoroutine
    ...
    Percentage of the requests served within a certain time (ms)
      50%      5
      66%      5
      75%      6
      80%      6
      90%      8
      95%     10
      98%     14
      99%     16
     100%     42 (longest request)

Why case 2 has slower responses?
If in case 2 I add more asynchronous calls (two or three) the response time were worst and worst.
I would expect the same response time.

To be sure about the problem is not the creation of goroutines, I created a third case: Case 1 with the creation of goroutines but without an HTTP call.

func withSleepyGoroutine(w http.ResponseWriter, r *http.Request) {

  httpGetA()

  _, err := w.Write([]byte(``))
  if err != nil {
    log.Printf("%v", err)
  }

  go func() {
    time.Sleep(1 * time.Millisecond)
  }()
}

In cases 2 and 3, the goroutines are created after the HTTP response is sent.

I made the load tests in different machines to dismiss a network issue.

The full source code is available in this Github repo.


Thanks a lot for your help.

Regards

Jesper Louis Andersen

unread,
Oct 25, 2020, 7:07:58 AM10/25/20
to JuanPablo AJ, golang-nuts
On Sat, Oct 24, 2020 at 7:30 PM JuanPablo AJ <jpab...@gmail.com> wrote:
 
I have some doubts related to the HTTP client.

First, if you have unexplained efficiency concerns in a program, you should profile and instrument. Make the system tell you what is happening rather than making guesses as to why. With that said, I have some hunches and experiments you might want to try out.

When you perform a load test, you have a SUT, or system-under-test. That is the whole system, including infrastructure around it. I can be a single program, or a cluster of machines. You also have a load generator, which generates load on your SUT in order to test different aspects of the SUT: bandwidth usage, latency in response, capacity limits, resource limits, etc[1]. Your goal is to figure out if the data you are seeing are within an acceptable range for your use case, or if you have to work more on the system to make it fall within the acceptable window.

Your test is about RTT latency of requests. This will become important.

One particular problem in your test is that the load generator and the SUT runs in the same environment. If the test is simple and you are trying to stress the system maximally, chances are that the load generator impacts the SUT. That means the latency will rise due to time sharing in the operating system.

Second, when measuring latency you should look out for the problem Gil Tene coined as "coordinated omission". In CO, the problem is that the load generator and the SUT cooperates in order to deliver the wrong latency counts. This is especially true if you just fire as many requests as possible on 50 connections. Under an overload situation, the system will suffer in latency since that is the only way the system can alleviate pressure. The problem with CO is that a server can decide to park a couple of requests and handle the other requests as fast as possible. This can load to a high number of requests on the active connections, and the stalled connections become noise in the statistics. You can look up Tene's `wrk2` project, but I think the ideas were baked back into Will Glozers wrk at a later point in time (memory eludes me).

The third point is about the sensitivity of your tests: when you measure things at the millisecond, microsecond or nanosecond range, your test becomes far more susceptible to foreign impact. You can generally use statistical bootstrapping to measure the impact this has on test variance, which I've done in the past. You start finding all kinds of interesting corner cases that perturb your benchmarks. Among the more surprising ones:

* CPU Scaling governors
* Turbo boosting: one core can be run at a higher clock frequency than a cluster. GC in Go is multicore, so even for a single-core program, this might have an effect
* CPU heat. Laptop CPUs have miserable thermal cooling compared to a server or desktop. They can run fast in small bursts, but not for longer stretches
* Someone using the computer while doing the benchmark
* An open browser window which runs some Javascript in the background
* An open electron app with a rendering of a .gif or .webm file
* Playing music while performing the benchmark, yielding CPU power to the MP3, Vorbis or AAC decoder
* Amount of incoming network traffic to process for a benchmark that has nothing to do with the network

Finally, asynchronous goroutines are still work the program needs to execute. It isn't free. So as the system is stressed with a higher load you run higher against the capacity limit, thus incurring slower response times. In the case where you perform requests in the background to another HTTP server, you are taking a slice of the available resources. You are also generating as much work internally as is coming in externally. In a real world server, this is usually a bad idea and you must put a resource limit in place. Otherwise an aggressive client can overwhelm your server. The trick is to slow the caller down by *not* responding right away if you are overloaded internally.

You should check your kernel. When you perform a large amount of requests on the same machine, you can run into limits in the number of TCP source ports if they are rotated too fast. It is a common problem when the load generator and SUT are on the same host.

You should check your HTTP client configuration as well. One way to avoid the above problem is to maximize connection reuse, but then you risk head-of-line blocking on the connections, even (or perhaps even more so) in the HTTP/2 case.

But above all: instrument, profile, observe. Nothing beats data and plots.

[1] SLI, SLOs etc. A good starting point is https://landing.google.com/sre/sre-book/chapters/service-level-objectives/ but that book is worth it for a full read. https://landing.google.com/sre/books/ too!


JuanPablo AJ

unread,
Oct 26, 2020, 3:21:42 PM10/26/20
to golang-nuts
Jesper, 
thanks a lot for your email, your answer was a hand in the dark forest of doubts.

I will start trying the load generator wrk2.

About "instrument, profile, observe", yes, I added the gops agent but until now I don't have any conclusion related to that information.

Regards.


Jesper Louis Andersen

unread,
Oct 27, 2020, 5:59:22 AM10/27/20
to JuanPablo AJ, golang-nuts
On Mon, Oct 26, 2020 at 8:21 PM JuanPablo AJ <jpab...@gmail.com> wrote:
Jesper, 
thanks a lot for your email, your answer was a hand in the dark forest of doubts.

I will start trying the load generator wrk2.

About "instrument, profile, observe", yes, I added the gops agent but until now I don't have any conclusion related to that information.


I'm a proponent of adding metrics into your production systems running code. If the system has low load, you can certainly pay the overhead of such metrics. If the system has high load, you can always sample and only pick a fraction of every request (1% say). I'm happy to pay the cost of 5-10% on my production systems if the sacrifice means I know what is going on. Observability is formally defined as a way to determine the state of a system based on its outputs[0]. If you start having metrics along your genuine program output, you stand a far better chance at figuring out what is going on inside the system. Also, metrics tend to be proactive: problems can show themselves in metrics long before the critical threshold of system failure is hit.

Good algorithms and data structures which your metrics package could endorse. Either directly, or as a variant thereof:

* Vitter's algorithm R. It is related to a Fisher-Yates shuffle in a peculiar and interesting way. Though you may have to drop or decay the reservoir unless you are measuring the whole window.
* Gil Tene's HdrHistogram. This essentially tracks a histogram based on the observation of floating point numbers: If we regard the exponent as buckets, each containing a set of mantissa buckets, we can quickly increment a bucket (a few nanoseconds). And the exponent-nature means we have high resolution close to 0 and less resolution away from 0. But this is often what one wants: if something takes 5 minutes, you often don't care if it was 5 minutes and 34 microseconds, so the approximation is sound. HdrHistogram also supports some nice algebraic properties such as merging (It forms a commutative monoid with the empty histogram as neutral element, and merging as the composition operation).
* HyperLogLog-based data structure ideas: accept approximate values in exchange for much smaller data storage needs.
* Decay ideas: If you keep a pair of (value, timestamp), you can decay the value over time according to some curve you decide. Keep an array of these and you can track top popular items efficiently. Go through the array and weed out any value which decays under a noise floor periodically to keep it down.

I'm not saying you should implement these things yourself. I'm saying that a good metrics package will do that for you, and you should endorse it. The key is to figure out which metrics your application needs and then you need to add those. The SRE handbooks I linked earlier have some good starting points on what to measure. But nothing beats having knowledge of the internals of a system so you can add the better metrics yourself. At-a-glance blackbox metrics are nice. However, they often simply tells you something is wrong, but not what.

In general, descriptive statistics is the tool you need to understand system behavior in the modern world. Infrastructures are simply too complex nowadays. For more pin-point understanding, a profiler might work really well, but the more concurrency a system has, the harder it is to gleam anything meaningful from a profile[1].



[0] Hat tip to Charity Majors for recognizing this from control theory.
[1] This is the same reason debuggers can have a hard time in a distributed setting. Your program is halted, but half of the program lives behind an API not under your control. And the timeout is lurking.


Reply all
Reply to author
Forward
0 new messages