First, if you have unexplained efficiency concerns in a program, you should profile and instrument. Make the system tell you what is happening rather than making guesses as to why. With that said, I have some hunches and experiments you might want to try out.
When you perform a load test, you have a SUT, or system-under-test. That is the whole system, including infrastructure around it. I can be a single program, or a cluster of machines. You also have a load generator, which generates load on your SUT in order to test different aspects of the SUT: bandwidth usage, latency in response, capacity limits, resource limits, etc[1]. Your goal is to figure out if the data you are seeing are within an acceptable range for your use case, or if you have to work more on the system to make it fall within the acceptable window.
Your test is about RTT latency of requests. This will become important.
One particular problem in your test is that the load generator and the SUT runs in the same environment. If the test is simple and you are trying to stress the system maximally, chances are that the load generator impacts the SUT. That means the latency will rise due to time sharing in the operating system.
Second, when measuring latency you should look out for the problem Gil Tene coined as "coordinated omission". In CO, the problem is that the load generator and the SUT cooperates in order to deliver the wrong latency counts. This is especially true if you just fire as many requests as possible on 50 connections. Under an overload situation, the system will suffer in latency since that is the only way the system can alleviate pressure. The problem with CO is that a server can decide to park a couple of requests and handle the other requests as fast as possible. This can load to a high number of requests on the active connections, and the stalled connections become noise in the statistics. You can look up Tene's `wrk2` project, but I think the ideas were baked back into Will Glozers wrk at a later point in time (memory eludes me).
The third point is about the sensitivity of your tests: when you measure things at the millisecond, microsecond or nanosecond range, your test becomes far more susceptible to foreign impact. You can generally use statistical bootstrapping to measure the impact this has on test variance, which I've done in the past. You start finding all kinds of interesting corner cases that perturb your benchmarks. Among the more surprising ones:
* CPU Scaling governors
* Turbo boosting: one core can be run at a higher clock frequency than a cluster. GC in Go is multicore, so even for a single-core program, this might have an effect
* CPU heat. Laptop CPUs have miserable thermal cooling compared to a server or desktop. They can run fast in small bursts, but not for longer stretches
* Someone using the computer while doing the benchmark
* An open browser window which runs some Javascript in the background
* An open electron app with a rendering of a .gif or .webm file
* Playing music while performing the benchmark, yielding CPU power to the MP3, Vorbis or AAC decoder
* Amount of incoming network traffic to process for a benchmark that has nothing to do with the network
Finally, asynchronous goroutines are still work the program needs to execute. It isn't free. So as the system is stressed with a higher load you run higher against the capacity limit, thus incurring slower response times. In the case where you perform requests in the background to another HTTP server, you are taking a slice of the available resources. You are also generating as much work internally as is coming in externally. In a real world server, this is usually a bad idea and you must put a resource limit in place. Otherwise an aggressive client can overwhelm your server. The trick is to slow the caller down by *not* responding right away if you are overloaded internally.
You should check your kernel. When you perform a large amount of requests on the same machine, you can run into limits in the number of TCP source ports if they are rotated too fast. It is a common problem when the load generator and SUT are on the same host.
You should check your HTTP client configuration as well. One way to avoid the above problem is to maximize connection reuse, but then you risk head-of-line blocking on the connections, even (or perhaps even more so) in the HTTP/2 case.
But above all: instrument, profile, observe. Nothing beats data and plots.