Go is designed to maximize throughput, rather than minimize latency.
Where does latency come from?
0) serialization delay reading the NIC (packet length/network speed)
1) when a packet arrives the NIC DMA's its contents into memory
2) the NIC then waits to see if more packets arrive (about 60us) - this can be turned off using ethtool's interrupt moderation settings.
3) The NIC raises an interrupt
4) The interrupt handler puts the packet onto the lower part of the device driver's queue
5) eventually the device driver gets scheduled, and passes the packet into the kernel's TCP or UDP networking code
6) this code will check sequence numbers, send ACKs, figure out which process will read the packet (i.e. your Go executable), and move it from the
list of sleeping process to the list of runnable processes
7) at some point later the kernel will resume your now runnable process
8) The netpoller in your executable will read the data, and figure out which Goroutine is sleeping on this socket, and make it runable
9) Go's scheduler will at some point later, schedule that Goroutine so that it resumes running and consumes your packet.
So this is what happens when you read a packet. Sending a packet is similar but backwards, but with lots of steps missed out.
So where does latency come from?
The biggest one is 9 - waiting for the Go scheduler to schedule your Goroutine - this could take a long time, depending on the load on your system.
That is the cost of Go's netpoller which can efficiently mtiplex io over 100,000s of Goroutines
The next is 7 (the Os scheduler)
And then 2 (interrupt moderation) - which accounts for up to 60us but is very easy to fix,
If you are in the HFT business and loose vast amounts of money for each us delay, then you can do crazy and ugly stuff with
kernel bypass and code locked on CPU cores busywaiting on the device ringbuffer.
But for sane people Go gives you a great compromise of superb scalability (avoiding the thundering herd problem) while maintaining acceptable latency.