Minimizing net latency?

101 views
Skip to first unread message

TH

unread,
Jul 29, 2022, 8:08:39 PM7/29/22
to golang-nuts
Hey,

Bit confused on how stdlib net is implemented, but I'm noticing round trip >150µs latencies on idle connections (loopback). Round trip latency will drop to <20µs if sent packets constantly.

I assume that this latency is caused by kernel / syscall wakeup to indicate that new data has arrived. Are there any methods to minimize this wakeup latency?

Thanks

robert engels

unread,
Jul 29, 2022, 8:16:09 PM7/29/22
to TH, golang-nuts
Since the net IO is abstracted away from you, the answer is ’not usually’.

The usual solution is dedicated threads that constantly poll the sockets, or hardware support + real-time threads, etc.

BUT, typically this is not what you are experiencing. More likely, the cpu cache gets cold - so the operations can take 10x longer - especially on a multi-use/user system - so it is actually your processing code that is taking longer.

You can probably use ‘perf’ on the process to monitor the cache misses.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/87deab65-b441-42ce-b51b-663651ecfccbn%40googlegroups.com.

TH

unread,
Jul 29, 2022, 9:11:05 PM7/29/22
to golang-nuts
In case of cache miss, I assume there's no way to tell CPU to keep the data in the cache? The structs related to processing is all below cache line size. (the test code is very crude one, not really much processing done)

On the dedicated thread polling sockets... could I move away net and use syscall & poll to implement low latency? But then again I'm seeing this similar behaviour on my C code too.

Thanks

robert engels

unread,
Jul 29, 2022, 10:34:45 PM7/29/22
to TH, golang-nuts
It is probably not the data in the cache - it is the code - whether yours or the kernel.

Did you try “hot threading” with polling in the C code? Did you see the same results?

Typically a context switch on modern linux is 6 usecs - so I am doubting that is the source of your latency.

Amnon

unread,
Jul 30, 2022, 1:18:27 AM7/30/22
to golang-nuts

Go is designed to maximize throughput, rather than minimize latency. 

Where does latency come from?

0) serialization delay reading the NIC (packet length/network speed)
1) when a packet arrives the NIC DMA's its contents into memory  
2) the NIC then waits to see if more packets arrive (about 60us) - this can be turned off using ethtool's interrupt moderation settings. 
3) The NIC raises an interrupt
4) The interrupt handler puts the packet onto the lower part of the device driver's queue
5) eventually the device driver gets scheduled, and passes the packet into the kernel's TCP or UDP networking code
6) this code will check sequence numbers, send ACKs, figure out which process will read the packet (i.e. your Go executable), and move it from the 
list of sleeping process to the list of runnable processes
7) at some point later the kernel will resume your now runnable process
8) The netpoller in your executable will read the data, and figure out which Goroutine is sleeping on this socket, and make it runable
9) Go's scheduler will at some point later, schedule that Goroutine so that it resumes running and consumes your packet.

So this is what happens when you read a packet. Sending a packet is similar but backwards, but with lots of steps missed out.

So where does latency come from?

The biggest one is 9 - waiting for the Go scheduler to schedule your Goroutine - this could take a long time, depending on the load on your system.
That is the cost of Go's netpoller which can efficiently mtiplex io over 100,000s of Goroutines

The next is 7 (the Os scheduler)

And then 2 (interrupt moderation) - which accounts for up to 60us but is very easy to fix,

If you are in the HFT business and loose vast amounts of money for each us delay, then you can do crazy and ugly stuff with 
kernel bypass and code locked on CPU cores busywaiting on the device ringbuffer.

But for sane people Go gives you a great compromise of superb scalability (avoiding the thundering herd problem) while maintaining acceptable latency.

robert engels

unread,
Jul 30, 2022, 1:36:20 AM7/30/22
to Amnon, golang-nuts
100% (the argument is simpler though if you focus on UDP rather than TCP).

If you think usecs matter that much you need to be hardware based or you’ve already lost. Most people that are concerned about usecs have other latency killers that are in the 100’s of milliseconds - but they never realize this because the exchange itself has latency pauses of 2x that at times.

You are in a losing (random outcome) battle when one side is trying to optimize for latency and the other side doesn’t care - because it doesn’t matter to their business model and they’re in control.

There was a time when major market participants were above the inherent latency of the exchange - but those orgs are long gone (or have fixed their infrastructure) - so the remaining players are fighting for worthless microseconds.

Reply all
Reply to author
Forward
0 new messages