Maximise performance UDP server on Linux

3,720 views
Skip to first unread message

Jerome Touffe-Blin

unread,
Mar 4, 2016, 8:33:36 AM3/4/16
to golang-nuts
Hi,

I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.

Context: AWS c4.4xlarge with 16 CPU cores on Ubunutu 14.04. Number of workers equal to the number of cores. Load testing from another similar instance able to send almost 300k packets per second.

func ListenAndReceive(nbWorkers int) error {
       addr := ":8181"
       c, err := net.ListenPacket("udp", addr)
if err != nil {
return err
}

for i := 0; i < nbWorkers; i++ {
           go func() {
               receive(c)
            }()
}
return nil
}

func receive(c net.PacketConn) {
defer c.Close()

msg := make([]byte, UDPPacketSize)
for {
nbytes, addr, err := c.ReadFrom(msg)
if err != nil {
log.Errorf("%s", err)
continue
}
buf := make([]byte, nbytes)
copy(buf, msg[:nbytes])
handleMessage(addr, buf)
}
}

func handleMessage(addr net.Addr, msg []byte) {
       // Do nothing 
}

I am only able to read ~100k packets from the socket (so 2/3rd loss) while almost maxing out one CPU. The other 15 CPUs are literally idle. I would have thought that with this method, I can read from the socket using all workers and therefore be able to max out all CPUs (it is marked as safe to use multiple goroutines for net.PacketConn in net.go). On Linux it should be possible to use SO_REUSEPORT to share the socket between workers but that doesn't seem to be implemented in the standard library yet.

However on my MBP, with a lower number of packets and workers (8) I am able to pretty much max out all CPUs (which make troubleshooting even harder). 

Am I doing something wrong? Any idea on how this can be improved? I would expect to be able to read much more than 100k packets per second on a c4.4xlarge.

Thanks
--jtblin






Roberto Zanotto

unread,
Mar 4, 2016, 8:52:52 AM3/4/16
to golang-nuts
Which version of Go are you using? Are you setting GOMAXPROCS?

Jesper Louis Andersen

unread,
Mar 4, 2016, 9:06:06 AM3/4/16
to Jerome Touffe-Blin, golang-nuts

On Fri, Mar 4, 2016 at 11:56 AM, Jerome Touffe-Blin <jtb...@gmail.com> wrote:
I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.

Forgive me for eventual ignorance, but have you made any tuning of the operating systems accept buffer and how many packets it want to keep before dropping them?

If you have a very fast sender, and enough bandwidth, then you may be able to fill a small buffer before the go program can empty the buffer. This leads to dropping, but your system may be able to keep up were it allowed to queue and then transfer larger chunks of packets to the application layer. Your Mac might have another default setting for the "backlog" the kernel is willing to accept.

Usually, the kernel can tell you the default size, and `netstat -s` tend to give you the hint that the kernel are dropping packets because the buffer is full.

The reason to introduce queues in systems is exactly for having such absorption holding places. They "smooth out" latencies in the system as long as the system has the bandwidth to digest the flow of packets.

--
J.

C Banning

unread,
Mar 4, 2016, 9:58:46 AM3/4/16
to golang-nuts
Did you try - 

func receive(c net.PacketConn) {
defer c.Close()


       for {
              msg := make([]byte, UDPPacketSize)
              nbytes, addr, err := c.ReadFrom(msg)

if err != nil {
log.Errorf("%s", err)
continue
}
              go handleMessage(addr, msg[:nbytes])
}
}

John McKown

unread,
Mar 4, 2016, 10:51:11 AM3/4/16
to Jerome Touffe-Blin, golang-nuts
Some some observations / questions from a new Go learner.

On Fri, Mar 4, 2016 at 4:56 AM, Jerome Touffe-Blin <jtb...@gmail.com> wrote:
Hi,

I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.

Context: AWS c4.4xlarge with 16 CPU cores on Ubunutu 14.04. Number of workers equal to the number of cores. Load testing from another similar instance able to send almost 300k packets per second.

func ListenAndReceive(nbWorkers int) error {
       addr := ":8181"
       c, err := net.ListenPacket("udp", addr)
if err != nil {
return err
}

for i := 0; i < nbWorkers; i++ {
           go func() {
               receive(c)
            }()
}
return nil
}

func receive(c net.PacketConn) {
defer c.Close()

msg := make([]byte, UDPPacketSize)
for {
nbytes, addr, err := c.ReadFrom(msg)
if err != nil {
log.Errorf("%s", err)
​Hum, this flags with log.Errorf is not defined. Not a big deal, I replace with fmt.Fprintln(os.Stderr,err)
                     continue
}
buf := make([]byte, nbytes)
​Why do you remake buf in the loop? Why not make it UDPPacketSize before the loop, like you did msg?​
 
              copy(buf, msg[:nbytes])
handleMessage(addr, buf)
​to go with the above, why are you sending buf to handleMessage instead of &buf? Doesn't passing a slice make a copy of the slice and pass that?​
 
       }
}

func handleMessage(addr net.Addr, msg []byte) {
       // Do nothing 
}
​My changes to receive() and handleMessage() might look something like:

func receive(c net.PacketConn) {
defer c.Close()

    var msgArray [UDPPacketSize]byte
    var bufArray [UDPPacketSize]byte
    var buf []byte

// msg := make([]byte, UDPPacketSize)
    msg:=msgArray[0:UDPPacketSize]
for {
nbytes, addr, err := c.ReadFrom(msg)
if err != nil {
// log.Errorf("%s", err)
            fmt.Fprintln(os.Stderr,err)
continue
}
// buf := make([]byte, nbytes)
        buf=bufArray[:nbytes]
copy(buf, msg[:nbytes])
handleMessage(addr, &buf)
}
}

func handleMessage(addr net.Addr, msg *[]byte) {
// Do nothing
}

Please forgive my presumption if the above is useless. It likely does _not_ address the "why is only one CPU being used?" question. Perhaps your code should, somewhere during start up, have something like:

fmt.Printf("The value of MAXGOPROCS is %d.\n",runtime.GOMAXPROCS())

 

I am only able to read ~100k packets from the socket (so 2/3rd loss) while almost maxing out one CPU. The other 15 CPUs are literally idle. I would have thought that with this method, I can read from the socket using all workers and therefore be able to max out all CPUs (it is marked as safe to use multiple goroutines for net.PacketConn in net.go). On Linux it should be possible to use SO_REUSEPORT to share the socket between workers but that doesn't seem to be implemented in the standard library yet.

However on my MBP, with a lower number of packets and workers (8) I am able to pretty much max out all CPUs (which make troubleshooting even harder). 

Am I doing something wrong? Any idea on how this can be improved? I would expect to be able to read much more than 100k packets per second on a c4.4xlarge.

Thanks
--jtblin


--
A fail-safe circuit will destroy others. -- Klipstein

Maranatha! <><
John McKown

Shawn Milochik

unread,
Mar 4, 2016, 10:53:52 AM3/4/16
to golang-nuts
Limiting the number of goroutines to GOMAXPROCS seems silly. Why not bump that up significantly? One CPU can easily handle the tiny amount you're creating, so it probably has no need to create OS threads. Experiment. Try 50, or 1,000.

Egon

unread,
Mar 4, 2016, 11:46:45 AM3/4/16
to golang-nuts
On Friday, 4 March 2016 15:33:36 UTC+2, Jerome Touffe-Blin wrote:
Hi,

I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.

What is your reader behavior, is it predictable?
What should happen if the readers cannot keep up?

I would probably try implementing the UDP reader loop inside C, and implement something like https://github.com/egonelbre/exp/tree/master/ring in both Go and C side for communication, to ensure as little pressure on the memory allocation and make sure I'm not hitting the syscall overhead.

You can try whether just using that ring buffer helps; on the UDP side, lock it to a thread and read in a single routine. Of course, code is completely untested and use at your own peril.

+ Egon

Jerome Touffe-Blin

unread,
Mar 4, 2016, 4:33:39 PM3/4/16
to golang-nuts
Thanks everyone for all the advises!

Which version of Go are you using? Are you setting GOMAXPROCS?

@Roberto Zanotto. Forgot to mention, I'm using 1.6 and I still have the boilerplate code that sets GOMAXPROCS to the number of CPUs anyway.

Forgive me for eventual ignorance, but have you made any tuning of the operating systems accept buffer and how many packets it want to keep before dropping them?

@Jesper Louis Andersen. No I haven't made any tuning of the OS and I have to admit my ignorance about this. What kind of tuning do you suggest I make? Thanks for the pointer, I'll look into this.

@C Banning, I've just tried now and it seems there is more packets dropped with this approach of re-creating the large slice on each loop.


why are you sending buf to handleMessage instead of &buf? Doesn't passing a slice make a copy of the slice and pass that?​

@John McKown, A slice is already a reference to an array, see https://golang.org/doc/effective_go.html#slices
 
​Why do you remake buf in the loop? Why not make it UDPPacketSize before the loop, like you did msg?​
 
@John McKown, to have a clean buffer. It should be possible to optimise this part a little bit by not creating the second buffer and always reusing the first one though. I'll give that a go but I doubt it will change the end results dramatically.

Limiting the number of goroutines to GOMAXPROCS seems silly. Why not bump that up significantly? One CPU can easily handle the tiny amount you're creating, so it probably has no need to create OS threads. Experiment. Try 50, or 1,000.

Shawn Milochik old habits where having too many threads is bad for context switching. However I've tried bumping the number of goroutines to 1000 and even 10000 and although I'm getting significantly better throughput, it still drops around 50% of the packet and only uses one core.

What is your reader behavior, is it predictable?
What should happen if the readers cannot keep up?

@Egon what do you mean? The reader is the code I posted above. I've ported it to a gist for convenience btw: https://gist.github.com/jtblin/18df559cf14438223f93

I would probably try implementing the UDP reader loop inside C, and implement something like https://github.com/egonelbre/exp/tree/master/ring in both Go and C side for communication, to ensure as little pressure on the memory allocation and make sure I'm not hitting the syscall overhead.

@Egon I'd rather not have to mix C and Go. I'd see that as an issue with Go tbh if one cannot develop something that is going to read fast enough from an UDP socket, but at the moment, I think it's more a problem with my specific implementation or other issue.

So far, based on comments above, it seems that increasing the number of goroutines had the most significant impact with around a 30% increase in packets processed, but it still a 50%+ packets dropped which is massive. 

Regarding the difference in CPU usage on my MBP, I think it's just a red herring as I run the loader and the receiver on the same machine whereas on Linux it's on 2 different boxes.

I forgot to mention I ran the program through pprof and got the following result (showing top 10 below):

Showing top 50 nodes out of 93 (cum >= 0.13s)
      flat  flat%   sum%        cum   cum%
     3.13s 32.74% 32.74%      3.19s 33.37%  syscall.Syscall6
     1.70s 17.78% 50.52%      1.70s 17.78%  runtime.usleep
     1.23s 12.87% 63.39%      1.23s 12.87%  runtime.kevent
     0.73s  7.64% 71.03%      0.73s  7.64%  runtime.mach_semaphore_wait
     0.54s  5.65% 76.67%      0.54s  5.65%  runtime.mach_semaphore_signal
     0.38s  3.97% 80.65%      3.61s 37.76%  syscall.Recvfrom
     0.26s  2.72% 83.37%      0.26s  2.72%  runtime.mach_semaphore_timedwait
     0.12s  1.26% 84.62%      0.17s  1.78%  fmt.(*pp).doPrintf
     0.12s  1.26% 85.88%      0.12s  1.26%  runtime.memmove
     0.09s  0.94% 86.82%      0.09s  0.94%  runtime/internal/atomic.Cas64


My assumption so far is that most of the processing is done at the socket sycall level and the reason I'm not maxing out CPUs is that Go doesn't support sharing the socket between goroutines yet (SO_REUSEPORT) but I'd love to be proven wrong.

Does anyone know if using SO_REUSEPORT is in the roadmap for Go? I had a quick look at https://github.com/kavu/go_reuseport but haven't finished exploring this option yet.

Manlio Perillo

unread,
Mar 4, 2016, 4:45:01 PM3/4/16
to golang-nuts
How are performance if you use only one goroutine to read packets?


Manlio


 

Dave Cheney

unread,
Mar 4, 2016, 5:04:05 PM3/4/16
to golang-nuts
You're only getting one CPU's worth of performance because there is only one socket to listen on. All of those goroutines are going to logjam behind the epoll/kqueue/whatever, one will wake up when you get another packet, and the rest will stall waiting on the next package. I think your idea of multiple listening sockets will perform better.

Also, I'm not 100% sure if net.Conn.Read is defined to be safe for concurrent readers, check the race detector.

Jerome Touffe-Blin

unread,
Mar 4, 2016, 5:08:26 PM3/4/16
to golang-nuts
How are performance if you use only one goroutine to read packets?

Good question. CPU usage is similar but I can only read around 35k packets out of 280k so around 90% drop :(

If I use a huge number of goroutines i.e. 100,000 I am able to get more read throughput and CPU usage but performance seems to become more variable. Even more if I use 1,000,000. I wonder what is a sane limit for the number of goroutines someone can use. 

Jerome Touffe-Blin

unread,
Mar 4, 2016, 5:17:46 PM3/4/16
to golang-nuts
Thanks Dave. Yes makes sense, I'm going to investigate this path more.

Also, I'm not 100% sure if net.Conn.Read is defined to be safe for concurrent readers, check the race detector.

I did run it with -race and the race detector didn't complain. Also ReadFrom is from the net.PacketConn interface which specifically says that it should be safe IIUC

// PacketConn is a generic packet-oriented network connection.
//
// Multiple goroutines may invoke methods on a PacketConn simultaneously.

Manlio Perillo

unread,
Mar 4, 2016, 5:19:11 PM3/4/16
to golang-nuts
Il giorno venerdì 4 marzo 2016 23:08:26 UTC+1, Jerome Touffe-Blin ha scritto:
How are performance if you use only one goroutine to read packets?

Good question. CPU usage is similar but I can only read around 35k packets out of 280k so around 90% drop :(

This is wrong. It should not be that slow.
Can you post the updated code?

You should optimize all the code that do some allocations.
Since you are using only one goroutine, preallocate the msg buffer and reuse it for all the packets.

Do another perfomance test by removing the code to copy msg to buf and the call handleMessage.
By the way, the copy is not necessary, since msg is local to the goroutine; you can pass msg to handleMessage.
What is the value of UDPPacketSize?

If it is still slow, then the cause may be the net poller.


Manlio

C Banning

unread,
Mar 4, 2016, 5:22:46 PM3/4/16
to golang-nuts
did you couple using a single receive()  with - 

for {
              msg := make([]byte, UDPPacketSize)
              nbytes, addr, err := c.ReadFrom(msg)

if err != nil {
log.Errorf("%s", err)
continue
}
              go handleMessage(addr, msg[:nbytes])
}
?

I've found that approach to be highly performant on a net.TCPConn.

Manlio Perillo

unread,
Mar 4, 2016, 5:42:13 PM3/4/16
to golang-nuts
Il giorno venerdì 4 marzo 2016 23:22:46 UTC+1, C Banning ha scritto:
did you couple using a single receive()  with - 

for {
              msg := make([]byte, UDPPacketSize)
nbytes, addr, err := c.ReadFrom(msg)

if err != nil {
log.Errorf("%s", err)
continue
}
go handleMessage(addr, msg[:nbytes])
}
?

I've found that approach to be highly performant on a net.TCPConn.

This is still not optimized.
You are allocating memory for every packet.

Try with sync.Pool, or, probably better, with a ring buffer as suggested by Egon.
IMHO, in https://github.com/egonelbre/exp/blob/master/ring/buffer.go. you don't need the busy wait if the ring buf size is the same as the number of goroutines used to handle connections.

> [...]

Manlio

Egon

unread,
Mar 4, 2016, 6:31:27 PM3/4/16
to golang-nuts


On Friday, 4 March 2016 23:33:39 UTC+2, Jerome Touffe-Blin wrote:
Thanks everyone for all the advises!

Which version of Go are you using? Are you setting GOMAXPROCS?

@Roberto Zanotto. Forgot to mention, I'm using 1.6 and I still have the boilerplate code that sets GOMAXPROCS to the number of CPUs anyway.

Forgive me for eventual ignorance, but have you made any tuning of the operating systems accept buffer and how many packets it want to keep before dropping them?

@Jesper Louis Andersen. No I haven't made any tuning of the OS and I have to admit my ignorance about this. What kind of tuning do you suggest I make? Thanks for the pointer, I'll look into this.

@C Banning, I've just tried now and it seems there is more packets dropped with this approach of re-creating the large slice on each loop.


why are you sending buf to handleMessage instead of &buf? Doesn't passing a slice make a copy of the slice and pass that?​

@John McKown, A slice is already a reference to an array, see https://golang.org/doc/effective_go.html#slices
 
​Why do you remake buf in the loop? Why not make it UDPPacketSize before the loop, like you did msg?​
 
@John McKown, to have a clean buffer. It should be possible to optimise this part a little bit by not creating the second buffer and always reusing the first one though. I'll give that a go but I doubt it will change the end results dramatically.

Limiting the number of goroutines to GOMAXPROCS seems silly. Why not bump that up significantly? One CPU can easily handle the tiny amount you're creating, so it probably has no need to create OS threads. Experiment. Try 50, or 1,000.

Shawn Milochik old habits where having too many threads is bad for context switching. However I've tried bumping the number of goroutines to 1000 and even 10000 and although I'm getting significantly better throughput, it still drops around 50% of the packet and only uses one core.

What is your reader behavior, is it predictable?
What should happen if the readers cannot keep up?

@Egon what do you mean? The reader is the code I posted above. I've ported it to a gist for convenience btw: https://gist.github.com/jtblin/18df559cf14438223f93

Sorry, I meant the "handler", I used "reader" in my code. Handlers dictate the queue design between the packet reading and workload distribution.
 
I would probably try implementing the UDP reader loop inside C, and implement something like https://github.com/egonelbre/exp/tree/master/ring in both Go and C side for communication, to ensure as little pressure on the memory allocation and make sure I'm not hitting the syscall overhead.

@Egon I'd rather not have to mix C and Go. I'd see that as an issue with Go tbh if one cannot develop something that is going to read fast enough from an UDP socket, but at the moment, I think it's more a problem with my specific implementation or other issue.

There's a constant overhead with syscall and cgo, due to the runtime differences... in this case we are doing tons of those calls, hence you will get a high total overhead. I haven't measured them lately, so I'm not sure what the exact numbers are. You also have Go runtime trying to schedule and coordinate syscalls and goroutines -> more complexity.

There are two ways of reducing it, either avoid it altogether, e.g. by writing the "heavy" loops in C, or do less calls, e.g. recvmmsg or use some other call that can receive multiple packets at a time.

Of course, SO_REUSEPORT should also help with performance + multiple readers.

I'm not sure which part will have the biggest effect, but for high-perf systems I tend to prefer building them from ground-up and trying to understand where my cycles are spent. From pprof, it looks like you aren't hitting any allocation and GC issues, yet.

Also, I would test using directly syscall-s instead of net.UDP*; not sure what the overhead in net package is and whether they do something extra for some reason. From quick skim, it doesn't look like there is much, but probably worth testing.

Jerome Touffe-Blin

unread,
Mar 4, 2016, 6:41:47 PM3/4/16
to golang-nuts
Thanks all. The updated code is at https://gist.github.com/jtblin/18df559cf14438223f93.

I'm now reusing the same buffer but it didn't change results significantly.

defer c.Close()


msg := make([]byte, UDPPacketSize)
for {
nbytes, addr, err := c.ReadFrom(msg[0:])

if err != nil {
log.Errorf("%s", err)
continue
}
       handleMessage(addr, msg[0:nbytes])
}

Even with the implementation above which reuses the same buffer and doing nothing with it in handleMessage apart counting the number of packets for now, I am still not able to get over 150k packets/s with 1024 goroutines and I'm still only really using one CPU.

I'll have a look at sync.Pool and the ring buffer though because when I actually start doing something with the message I'll need to optimise the buffer management indeed. Thanks for pointing this out!

James Aguilar

unread,
Mar 4, 2016, 7:06:22 PM3/4/16
to golang-nuts
I don't know what the solution is, but the problem is pretty straightforward. netFD has a mutex around readFrom, which is what is ultimately called to support this API.


Ctrl-F "readFrom"

I'm not sure if you have to open multiple listeners or do some other type of magic, but you will definitely never use more than one core with a single listener.

Jerome Touffe-Blin

unread,
Mar 4, 2016, 7:13:42 PM3/4/16
to golang-nuts
Thanks a lot for the detailed response Egon.

What is your reader behavior, is it predictable? 
Sorry, I meant the "handler", I used "reader" in my code. Handlers dictate the queue design between the packet reading and workload distribution.

At the moment, I have disabled pretty much all handling to try and reduce the potential issues and bottlenecks. The handler basically sends the packet received count to a buffered channel which aggregates the packets received and flush them to stdout every second so that I can compare send vs received. The buffered channel itself is processed by the same number of workers. This way I believe I avoid interfering with the socket handling itself so it doesn't impact performance.

There's a constant overhead with syscall and cgo, due to the runtime differences... in this case we are doing tons of those calls, hence you will get a high total overhead. I haven't measured them lately, so I'm not sure what the exact numbers are. You also have Go runtime trying to schedule and coordinate syscalls and goroutines -> more complexity.
There are two ways of reducing it, either avoid it altogether, e.g. by writing the "heavy" loops in C, or do less calls, e.g. recvmmsg or use some other call that can receive multiple packets at a time.

Makes sense, yes that's pretty much where I am now in terms of thinking. I thought about using recvmmsg as well to receive multiple packets at once but I'm not sure what the benefits will be exactly in a real lifer scenario where packets will come from multiple clients, and not sure how to implement that yet. I was hoping I can avoid writing the socket handling in C but what you say totally makes sense.

Of course, SO_REUSEPORT should also help with performance + multiple readers.
 
I'm not sure which part will have the biggest effect, but for high-perf systems I tend to prefer building them from ground-up and trying to understand where my cycles are spent. From pprof, it looks like you aren't hitting any allocation and GC issues, yet.
 
Also, I would test using directly syscall-s instead of net.UDP*; not sure what the overhead in net package is and whether they do something extra for some reason. From quick skim, it doesn't look like there is much, but probably worth testing. 

Not having too much luck with SO_REUSEPORT e.g. https://github.com/kavu/go_reuseport/pull/4 yet but it looks like I'm going to have to use syscalls directly if I want to go that route.

Jerome Touffe-Blin

unread,
Mar 4, 2016, 7:28:35 PM3/4/16
to golang-nuts
Interesting, I guess that explains why net.PacketConn is advertised as safe to use with multiple goroutines. I'd have imagined the mutex just allows to synchronise read and writes between goroutines, but I'm not sure it explains why it only uses one core as goroutines could span across multiple cores while still being synchronised no? Anyhow I agree with your conclusion and I am more and more convinced that to be able to fully utilise all the cores I need to be able to share the socket which is what SO_REUSEPORT is for.

James Aguilar

unread,
Mar 4, 2016, 7:51:21 PM3/4/16
to golang-nuts
On Friday, March 4, 2016 at 4:28:35 PM UTC-8, Jerome Touffe-Blin wrote:
as goroutines could span across multiple cores while still being synchronised no?

Not if the critical section represents a dominant fraction of packet processing time. Since your packet handler is a noop, this is probably the case in your test..

Manlio Perillo

unread,
Mar 5, 2016, 4:55:53 AM3/5/16
to golang-nuts
Il giorno sabato 5 marzo 2016 00:41:47 UTC+1, Jerome Touffe-Blin ha scritto:
Thanks all. The updated code is at https://gist.github.com/jtblin/18df559cf14438223f93.


What is the reason for the message queue?
Since you are already reading a packet inside a separate goroutine, IMHO there is no need to use an additional goroutine.
Just call handleMessage from the receive function.
 
I'm now reusing the same buffer but it didn't change results significantly.


Handled packets per seconds are 150k vs ~100k; it's a 50% performance increase...

defer c.Close()

msg := make([]byte, UDPPacketSize)
for {
nbytes, addr, err := c.ReadFrom(msg[0:])
if err != nil {
log.Errorf("%s", err)
continue
}
handleMessage(addr, msg[0:nbytes])
}

Even with the implementation above which reuses the same buffer and doing nothing with it in handleMessage apart counting the number of packets for now, I am still not able to get over 150k packets/s with 1024 goroutines and I'm still only really using one CPU.

How are you using 1024 goroutine?
The code above is supposed to run inside *one* goroutine.  Counting packets is a simple n = 0 and n += 1.

Finally, I suggest you to create a reference implementation in C (the same code as the Go version, reading packets from a single thread using recvfrom and counting them.
With the reference implementation you can compare the performance of your Go versions against it.

A possible optimization for the Go version is to use syscall.Recvfrom, to avoid the use of mutex.


Manlio

Jerome Touffe-Blin

unread,
Mar 6, 2016, 3:16:23 AM3/6/16
to golang-nuts
as goroutines could span across multiple cores while still being synchronised no?
 
Not if the critical section represents a dominant fraction of packet processing time. Since your packet handler is a noop, this is probably the case in your test..

But it wasn't a noop originally when I discovered that only one core was used. I just made it a noop to isolate the problem.

Anyhow I have now found why it can only use on CPU core and this is due to how Linux kernel hashes the packets to distribute to each RX queue of the NIC which queue is itself pinned to a specific CPU core. The hash used is basically a tuple (src IP, dst IP, src port, dst port) so that explains why all the packets are going to the same core deterministically as I load test using only one source. This article was great to explain the issue and confirm the hunch I have from the beginning i.e. the socket is not shared across CPUs: https://blog.cloudflare.com/how-to-receive-a-million-packets/. In a real life scenario I expect the sources IPs and  ports to be different so it shall be able to use more than one core and it won't be as much of a problem. That said I expect to have clients who will send more data than other, so it may still be a problem to a certain extent and will create hot core problems. The article also confirms that to be able to span the load across all CPUs, I'd have to use SO_REUSEPORT which is not available in Go standard library at the moment :(

What is the reason for the message queue?
Since you are already reading a packet inside a separate goroutine, IMHO there is no need to use an additional goroutine.
Just call handleMessage from the receive function.

Yes in this dummy example a message queue may not be necessary but I will need it in my server as there will be some parsing and processing of the message. If I don't then the goroutines are busy doing some processing and cannot read as fast as possible from the socket.

Handled packets per seconds are 150k vs ~100k; it's a 50% performance increase...

Yes you're right, I meant it didn't change the CPU core usage but I now understand why cf. explanation above. It was indeed a good performance improvement. I'm now using a sync.Pool following your advise as reusing the buffer was not thread safe and didn't work beyond this dummy example, and with this and other performance improvements I'm now able to get around ~230k packets a second with my server :)
I've updated the reference implementation at https://gist.github.com/jtblin/18df559cf14438223f93#file-udp-server-go. The message queue and the buffer pool were the big drivers in the performance improvements, running more goroutines than the number of cores does not have any positive performance impact AFAICT. 

A possible optimization for the Go version is to use syscall.Recvfrom, to avoid the use of mutex.

Yeah, although I am not sure it would be safe to read from the socket from different goroutines without a mutex. I may try this out anyway but next step is to look into getting SO_REUSEPORT to work in Go first.

Thanks a ton everyone for all the advises!

Krzysztof Kowalczyk

unread,
Mar 6, 2016, 11:48:05 PM3/6/16
to golang-nuts
Untested optimization. Full: https://play.golang.org/p/En0ksz8qF_

Relevant part:

type message struct {
addr net.Addr
msg  []byte
}

func listenAndReceive(maxWorkers int) error {
c, err := net.ListenPacket("udp", address)
if err != nil {
return err
}
for i := 0; i < maxWorkers; i++ {
go dequeue(mq)
}
go receive(c)
return nil
}

func receive(c net.PacketConn) {
defer c.Close() // TODO: closed multiple times

var buf []byte
for {
if len(buf) < UDPPacketSize {
buf = make([]byte, packetBufSize, packetBufSize)
}

nbytes, addr, err := c.ReadFrom(buf)
if err != nil {
log.Printf("Error %s", err)
continue
}
msg := buf[:nbytes]
buf = buf[nbytes:]
mq <- message{addr, msg}
}
}

func dequeue(chan message) {
for m := range mq {
handleMessage(m.addr, m.msg)
}
}

func handleMessage(addr net.Addr, msg []byte) {
// Do something with message
atomic.AddUint64(&ops, 1)
}


The important part is that instead of using buffer pool it allocates a 1 MB array, reads a packet directly into that buffer and sends out a slice backed by that buffer. When buffer is full, allocates new buffer. It'll be dropped by GC when all packets (slieces) that reference it are processed. Should be cheaper than sync.Pool (and contention free).

I only used a single worker for reads because I doubt having more helps.

Also, calling c.Close() in receive() is a (harmless) bug as it closes the same channel multiple times.

-- kjk

Roger Pack

unread,
Mar 7, 2016, 1:50:45 PM3/7/16
to golang-nuts


On Friday, March 4, 2016 at 6:33:36 AM UTC-7, Jerome Touffe-Blin wrote:
Hi,

I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.

Context: AWS c4.4xlarge with 16 CPU cores on Ubunutu 14.04. Number of workers equal to the number of cores. Load testing from another similar instance able to send almost 300k packets per second.

Are you sure the packets are being dropped at the receiving side? ... 

magnus....@gmail.com

unread,
Apr 10, 2016, 8:29:01 AM4/10/16
to golang-nuts
I noticed that when when using IPv6 the ReadFromUDP function will call zoneToString to map the interface
index to an interface name. This call will wind up calling syscall.NetlinkRIB for every packet. I.e. for every packet
received it will load and parse the routing table of the host.

The zone map was probably intended to be cached but isn't.

Unfortunately I haven't been able to try this with an IPv4 peer yet but that should be significantly faster.



Manlio Perillo

unread,
Apr 10, 2016, 11:17:46 AM4/10/16
to golang-nuts, magnus....@gmail.com
Can you fill an issue on https://github.com/golang/go/issues/?


Thanks  Manlio 
Reply all
Reply to author
Forward
0 new messages