func ListenAndReceive(nbWorkers int) error {
addr := ":8181"
c, err := net.ListenPacket("udp", addr)
if err != nil {
return err
}
for i := 0; i < nbWorkers; i++ {
go func() {
receive(c)
}}()
return nil
}
func receive(c net.PacketConn) {
defer c.Close()
msg := make([]byte, UDPPacketSize)
for {
nbytes, addr, err := c.ReadFrom(msg)
if err != nil {
log.Errorf("%s", err)
continue
}
buf := make([]byte, nbytes)
copy(buf, msg[:nbytes])
handleMessage(addr, buf)
}
}
func handleMessage(addr net.Addr, msg []byte) {// Do nothing}
I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.
func receive(c net.PacketConn) {
defer c.Close()
for {
msg := make([]byte, UDPPacketSize)
nbytes, addr, err := c.ReadFrom(msg)
if err != nil {
log.Errorf("%s", err)
continue
}
go handleMessage(addr, msg[:nbytes])
}
}
Hi,I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.Context: AWS c4.4xlarge with 16 CPU cores on Ubunutu 14.04. Number of workers equal to the number of cores. Load testing from another similar instance able to send almost 300k packets per second.func ListenAndReceive(nbWorkers int) error {addr := ":8181"c, err := net.ListenPacket("udp", addr)
if err != nil {
return err
}
for i := 0; i < nbWorkers; i++ {go func() {
receive(c)
}}()
return nil
}func receive(c net.PacketConn) {
defer c.Close()
msg := make([]byte, UDPPacketSize)
for {
nbytes, addr, err := c.ReadFrom(msg)
if err != nil {
log.Errorf("%s", err)
continue
}
buf := make([]byte, nbytes)
copy(buf, msg[:nbytes])
handleMessage(addr, buf)
}
}func handleMessage(addr net.Addr, msg []byte) {// Do nothing}
I am only able to read ~100k packets from the socket (so 2/3rd loss) while almost maxing out one CPU. The other 15 CPUs are literally idle. I would have thought that with this method, I can read from the socket using all workers and therefore be able to max out all CPUs (it is marked as safe to use multiple goroutines for net.PacketConn in net.go). On Linux it should be possible to use SO_REUSEPORT to share the socket between workers but that doesn't seem to be implemented in the standard library yet.However on my MBP, with a lower number of packets and workers (8) I am able to pretty much max out all CPUs (which make troubleshooting even harder).Am I doing something wrong? Any idea on how this can be improved? I would expect to be able to read much more than 100k packets per second on a c4.4xlarge.Thanks--jtblin
Hi,I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.
Which version of Go are you using? Are you setting GOMAXPROCS?
Forgive me for eventual ignorance, but have you made any tuning of the operating systems accept buffer and how many packets it want to keep before dropping them?
why are you sending buf to handleMessage instead of &buf? Doesn't passing a slice make a copy of the slice and pass that?
Why do you remake buf in the loop? Why not make it UDPPacketSize before the loop, like you did msg?
Limiting the number of goroutines to GOMAXPROCS seems silly. Why not bump that up significantly? One CPU can easily handle the tiny amount you're creating, so it probably has no need to create OS threads. Experiment. Try 50, or 1,000.
What is your reader behavior, is it predictable?
What should happen if the readers cannot keep up?
I would probably try implementing the UDP reader loop inside C, and implement something like https://github.com/egonelbre/exp/tree/master/ring in both Go and C side for communication, to ensure as little pressure on the memory allocation and make sure I'm not hitting the syscall overhead.
Showing top 50 nodes out of 93 (cum >= 0.13s)
flat flat% sum% cum cum%
3.13s 32.74% 32.74% 3.19s 33.37% syscall.Syscall6
1.70s 17.78% 50.52% 1.70s 17.78% runtime.usleep
1.23s 12.87% 63.39% 1.23s 12.87% runtime.kevent
0.73s 7.64% 71.03% 0.73s 7.64% runtime.mach_semaphore_wait
0.54s 5.65% 76.67% 0.54s 5.65% runtime.mach_semaphore_signal
0.38s 3.97% 80.65% 3.61s 37.76% syscall.Recvfrom
0.26s 2.72% 83.37% 0.26s 2.72% runtime.mach_semaphore_timedwait
0.12s 1.26% 84.62% 0.17s 1.78% fmt.(*pp).doPrintf
0.12s 1.26% 85.88% 0.12s 1.26% runtime.memmove
0.09s 0.94% 86.82% 0.09s 0.94% runtime/internal/atomic.Cas64
How are performance if you use only one goroutine to read packets?
Also, I'm not 100% sure if net.Conn.Read is defined to be safe for concurrent readers, check the race detector.
// PacketConn is a generic packet-oriented network connection.
//
// Multiple goroutines may invoke methods on a PacketConn simultaneously.
How are performance if you use only one goroutine to read packets?Good question. CPU usage is similar but I can only read around 35k packets out of 280k so around 90% drop :(
for {
msg := make([]byte, UDPPacketSize)
nbytes, addr, err := c.ReadFrom(msg)
if err != nil {
log.Errorf("%s", err)
continue
}
go handleMessage(addr, msg[:nbytes])
}
?
did you couple using a single receive() with -for {
nbytes, addr, err := c.ReadFrom(msg)msg := make([]byte, UDPPacketSize)
go handleMessage(addr, msg[:nbytes])
if err != nil {
log.Errorf("%s", err)
continue
}
}?I've found that approach to be highly performant on a net.TCPConn.
Thanks everyone for all the advises!Which version of Go are you using? Are you setting GOMAXPROCS?@Roberto Zanotto. Forgot to mention, I'm using 1.6 and I still have the boilerplate code that sets GOMAXPROCS to the number of CPUs anyway.Forgive me for eventual ignorance, but have you made any tuning of the operating systems accept buffer and how many packets it want to keep before dropping them?@Jesper Louis Andersen. No I haven't made any tuning of the OS and I have to admit my ignorance about this. What kind of tuning do you suggest I make? Thanks for the pointer, I'll look into this.@C Banning, I've just tried now and it seems there is more packets dropped with this approach of re-creating the large slice on each loop.why are you sending buf to handleMessage instead of &buf? Doesn't passing a slice make a copy of the slice and pass that?@John McKown, A slice is already a reference to an array, see https://golang.org/doc/effective_go.html#slicesWhy do you remake buf in the loop? Why not make it UDPPacketSize before the loop, like you did msg?@John McKown, to have a clean buffer. It should be possible to optimise this part a little bit by not creating the second buffer and always reusing the first one though. I'll give that a go but I doubt it will change the end results dramatically.Limiting the number of goroutines to GOMAXPROCS seems silly. Why not bump that up significantly? One CPU can easily handle the tiny amount you're creating, so it probably has no need to create OS threads. Experiment. Try 50, or 1,000.Shawn Milochik old habits where having too many threads is bad for context switching. However I've tried bumping the number of goroutines to 1000 and even 10000 and although I'm getting significantly better throughput, it still drops around 50% of the packet and only uses one core.
What is your reader behavior, is it predictable?
What should happen if the readers cannot keep up?@Egon, what do you mean? The reader is the code I posted above. I've ported it to a gist for convenience btw: https://gist.github.com/jtblin/18df559cf14438223f93
I would probably try implementing the UDP reader loop inside C, and implement something like https://github.com/egonelbre/exp/tree/master/ring in both Go and C side for communication, to ensure as little pressure on the memory allocation and make sure I'm not hitting the syscall overhead.@Egon, I'd rather not have to mix C and Go. I'd see that as an issue with Go tbh if one cannot develop something that is going to read fast enough from an UDP socket, but at the moment, I think it's more a problem with my specific implementation or other issue.
defer c.Close()
msg := make([]byte, UDPPacketSize)
for {
nbytes, addr, err := c.ReadFrom(msg[0:])
if err != nil {
log.Errorf("%s", err)
continue
}
handleMessage(addr, msg[0:nbytes])
}
What is your reader behavior, is it predictable?
Sorry, I meant the "handler", I used "reader" in my code. Handlers dictate the queue design between the packet reading and workload distribution.
There's a constant overhead with syscall and cgo, due to the runtime differences... in this case we are doing tons of those calls, hence you will get a high total overhead. I haven't measured them lately, so I'm not sure what the exact numbers are. You also have Go runtime trying to schedule and coordinate syscalls and goroutines -> more complexity.
There are two ways of reducing it, either avoid it altogether, e.g. by writing the "heavy" loops in C, or do less calls, e.g. recvmmsg or use some other call that can receive multiple packets at a time.
Of course, SO_REUSEPORT should also help with performance + multiple readers.
I'm not sure which part will have the biggest effect, but for high-perf systems I tend to prefer building them from ground-up and trying to understand where my cycles are spent. From pprof, it looks like you aren't hitting any allocation and GC issues, yet.
Also, I would test using directly syscall-s instead of net.UDP*; not sure what the overhead in net package is and whether they do something extra for some reason. From quick skim, it doesn't look like there is much, but probably worth testing.
as goroutines could span across multiple cores while still being synchronised no?
Thanks all. The updated code is at https://gist.github.com/jtblin/18df559cf14438223f93.
I'm now reusing the same buffer but it didn't change results significantly.
defer c.Close()
msg := make([]byte, UDPPacketSize)
for {
nbytes, addr, err := c.ReadFrom(msg[0:])
if err != nil {
log.Errorf("%s", err)
continue
}
handleMessage(addr, msg[0:nbytes])
}Even with the implementation above which reuses the same buffer and doing nothing with it in handleMessage apart counting the number of packets for now, I am still not able to get over 150k packets/s with 1024 goroutines and I'm still only really using one CPU.
as goroutines could span across multiple cores while still being synchronised no?
Not if the critical section represents a dominant fraction of packet processing time. Since your packet handler is a noop, this is probably the case in your test..
What is the reason for the message queue?
Since you are already reading a packet inside a separate goroutine, IMHO there is no need to use an additional goroutine.
Just call handleMessage from the receive function.
Handled packets per seconds are 150k vs ~100k; it's a 50% performance increase...
A possible optimization for the Go version is to use syscall.Recvfrom, to avoid the use of mutex.
Hi,I'm trying to maximise the performance reading from a socket for an UDP server but despite using multiple goroutines to read from the socket, I'm only able to max one of the CPUs, and therefore cannot read fast enough from the socket to avoid dropping a lot of packets.Context: AWS c4.4xlarge with 16 CPU cores on Ubunutu 14.04. Number of workers equal to the number of cores. Load testing from another similar instance able to send almost 300k packets per second.