I changed the code from 2 bytes write into 100 bytes write. Using 100 bytes to simulate my really use case.
100 bytes * 200000 times
$ ./server_cpp # CPU(97% real:0.19s user:0.01s sys:0.18s) Mem(max:4kB avg:0kB) pf:0 ./client_cpp
recv system call times: 6557
recv bytes: 20000000
$ ./server_cpp # CPU(99% real:0.36s user:0.15s sys:0.21s) Mem(max:4kB avg:0kB) pf:0 ./client_go
recv system call times: 10571
recv bytes: 20000000
100 bytes * 10000000 times
$ ./server_cpp # CPU(99% real:6.62s user:0.47s sys:6.12s) Mem(max:4kB avg:0kB) pf:0 ./client_cpp
recv system call times: 308790
recv bytes: 1000000000
$ ./server_cpp # CPU(99% real:15.69s user:5.70s sys:9.96s) Mem(max:4kB avg:0kB) pf:0 ./client_go
recv system call times: 530584
recv bytes: 1000000000
The difference is reducing when system call count reduced. I guess the reason of the Go version slower maybe:
1. It is based on epoll, require more system call
2. For every system call, Go's overhead is higher then C's. (It require convert parameter type)
3. Go still have space to optimize its runtime speed.