Here is the profiling for GOMAXPROCS=1 and GOMAXPROCS=#ofCPUs, respectively, with the profiling graph.
The previous heavy usage of "syscall.Syscall" seems much less now and I consider it a very good change. However, the overall performance still stay almost the same though. The bottleneck seems on the "runtime.futex" now.
(pprof) top
Total: 3216 samples
1406 43.7% 43.7% 1406 43.7% runtime.futex
1216 37.8% 81.5% 1216 37.8% syscall.RawSyscall
324 10.1% 91.6% 327 10.2% syscall.Syscall
152 4.7% 96.3% 152 4.7% bytes.IndexByte
30 0.9% 97.3% 30 0.9% scanblock
6 0.2% 97.5% 7 0.2% sweepspan
5 0.2% 97.6% 6 0.2% syscall.Syscall6
3 0.1% 97.7% 4 0.1% MCentral_Alloc
3 0.1% 97.8% 7 0.2% net/textproto.(*Reader).ReadMIMEHeader
3 0.1% 97.9% 9 0.3% runtime.MCache_Alloc
(pprof) top --cum
Total: 3216 samples
2 0.1% 0.1% 3197 99.4% schedunlock
1 0.0% 0.1% 2331 72.5% net/http.(*conn).serve
1406 43.7% 43.8% 1406 43.7% runtime.futex
1 0.0% 43.8% 1401 43.6% runtime.entersyscall
0 0.0% 43.8% 1400 43.5% type..eq.[32]string
0 0.0% 43.8% 1398 43.5% runtime.futexwakeup
0 0.0% 43.8% 1398 43.5% runtime.notewakeup
0 0.0% 43.8% 1217 37.8% bufio.(*Writer).Flush
0 0.0% 43.8% 1217 37.8% net.(*conn).Write
0 0.0% 43.8% 1217 37.8% net.(*netFD).Write
2) GOMAXPROCS = number of CPUs
(pprof) top
Total: 4550 samples
1351 29.7% 29.7% 1351 29.7% runtime.futex
1181 26.0% 55.6% 1181 26.0% syscall.RawSyscall
588 12.9% 68.6% 588 12.9% runtime.usleep
439 9.6% 78.2% 443 9.7% syscall.Syscall
320 7.0% 85.3% 320 7.0% syscall.Syscall6
39 0.9% 86.1% 84 1.8% sweepspan
37 0.8% 86.9% 37 0.8% syscall.RawSyscall6
30 0.7% 87.6% 30 0.7% bytes.IndexByte
27 0.6% 88.2% 179 3.9% scanblock
25 0.5% 88.7% 25 0.5% runtime.memmove
(pprof) top --cum
Total: 4550 samples
0 0.0% 0.0% 3518 77.3% schedunlock
5 0.1% 0.1% 1910 42.0% net/http.(*conn).serve
1351 29.7% 29.8% 1351 29.7% runtime.futex
0 0.0% 29.8% 1188 26.1% net/http.(*response).finishRequest
0 0.0% 29.8% 1187 26.1% bufio.(*Writer).Flush
3 0.1% 29.9% 1187 26.1% net.(*conn).Write
1 0.0% 29.9% 1184 26.0% net.(*netFD).Write
1181 26.0% 55.8% 1181 26.0% syscall.RawSyscall
2 0.0% 55.9% 1180 25.9% syscall.WriteNB
0 0.0% 55.9% 1178 25.9% syscall.writeNB