As Remy judiciously noted, actually using the WriteNB call might help in
getting useful results.
I shouldn't run benchmarks on the early morning, sorry for the noise.
Here's an updated run, I've double check the assembly to make sure the
correct RawSyscall was made:
BenchmarkTCPOneShot 173746 142280 -18.11%
BenchmarkTCPOneShot-2 87490 96902 +10.76%
BenchmarkTCPOneShot-4 45649 47590 +4.25%
BenchmarkTCPOneShot-8 32219 31521 -2.17%
BenchmarkTCPOneShot-10 302213 301983 -0.08%
BenchmarkTCPOneShot-12 304442 317153 +4.18%
BenchmarkTCPOneShot-16 309886 312512 +0.85%
BenchmarkTCPOneShotTimeout 198172 214151 +8.06%
BenchmarkTCPOneShotTimeout-2 92151 101076 +9.69%
BenchmarkTCPOneShotTimeout-4 45978 48018 +4.44%
BenchmarkTCPOneShotTimeout-8 32311 31608 -2.18%
BenchmarkTCPOneShotTimeout-10 322381 320882 -0.46%
BenchmarkTCPOneShotTimeout-12 302204 312676 +3.47%
BenchmarkTCPOneShotTimeout-16 315046 305763 -2.95%
BenchmarkTCPPersistent 55874 56794 +1.65%
BenchmarkTCPPersistent-2 28429 32224 +13.35%
BenchmarkTCPPersistent-4 17330 16939 -2.26%
BenchmarkTCPPersistent-8 18793 14062 -25.17%
BenchmarkTCPPersistent-10 303544 302887 -0.22%
BenchmarkTCPPersistent-12 306567 304997 -0.51%
BenchmarkTCPPersistent-16 310066 308193 -0.60%
BenchmarkTCPPersistentTimeout 59690 56994 -4.52%
BenchmarkTCPPersistentTimeout-2 28978 30653 +5.78%
BenchmarkTCPPersistentTimeout-4 17356 17080 -1.59%
BenchmarkTCPPersistentTimeout-8 18847 14125 -25.05%
BenchmarkTCPPersistentTimeout-10 303528 302658 -0.29%
BenchmarkTCPPersistentTimeout-12 306600 304994 -0.52%
BenchmarkTCPPersistentTimeout-16 309747 307902 -0.60%
Results are bit more interesting, provided you're using at least 8 cores
(and not more than eight due to the spurious behavior then) restricted
to the `TCPPersistent' use-case..
As a side note, whilst some syscall are non-blocking, they're still not
free. The performance deterioration might just come from not "cheating"
by artificially increasing the thread count.
Sebastien
https://codereview.appspot.com/7126043/