Poor multi-core performance for simple TCP server

2,446 views
Skip to first unread message

j...@tsp.io

unread,
Jun 15, 2015, 5:03:36 PM6/15/15
to golan...@googlegroups.com
I recently decided to benchmark simple TCP server implementations written in different languages against one another[1], and am seeing some strange results for the Go server. In particular, as the number of cores increases, the performance of the Go server *decreases*. Profiling shows that the majority of time is spent in the system call epollwait, but I don't have a good explanation for why that would be the case, given that the benchmark client replies immediately to all messages. GOMAXPROCS is set appropriately[3], and the problem also occurs on go tip[4]. It might be relevant that the performance drop starts once multiple NUMA nodes are in use (the server running the benchmarks has NUMA nodes with 10 cores each), though this should not matter as there is no communication between the cores for the benchmarked paths. This is also evident from the fact that the Rust and C servers don't see this performance drop.

As far as I can tell, I haven't made any obvious mistakes in my server code, but I'd be very grateful if someone else could take a look[2] and see if there's something fundamentally wrong with it.
Cheers,

Jeffrey 'jf' Lim

unread,
Jun 15, 2015, 5:25:46 PM6/15/15
to golang-nuts
On Tue, Jun 16, 2015 at 5:03 AM, <j...@tsp.io> wrote:
I recently decided to benchmark simple TCP server implementations written in different languages against one another[1], and am seeing some strange results for the Go server. In particular, as the number of cores increases, the performance of the Go server *decreases*. Profiling shows that the majority of time is spent in the system call epollwait, but I don't have a good explanation for why that would be the case, given that the benchmark client replies immediately to all messages. GOMAXPROCS is set appropriately[3],


You mean https://github.com/jonhoo/volley/pull/1#issuecomment-112177356 ("In #1, @tsenart pointed out that increasing GOMAXPROCS might improve the performance of the Go server significantly. This issue tracks that claim.")? Unless you're familiar (and even then!) with Go's scheduling, you shouldn't be manually changing GOMAXPROCS.

Goroutines aren't os threads. When you expand the number of GOMAXPROCS willy-nilly, you're actually losing the point and advantages of the Go runtime/scheduler. http://morsmachine.dk/go-scheduler

-jf


 
and the problem also occurs on go tip[4]. It might be relevant that the performance drop starts once multiple NUMA nodes are in use (the server running the benchmarks has NUMA nodes with 10 cores each), though this should not matter as there is no communication between the cores for the benchmarked paths. This is also evident from the fact that the Rust and C servers don't see this performance drop.

As far as I can tell, I haven't made any obvious mistakes in my server code, but I'd be very grateful if someone else could take a look[2] and see if there's something fundamentally wrong with it.
Cheers,

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jon Gjengset

unread,
Jun 15, 2015, 5:48:04 PM6/15/15
to golang-nuts
Jeffrey 'jf' Lim wrote:
> Unless you're familiar (and even then!) with Go's scheduling, you
> shouldn't be manually changing GOMAXPROCS.

I am familiar with Go scheduling, and I am aware that setting GOMAXPROCS
to anything above 1 (the default) incurs additional overhead compared
to leaving it as it is. However, keeping it at one significantly reduces
the amount of parallelism you can extract from the system.

A simple benchmark[1] shows that by not setting GOMAXPROCS, performance
remains constant (as you would expect) as the number of cores increases.

> Goroutines aren't os threads. When you expand the number of GOMAXPROCS
> willy-nilly, you're actually losing the point and advantages of the Go
> runtime/scheduler. http://morsmachine.dk/go-scheduler

That is a wildly misleading statement. It's not as though increasing
GOMAXPROCS will always make your application slower -- if this were
true, Go would be doomed as a systems language, as it would be unable to
parallelize compute-heavy operations. The Go FAQ specifically states[2]:

Programs that perform parallel computation should benefit from
an increase in GOMAXPROCS.

It is true that the scheduler doesn't handle many OS threads as well as
it could:

Go's goroutine scheduler is not as good as it needs to be. In
the future, it should recognize such cases and optimize its use
of OS threads.

But this doesn't mean that you should never use more than one core. For
this benchmark for example, the threads do in fact perform computation,
as everything is happening over loopback, and the socket operations
should thus be fast. There is also no communication between goroutines,
so few context switches should be necessary.

For completeness, [1] also gives the performance for "gohalfmax", in
which GOMAXPROCS is set to half of the number of CPUs. As you can see,
performance does not improve.

[1]: https://jon.thesquareplanet.com/share/volley-go-nomaxprocs.png
[2]: https://golang.org/doc/faq#Why_no_multi_CPU

Jeffrey 'jf' Lim

unread,
Jun 15, 2015, 6:42:17 PM6/15/15
to golang-nuts
On Tue, Jun 16, 2015 at 5:47 AM, Jon Gjengset <j...@thesquareplanet.com> wrote:
Jeffrey 'jf' Lim wrote:
> Unless you're familiar (and even then!) with Go's scheduling, you
> shouldn't be manually changing GOMAXPROCS.

I am familiar with Go scheduling, and I am aware that setting GOMAXPROCS
to anything above 1 (the default) incurs additional overhead compared
to leaving it as it is. However, keeping it at one significantly reduces
the amount of parallelism you can extract from the system.

A simple benchmark[1] shows that by not setting GOMAXPROCS, performance
remains constant (as you would expect) as the number of cores increases.

> Goroutines aren't os threads. When you expand the number of GOMAXPROCS
> willy-nilly, you're actually losing the point and advantages of the Go
> runtime/scheduler. http://morsmachine.dk/go-scheduler

That is a wildly misleading statement. It's not as though increasing
GOMAXPROCS will always make your application slower --


Note the "willy-nilly" qualifier in there. That's the impression that I get when I read that (https://github.com/jonhoo/volley/issues/2#issue-88531006) "I'm currently running experiments with GOMAXPROCS = 4x#CPUs to see whether this makes a difference or not.". Note that in your introduction post, you referred to this url, saying that "GOMAXPROCS is set appropriately".

 
if this were
true, Go would be doomed as a systems language,


I don't know about you, but I've stopped thinking of Go as a systems language because my own definition, vs what Go presents, is different. I could elaborate more, but I think the discussion at https://www.youtube.com/watch?v=BBbv1ej0fFo#t=2m50s explains things better than I can.


 
as it would be unable to
parallelize compute-heavy operations. The Go FAQ specifically states[2]:

        Programs that perform parallel computation should benefit from
        an increase in GOMAXPROCS.

It is true that the scheduler doesn't handle many OS threads as well as
it could:

        Go's goroutine scheduler is not as good as it needs to be. In
        the future, it should recognize such cases and optimize its use
        of OS threads.

But this doesn't mean that you should never use more than one core. For
this benchmark for example, the threads do in fact perform computation,
as everything is happening over loopback, and the socket operations
should thus be fast. There is also no communication between goroutines,
so few context switches should be necessary.


They might not (?) be "necessary", but beyond a certain length of time, context switches (or goroutines "swops") will happen, just because it's a scheduler decision.

-jf


 
For completeness, [1] also gives the performance for "gohalfmax", in
which GOMAXPROCS is set to half of the number of CPUs. As you can see,
performance does not improve.

  [1]: https://jon.thesquareplanet.com/share/volley-go-nomaxprocs.png
  [2]: https://golang.org/doc/faq#Why_no_multi_CPU

andrewc...@gmail.com

unread,
Jun 15, 2015, 6:46:03 PM6/15/15
to golan...@googlegroups.com, j...@tsp.io
Test with the Go tip version from git, it performs dramatically better with multiple cores. In fact, 1.5 release is making the default number of OS threads equal to the number of cores.

Jon Gjengset

unread,
Jun 15, 2015, 6:47:20 PM6/15/15
to golan...@googlegroups.com
andrewc...@gmail.com wrote:
> Test with the Go tip version from git, it performs dramatically better with
> multiple cores. In fact, 1.5 release is making the default number of OS
> threads equal to the number of cores.

Unfortunately, as mentioned in the initial e-mail, the problem also
occurs on tip[1].

[1]: https://github.com/jonhoo/volley/issues/3

andrewc...@gmail.com

unread,
Jun 15, 2015, 6:49:13 PM6/15/15
to golan...@googlegroups.com, j...@thesquareplanet.com
My mistake, I skim read looking for the word 1.4/1.5 and didn't see it.

Jon Gjengset

unread,
Jun 15, 2015, 6:55:37 PM6/15/15
to golang-nuts
> Note the "willy-nilly" qualifier in there. That's the impression that
> I get when I read that
> (https://github.com/jonhoo/volley/issues/2#issue-88531006) "I'm
> currently running experiments with GOMAXPROCS = 4x#CPUs to see whether
> this makes a difference or not.". Note that in your introduction post,
> you referred to this url, saying that "GOMAXPROCS is set
> appropriately".

When I say that GOMAXPROCS is set appropriately, I'm referring to the
fact that GOMAXPROCS is set to the number of CPUs[1]. I tried increasing
this to see how it would affect performance, and increasing it by a
small factor seemed like the right thing to do to get a feel for how it
would affect performance. If you have a better proposal for what values
I should try, I'm all ears.

> I don't know about you, but I've stopped thinking of Go as a systems
> language because my own definition, vs what Go presents, is different. I
> could elaborate more, but I think the discussion at
> https://www.youtube.com/watch?v=BBbv1ej0fFo#t=2m50s explains things better
> than I can.

Haven't seen that before, but thanks, I'll take a look.

I'm not sure I'm ready to write Go off as a systems language just yet,
as it's not clear that having a runtime excludes you from being usable
for systems programming. For example, having a garbage collector can
significantly help scalability by reducing the inter-CPU communication
required to do reference counting.

> They might not (?) be "necessary", but beyond a certain length of
> time, context switches (or goroutines "swops") will happen, just
> because it's a scheduler decision.

Oh, sure, but "accidental" context switches like these should be rare,
and should not impact performance that significantly. If goroutines were
moving between threads and cores regularly, this would completely trash
CPU caches, killing performance. My _guess_ is that the Go scheduler,
much like the OS scheduler, doesn't move computation around without a
good reason. It tries to keep things where they are unless cores are
sitting idle. As long as this is true, this kind of context switching
shouldn't be a major concern as long as there is enough work to do.

Jon

[1]: https://github.com/jonhoo/volley/blob/master/servers/go/main.go#L26

Jeffrey 'jf' Lim

unread,
Jun 15, 2015, 7:17:34 PM6/15/15
to golang-nuts
On Tue, Jun 16, 2015 at 6:55 AM, Jon Gjengset <j...@thesquareplanet.com> wrote:
> Note the "willy-nilly" qualifier in there. That's the impression that
> I get when I read that
> (https://github.com/jonhoo/volley/issues/2#issue-88531006) "I'm
> currently running experiments with GOMAXPROCS = 4x#CPUs to see whether
> this makes a difference or not.". Note that in your introduction post,
> you referred to this url, saying that "GOMAXPROCS is set
> appropriately".

When I say that GOMAXPROCS is set appropriately, I'm referring to the
fact that GOMAXPROCS is set to the number of CPUs[1].


Ok. Sorry about that. As explained, that was not what I got. Setting GOMAXPROCS to the number of CPUs is a fair number.

 
I tried increasing
this to see how it would affect performance, and increasing it by a
small factor seemed like the right thing to do to get a feel for how it
would affect performance. If you have a better proposal for what values
I should try, I'm all ears.


Sure, for experimentation's sake, try anything you want. There are some logical good values to try... as well as some I would just stay away from. I would try anything from 1 upwards to the number of cpus (but that's just logical). Having said this, a lot of tuning really has to do with your workload.

 

> I don't know about you, but I've stopped thinking of Go as a systems
> language because my own definition, vs what Go presents, is different. I
> could elaborate more, but I think the discussion at
> https://www.youtube.com/watch?v=BBbv1ej0fFo#t=2m50s explains things better
> than I can.

Haven't seen that before, but thanks, I'll take a look.

I'm not sure I'm ready to write Go off as a systems language just yet,
as it's not clear that having a runtime excludes you from being usable
for systems programming. For example, having a garbage collector can
significantly help scalability by reducing the inter-CPU communication
required to do reference counting.

> They might not (?) be "necessary", but beyond a certain length of
> time, context switches (or goroutines "swops") will happen, just
> because it's a scheduler decision.

Oh, sure, but "accidental" context switches like these should be rare,
and should not impact performance that significantly. If goroutines were
moving between threads and cores regularly, this would completely trash
CPU caches, killing performance. My _guess_ is that the Go scheduler,
much like the OS scheduler, doesn't move computation around without a
good reason. It tries to keep things where they are unless cores are
sitting idle. As long as this is true, this kind of context switching
shouldn't be a major concern as long as there is enough work to do.


And I'm sure you would be right. As per http://morsmachine.dk/go-scheduler, the Go runtime schedules goroutines on top of operating system threads, and really doesn't move them unless necessary.


-jf



Jon

  [1]: https://github.com/jonhoo/volley/blob/master/servers/go/main.go#L26

Jon Gjengset

unread,
Jun 15, 2015, 7:23:59 PM6/15/15
to golang-nuts
> > When I say that GOMAXPROCS is set appropriately, I'm referring to the
> > fact that GOMAXPROCS is set to the number of CPUs[1].
>
> Ok. Sorry about that. As explained, that was not what I got. Setting
> GOMAXPROCS to the number of CPUs is a fair number.

No worries. It wasn't as clear as it should have been.

> > I tried increasing this to see how it would affect performance, and
> > increasing it by a small factor seemed like the right thing to do to
> > get a feel for how it would affect performance. If you have a better
> > proposal for what values I should try, I'm all ears.
> >
>
> Sure, for experimentation's sake, try anything you want. There are
> some logical good values to try... as well as some I would just stay
> away from. I would try anything from 1 upwards to the number of cpus
> (but that's just logical). Having said this, a lot of tuning really
> has to do with your workload.

In this case I was specifically trying to see whether increasing
GOMAXPROCS would (counterintuitively) improve the performance of the
application after one GitHub user claimed that it might. Investigating
lower values (1 < x < #CPUs) might also be worthwhile, though I doubt
that that change alone will somehow make the performance curve suddenly
match that of Rust and C unfortunately..

Jon

Tamás Gulácsi

unread,
Jun 16, 2015, 1:02:52 AM6/16/15
to golan...@googlegroups.com
Have pprof'd that go server? I see nothing obvious, maybe that buf could be sync.Pool'd or constant allocated (var are [4]byte; buf=arr[:]).

Jon Gjengset

unread,
Jun 16, 2015, 10:56:13 AM6/16/15
to golan...@googlegroups.com
Tamás Gulácsi wrote:
> Have pprof'd that go server?

Yes. See the results at
https://github.com/jonhoo/volley/blob/master/servers/go/profile.svg
You probably need to right-click and "open image in new tab" to see it
properly. Essentially, most of the time is spent in epollwait, which is
strange given that the server should pretty much never be waiting for
input (the clients reply immediately, there are a lot of them, and all
the traffic is going over localhost).

> I see nothing obvious, maybe that buf could be sync.Pool'd or constant
> allocated (var are [4]byte; buf=arr[:]).

buf should (in theory) be automatically stack-allocated since it does
not escape, but unfortunately it seems as though this is not the case in
go 1.4.1..

$ go build -o go -gcflags=-m main.go
...
./main.go:51: make([]byte, 4) escapes to heap
...

That said, I don't think should cause much of a performance penalty. The
allocation only happens once per *client*, not once per request, which
is pretty rare, and also not counted in the performance benchmark.

Jon

James Bardin

unread,
Jun 16, 2015, 10:56:46 AM6/16/15
to golan...@googlegroups.com, j...@tsp.io

The Go server here is doing a lot more than the other two (well I think so, I'm not really familiar with rust yet).

The C server is using blocking reads and writes directly on each socket, each in their own thread, eschewing any sort of poll/epoll/select overhead. The Go server however needs to wait for epoll to be triggered and notify the network Read. This overhead goes up dramatically when the epoll object is running in another thread, on another numa core. 

There's not really a good way to manage this in Go for now. My own preference is to use multiple Go processes, each on their own CPU, treating the host as a small shared memory cluster. The new option for SO_REUSEPORT on Linux makes this possible without any sort of load balancing layer. IIRC there's a couple issues currently open to make that easier to use in Go (SO_REUSEPORT is not in /x/net or /x/sys/unix either). 

Jon Gjengset

unread,
Jun 16, 2015, 11:10:25 AM6/16/15
to golan...@googlegroups.com
James Bardin wrote:
> The Go server here is doing a lot more than the other two (well I
> think so, I'm not really familiar with rust yet).

The Rust implementation should behave fairly similarly to the C one;
Rust has no M on N green threads anymore.

> The C server is using blocking reads and writes directly on each
> socket, each in their own thread, eschewing any sort of
> poll/epoll/select overhead. The Go server however needs to wait for
> epoll to be triggered and notify the network Read. This overhead goes
> up dramatically when the epoll object is running in another thread, on
> another numa core.

Ah, yes, this makes a lot of sense.
Could this cost be amortized to some degree by having more threads doing
the polling, or is this a fundamental scalability limitation of doing
polling over blocking reads? If that is the case, it seems like
applications with high latency requirements would either need a
different way of handling sockets, or would have to abandon Go for the
core application loops :/

It would be interesting to compare against a C/Rust/... server that also
used epoll, but I'm unfortunately a bit tight on time at the moment. If
anyone wants to give it a shot, I'd be open to accepting a pull request!

> There's not really a good way to manage this in Go for now.

Do you know if there are any plans to improve this situation?

> My own preference is to use multiple Go processes, each on their own
> CPU, treating the host as a small shared memory cluster. The new
> option for SO_REUSEPORT on Linux makes this possible without any sort
> of load balancing layer. IIRC there's a couple issues currently open
> to make that easier to use in Go (SO_REUSEPORT is not in /x/net or
> /x/sys/unix either).

Mmm, that could work. In a sense, it is similar to a pre-forking server,
except that you do the "forking" before the server even starts. I can
probably try to cook up a modification of the Go server for the
benchmark that just spawns many copies of itself and see how that
performs. It might take some time before I get the time to sit down and
do it though, so if anyone has an abundance of free time, PR are welcome
as always.

Cheers,
Jon
signature.asc

Naoki INADA

unread,
Jun 16, 2015, 2:08:49 PM6/16/15
to golan...@googlegroups.com, j...@thesquareplanet.com

> There's not really a good way to manage this in Go for now.

Do you know if there are any plans to improve this situation?


Go's file I/O is blocking I/O.  Go can use thread to handle blocking I/O.
I've sent pull request using file I/O.

silviun...@gmail.com

unread,
Jun 17, 2015, 1:22:42 PM6/17/15
to golan...@googlegroups.com, j...@thesquareplanet.com
Hi Naoki and Jon

Thanks a lot for you doing the PR and for Jon merging and posting the new benchmarks today with the clearly improved performance. 

I have a question for Naoki, particularly that I know that he collaborated on some of the Techempower Framework Benchmarks: do you think the original issues in the Volley tests are fundamentally the same ones affecting http server performance (where it does not scale past 4-8 cores) in Techempower recent tests with 40 cores ?

Cheers
s.

di3go.b...@gmail.com

unread,
Jun 17, 2015, 1:40:31 PM6/17/15
to golan...@googlegroups.com, silviun...@gmail.com, j...@thesquareplanet.com
nice question!
another thing is to test again with golang1.5, it has better goroutines performance.
too bad golang still falling behind c and rust.

di3go.b...@gmail.com

unread,
Jun 17, 2015, 4:05:47 PM6/17/15
to golan...@googlegroups.com, j...@thesquareplanet.com, silviun...@gmail.com, di3go.b...@gmail.com
jon updated the benchmark with the golang tip with blocking-io, now, golang has the same kind of performance of rust and c.

di3go.b...@gmail.com

unread,
Jun 17, 2015, 8:32:43 PM6/17/15
to golan...@googlegroups.com, di3go.b...@gmail.com, j...@thesquareplanet.com, silviun...@gmail.com
anyone can help with the last question about the performance of golang? https://github.com/jonhoo/volley/issues/3
like Naoki said, golang was slow because of the nonblocking-io, with the blocking-io golang is fast like c and rust.
the problem now is with the latency that is very high.
i think the problem happens because golang start with GOMAXPROCS = numcpu, while c and rust start a new thread to each connection.
so in the end, golang has less threads and the threads are blocked, increasing the latency.
in the raw data from tests, golang c and rust has almost the same performance when the number of connections are equal the numbers of threads:

go-blocking-tip 40 40 39us 5.89us 1000000
rust            40 40 41us 6.68us 1000000
c-threaded      40 40 40us 7.91us 1000000
make sense what i said or am i wrong?

James Bardin

unread,
Jun 17, 2015, 8:54:24 PM6/17/15
to di3go.b...@gmail.com, golan...@googlegroups.com, j...@thesquareplanet.com, silviun...@gmail.com
Yes, a dedicated thread per connection would probably help. Use runtime.LockOSThread in each goroutine. 

I'm really not seeing the point if this benchmark though. These servers do so little, if you perfectly tune each of these implementations, you're basically comparing the same OS threads making syscalls. 
--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/fuKOlm5CAdM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.

Naoki INADA

unread,
Jun 17, 2015, 11:22:22 PM6/17/15
to golan...@googlegroups.com, di3go.b...@gmail.com, silviun...@gmail.com, j...@thesquareplanet.com

On Thursday, June 18, 2015 at 9:32:43 AM UTC+9, di3go.b...@gmail.com wrote:
anyone can help with the last question about the performance of golang? https://github.com/jonhoo/volley/issues/3
like Naoki said, golang was slow because of the nonblocking-io, with the blocking-io golang is fast like c and rust.
the problem now is with the latency that is very high.
i think the problem happens because golang start with GOMAXPROCS = numcpu, while c and rust start a new thread to each connection.
so in the end, golang has less threads and the threads are blocked, increasing the latency.

Go's net API are tightly coupled with scheduler.
I think Go's scheduler doesn't scale for such too high goroutine switching for some reason.

I hope Go 1.5's tracer show us where is bottleneck.

Naoki INADA

unread,
Jun 17, 2015, 11:28:20 PM6/17/15
to golan...@googlegroups.com, silviun...@gmail.com, j...@thesquareplanet.com
On Thursday, June 18, 2015 at 2:22:42 AM UTC+9, silviun...@gmail.com wrote:
Hi Naoki and Jon

Thanks a lot for you doing the PR and for Jon merging and posting the new benchmarks today with the clearly improved performance. 

I have a question for Naoki, particularly that I know that he collaborated on some of the Techempower Framework Benchmarks: do you think the original issues in the Volley tests are fundamentally the same ones affecting http server performance (where it does not scale past 4-8 cores) in Techempower recent tests with 40 cores ?

I don't know what happened on PEAK environment.
I tested on EC2 c4.8xlarge (36cores) and Go scales more.

When I've tested, Go 1.5's performance is poor and it's unstable.
But Go 1.5's concurrent GC is very improved and stabilized in these days.
Especially, rsc's commits in yesterday makes stabler.

So I'll try to investigate and optimize Go 1.5 again.

Dave Cheney

unread,
Jun 18, 2015, 1:44:16 AM6/18/15
to golan...@googlegroups.com
Thanks for continuing to work on this. Please continue to report your results, and especially your bug reports.

Wojciech S. Czarnecki

unread,
Jun 18, 2015, 4:54:14 AM6/18/15
to James Bardin, di3go.b...@gmail.com, golan...@googlegroups.com, j...@thesquareplanet.com, silviun...@gmail.com
Dnia 2015-06-17, o godz. 20:54:13
James Bardin <j.ba...@gmail.com> napisał(a):

> Yes, a dedicated thread per connection would probably help. Use
> runtime.LockOSThread in each goroutine.
>
> I'm really not seeing the point if this benchmark though. These servers do
> so little, if you perfectly tune each of these implementations, you're
> basically comparing the same OS threads making syscalls.

(For me) this benchmark shows whether a compiler/runtime of a
language adds significant overhead or not. Already we do know that certain
implementation in Go 1.4.x is far behind others but 'go-blocking-tip' one
is on par with c's and rust's. Thats good.

--
Wojciech S. Czarnecki
^oo^ OHIR-RIPE

James Bardin

unread,
Jun 18, 2015, 10:22:33 AM6/18/15
to Wojciech S. Czarnecki, di3go.b...@gmail.com, golan...@googlegroups.com, j...@thesquareplanet.com, silviun...@gmail.com

On Thu, Jun 18, 2015 at 4:53 AM, Wojciech S. Czarnecki <oh...@fairbe.org> wrote:

(For me) this benchmark shows whether a compiler/runtime of a
language adds significant overhead or not. Already we do know that certain
implementation in Go 1.4.x is far behind others but 'go-blocking-tip' one
is on par with c's and rust's. Thats good.

This is good, it's just that the end result *comparison* between languages doesn't show anything useful. I already know that a dedicated thread per socket can make a lot of syscalls very quickly. The go-blocking server for example gets as much of the runtime out of the way as it can and just calls read and write on the socket file descriptor. The way to make it faster is to do less, and once all the benchmarks have removed as much as possible, they are all basically the same. (Why not take this further and let threads use sched_setaffinity to locate themselves closer to the interface and avoid interrupts (which would of course be pinned to their own cores), or bypass the kernel with user space packet handling and RDMA? Ok, the parentheses probably won't shield this from the logical fallacy arguments, but where exactly do you draw the line for benchmark-specific code?) 

Don't get me wrong, I think the optimization process *within* the language benchmark is of interest, especially since this produces some nice graphs that highlight the problem with a single scheduler and poller on a NUMA system. This is a well known problem in general, and Dmitry (+dvyukov) wrote up a proposal last year for a NUMA aware scheduler. 

Go works very well within a single NUMA node, and within the limits of a single poller, and I would love to have a way to scale up further. Having a smooth method to use SO_REUSEPORT is one way to do that for networking. A more general solution of a runtime that can scale up to make use of current large[ish] systems would be even better. 


Jon Gjengset

unread,
Jun 18, 2015, 11:01:44 AM6/18/15
to golan...@googlegroups.com
James Bardin wrote:
> Wojciech S. Czarnecki wrote:
> > (For me) this benchmark shows whether a compiler/runtime of a
> > language adds significant overhead or not. Already we do know that certain
> > implementation in Go 1.4.x is far behind others but 'go-blocking-tip' one
> > is on par with c's and rust's. Thats good.
>
> This is good, it's just that the end result *comparison* between languages
> doesn't show anything useful. I already know that a dedicated thread per
> socket can make a lot of syscalls very quickly. The go-blocking server for
> example gets as much of the runtime out of the way as it can and just calls
> read and write on the socket file descriptor. The way to make it faster is
> to do less, and once all the benchmarks have removed as much as possible,
> they are all basically the same.

I envision that there will be at least two implementations for every
language represented in Volley: one that is idiomatic, and one that is
optimized. Hopefully you'll agree that comparing the idiomatic solutions
in each language is fairly interesting. As for the optimized ones,
you're probably right that they will be fairly similar in terms of
performance. However, the code might look (radically) different.

In a sense, the reason to include optimized implementations is (a) to
show how one would squeeze the most network performance out of a
particular language, and (b) to show how easy/hard it is to properly
optimize a network server in that language.

I say *at least* two, because there might be other interesting designs
to explore. For example, does the conventional wisdom that having a
worker pool improves performance actually still hold? What is the
performance difference between a forking and a threaded server?

> Don't get me wrong, I think the optimization process *within* the
> language benchmark is of interest, especially since this produces some
> nice graphs that highlight the problem with a single scheduler and
> poller on a NUMA system. This is a well known problem in general, and
> Dmitry (+dvyukov) wrote up a proposal last year for a NUMA aware
> scheduler.

This is where the idiomatic solutions are of interest. I suspect that,
as more implementations are added, we'll see performance bottlenecks in
several languages' idiomatic implementations, and I believe that that
alone is useful. It is to encourage these discoveries that I did not
merge `go-blocking` into `go`, but instead kept them separate.

> Go works very well within a single NUMA node, and within the limits of
> a single poller, and I would love to have a way to scale up further.
> Having a smooth method to use SO_REUSEPORT is one way to do that for
> networking. A more general solution of a runtime that can scale up to
> make use of current large[ish] systems would be even better.

Yeah, this has been a gripe of mine for quite a while with Go. I know
Dmitry has done a bunch of work in this field, but as far as I'm aware,
none of it has been merged yet.

Jon

Naoki INADA

unread,
Jun 19, 2015, 12:14:16 AM6/19/15
to golan...@googlegroups.com
I tuned scheduler. Here is report.

Largest machine I can use is c4.8xlarge (36 cores).
I'll create other benchmark program to avoid using wrk to reduce threads used by wrk.
But I hope someone try this patch on larger (64~ cores) machine.

Naoki INADA

unread,
Jun 19, 2015, 9:08:58 AM6/19/15
to golan...@googlegroups.com
Previous patch caused significant performance degradation in some cases.
I've created another patch.  I don't find any performance degradation for now.

diff --git a/src/runtime/proc1.go b/src/runtime/proc1.go
index fa6c2e1
..a3a2b94 100644
--- a/src/runtime/proc1.go
+++ b/src/runtime/proc1.go
@@ -1395,6 +1395,16 @@ top:
                       
}
                       
return gp, false
               
}
+
+               // try global runq again.
+               if sched.runqsize != 0 {
+                       lock(&sched.lock)
+                       gp := globrunqget(_g_.m.p.ptr(), 0)
+                       unlock(&sched.lock)
+                       if gp != nil {
+                               return gp, false
+                       }
+               }
       
}


       
// If number of spinning M's >= number of busy P's, block.




Naoki INADA

unread,
Jun 19, 2015, 9:32:15 AM6/19/15
to golan...@googlegroups.com
I updated my gist with new patch.

Could someone try this?

Jon Gjengset

unread,
Jun 19, 2015, 11:12:01 AM6/19/15
to golan...@googlegroups.com
Naoki INADA wrote:
> I updated my gist with new patch.
> https://gist.github.com/methane/7868758d7d6438f6c02a
>
> Could someone try this?

I might be able to test this on an 80-core machine.
How should I run the client and server? And how many of each?

Jon

pfre...@gmail.com

unread,
Jun 20, 2015, 10:40:00 PM6/20/15
to golan...@googlegroups.com, j...@thesquareplanet.com
hey gophers, any magic left? rust got a large speed up in the last test: https://github.com/jonhoo/volley
i really wanna see if golang can match rust speed

Milan P. Stanic

unread,
Jun 21, 2015, 6:20:05 AM6/21/15
to golan...@googlegroups.com
I suspect that the go ever can match rust in speed because the rust is
designed to be more low level while the go is designed to be 'higher'
level. But that is not bad about go because it is more server language
(as Rob Pike pointed somewhere, IIRC) than the bare metal language.

And I would be pleasantly surprised if I'm wrong because I don't like
rust syntax but the go's simple and clear syntax.

Naoki INADA

unread,
Jun 21, 2015, 12:01:18 PM6/21/15
to golan...@googlegroups.com, pfre...@gmail.com, j...@thesquareplanet.com
I don't think rust-multiplex is realistic program.
It relies upon client sends data to all stream.  When there are some stream idle, the server is blocked.

But, just for fan, I've ported it.
Reply all
Reply to author
Forward
0 new messages