I recently decided to benchmark simple TCP server implementations written in different languages against one another[1], and am seeing some strange results for the Go server. In particular, as the number of cores increases, the performance of the Go server *decreases*. Profiling shows that the majority of time is spent in the system call epollwait, but I don't have a good explanation for why that would be the case, given that the benchmark client replies immediately to all messages. GOMAXPROCS is set appropriately[3],
GOMAXPROCS might improve the performance of the Go server significantly. This issue tracks that claim.")? Unless you're familiar (and even then!) with Go's scheduling, you shouldn't be manually changing GOMAXPROCS.and the problem also occurs on go tip[4]. It might be relevant that the performance drop starts once multiple NUMA nodes are in use (the server running the benchmarks has NUMA nodes with 10 cores each), though this should not matter as there is no communication between the cores for the benchmarked paths. This is also evident from the fact that the Rust and C servers don't see this performance drop.As far as I can tell, I haven't made any obvious mistakes in my server code, but I'd be very grateful if someone else could take a look[2] and see if there's something fundamentally wrong with it.Cheers,
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jeffrey 'jf' Lim wrote:
> Unless you're familiar (and even then!) with Go's scheduling, you
> shouldn't be manually changing GOMAXPROCS.
I am familiar with Go scheduling, and I am aware that setting GOMAXPROCS
to anything above 1 (the default) incurs additional overhead compared
to leaving it as it is. However, keeping it at one significantly reduces
the amount of parallelism you can extract from the system.
A simple benchmark[1] shows that by not setting GOMAXPROCS, performance
remains constant (as you would expect) as the number of cores increases.
> Goroutines aren't os threads. When you expand the number of GOMAXPROCS
> willy-nilly, you're actually losing the point and advantages of the Go
> runtime/scheduler. http://morsmachine.dk/go-scheduler
That is a wildly misleading statement. It's not as though increasing
GOMAXPROCS will always make your application slower --
GOMAXPROCS = 4x#CPUs to see whether this makes a difference or not.". Note that in your introduction post, you referred to this url, saying that "GOMAXPROCS is set appropriately".if this were
true, Go would be doomed as a systems language,
as it would be unable to
parallelize compute-heavy operations. The Go FAQ specifically states[2]:
Programs that perform parallel computation should benefit from
an increase in GOMAXPROCS.
It is true that the scheduler doesn't handle many OS threads as well as
it could:
Go's goroutine scheduler is not as good as it needs to be. In
the future, it should recognize such cases and optimize its use
of OS threads.
But this doesn't mean that you should never use more than one core. For
this benchmark for example, the threads do in fact perform computation,
as everything is happening over loopback, and the socket operations
should thus be fast. There is also no communication between goroutines,
so few context switches should be necessary.
For completeness, [1] also gives the performance for "gohalfmax", in
which GOMAXPROCS is set to half of the number of CPUs. As you can see,
performance does not improve.
[1]: https://jon.thesquareplanet.com/share/volley-go-nomaxprocs.png
[2]: https://golang.org/doc/faq#Why_no_multi_CPU
> Note the "willy-nilly" qualifier in there. That's the impression that
> I get when I read that
> (https://github.com/jonhoo/volley/issues/2#issue-88531006) "I'm
> currently running experiments with GOMAXPROCS = 4x#CPUs to see whether
> this makes a difference or not.". Note that in your introduction post,
> you referred to this url, saying that "GOMAXPROCS is set
> appropriately".
When I say that GOMAXPROCS is set appropriately, I'm referring to the
fact that GOMAXPROCS is set to the number of CPUs[1].
I tried increasing
this to see how it would affect performance, and increasing it by a
small factor seemed like the right thing to do to get a feel for how it
would affect performance. If you have a better proposal for what values
I should try, I'm all ears.
> I don't know about you, but I've stopped thinking of Go as a systems
> language because my own definition, vs what Go presents, is different. I
> could elaborate more, but I think the discussion at
> https://www.youtube.com/watch?v=BBbv1ej0fFo#t=2m50s explains things better
> than I can.
Haven't seen that before, but thanks, I'll take a look.
I'm not sure I'm ready to write Go off as a systems language just yet,
as it's not clear that having a runtime excludes you from being usable
for systems programming. For example, having a garbage collector can
significantly help scalability by reducing the inter-CPU communication
required to do reference counting.
> They might not (?) be "necessary", but beyond a certain length of
> time, context switches (or goroutines "swops") will happen, just
> because it's a scheduler decision.
Oh, sure, but "accidental" context switches like these should be rare,
and should not impact performance that significantly. If goroutines were
moving between threads and cores regularly, this would completely trash
CPU caches, killing performance. My _guess_ is that the Go scheduler,
much like the OS scheduler, doesn't move computation around without a
good reason. It tries to keep things where they are unless cores are
sitting idle. As long as this is true, this kind of context switching
shouldn't be a major concern as long as there is enough work to do.
Jon
[1]: https://github.com/jonhoo/volley/blob/master/servers/go/main.go#L26
> There's not really a good way to manage this in Go for now.
Do you know if there are any plans to improve this situation?
go-blocking-tip 40 40 39us 5.89us 1000000
rust 40 40 41us 6.68us 1000000
c-threaded 40 40 40us 7.91us 1000000--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/fuKOlm5CAdM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
anyone can help with the last question about the performance of golang? https://github.com/jonhoo/volley/issues/3like Naoki said, golang was slow because of the nonblocking-io, with the blocking-io golang is fast like c and rust.the problem now is with the latency that is very high.i think the problem happens because golang start with GOMAXPROCS = numcpu, while c and rust start a new thread to each connection.so in the end, golang has less threads and the threads are blocked, increasing the latency.
Hi Naoki and JonThanks a lot for you doing the PR and for Jon merging and posting the new benchmarks today with the clearly improved performance.I have a question for Naoki, particularly that I know that he collaborated on some of the Techempower Framework Benchmarks: do you think the original issues in the Volley tests are fundamentally the same ones affecting http server performance (where it does not scale past 4-8 cores) in Techempower recent tests with 40 cores ?
(For me) this benchmark shows whether a compiler/runtime of a
language adds significant overhead or not. Already we do know that certain
implementation in Go 1.4.x is far behind others but 'go-blocking-tip' one
is on par with c's and rust's. Thats good.
diff --git a/src/runtime/proc1.go b/src/runtime/proc1.go
index fa6c2e1..a3a2b94 100644
--- a/src/runtime/proc1.go
+++ b/src/runtime/proc1.go
@@ -1395,6 +1395,16 @@ top:
}
return gp, false
}
+
+ // try global runq again.
+ if sched.runqsize != 0 {
+ lock(&sched.lock)
+ gp := globrunqget(_g_.m.p.ptr(), 0)
+ unlock(&sched.lock)
+ if gp != nil {
+ return gp, false
+ }
+ }
}
// If number of spinning M's >= number of busy P's, block.