type Store struct {
durations map[string]*Distribution
counters map[string]int64
samples map[string]*Distribution
lock *sync.Mutex
}
func (store *Store) addSample(key string, value int64) {
store.addToStore(store.samples, key, value)
}
func (store *Store) addDuration(key string, value int64) {
store.addToStore(store.durations, key, value)
}
func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
distribution, exists := destination[key]
if !exists {
distribution = NewDistribution()
destination[key] = distribution
}
distribution.addSample(value)
}
Now, when I benchmark this GO code, I get the following results (see gist: benchmark code):
10 threads with 20000 items took 133 millis |
100 threads with 20000 items took 1809 millis |
1000 threads with 20000 items took 17576 millis |
10 threads with 200000 items took 1228 millis |
When I benchmark the Java code, there are much better results (see gist: java benchmark code)
10 threads with 20000 items takes 89 millis |
100 threads with 20000 items takes 265 millis |
1000 threads with 20000 items takes 2888 millis |
10 threads with 200000 items takes 311 millis |
I have profiled the Go code and created a call graph. I interpret this as follows:
GO spends 0.31 and 0.25 seconds in my methods, and pretty much the rest in
sync.(*Mutex).Lock() and sync.(*Mutex).Unlock()
The top20 output of the profiler:
(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
flat flat% sum% cum cum%
8900ms 12.04% 12.04% 8900ms 12.04% runtime.futex
7270ms 9.84% 21.88% 7270ms 9.84% runtime/internal/atomic.Xchg
7020ms 9.50% 31.38% 7020ms 9.50% runtime.procyield
4560ms 6.17% 37.56% 4560ms 6.17% sync/atomic.CompareAndSwapUint32
4400ms 5.95% 43.51% 4400ms 5.95% runtime/internal/atomic.Xadd
4210ms 5.70% 49.21% 22040ms 29.83% runtime.lock
3650ms 4.94% 54.15% 3650ms 4.94% runtime/internal/atomic.Cas
3260ms 4.41% 58.56% 3260ms 4.41% runtime/internal/atomic.Load
2220ms 3.00% 61.56% 22810ms 30.87% sync.(*Mutex).Lock
1870ms 2.53% 64.10% 1870ms 2.53% runtime.osyield
1540ms 2.08% 66.18% 16740ms 22.66% runtime.findrunnable
1430ms 1.94% 68.11% 1430ms 1.94% runtime.freedefer
1400ms 1.89% 70.01% 1400ms 1.89% sync/atomic.AddUint32
1250ms 1.69% 71.70% 1250ms 1.69% github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
1240ms 1.68% 73.38% 3140ms 4.25% runtime.deferreturn
1070ms 1.45% 74.83% 6520ms 8.82% runtime.systemstack
1010ms 1.37% 76.19% 1010ms 1.37% runtime.newdefer
1000ms 1.35% 77.55% 1000ms 1.35% runtime.mapaccess1_faststr
950ms 1.29% 78.83% 15660ms 21.19% runtime.semacquire
860ms 1.16% 80.00% 50220ms 67.97% main.Benchmrk.func1
I would really like to understand why Locking in Go is so much slower than in Java. I've initially written this program using channels, but that was much slower than locking. Can somebody please help me out?
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
This program has a store that holds the data, and a sync.Mutex that guards concurrent access on reads and writes. This is a snippet of the locking based implementation:type Store struct {
durations map[string]*Distribution
counters map[string]int64
samples map[string]*Distribution
lock *sync.Mutex
}
I would really like to understand why Locking in Go is so much slower than in Java. I've initially written this program using channels, but that was much slower than locking. Can somebody please help me out?
type PartitionedStore struct {stores []Store}func NewPartitionedStore(sz int) *PartitionedStore {p := &PartitionedStore{stores: make([]Store, sz),}for i := 0; i < sz; i++ {p.stores[i] = *(NewStore())}return p}func (p *PartitionedStore) addSample(key string, value int64) {p.getStore(key).addSample(key, value)}func (p *PartitionedStore) getStore(key string) *Store {h := fnv.New32a() // might want to use a faster algorithm hereio.WriteString(h, key)idx := h.Sum32() % uint32(len(p.stores))return &p.stores[idx]}func main() {p := NewPartitionedStore(32)p.addSample("test", 15)}
diff --git a/gopatanbench/benchmark.go b/gopatanbench/benchmark.goindex 23503a9..e92ed88 100644--- a/gopatanbench/benchmark.go+++ b/gopatanbench/benchmark.go@@ -37,13 +37,13 @@ func Benchmrk(threads int64, itemsPerThread int64) { for i := int64(0); i < threads; i++ { wg.Add(1) go func() {- defer wg.Done() sw := subject.StartStopwatch()- defer subject.RecordElapsedTime("goroutine.duration", sw) for i := int64(0); i < itemsPerThread; i++ { subject.IncrementCounter("concurrency.counter") subject.AddSample("concurrency.sample", i) }+ subject.RecordElapsedTime("goroutine.duration", sw)+ wg.Done() }() } wg.Wait()
2016/10/03 10:21:06 [STATISTICS] created new lockbased store10 threads with 20000 items took 2272016/10/03 10:21:06 [STATISTICS] created new lockbased store100 threads with 20000 items took 24162016/10/03 10:21:09 [STATISTICS] created new lockbased store1000 threads with 20000 items took 230952016/10/03 10:21:32 [STATISTICS] created new lockbased store10 threads with 200000 items took 20882016/10/03 10:21:34 [STATISTICS] created new lockbased store100 threads with 200000 items took 24436
2016/10/03 10:19:37 [STATISTICS] created new lockbased store
10 threads with 20000 items took 2122016/10/03 10:19:37 [STATISTICS] created new lockbased store100 threads with 20000 items took 22952016/10/03 10:19:39 [STATISTICS] created new lockbased store1000 threads with 20000 items took 226772016/10/03 10:20:02 [STATISTICS] created new lockbased store10 threads with 200000 items took 20112016/10/03 10:20:04 [STATISTICS] created new lockbased store100 threads with 200000 items took 23322
I don't think defer is the main bottleneck here. I tested the benchmark with Go tip and with removing defers. While it becomes faster it is still slower than Java.
I believe that the reason the Java version is faster is that it's lock implementation behaves better under contention.
While I am not 100% sure here are a few observations that support this statement.
If the benchmark is executed with GOMAXPROCS=1 it runs much faster (same speed as Java or faster). This proves that the lock contention plays the main role in this benchmark.
Starting 1000 gorountines is not the same as starting 1000 threads. The benchmark shows better results when started with GOMAXPROCS=1000. I think this happens because the time between different threads acquiring the lock is longer and each thread has more time to do useful work instead of waiting for the lock.
--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/I-p5vmyln9c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts+unsubscribe@googlegroups.com.
func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
store.lock.Lock()
distribution, exists := destination[key]
if !exists {
store.lock.Unlock()
distribution = NewDistribution()
distribution.addSample(value)
store.lock.Lock()
distr, ex := destination[key]
if !ex {
destination[key] = distribution
store.lock.Unlock()
return
}
distribution = distr
}
distribution.addSample(value)
store.lock.Unlock()
}
So, lock around allocation eliminated. (and no defer :-) )
But do these types of spin locks provide the same memory effects as standard locks? I get that only one goroutine at a time can run the given block but assigning to shared vars inside the block will still need to use the methods from sync/atomic right?
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
I'm not entirely sure, but my gut tells me there's probably strict ordering across threads there. More info can be found here: https://github.com/golang/go/issues/5045
As I understand it, Go’s mutex lock will spin for a while (good if everyone using the mutex holds it for only very short periods), but will back off to a less compute intensive method after a while. This avoids tying up a CPU at the cost of some latency in seeing the other guy’s unlock.
John
John Souvestre - New Orleans LA
--
Ø … state that one measly atomic load has the same memory effects as a sync/lock which seems like it might work on some platforms (maybe) but surely not for all?
I believe that any of the atomic operations in sync/atomic is a memory barrier, just as a mutex is, and this is for all platforms.
Ø Don't I at least have to load the shared vars using atomic load (atomic.Value for example) or something similar?
Not if everyone accessing them is using a mutex to synchronize the access.
John
John Souvestre - New Orleans LA
Sure that's my question. Does a SpinLock as given in several examples above provide the same semantics as a proper mutex?
--
I looked at pi/goal. It uses a sync/atomic CAS. Thus, yes, it provides a memory barrier.
As someone else already recommended, the call to Gosched() for each loop will probably tie up the runtime quite a bit. It would probably be better to drop it entirely (if the spin isn’t going to last long, worst case) or only do it every so often (perhaps 1,000 or more loops).
Depending on the amount of congestion and what your latency goal is, you might find that a regular sync/Mutex does as well or better. The fast path (when there’s little congestion) isn’t much more than a CAS.
John
John Souvestre - New Orleans LA
Ø I am sorry if I am dense but what Russ said in that thread "and that you shouldn't mix atomic and non-atomic accesses for a given memory word" seems to indicate otherwise.
I’m not sure what thread you are referring to. In general it is best to avoid the sync/atomic stuff unless you * really * need it for performance and you take the time to understand it well. A mutex lock would not prevent another goroutine from doing an atomic operation, for example. So mixing the two could be disastrous. But there are some cases where it can be done.
Interesting. I didn’t realize that thread was live again. I thought that this one put it to rest. https://groups.google.com/forum/#!msg/golang-nuts/7EnEhM3U7B8/nKCZ17yAtZwJ
I don’t know for sure, but I imagine that Russ’ statement about atomics was mainly concerning synchronization – which Go’s sync/atomic operations provide. And I would certainly agree.
1. // goroutine 1 data = 42 atomic.Store(&ready, 1) // goroutine 2 if atomic.Load(&ready) { if data != 42 { panic("broken") } }
So mixing atomic read/write of one variable makes non-atomic read/writes of another variable safe as well? As far as the memory model goes? Caveat being that it's not formalized.
I am sure great care is needed when attempting such code regardless...
Ø . The only reason I hesitate to go further is because that isn't formalized as part of the spec I don't believe, hence the issue.
I believe it is. From the Go Memory Model:
“To serialize access, protect the data with channel operations or other synchronization primitives such as those in the sync
and sync/atomic
packages.”
John
John Souvestre - New Orleans LA
Thanks for clarifying that Ian!