Go locking and channels much slower than Java equivalent, program spends most of time in sync.(*Mutex).Lock() and sync.(*Mutex).Unlock()

toef...@gmail.com

unread,

Oct 2, 2016, 10:17:53 AM10/2/16

to golang-nuts

Hi,

I've written a small library (https://github.com/toefel18/go-patan) in that stores counters and collects statistics (running min/max/average/stddeviation) of a program during runtime. There is a lock based implementation and a channel based implementation. I've written this library before in Java as well https://github.com/toefel18/patan. The Java version with equivalent implementation is much faster than both the channel and locking implementations in Go, I really don't understand why.

This program has a store that holds the data, and a sync.Mutex that guards concurrent access on reads and writes. This is a snippet of the locking based implementation:

type Store struct {
   durations map[string]*Distribution
   counters  map[string]int64
   samples   map[string]*Distribution

   lock *sync.Mutex
}

func (store *Store) addSample(key string, value int64) {
   store.addToStore(store.samples, key, value)
}

func (store *Store) addDuration(key string, value int64) {
   store.addToStore(store.durations, key, value)
}

func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
   store.lock.Lock()
   defer store.lock.Unlock()
   distribution, exists := destination[key]
   if !exists {
      distribution = NewDistribution()
      destination[key] = distribution
   }
   distribution.addSample(value)
}

Now, when I benchmark this GO code, I get the following results (see gist: benchmark code):

10 threads with 20000 items took 133 millis

100 threads with 20000 items took 1809 millis

1000 threads with 20000 items took 17576 millis

10 threads with 200000 items took 1228 millis

100 threads with 200000 items took 17900 millis

When I benchmark the Java code, there are much better results (see gist: java benchmark code)

10 threads with 20000 items takes 89 millis

100 threads with 20000 items takes 265 millis

1000 threads with 20000 items takes 2888 millis

10 threads with 200000 items takes 311 millis

100 threads with 200000 items takes 3067 millis


I have profiled the Go code and created a call graph. I interpret this as follows:
GO spends 0.31 and 0.25 seconds in my methods, and pretty much the rest in
 sync.(*Mutex).Lock() and sync.(*Mutex).Unlock()  

The top20 output of the profiler:

(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
      flat  flat%   sum%        cum   cum%
    8900ms 12.04% 12.04%     8900ms 12.04%  runtime.futex
    7270ms  9.84% 21.88%     7270ms  9.84%  runtime/internal/atomic.Xchg
    7020ms  9.50% 31.38%     7020ms  9.50%  runtime.procyield
    4560ms  6.17% 37.56%     4560ms  6.17%  sync/atomic.CompareAndSwapUint32
    4400ms  5.95% 43.51%     4400ms  5.95%  runtime/internal/atomic.Xadd
    4210ms  5.70% 49.21%    22040ms 29.83%  runtime.lock
    3650ms  4.94% 54.15%     3650ms  4.94%  runtime/internal/atomic.Cas
    3260ms  4.41% 58.56%     3260ms  4.41%  runtime/internal/atomic.Load
    2220ms  3.00% 61.56%    22810ms 30.87%  sync.(*Mutex).Lock
    1870ms  2.53% 64.10%     1870ms  2.53%  runtime.osyield
    1540ms  2.08% 66.18%    16740ms 22.66%  runtime.findrunnable
    1430ms  1.94% 68.11%     1430ms  1.94%  runtime.freedefer
    1400ms  1.89% 70.01%     1400ms  1.89%  sync/atomic.AddUint32
    1250ms  1.69% 71.70%     1250ms  1.69%  github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
    1240ms  1.68% 73.38%     3140ms  4.25%  runtime.deferreturn
    1070ms  1.45% 74.83%     6520ms  8.82%  runtime.systemstack
    1010ms  1.37% 76.19%     1010ms  1.37%  runtime.newdefer
    1000ms  1.35% 77.55%     1000ms  1.35%  runtime.mapaccess1_faststr
     950ms  1.29% 78.83%    15660ms 21.19%  runtime.semacquire
     860ms  1.16% 80.00%    50220ms 67.97%  main.Benchmrk.func1


I would really like to understand why Locking in Go is so much slower than in Java. I've initially written this program using channels, but that was much slower than locking. Can somebody please help me out?

Justin Israel

unread,

Oct 2, 2016, 2:33:15 PM10/2/16

to toef...@gmail.com, golang-nuts

Do you get better performance when you remove the defer and do an explicit unlock at the end of the function? There are a new references to the defer process happening in your profile.

I'm guessing the try/finally in Java is cheaper.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jesper Louis Andersen

unread,

Oct 2, 2016, 3:07:15 PM10/2/16

to toef...@gmail.com, golang-nuts

This program has a store that holds the data, and a sync.Mutex that guards concurrent access on reads and writes. This is a snippet of the locking based implementation:
type Store struct {
   durations map[string]*Distribution
   counters  map[string]int64
   samples   map[string]*Distribution

   lock *sync.Mutex
}

If I were to write this, I would have one Store per goroutine. Then at intervals, you send this Store to a central location (over a channel) and tally up the counts. In turn, you avoid locking in the tight loops since each goroutine owns the Store it operates on, until you make a copy of it and send it along. approach, which is a bit harder to pull of in Go, is to have a mutex/Store per core. Read operations then take all mutexes for reading, but write operations will almost never compete on the lock, so it becomes free of contention.

The reason this works is because your counters work as CRDT P-counters and max/min are CRDTs on their own. So you don't have to synchronize at all[0]. Also, if you keep the count, the sum and the sum of squares, it is easy to compute the mean and the std. dev. of the data set. All form CRDTs [1].

The key to fast concurrent programming which achieves good parallelism is to avoid having to synchronize at all.

(Aside: both of the above schemes have been used by me with success in Erlang, so I know they work in practice as well as in theory).

I would really like to understand why Locking in Go is so much slower than in Java. I've initially written this program using channels, but that was much slower than locking. Can somebody please help me out?

Javas compiler employs a number of optimizations w.r.t. Locking:

* Escape analysis and lock elision: If we can prove data is thread-local and never escapes the thread, then we can elide the lock and never take it.

* Lock coarsening: Suppose we take and release the same lock multiple times in a row. Say you roll out the inner loop of your benchmark a bit. Then you can take the lock once, update several times and release the lock. This is called coarsening the lock. If the lock is heavily contended, this may be faster. Especially if other optimizations allows us to strength-reduce the updates and fold several of them into a single update. Which would be exactly what we do once we have unrolled the loop 4-8 times.

There is also the ability to exploit modern CPUs hardware transactional memory: If you have a coarse lock over a large array, and most updates affect different indexes in the array, then you can run a transaction optimistically under the assumption threads will write to different slots. On a collision, you restart the transaction, this time with normal mutexes. It is sometimes called hardware lock elision, but the actual state of it on modern CPUs eludes me. The extensions were there but I think they were disabled by microcode since there were bugs in them (At least on the Haswell generation of Intel CPUs). Also, I don't think Java has this, yet, but I might be mistaken.

[0] CRDT: Commutative/Convergent Replicated Data Type.

[1] Consider just using https://godoc.org/github.com/codahale/hdrhistogram by Coda Hale. HDR Histogram is a nice data structure, invented by Gil Tene, which combines the ideas of Flajolet's HyperLogLog with the structure of floating point representations. Histogram updates are measured in the lower nanoseconds.

Matt Harden

unread,

Oct 2, 2016, 4:41:20 PM10/2/16

to Jesper Louis Andersen, toef...@gmail.com, golang-nuts

Re: transactional memory, in Intel this is called TSX-NI. It looks like even for the latest generation (Skylake), it's enabled for certain processors and disabled for others: compare all skylake processors

Caleb Doxsey

unread,

Oct 2, 2016, 7:55:36 PM10/2/16

to golang-nuts

One thing you can try is to increase the number of locks so that the goroutines aren't all stopped on a single lock. For example you can partition your stores: https://play.golang.org/p/y16yr57KcQ

type PartitionedStore struct {
stores []Store
}

func NewPartitionedStore(sz int) *PartitionedStore {
p := &PartitionedStore{
stores: make([]Store, sz),
}
for i := 0; i < sz; i++ {
p.stores[i] = *(NewStore())
}
return p
}

func (p *PartitionedStore) addSample(key string, value int64) {
p.getStore(key).addSample(key, value)
}

func (p *PartitionedStore) getStore(key string) *Store {
h := fnv.New32a() // might want to use a faster algorithm here
io.WriteString(h, key)
idx := h.Sum32() % uint32(len(p.stores))
return &p.stores[idx]
}

func main() {
p := NewPartitionedStore(32)
p.addSample("test", 15)
}

So in this case there are 32 pre-computed locks (no need for a mutex on a fixed-size array that doesn't change) and if the keys are uniformly distributed you end up reducing the chances that two goroutines hit the same mutex at the same time. (this approach is language agnostic, so may help the java program too...)

toefel18

unread,

Oct 3, 2016, 4:10:54 AM10/3/16

to golang-nuts, toef...@gmail.com

Thanks for the tip, I rewrote it using explicit locking and it indeed results in much better performance, still far from Java. but the next response gives a good explanation why that can happen.

Aliaksandr Valialkin

unread,

Oct 3, 2016, 4:14:36 AM10/3/16

to golang-nuts

I bet the bottleneck is exactly at defer. Good news - it looks like go 1.8 will have much faster defer and, as a bonus, faster cgo calls:
- https://github.com/golang/go/commit/f8b2314c563be4366f645536e8031a132cfdf3dd
- https://github.com/golang/go/commit/441502154fa5f78e93c9c7985fbea78a02c21f4f

Florin Pățan

unread,

Oct 3, 2016, 5:26:56 AM10/3/16

to golang-nuts, toef...@gmail.com

Not only that defer but also the ones in the benchmark code itself.

See this:

diff --git a/gopatanbench/benchmark.go b/gopatanbench/benchmark.go
index 23503a9..e92ed88 100644
--- a/gopatanbench/benchmark.go
+++ b/gopatanbench/benchmark.go
@@ -37,13 +37,13 @@ func Benchmrk(threads int64, itemsPerThread int64) {
     for i := int64(0); i < threads; i++ {
         wg.Add(1)
         go func() {
-            defer wg.Done()
             sw := subject.StartStopwatch()
-            defer subject.RecordElapsedTime("goroutine.duration", sw)
             for i := int64(0); i < itemsPerThread; i++ {
                 subject.IncrementCounter("concurrency.counter")
                 subject.AddSample("concurrency.sample", i)
             }
+            subject.RecordElapsedTime("goroutine.duration", sw)
+            wg.Done()
         }()
     }
     wg.Wait()

Which on my machine does the following:

- original:

2016/10/03 10:21:06 [STATISTICS] created new lockbased store
10 threads with 20000 items took 227
2016/10/03 10:21:06 [STATISTICS] created new lockbased store
100 threads with 20000 items took 2416
2016/10/03 10:21:09 [STATISTICS] created new lockbased store
1000 threads with 20000 items took 23095
2016/10/03 10:21:32 [STATISTICS] created new lockbased store
10 threads with 200000 items took 2088
2016/10/03 10:21:34 [STATISTICS] created new lockbased store
100 threads with 200000 items took 24436

- with patch:

2016/10/03 10:19:37 [STATISTICS] created new lockbased store
10 threads with 20000 items took 212
2016/10/03 10:19:37 [STATISTICS] created new lockbased store
100 threads with 20000 items took 2295
2016/10/03 10:19:39 [STATISTICS] created new lockbased store
1000 threads with 20000 items took 22677
2016/10/03 10:20:02 [STATISTICS] created new lockbased store
10 threads with 200000 items took 2011
2016/10/03 10:20:04 [STATISTICS] created new lockbased store
100 threads with 200000 items took 23322

If the benchmark code is slow it will also slow down the app itself.

Also I'd suggest running the benchmarks using the builtin benchmark capability of Go.

toefel18

unread,

Oct 3, 2016, 5:30:05 AM10/3/16

to golang-nuts, toef...@gmail.com

Hi Jesper,

Thanks for the good explanation :). The Java optimization techniques can explain the difference in performance. I now realize that the Java performance in production (a web application) will be worse than in the benchmark because lock elision and coarsening will most likely not be possible or be as beneficial in the benchmark. I've run the Java benchmark with -Djava.compiler=NONE to disable the optimizations, but that is not realy fair and performance drops two orders of magnitude.

Your suggestions of using multiple stores and giving each goroutine his own store would work well when having multiple routines that process data. This library is intended to be used by a team that develops web applications, and we want to measure durations of certain operations in a request or keep track of the response times of remote systems. The statistics must be available for inspection at request Multiple stores would then be more difficult to achieve, at least the implementation would be a little bit more complex.

Konstantin Shaposhnikov

unread,

Oct 3, 2016, 5:34:51 AM10/3/16

to Aliaksandr Valialkin, golang-nuts

I don't think defer is the main bottleneck here. I tested the benchmark with Go tip and with removing defers. While it becomes faster it is still slower than Java.

I believe that the reason the Java version is faster is that it's lock implementation behaves better under contention.

While I am not 100% sure here are a few observations that support this statement.

If the benchmark is executed with GOMAXPROCS=1 it runs much faster (same speed as Java or faster). This proves that the lock contention plays the main role in this benchmark.

Starting 1000 gorountines is not the same as starting 1000 threads. The benchmark shows better results when started with GOMAXPROCS=1000. I think this happens because the time between different threads acquiring the lock is longer and each thread has more time to do useful work instead of waiting for the lock.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/I-p5vmyln9c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts+unsubscribe@googlegroups.com.

Sokolov Yura

unread,

Oct 4, 2016, 1:59:27 AM10/4/16

to golang-nuts

Try this:

func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
store.lock.Lock()

distribution, exists := destination[key]
if !exists {

store.lock.Unlock()
distribution = NewDistribution()
distribution.addSample(value)
store.lock.Lock()
distr, ex := destination[key]
if !ex {
destination[key] = distribution
store.lock.Unlock()
return
}
distribution = distr
}
distribution.addSample(value)
store.lock.Unlock()
}

So, lock around allocation eliminated. (and no defer :-) )

sphil...@gmail.com

unread,

Oct 4, 2016, 5:09:30 AM10/4/16

to golang-nuts

Try spinlocks instead of mutexes: https://github.com/pi/goal/blob/master/gut/spinlock.go

hiatt....@gmail.com

unread,

Oct 11, 2016, 4:49:34 PM10/11/16

to golang-nuts

The runtime.Gosched in the spin lock is going to be quite expensive. I found it's best to only do that every once and awhile, maybe every 100 iterations of the loop or so (you'll want to find optimal for your case).

Henrik Johansson

unread,

Oct 12, 2016, 12:20:15 AM10/12/16

to hiatt....@gmail.com, golang-nuts

But do these types of spin locks provide the same memory effects as standard locks? I get that only one goroutine at a time can run the given block but assigning to shared vars inside the block will still need to use the methods from sync/atomic right?

--

You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

hiatt....@gmail.com

unread,

Oct 12, 2016, 12:36:59 AM10/12/16

to golang-nuts, hiatt....@gmail.com

You should only mutate variables inside the block that you are protecting with the locks, in this regard they are similar to the mutexes in the standard library.

Be careful with the spin locks though, I would only use them where low latency is an absolute must, your go routines will sit there spinning in the infinite loop until some other routine unlocks which is going to cost CPU.

Henrik Johansson

unread,

Oct 12, 2016, 1:02:45 AM10/12/16

to hiatt....@gmail.com, golang-nuts

Yes I get that but it seems as there other constraints at play here wrt the memory model.

In essence the spin locks (unless described outside their code somewhere) state that one measly atomic load has the same memory effects as a sync/lock which seems like it might work on some platforms (maybe) but surely not for all?

Don't I at least have to load the shared vars using atomic load (atomic.Value for example) or something similar?

My point is that the protected section isn't guaranteed the same memory rules as when protected by a standard lock.

Henrik Johansson

unread,

Oct 12, 2016, 2:33:20 AM10/12/16

to golang-nuts

Forgot the list, sorry.

---------- Forwarded message ---------
From: Henrik Johansson <dahan...@gmail.com>
Date: ons 12 okt. 2016 kl 08:32
Subject: Re: [go-nuts] Re: Go locking and channels much slower than Java equivalent, program spends most of time in sync.(*Mutex).Lock() and sync.(*Mutex).Unlock()
To: Dustin Hiatt <hiatt....@gmail.com>

Thats a good read but consider the following:

https://play.golang.org/p/j2XVkmcfRl

Is this adequately safe in light of the discussion in that thread?

Specially the last comment by +rsc seems interesting.

ons 12 okt. 2016 kl 07:39 skrev Dustin Hiatt <hiatt....@gmail.com>:

I'm not entirely sure, but my gut tells me there's probably strict ordering across threads there. More info can be found here: https://github.com/golang/go/issues/5045

John Souvestre

unread,

Oct 12, 2016, 3:36:19 AM10/12/16

to golang-nuts

As I understand it, Go’s mutex lock will spin for a while (good if everyone using the mutex holds it for only very short periods), but will back off to a less compute intensive method after a while. This avoids tying up a CPU at the cost of some latency in seeing the other guy’s unlock.

John

John Souvestre - New Orleans LA

--

John Souvestre

unread,

Oct 12, 2016, 3:51:04 AM10/12/16

to golang-nuts

Ø … state that one measly atomic load has the same memory effects as a sync/lock which seems like it might work on some platforms (maybe) but surely not for all?

I believe that any of the atomic operations in sync/atomic is a memory barrier, just as a mutex is, and this is for all platforms.

Ø Don't I at least have to load the shared vars using atomic load (atomic.Value for example) or something similar?

Not if everyone accessing them is using a mutex to synchronize the access.

John

John Souvestre - New Orleans LA

Henrik Johansson

unread,

Oct 12, 2016, 4:02:19 AM10/12/16

to John Souvestre, golang-nuts

Sure that's my question. Does a SpinLock as given in several examples above provide the same semantics as a proper mutex?

--

John Souvestre

unread,

Oct 12, 2016, 4:16:48 AM10/12/16

to golang-nuts

I looked at pi/goal. It uses a sync/atomic CAS. Thus, yes, it provides a memory barrier.

As someone else already recommended, the call to Gosched() for each loop will probably tie up the runtime quite a bit. It would probably be better to drop it entirely (if the spin isn’t going to last long, worst case) or only do it every so often (perhaps 1,000 or more loops).

Depending on the amount of congestion and what your latency goal is, you might find that a regular sync/Mutex does as well or better. The fast path (when there’s little congestion) isn’t much more than a CAS.

John

John Souvestre - New Orleans LA

Henrik Johansson

unread,

Oct 12, 2016, 7:16:00 AM10/12/16

to John Souvestre, golang-nuts

I am sorry if I am dense but what Russ said in that thread "and that you shouldn't mix atomic and non-atomic accesses for a given memory word" seems to indicate otherwise.

I am not going to use spin locks left and right but just understand the workings and adjust my expectations accordingly.

John Souvestre

unread,

Oct 12, 2016, 7:50:33 AM10/12/16

to golang-nuts

Ø I am sorry if I am dense but what Russ said in that thread "and that you shouldn't mix atomic and non-atomic accesses for a given memory word" seems to indicate otherwise.

I’m not sure what thread you are referring to. In general it is best to avoid the sync/atomic stuff unless you * really * need it for performance and you take the time to understand it well. A mutex lock would not prevent another goroutine from doing an atomic operation, for example. So mixing the two could be disastrous. But there are some cases where it can be done.

Henrik Johansson

unread,

Oct 12, 2016, 8:04:17 AM10/12/16

to John Souvestre, golang-nuts

I mean https://github.com/golang/go/issues/5045

John Souvestre

unread,

Oct 12, 2016, 8:38:02 AM10/12/16

to golang-nuts

Interesting. I didn’t realize that thread was live again. I thought that this one put it to rest. https://groups.google.com/forum/#!msg/golang-nuts/7EnEhM3U7B8/nKCZ17yAtZwJ

I don’t know for sure, but I imagine that Russ’ statement about atomics was mainly concerning synchronization – which Go’s sync/atomic operations provide. And I would certainly agree.

hiatt....@gmail.com

unread,

Oct 12, 2016, 10:04:50 AM10/12/16

to golang-nuts, jo...@souvestre.com

I think what Russ is saying there is don't do

// routine 1
x = 5

// routine 2
atomic.LoadInt32(&x)

That's mixing atomic operations on the same word. In the case of a spin lock to coordinate threads, Dmitriy's comment 15 is illustrative:

1.

// goroutine 1
data = 42
atomic.Store(&ready, 1)

// goroutine 2
if atomic.Load(&ready) {
  if data != 42 {
    panic("broken")
  }
}

I'm pretty sure the above case works in go without panicking and there is causal ordering here. The only reason I hesitate to go further is because that isn't formalized as part of the spec I don't believe, hence the issue.

Henrik Johansson

unread,

Oct 12, 2016, 10:41:54 AM10/12/16

to hiatt....@gmail.com, golang-nuts, jo...@souvestre.com

So mixing atomic read/write of one variable makes non-atomic read/writes of another variable safe as well? As far as the memory model goes? Caveat being that it's not formalized.

I am sure great care is needed when attempting such code regardless...

John Souvestre

unread,

Oct 12, 2016, 11:23:47 AM10/12/16

to golang-nuts

Ø . The only reason I hesitate to go further is because that isn't formalized as part of the spec I don't believe, hence the issue.

I believe it is. From the Go Memory Model:

“To serialize access, protect the data with channel operations or other synchronization primitives such as those in the sync and sync/atomic packages.”

John

John Souvestre - New Orleans LA

Konstantin Khomoutov

unread,

Oct 12, 2016, 12:40:14 PM10/12/16

to John Souvestre, golang-nuts

On Wed, 12 Oct 2016 07:36:15 -0500
John Souvestre <jo...@souvestre.com> wrote:

> Interesting. I didn’t realize that thread was live again. I thought
> that this one put it to rest.
> https://groups.google.com/forum/#!msg/golang-nuts/7EnEhM3U7B8/nKCZ17yAtZwJ
>
> I don’t know for sure, but I imagine that Russ’ statement about
> atomics was mainly concerning synchronization – which Go’s
> sync/atomic operations provide. And I would certainly agree.

I surely maybe completely wrong in interpreting what Henrik Johansson
tries to express but I think I share confusion with him on this point so
let me try to express it in different words.

I assume you're correct that on today's popular H/W architectures
functions of the sync/atomic package emit a memory barrier and hence
stuff like

// goroutine 1
data = 42

atomic.StoreInt32(&ready, 1)

// goroutine 2
for {
if atomic.CompareAndSwapInt32(&ready, 1, 0) {
break
}
runtime.Gosched()

}
if data != 42 {
panic("broken")
}

works because those memory barriers are full—and hence "global".

But the crucial point is that this is an implicit and unspecified
(as in "not in the spec") property of those operations.

I, for one, can't see why atomic.Store() has to issue a full memory
barrier at all: as I understand the sole guarantee of its "atomicity"
property is that no reader of the affected memory region will observe a
so-called "partial update" when atomic.Store() performs that update.
Now suppose some (future or existing) H/W arch would allow maintaining
this "readers see no partial update" invariant while not issuing a
memory fence. In this case the value read from the "data" variable in
the second goroutine in the example above may legitimately be != 42.
To rephrase, I fail to see how a pair of store/CAS functions from
sync/atomic enforce the "happens before" relationship on anything
except the precise memory region under a variable they operate on.

To quote the Wikipedia article on memory barriers [1]:

| Some architectures, including the ubiquitous x86/x64, provide several
| memory barrier instructions including an instruction sometimes called
| "full fence". A full fence ensures that all load and store operations
| prior to the fence will have been committed prior to any loads and
| stores issued following the fence. Other architectures, such as the
| Itanium, provide separate "acquire" and "release" memory barriers
| which address the visibility of read-after-write operations from the
| point of view of a reader (sink) or writer (source) respectively.
| Some architectures provide separate memory barriers to control
| ordering between different combinations of system memory and I/O
| memory. When more than one memory barrier instruction is available it
| is important to consider that the cost of different instructions may
| vary considerably.

So, above I was actually referring to those «separate "acquire" and
"release" memory barriers».

Could you please clear this confusion up for me?

1. https://en.wikipedia.org/wiki/Memory_barrier

Ian Lance Taylor

unread,

Oct 12, 2016, 1:21:10 PM10/12/16

to Konstantin Khomoutov, John Souvestre, golang-nuts

I think I've lost the context here, but: 1) I agree that sync/atomic
should ideally have more details in the memory model
(https://golang.org/issue/5045); 2) atomic.LoadXX is either a full
barrier or a load-acquire, and atomic.StoreXX is either a full barrier
or a store-release. EIther way, goroutine 2 will see the value
assigned to data in goroutine 1.

Ian

Henrik Johansson

unread,

Oct 12, 2016, 3:26:49 PM10/12/16

to Ian Lance Taylor, Konstantin Khomoutov, John Souvestre, golang-nuts

Thanks for clarifying that Ian!

John Souvestre

unread,

Oct 12, 2016, 4:13:55 PM10/12/16

to golang-nuts

I see that Ian has already answered your question.

> But the crucial point is that this is an implicit and unspecified (as in "not in the spec") property of those operations.

Although not very verbose, I believe that the Go Memory Model (as I quoted earlier) is the guarantee. And I believe that this statement is meant to cover both CPU and compiler ordering.

John

John Souvestre - New Orleans LA

-----Original Message-----
From: Konstantin Khomoutov [mailto:flat...@users.sourceforge.net]
Sent: 2016 October 12, Wed 11:40
To: John Souvestre
Cc: 'golang-nuts'
Subject: Re: [go-nuts] Re: Go locking and channels much slower than Java equivalent, program spends most of time in sync.(*Mutex).Lock() and sync.(*Mutex).Unlock()

Reply all

Reply to author

Forward