Surprising performance result with sync/atomic

548 views
Skip to first unread message

Joubin Houshyar

unread,
May 1, 2015, 3:22:26 PM5/1/15
to golan...@googlegroups.com
Hi Gophers,

I'm getting a surprising result with a basic comparative test of 2 goroutines performing +/- on a int pointer. I had expected the atomic.AddUint64 to flat out perform explicit Load & CAS based functions, but the CAS variant consistently out performs the atomic adders. Not sure what to make of it.

Typical output on my laptop: 

--- access-with-CAS
ack (CASAdder           (104491827))
ack (CASSubtractrer     (113229174))

delta: 5261853801 v.data[0]:1 [access-with-CAS]

--- access-with-Atomic
ack (AtomicAdder        (100000000))
ack (AtomicSubtracter   (100000000))

delta: 6699827238 v.data[0]:1 [access-with-Atomic]

 ------------------------------
report: CAS access faster by 1437973437 nsecs (14 nsec/mutation-op)

Joubin Houshyar

unread,
May 2, 2015, 9:45:04 AM5/2/15
to golan...@googlegroups.com
Enhancing the test with a few knobs was informative. The underlying cause is either the scheduler and/or channel implementation in Go1.4. 

Various other issues cropped up:

- with certain set of input params, the CPUs churn at 100% without making any progress (at all). 

- CAS based test (at least on Go1.4 OS X 10.10.1 Intel i7) only uses 2 cores at 100%. The rest of machine sits idle.

The options here are number of +/- workers (-a & -s), iterations per worker (-n). Channel depth (-q) defaults to number of workers.
--quiet suppresses individual worker reports. 
--cpu-load simulates additional load. 
 
The surprising result was due to the fact that times were observed end to end total time, as observed by the task runner. Reporting the times observed by the goroutines showed the expected superior performance of atomic.Addxxx over the CAS based implementation.

Fairly typical numbers below, which indicates that either scheduler and/or channel is having difficulty dealing with the 'running herd' effect of the fast completing atomic.Adders. The CAS workers, which are /individually slower by orders of magnitude/ actually give better end-to-end task completion times. This is given the noted fact that 1/2 of the machine is sitting idle..

± go run atomic_number.go -a 10000 -s 10000 -n 10000 --quiet                                                                              master
Salaam!
comparative test of concurrent counter mutators using explicit CAS and atomic Addders
with channel len 20000
--- access-with-CAS

delta:[reported:  3995467622 observed:  5016057141] e2e-overhead:[1020589519 (nsec) 20.346 (%)] [access-with-CAS]
--- access-with-Atomic

delta:[reported:     1328514 observed:  6424952596] e2e-overhead:[6423624082 (nsec) 99.979 (%)] [access-with-Atomic]

---------------------
reported: Atomic access faster by  3994139108 nsecs (39 nsec/mutation-op)
observed:    CAS access faster by  1408895455 nsecs (14 nsec/mutation-op)

For (-a 10000 -s 10000 -n 100000 --quiet), the atomic.Adder spends 90% of its time -- nearly 1 *minutes* -- in the channel (57,088,073,955 (nsec) 90.291 (%)).

Adding simulated orthogonal load doesn't help the atomic.Adders. 

± go run atomic_number.go -a 10000 -s 10000 -n 10000 --quiet --cpu-load                                                                   master
Salaam!
comparative test of concurrent counter mutators using explicit CAS and atomic Addders
with simulated cpu-load
with channel len 20000
--- access-with-CAS

delta:[reported:  3121890668 observed:  4715098119] e2e-overhead:[1593207451 (nsec) 33.789 (%)] [access-with-CAS]
--- access-with-Atomic

delta:[reported:     1733888 observed:  8550969928] e2e-overhead:[8549236040 (nsec) 99.980 (%)] [access-with-Atomic]

---------------------
reported: Atomic access faster by  3120156780 nsecs (31 nsec/mutation-op)
observed:    CAS access faster by  3835871809 nsecs (38 nsec/mutation-op)
 
Tweaking the channel depth did not affect the general results above.

potential bugs:

- The other 2 cores of the machine however finally get to see some action with CAS workers with --cpu-load option. Exception to the 50% machine utilization for CASers is when their total number is equal to, e.g. -a 2 -s 2 for the 4 core i7, the number of cores.  Why would using CAS vs AddUint64 make this difference is entirely unclear. 

- The following typically -- but not always. cold starts seem to work fine -- has to be ctl-c aborted and appears to be thrashing without making any progress:

go run atomic_number.go -a 10000 -s 10000 -n 1 --quiet.

Todo is swapping the channel for a shared memory approach for task result reporting. 
Reply all
Reply to author
Forward
0 new messages