Enhancing the test with a few knobs was informative. The underlying cause is either the scheduler and/or channel implementation in Go1.4.
Various other issues cropped up:
- with certain set of input params, the CPUs churn at 100% without making any progress (at all).
- CAS based test (at least on Go1.4 OS X 10.10.1 Intel i7) only uses 2 cores at 100%. The rest of machine sits idle.
The options here are number of +/- workers (-a & -s), iterations per worker (-n). Channel depth (-q) defaults to number of workers.
--quiet suppresses individual worker reports.
--cpu-load simulates additional load.
The surprising result was due to the fact that times were observed end to end total time, as observed by the task runner. Reporting the times observed by the goroutines showed the expected superior performance of atomic.Addxxx over the CAS based implementation.
Fairly typical numbers below, which indicates that either scheduler and/or channel is having difficulty dealing with the 'running herd' effect of the fast completing atomic.Adders. The CAS workers, which are /individually slower by orders of magnitude/ actually give better end-to-end task completion times. This is given the noted fact that 1/2 of the machine is sitting idle..
± go run atomic_number.go -a 10000 -s 10000 -n 10000 --quiet master
Salaam!
comparative test of concurrent counter mutators using explicit CAS and atomic Addders
delta:[reported: 3995467622 observed: 5016057141] e2e-overhead:[1020589519 (nsec) 20.346 (%)] [access-with-CAS] delta:[reported: 1328514 observed: 6424952596] e2e-overhead:[6423624082 (nsec) 99.979 (%)] [access-with-Atomic]
reported: Atomic access faster by 3994139108 nsecs (39 nsec/mutation-op)
observed: CAS access faster by 1408895455 nsecs (14 nsec/mutation-op)
For (-a 10000 -s 10000 -n 100000 --quiet), the atomic.Adder spends 90% of its time -- nearly 1 *minutes* -- in the channel (57,088,073,955 (nsec) 90.291 (%)).
Adding simulated orthogonal load doesn't help the atomic.Adders.
± go run atomic_number.go -a 10000 -s 10000 -n 10000 --quiet --cpu-load master
comparative test of concurrent counter mutators using explicit CAS and atomic Addders
delta:[reported: 3121890668 observed: 4715098119] e2e-overhead:[1593207451 (nsec) 33.789 (%)] [access-with-CAS]
delta:[reported: 1733888 observed: 8550969928] e2e-overhead:[8549236040 (nsec) 99.980 (%)] [access-with-Atomic] reported: Atomic access faster by 3120156780 nsecs (31 nsec/mutation-op)
observed: CAS access faster by 3835871809 nsecs (38 nsec/mutation-op)
Tweaking the channel depth did not affect the general results above.
potential bugs:
- The other 2 cores of the machine however finally get to see some action with CAS workers with --cpu-load option. Exception to the 50% machine utilization for CASers is when their total number is equal to, e.g. -a 2 -s 2 for the 4 core i7, the number of cores. Why would using CAS vs AddUint64 make this difference is entirely unclear.
- The following typically -- but not always. cold starts seem to work fine -- has to be ctl-c aborted and appears to be thrashing without making any progress:
go run atomic_number.go -a 10000 -s 10000 -n 1 --quiet.
Todo is swapping the channel for a shared memory approach for task result reporting.