In this particular case, this is a toy example of course, but for this toy example, it is absolutely a case where the performance of the synchronization primitive literally does not matter at all. (If I followed here, the intent is seemingly to watch a long running task, and the example has a 500ms sleep, etc.).
That said, sometimes performance does matter.
If instead this was a DIFFERENT example that was instead in some tight performance critical inner loop, the performance of the synchronization primitive for a stop flag can start to be meaningful (which of course should first be demonstrated by your benchmarking and/or your profiling of your particular program -- "premature optimization is the root of all evil" and all that).
Empirically speaking, that is then a situation where I've seen people start to get into discussion around "well on amd64 you can rely on X and Y so we can get away with a stop flag that doesn't use a mutex or an atomic", and then people start talking about benign data races, and then other people start talking about "there are no benign data races", and then there's the reference to Dmitry Vyukov's article on benign data races, etc.:
To be honest I've seen it take up a fair amount of energy just to discuss, and again empirically speaking I've seen very smart people make mistakes here.
One question I have regarding using an atomic for a stop flag is whether or not there is material performance overhead compared to a simple unprotected/non-atomic stop flag.
I looked at that question a bit circa go ~1.7, so possibly stale info here, and I haven't gone back to look at my more detailed notes, but if I recall correctly I think my conclusion at the time was:
1. If your use case is a stop flag that will be checked many times (say, in a tight inner loop), and the stop flag only gets set rarely, then the performance of the set doesn't matter much, the assembly emitted for an atomic load (such as atomic.LoadUint64) seems to be identical for the assembly emitted for a simple load (say, *myUnint64Ptr) based on some spot checking a while ago (with a sample of assembly from 1.9 pasted in at the bottom of this post for comparison purposes).
2. Basic micro benchmarks didn't show a difference between an atomic load and an unprotected load for a stop flag.
3. There might be some theoretical possible overhead due to the go compiler will do instruction reordering in general, but won't do instruction reordering across function calls, so that could be one difference that might or might not make a difference depending on your exact code (though likely modest impact)? Regarding this last point, I'm just an interested spectator when it comes to the go compiler, so just to be clear I don't really know this definitively.
At least for us at the time, the quick test program & comparison of assembly was enough to move us past the whole "Well, on amd64 you can get away with X" discussion, so I didn't delve too much deeper than that at the time.
In any event, sharing this sample assembly with the community in case anyone is interested and/or has additional commentary on the particulars here in terms of performance impact in go of using an atomic load vs an unprotected stop flag for the (admittedly somewhat rare) cases when the nanoseconds do indeed matter. (And for me, a clean report from the race detector trumps the performance arguments, but that doesn't mean I'm not curious about the performance...).
Here is a simple test program:
func simpleLoadUint64(inputPtr *uint64) uint64 {
// normal load of *inputPtr
return *inputPtr
}
func atomicLoadUint64(inputPtr *uint64) uint64 {
// atomic.LoadUint64 atomically loads *inputPtr
return atomic.LoadUint64(inputPtr)
}
And here is the corresponding assembly snippets (from go 1.9):
go build -gcflags -S atomic_vs_normal_load.go
// trivial function with an unprotected load
"".simpleLoadUint64 STEXT nosplit size=14 args=0x10 locals=0x0
0x0000 00000 (atomic_vs_normal_load.go:8) TEXT "".simpleLoadUint64(SB), NOSPLIT, $0-16
0x0000 00000 (atomic_vs_normal_load.go:8) FUNCDATA $0, gclocals·aef1f7ba6e2630c93a51843d99f5a28a(SB)
0x0000 00000 (atomic_vs_normal_load.go:8) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (atomic_vs_normal_load.go:8) MOVQ "".inputPtr+8(SP), AX
0x0005 00005 (atomic_vs_normal_load.go:10) MOVQ (AX), AX
0x0008 00008 (atomic_vs_normal_load.go:10) MOVQ AX, "".~r1+16(SP)
0x000d 00013 (atomic_vs_normal_load.go:10) RET
0x0000 48 8b 44 24 08 48 8b 00 48 89 44 24 10 c3 H.D$.H..H.D$..
// trivial function with a sync/atomic LoadUint64:
"".atomicLoadUint64 STEXT nosplit size=14 args=0x10 locals=0x0
0x0000 00000 (atomic_vs_normal_load.go:13) TEXT "".atomicLoadUint64(SB), NOSPLIT, $0-16
0x0000 00000 (atomic_vs_normal_load.go:13) FUNCDATA $0, gclocals·aef1f7ba6e2630c93a51843d99f5a28a(SB)
0x0000 00000 (atomic_vs_normal_load.go:13) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (atomic_vs_normal_load.go:13) MOVQ "".inputPtr+8(SP), AX
0x0005 00005 (atomic_vs_normal_load.go:15) MOVQ (AX), AX
0x0008 00008 (atomic_vs_normal_load.go:15) MOVQ AX, "".~r1+16(SP)
0x000d 00013 (atomic_vs_normal_load.go:15) RET
0x0000 48 8b 44 24 08 48 8b 00 48 89 44 24 10 c3 H.D$.H..H.D$..
--thepudds