The main tip I give people that try to understand fences, barriers, and memory ordering rules is to completely (and very intentionally) ignore and avoid thinking about any CPU's internal details (like buffers or queues or flushes or such). The simple rule to remember is this: before a CPU ever sees the instructions you think you are running, a compiler (static, JIT, whatever) will have a chance to completely shuffle those instructions around and give them to the cpu in any order it sees fit. So what a CPU actually does with barriers has no functional/correntness implications, as it is the compiler's job to make the CPU do what it needs to. Barriers/fences/memory models are first and foremost compiler concerns, and have both a correctness and (potential and profound) performance implications. The CPU concerns are only secondary, and only effect performance (never correctness). This is critical to understand and get your head around before trying to reason about what fences/barriers/etc. *mean* to program code.
For example, the sort of code mutation we are talking about is not just a simple change in order. It is also a combination of reordering with other optimizations. E.g. a repreated load from the same location in a loop can (and will) float "up above" the loop, execute only once (instead of once per loop iteration) and cache the loaded result in a register, unless barriers/fences/ordering rules prevent it from doing so. Similarly redundant stores to the same location can be folded into a single store, unless rules prevent it.
With that in mind, you can start looking at the various barriers and fence semantics available in various languages (that actually define them) and in various ad-hoc tooling (like e.g. libraries and conventions used in C).
I like to use the simplistic LoadLoad, LoadStore, StoreStore, StoreLoad mental model as starting point for thinking about barriers, as it can be used to model most (not all) barrier semantics. Those are simple to understand:
- LoadStore means "all loads that precede this fence will appear (from the point of view of any other thread) to execute before the next store appears to execute.
- LoadLoad means "all loads that precede this fence will appear (from the point of view of any other thread) to execute before the next load appears to execute.
etc. etc.
The name tells you what the rule is, clear and simple. Nothing more is promised, and nothing less.
I like to think of Acquire and Release (which are somewhat more restrictive) in terms of "what would a lock need to do?" (even if no lock is involved). Thinking in terms of "in a critical section protected by a lock, what sort of code motion would the acquire/release fences associated with the lock acquire and release action allow?" allows me to answer questions about them without looking up a definition or thinking about the more abstract notions that define them. E.g. A lock acquire can allow preceding loads and stores that (in the program) appear "before" the lock acquire to "float forward past the acquire and into the locked region". But it cannot allow loads and stores that (in the program) appear "after" the lock acquire to "float to backwards past the acquire and out of the locked region". So an acquire fence needs to enforce those rules. Similarly, release can allow loads and stores that follow the release to "float back past the release and into the locked region", but cannot allow loads and stores that precede it to "float forward past the release and out of the locked region".
Since acquire doesn't (necessarily) involve a load or a store, Acquire doesn't quite equate to LoadLoad|LoadStore. It can be when acquiring the lock involves a load followed by LoadLoad|LoadStore, but there are lock implementations where that doesn't actually happen, and uses of Acquire fences with no locking or loads involved. Similarly, since a release doesn't (necessarily) involve a load or a store, it is not quite equivalent to StoreLoad|StoreStore. (But can be when releasing involves a store followed by StoreLoad|StoreStore.
HTH