On 2/3/25 12:06, Peter Veentjer wrote:
> Imagine the following code:
>
> ... lot of writes writes to the buffer
> buffer.putInt(a_offset,a_value) (1)
> buffer.putRelease(b_offset,b_value) (2)
> releaseFence() (3)
> buffer.putInt(c_offset,c_value) (4)
>
> Buffer is a chunk of memory that is shared with another process and the writes need to be seen in
> order. So when 'b' is seen, 'a' should be seen. And when 'c' is seen, 'b' should be seen. There is
> no other synchronization.
>
> All offsets are guaranteed to be naturally aligned. All the putInts are plain puts (using Unsafe).
>
> The putRelease (2) will ensure that 'a' is seen before 'b', and it will ensure atomicity and
> visibility of 'b' (so the appropriate compiler and memory fences where needed).
>
> The releaseFence (3) will ensure that b is seen before c.
Looks to me this fence can be replaced with releasing store of "c":
buffer.putInt(a_offset,a_value)
buffer.putRelease(b_offset,b_value)
buffer.putRelease(c_offset,c_value)
My preference is almost always to avoid the explicit fences if you can control the memory ordering
of the actual accesses. Using putRelease instead of explicit fence also forces you think about the
symmetries: should all loads of "c" be performed with getAcquire to match the putRelease?
> My question is about (4). Since it is a plain store, the compiler can do a ton of trickery including
> the delay of visibility of (4). Is my understanding correct and is there anything else that could go
> wrong?
The common wisdom is indeed "let's put non-plain memory access mode, so the access is hopefully more
prompt", but I have not seen any of these effects thoroughly quantified beyond "let's forbid the
compiler to yank our access out of the loop". Maybe I have not looked hard enough.
I suspect the delays introduced by compiler moving code around in sequential code streams is on the
scale where it does not matter all that much for end-to-end latency. The only (?) place where code
movement impact could be multiplied to a macro-effect is when the memory ops shift in/out/around the
loops. I would not be overly concerned about latency impact of reordering within the short straight
code stream.
You can try to measure it with producer-consumer / ping-pong style benchmarks: put more memory ops
around (4), turn on instruction scheduler randomizers (-XX:+StressLCM should be useful here, maybe
-XX:+StressGCM), see if there is an impact. I suspect the effect is too fine-grained to be
accurately measured with direct timing measurements, so you'll need to get creative how to measure
"promptness".
> What would be the lowest memory access mode that would resolve this problem? My guess is that the
> last putInt, should be a putIntOpaque.
Yes, in current Hotspot, opaque would effectively pin the access in place, so it would be exposed to
hardware in the order closer to original source code order. Then it is up to hardware to see when to
perform the store. But as I said above, I'll be surprised if it actually matters.
Thanks,
-Aleksey