Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Another memory order opaque question

251 views
Skip to first unread message

Peter Veentjer

unread,
Feb 3, 2025, 6:07:22 AMFeb 3
to mechanica...@googlegroups.com
Imagine the following code:

... lot of writes writes to the buffer
buffer.putInt(a_offset,a_value)  (1)
buffer.putRelease(b_offset,b_value) (2)
releaseFence() (3)
buffer.putInt(c_offset,c_value) (4)

Buffer is a chunk of memory that is shared with another process and the writes need to be seen in order. So when 'b' is seen, 'a' should be seen. And when 'c' is seen, 'b' should be seen. There is no other synchronization.

All offsets are guaranteed to be naturally aligned. All the putInts are plain puts (using Unsafe).

The putRelease (2) will ensure that 'a' is seen before 'b', and it will ensure atomicity and visibility of 'b' (so the appropriate compiler and memory fences where needed).

The releaseFence (3) will ensure that b is seen before c.

My question is about (4). Since it is a plain store, the compiler can do a ton of trickery including the delay of visibility of (4). Is my understanding correct and is there anything else that could go wrong?

What would be the lowest memory access mode that would resolve this problem? My guess is that the last putInt, should be a putIntOpaque.

Probably it would be better to replace (4), but a putIntRelease and get rid of the releaseFence since then the above problem will be solved.

Aleksey Shipilev

unread,
Feb 5, 2025, 5:31:07 AMFeb 5
to mechanica...@googlegroups.com, Peter Veentjer
On 2/3/25 12:06, Peter Veentjer wrote:
> Imagine the following code:
>
> ... lot of writes writes to the buffer
> buffer.putInt(a_offset,a_value)  (1)
> buffer.putRelease(b_offset,b_value) (2)
> releaseFence() (3)
> buffer.putInt(c_offset,c_value) (4)
>
> Buffer is a chunk of memory that is shared with another process and the writes need to be seen in
> order. So when 'b' is seen, 'a' should be seen. And when 'c' is seen, 'b' should be seen. There is
> no other synchronization.
>
> All offsets are guaranteed to be naturally aligned. All the putInts are plain puts (using Unsafe).
>
> The putRelease (2) will ensure that 'a' is seen before 'b', and it will ensure atomicity and
> visibility of 'b' (so the appropriate compiler and memory fences where needed).
>
> The releaseFence (3) will ensure that b is seen before c.

Looks to me this fence can be replaced with releasing store of "c":

buffer.putInt(a_offset,a_value)
buffer.putRelease(b_offset,b_value)
buffer.putRelease(c_offset,c_value)

My preference is almost always to avoid the explicit fences if you can control the memory ordering
of the actual accesses. Using putRelease instead of explicit fence also forces you think about the
symmetries: should all loads of "c" be performed with getAcquire to match the putRelease?

> My question is about (4). Since it is a plain store, the compiler can do a ton of trickery including
> the delay of visibility of (4). Is my understanding correct and is there anything else that could go
> wrong?

The common wisdom is indeed "let's put non-plain memory access mode, so the access is hopefully more
prompt", but I have not seen any of these effects thoroughly quantified beyond "let's forbid the
compiler to yank our access out of the loop". Maybe I have not looked hard enough.

I suspect the delays introduced by compiler moving code around in sequential code streams is on the
scale where it does not matter all that much for end-to-end latency. The only (?) place where code
movement impact could be multiplied to a macro-effect is when the memory ops shift in/out/around the
loops. I would not be overly concerned about latency impact of reordering within the short straight
code stream.

You can try to measure it with producer-consumer / ping-pong style benchmarks: put more memory ops
around (4), turn on instruction scheduler randomizers (-XX:+StressLCM should be useful here, maybe
-XX:+StressGCM), see if there is an impact. I suspect the effect is too fine-grained to be
accurately measured with direct timing measurements, so you'll need to get creative how to measure
"promptness".

> What would be the lowest memory access mode that would resolve this problem? My guess is that the
> last putInt, should be a putIntOpaque.

Yes, in current Hotspot, opaque would effectively pin the access in place, so it would be exposed to
hardware in the order closer to original source code order. Then it is up to hardware to see when to
perform the store. But as I said above, I'll be surprised if it actually matters.

Thanks,
-Aleksey

Peter Veentjer

unread,
Feb 11, 2025, 6:34:30 AMFeb 11
to Aleksey Shipilev, mechanica...@googlegroups.com
Thanks a lot for your answer and for the confirmation that my understanding is correct.

Daniel Marques

unread,
Feb 12, 2025, 10:55:39 AMFeb 12
to mechanica...@googlegroups.com, Aleksey Shipilev
I'm very new to both offheap allocations and the JMM, etc., so forgive the perhaps naive question.

The introduction material to the JMM typically presents the following example of a correct program, assuming the two methods are executed concurrently in different threads.

class ExampleOne {
     volatile int ready;
     int value;

     void threadOne() {
          value = 100;
          ready = 1;
     }

     void threadTwo() {
          while (ready == 0)
                ;
          assert value == 100;
     }
}

Is the following semantically equivalent, but now the two methods could be run in different processes, or are there any additional operations necessary to 'coordinate' between two processes sharing memory (assuming jdk >= 9)?

class ExampleTwo {
      MappedByteBuffer dataBuffer;
      long   dataBufferAddr;
      int valueOffset = 0;
      int readyOffset = 4;

      ExampleTwo() {
            File file = new File("foo.dat");
            FileChannel fc = new ...
            dataBuffer = fc.map(READ_WRITE, 0, 2 * Integer.BYTES)
            dataBufferAddr = Unsafe.magic(databuffer) // I'm actually using Agrona's UnsafeBuffer to do all the magic for me
      }

     void threadOne() {
          dataBuffer.putInt(valueOffset, 100)
          Unsafe.putIntVolatile(null,  dataBufferAddr + readyOffset, 1)
     }

     void threadTwo() {
          while ( Unsafe.getIntVolatile(null,  dataBufferAddr  + readyOffset)  == 0)
                ;
          assert  dataBuffer.getInt(valueOffset)  == 100;
     }
}

Thanks in advance, 

Dan

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/mechanical-sympathy/CAGuAWdAsWprk9BK46iJdZ_w1wPBcM4OCkDgCLTAP98B4VCPscw%40mail.gmail.com.

Peter Veentjer

unread,
Feb 12, 2025, 2:05:16 PMFeb 12
to mechanica...@googlegroups.com
Yes, it is the same.

You could even go for:

class ExampleTwo {

     void threadOne() {
          dataBuffer.putInt(valueOffset, 100)
          Unsafe.putIntRelease(null,  dataBufferAddr + readyOffset, 1)
     }

     void threadTwo() {
          while ( Unsafe.getIntAcquire(null,  dataBufferAddr  + readyOffset)  == 0)
                ;
          assert  dataBuffer.getInt(valueOffset)  == 100;
     }
}


Daniel Marques

unread,
Feb 17, 2025, 2:34:48 AMFeb 17
to mechanica...@googlegroups.com
Thanks for the response.  I hope you don't mind a few follow ups:

Is there a "for dummies" which describes the difference between Release/Acquire vs Volatile?  For Hotspot and x86-64, are there actual differences in implementation, and measurable performance using Release/Acquire vs Volatile?

Again, thanks in advance.

Dan


Peter Veentjer

unread,
Feb 17, 2025, 3:10:42 AMFeb 17
to mechanica...@googlegroups.com
For Hotspot and X86 a release/acquire vs volatile can make a difference.

Imagine you would have a:

A=10
r1=B

So we have a store to A and a load of B. 

On the X86 every store is a release store and every load is an acquire load.

On the X86, a store can be reordered with a newer load due to store buffers. So A=10 and r1=B could be reordered.

If A and B would be volatile, then this reordering isn't allowed. The reason this isn't allowed is that a program without data races can only have sequential consistent execution. And for an execution to be sequential consistent, it needs to have the same effect as if all the threads ran their operations in the order of the program (so no reordering).

To prevent the store and load to be reordered, a [StoreLoad] barrier needs to be inserted  (e.g. in the form of an MFENCE or an LOCK prefixed instruction)

A=10
[StoreLoad]
r1=B

This [StoreLoad] effectively stalls the execution of the load (r1=B) till the store (A=1) in the store buffer has been drained to the coherent cache and this could take some time. There could be many queued stores in the store buffer all waiting for the cache line to be returned in the right state.

Without [StoreLoad] the load can be performed if the store is still in the store buffer.





Peter Veentjer

unread,
Feb 17, 2025, 3:14:34 AMFeb 17
to mechanica...@googlegroups.com
And due to OOoE of modern microarchitectures, with acquire/release, the load can even be executed before the store is executed since there is no raw dependency.


Reply all
Reply to author
Forward
0 new messages