Hey Konstantinos,
Sure, but I'll begin by noting that the implementation of memory synchronization instructions must necessarily touch the full memory instruction path, so there's a lot of detail that is best gleaned from the code.
Bar instructions execute implicit membars, and specifically, bar.sync implicitly performs the membar.cta fence as specified in the PTX spec. The warp instruction barrier/blocking part of the bar instruction is handled completely within GPGPU-Sim by the m_barriers variable of warp_inst_t, and is enforced in the file gpgpu-sim/gpgpu-sim/shader.cc. This only enforces thread issue progress guarantees, but not the memory fence itself, which is a memory ordering guarantee.
So, you're interested in the consistency/coherence implications. The path of the membar instructions is as follows: Warp instructions with the op designation MEMORY_BARRIER_OP (or BARRIER_OP for bar.sync) are issued to the memory instruction dispatch register m_mem_out in scheduler_unit::cycle() (gpgpu-sim/gpgpu-sim/shader.cc). This is the same path followed by load, store, and atomic instructions to get over to the gem5-gpu side of the code, specifically, by the ldst_unit passing the instruction to CudaCore::executeMemOp.
The CudaCore interprets the instruction's configuration and passes a memory request across each lsqPorts (lanes) to the ShaderLSQ (ShaderLSQ::recvTimingReq()). In this exchange, the fence memory request is only passed to the ShaderLSQ for each of the shader lanes that are active in calling the membar instruction.
The ShaderLSQ maintains ordered queues of memory requests for each warp executing on the core, and enforces ordering around the fences. Before more detail, it is important to note that, like many real cache hierarchies, Ruby does not enforce memory access ordering from the LSQ, so the LSQ must make sure that the consistency model is enforced appropriately. To get this with current gem5-gpu coherence protocols, the LSQ must know the appropriate scope ("level" in the PTX spec) of the requested fence, which can be the .cta, .gl, or .sys, and the LSQ must enforce the ordering of memory accesses sent to the Ruby cache hierarchy. For the existing protocols in gem5-gpu, the LSQ just assumes this means waiting for all pending memory accesses to be completed by Ruby, clearing the fence, and then continuing to issue memory accesses from warp instructions queued after the fence.
More notes on consistency: VI_hammer is the only coherence protocol publicly available that allows stale data in the GPU L1s. However, VI_hammer invalidates any cache line in the GPU L1 that is being written to, and writes the data back to the GPU L2, which is always coherent with the rest of the cache hierarchy besides GPU L1s. So, like other coherence protocols available with gem5-gpu, the ShaderLSQ can be assured that a memory fence of any scope can be cleared after it has received responses/acks for all prior stores memory accesses sent to Ruby.
However, a coherence protocol that does not guarantee that stores are pushed at least to the scope of the memory fence may need to do more to ensure fences are handled correctly. For example, if you allow stale GPU L1 data to be updated without writing through to a globally-visible GPU L2 for coherence, a
membar.gl fence (all prior stores are globally visible) would require that that stale+dirty data in the GPU L1 be flushed back to the GPU L2 before the fence can be cleared. You may need to do some tweaking on the ShaderLSQ to do these checks and send appropriate messages to the Ruby cache hierarchy if you need functionality like this.
Hope this helps,
Joel