How to call cudaDeviceSynchronize(); from within a CUDA kernel.

Konstantinos Koukos

unread,

Mar 12, 2015, 2:36:58 PM3/12/15

to gem5-g...@googlegroups.com

Hello,

Is there any way to call cudaDeviceSynchronize(); from within a CUDA kernel?

Best Regards,
Konstantinos.

Jason Power

unread,

Mar 12, 2015, 3:27:28 PM3/12/15

to Konstantinos Koukos, gem5-g...@googlegroups.com

Hi Konstantinos,

As far as I know, CUDA doesn't support calling device synchronize from within a CUDA kernel. I could be wrong though. Is there some new PTX instruction that does this?

You could probably add your own instruction that calls into the simulator and then syncs the CUDA stream, though.

Cheers,

Jason

Konstantinos Koukos

unread,

Mar 12, 2015, 3:39:51 PM3/12/15

to gem5-g...@googlegroups.com, koukos.ko...@gmail.com

Hello Jason and thanks for the reply,

I don't know if it's correct either. The reason i tried to call it that i know it
implements the functionality that i want and is also included in the .cc file
generated from the CU file.

I was wondering if there is any way to flush the GPU CU from inside a CUDA kernel.
Maybe i could use something like: __syncthreads(); with different semantics. For example both:

__threadfence(); and
__threadfence_system();

According to CUDA semantics if i add support to flush() the cuda core would work fine.

Any idea how to do this?
I cannot find where the wrappers are for any of __syncthreads(), __threadfence(), or __threadfence_system()
is implemented in the simulator? Do i have to look for opcodes ???

Any other suggestions on how to achieve my final target.

Best Regards,
Konstantinos.

Joel Hestness

unread,

Mar 12, 2015, 4:48:05 PM3/12/15

to Konstantinos Koukos, gem5-g...@googlegroups.com

Hi Konstantinos

These are CUDA builtin GPU functions that are mapped to PTX instructions by the compiler. Specifically, I believe __syncthreads() goes to bar.sync, and __threadfence() goes to membar.gl. These are both correct for coherence protocols available in gem5-gpu. I'm not sure if the system threadfence works with a version of CUDA that works with gem5-gpu, but it should map to membar.sys. I'd recommend looking at the PTX docs for more details.

Joel

--

Joel Hestness

PhD Student, Computer Architecture

Dept. of Computer Science, University of Wisconsin - Madison

http://www.cs.utexas.edu/~hestness

Jason Power

unread,

Mar 12, 2015, 5:20:36 PM3/12/15

to Joel Hestness, Konstantinos Koukos, gem5-g...@googlegroups.com

Hi Konstantinos,

I believe bar.sync takes an integer as a parameter, so you may be able to leverage that. Also, I believe there is a function in the Cuda core or the shader lsq which will forward a flush to the the cache. And I think this flush is implemented in VI_hammer. If you can't find that support it may be on our review board. Let me know if you can find it and I can dig it up.

Jason

Konstantinos Koukos

unread,

Mar 12, 2015, 5:34:01 PM3/12/15

to gem5-g...@googlegroups.com, jthes...@gmail.com, koukos.ko...@gmail.com

Thanks a lot,

I think i found something (i wouldn't like to trouble you so much for this).
Apparently the situation is the following:

I couldn't find any of membar.sys or membar.gl in the opcodes.def of the simulator.
What i found instead is a: OP_DEF(MEMBAR_OP,membar_impl,"membar",1,3)

which is apparently unimplemented:

void membar_impl( const ptx_instruction *pI, ptx_thread_info *thread )
{
// handled by timing simulator
}

Any idea if this maps to both or to something else ?

Any suggestions here? Could i add both membar.sys and membar.gl to the simulator and
add the functionality i require?

Best Regards,
Konstantinos.

Jason Power

unread,

Mar 12, 2015, 6:04:44 PM3/12/15

to Konstantinos Koukos, gem5-g...@googlegroups.com, jthes...@gmail.com

That sounds totally reasonable to me. Hopefully we won't be providing real implementations anytime soon ;).

Jason

Joel Hestness

unread,

Mar 12, 2015, 6:15:52 PM3/12/15

to Konstantinos Koukos, gem5-gpu developers

Hi guys,

gem5-gpu already implements the three flavors of membar instructions (.cta, .gl, and .sys) and the .sync flavor of the bar instruction, and these *should* all work correctly with the coherence protocols we have publicly available. (NOTE: GPGPU-Sim parses instruction flags like '.gl' and '.sync' as instruction variants rather than as separate instructions - they perform the same instruction, but possibly with different effects). The details about how these instruction flavors are supposed to function is described in the PTX documentation (and may be worth reviewing/testing/validating if you've extended existing coherence protocols or have added new protocols).

There is a microbenchmark in our benchmarks/unittests directory called fence_group_consistency, which can be compiled to test each of the different flavors of synchronization instructions (bar and membar variants). It appears that I also checked in my latest version, which allows you to test coherence/consistency with possibly multiple CPU threads in a cache-coherent unified memory. I'd encourage you to check out and use the microbenchmark for more about how these instructions operate.

Joel

--

Joel Hestness
PhD Candidate, Computer Architecture

Dept. of Computer Science, University of Wisconsin - Madison

http://pages.cs.wisc.edu/~hestness/

Konstantinos Koukos

unread,

Mar 12, 2015, 6:28:36 PM3/12/15

to gem5-g...@googlegroups.com, koukos.ko...@gmail.com

Hello Joel,

Could you provide a few more details on where to look?
Of course VI should work but i am implementing my own protocol with different memory consistency constrains so i would like to check the implementation details.
Could you please provide some more info on which files to look?

Thanks a lot,
Konstantinos.

Joel Hestness

unread,

Mar 12, 2015, 9:31:05 PM3/12/15

to Konstantinos Koukos, gem5-gpu developers

Hey Konstantinos,

Sure, but I'll begin by noting that the implementation of memory synchronization instructions must necessarily touch the full memory instruction path, so there's a lot of detail that is best gleaned from the code.

Bar instructions execute implicit membars, and specifically, bar.sync implicitly performs the membar.cta fence as specified in the PTX spec. The warp instruction barrier/blocking part of the bar instruction is handled completely within GPGPU-Sim by the m_barriers variable of warp_inst_t, and is enforced in the file gpgpu-sim/gpgpu-sim/shader.cc. This only enforces thread issue progress guarantees, but not the memory fence itself, which is a memory ordering guarantee.

So, you're interested in the consistency/coherence implications. The path of the membar instructions is as follows: Warp instructions with the op designation MEMORY_BARRIER_OP (or BARRIER_OP for bar.sync) are issued to the memory instruction dispatch register m_mem_out in scheduler_unit::cycle() (gpgpu-sim/gpgpu-sim/shader.cc). This is the same path followed by load, store, and atomic instructions to get over to the gem5-gpu side of the code, specifically, by the ldst_unit passing the instruction to CudaCore::executeMemOp.

The CudaCore interprets the instruction's configuration and passes a memory request across each lsqPorts (lanes) to the ShaderLSQ (ShaderLSQ::recvTimingReq()). In this exchange, the fence memory request is only passed to the ShaderLSQ for each of the shader lanes that are active in calling the membar instruction.

The ShaderLSQ maintains ordered queues of memory requests for each warp executing on the core, and enforces ordering around the fences. Before more detail, it is important to note that, like many real cache hierarchies, Ruby does not enforce memory access ordering from the LSQ, so the LSQ must make sure that the consistency model is enforced appropriately. To get this with current gem5-gpu coherence protocols, the LSQ must know the appropriate scope ("level" in the PTX spec) of the requested fence, which can be the .cta, .gl, or .sys, and the LSQ must enforce the ordering of memory accesses sent to the Ruby cache hierarchy. For the existing protocols in gem5-gpu, the LSQ just assumes this means waiting for all pending memory accesses to be completed by Ruby, clearing the fence, and then continuing to issue memory accesses from warp instructions queued after the fence.

More notes on consistency: VI_hammer is the only coherence protocol publicly available that allows stale data in the GPU L1s. However, VI_hammer invalidates any cache line in the GPU L1 that is being written to, and writes the data back to the GPU L2, which is always coherent with the rest of the cache hierarchy besides GPU L1s. So, like other coherence protocols available with gem5-gpu, the ShaderLSQ can be assured that a memory fence of any scope can be cleared after it has received responses/acks for all prior stores memory accesses sent to Ruby.

However, a coherence protocol that does not guarantee that stores are pushed at least to the scope of the memory fence may need to do more to ensure fences are handled correctly. For example, if you allow stale GPU L1 data to be updated without writing through to a globally-visible GPU L2 for coherence, a membar.gl fence (all prior stores are globally visible) would require that that stale+dirty data in the GPU L1 be flushed back to the GPU L2 before the fence can be cleared. You may need to do some tweaking on the ShaderLSQ to do these checks and send appropriate messages to the Ruby cache hierarchy if you need functionality like this.

Hope this helps,

Joel

Konstantinos Koukos

unread,

Mar 13, 2015, 6:01:36 AM3/13/15

to gem5-g...@googlegroups.com, koukos.ko...@gmail.com

Hello Joel,

Thanks a lot for the very detailed explanation of the flow.
I think i have an idea of where i need to tweak the simulator.

Best Regards,
Konstantinos.

Reply all

Reply to author

Forward