Re: Question Regarding cudaMemcpy and cache coherence in gem5-gpu

84 views
Skip to first unread message

Joel Hestness

unread,
Jul 9, 2013, 5:16:26 PM7/9/13
to jy...@cs.umn.edu, gem5-gpu developers
Hi Jieming,

1. (Correct me if I were wrong) If the CPU initiates an data array, and copy the array to GPU with cudaMemcpy(src, dst, size, cudaMemcpyHostToDevice), the following actions will happen in sequence: 
First, CE captures the cuda call and generates read requests reading data from the physical addresses of "src"; 
Second, these requests are sent to the directory, and the directory broadcasts the requests to all components; 
Third, both the CPU (which has the updated version of the data) and the directory forward the data to the CE; 
Fourth, after getting the latest data, the CE writes it to the physical address of "dst" into the dir.

This is mostly a correct interpretation.  Note that the directory doesn't cache any data either, so when you say that the CE writes the data to the physical address, the data will be pushed out to the memory, and the directory will only appropriately update the state associated with the cache line.

As I get into more detail below, you may want to snoop through the L1CacheCE controller SLICC file for details: gem5-gpu/src/mem/protocol/VI-ce.sm


2. Since the CE also has caches, does it mean that the copied data are first written into the cache, then flush to the dir when replacement happens?

Actually, these "caches" are an artifact of the design of Ruby's sequencers, and these CE caches don't ever hold data during the simulation: In order for us to control the way that the copy engine moves data, we had to write a gem5 component that had a sequencer connected to a Ruby controller.  Sequencers require a connected cache for holding data (the Ruby controller holds the state of the data), since most Ruby controllers are for traditional caches.  This is why we instantiate a cache object in the *I_hammer_fusion.py and *I_hammer_split.py configuration files, but these caches are never actually used.


3. When CPU forwards the updated data, do the coherence states of the cache lines change? It seems the CE cache is a potential sharer.

First, since the copy engine doesn't have a "cache" per se, it is NOT possible that the CE will be caching data in a cache-like data array.

That said, the copy engine IS a potential sharer of the data: If the CE has previously read a piece of data that it will be writing out to another location during the copy, the data is only "live" while the TBE (MSHR) holding the request is live.  TBEs are only "live" for a cache line while a read or write request is pending to the directory/memory.  So, what does this mean about potential race conditions on a cache line?  If the CE tries to write data to the directory, but another controller has already requested exclusive access to the line, the directory may stall the CE write request, and when the line is eventually freed for the CE update, the line will be written (check out DMA_WRITE transitions in gem5-gpu/src/mem/protocol/VI_hammer-dir.sm).

You might consider the situation above to be "incorrect" in terms of coherence, because data that was written by a sharer might be overwritten by the CE's write request.  However, the CUDA runtime assumes that the programmer writing CUDA code will NOT overwrite regions of memory that are the source or destination of pending memory copies.  As such, we assume that no other cache controller will end up in a race to modify data pointed to by the copy engine.

In the case that the CE tries to read data that is held in a CPU cache, the directory forwards the read request to the CPU sharers and the line is either downgraded to shared (S state) or invalidated (I state).  Check out the DMA_READ transitions in gem5-gpu/src/mem/protocol/VI_hammer-dir.sm, and the Other_GETS (get shared privileges) transitions in gem5-gpu/src/mem/protocol/VI_hammer-CPUCache.sm.


4. When copying data from Device to Host, the directory will invalidate the CPU cache lines, am I correct?

If necessary, yes: The important thing to note is that this is ONLY the case if a CPU cache is holding the data that the CE is trying to write to.  To see how this works, check out gem5-gpu/src/mem/protocol/VI_hammer-dir.sm around line 1456:

  transition({O, S, NO, NX}, DMA_WRITE, NO_DW_B_W) {PfRead} {
    vd_allocateDmaRequestInTBE;
    f_forwardWriteFromDma;
    p_popDmaRequestQueue;
  }

This transition indicates that if the data is shared with another cache (S state), the directory will call f_forwardWriteFromDma, the action defined near line 1124 in VI_hammer-dir.sm:

  action(f_forwardWriteFromDma, "fw", desc="Forward requests") {
    assert(is_valid(tbe));
    if (tbe.NumPendingMsgs > 0) {
      peek(dmaRequestQueue_in, DMARequestMsg) {
        enqueue(forwardNetwork_out, RequestMsg, latency=memory_controller_latency) {
          out_msg.Address := address;
          out_msg.Type := CoherenceRequestType:GETX;
          //
          // Send to all L1 caches, since the requestor is the memory controller
          // itself
          //
          out_msg.Requestor := machineID;
          out_msg.Destination.broadcast(MachineType:L1Cache);
          out_msg.Destination.broadcast(MachineType:L2Cache); // Send to all L2 caches
          out_msg.MessageSize := MessageSizeType:Broadcast_Control;
        }
      }
    }
  }

In this action, you'll note that the directory forwards a GETX (get exclusive privileges) to the caches that are potential sharers defined by the out_msg.Desination.broadcast lines.  This will have the effect of invalidating all cache lines on-chip before updating the line in memory (see Other_GETX transitions in VI_hammer-CPUCache.sm).


  Hope this clarifies!  Let me know if you have further questions,
  Joel
 

--
  Joel Hestness
  PhD Student, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/
Reply all
Reply to author
Forward
0 new messages