queries on memory models in gem5-gpu sim

343 views
Skip to first unread message

Jithu Joseph

unread,
Apr 5, 2014, 1:28:04 AM4/5/14
to gem5-g...@googlegroups.com, Jintack Lim, Jubi Taneja, Muktesh Khole
As a part of our class project - we were looking into new memory models for GP-GPUs. We started running some rodinia benchmarks using gem5-gpu simulator . Since we are beginners here -we  had some queries :

  • When we run this command(as given in the webpage) - " build/VI_hammer/gem5.opt  ../gem5-gpu/configs/se_fusion.py -c ../benchmarks/rodinia/backprop/gem5_fusion_backprop -o “16” , which memory model are we using . I assume that in this case - CPU and GPU are using the same DRAM. 
  • Within the benchmark we have cudaMemcpy  calls  . I assume that here CPU and GPU - though they are using the same DRAM - they have separate dedicated areas and the cudaMemcpy moves data from CPU’s are to GPU’s and viceversa  (while using the above cmd)- Is this assumption correct(are these intra - DRAM copies done using DMA)
  • How do we simulate a traditional GPU configuration (which command)  , wherein the cudaMemcpy will involve a PCIe DMA transfer 
  • In addition to DMA transfer, i assume we would also need to configure the dedicated GPU memory as GDDR5 to accurately simulate the traditional scenario - how do we do that.
  • I see that within the benchmark suite we have a no copy  version of benchmarks - is this conceptually equalent to pinned host model (where in the GPU ignores its dedicated memory and accesses data directly from DRAM)
  • where do we get the timing details to compare two models. (In the console output - we couldn't see timing details)
Thanks
Jithu

Joel Hestness

unread,
Apr 6, 2014, 1:52:54 PM4/6/14
to Jithu Joseph, gem5-gpu developers, Jintack Lim, Jubi Taneja, Muktesh Khole
Hi Jithu,
  • When we run this command(as given in the webpage) - " build/VI_hammer/gem5.opt  ../gem5-gpu/configs/se_fusion.py -c ../benchmarks/rodinia/backprop/gem5_fusion_backprop -o “16” , which memory model are we using . I assume that in this case - CPU and GPU are using the same DRAM. 

That's correct, but there's a bit more to this. The *_fusion.py config files, by default, model a unified virtual address space shared among CPU and GPU. This means that they not only share a DRAM, but they can also use the same virtual-to-physical address translations to access common physical memory in the DRAM. The Rodinia benchmarks use memory copies from CPU to GPU pointers, but when using the *_fusion.py default memory config, these copies are superfluous given that the cores could just access each other's memory. To split the physical address space among cores and force CPU and GPU to use their respective portion, pass '--split' to the config file. Note that with this parameter, the simulator models a split physical memory address, but cores still share a common DRAM. This is the common case for existing heterogeneous processors.

  • Within the benchmark we have cudaMemcpy  calls  . I assume that here CPU and GPU - though they are using the same DRAM - they have separate dedicated areas and the cudaMemcpy moves data from CPU’s are to GPU’s and viceversa  (while using the above cmd)- Is this assumption correct(are these intra - DRAM copies done using DMA)
The cudaMalloc and cudaMemcpy calls are required if you use the split memory hierarchy ('--split').  They are not necessarily required in the unified memory space, and that is why we provide the Rodinia no-copy benchmarks, which eliminate these calls (benchmarks/rodinia-nocopy/). Note that to use the no-copy benchmarks in a unified address space setting, you will need to pass --access-host-pagetable to your simulations to ensure that the GPU uses the correct address translations setup by the CPU side.

  • How do we simulate a traditional GPU configuration (which command)  , wherein the cudaMemcpy will involve a PCIe DMA transfer 
Use the '--split' command line parameter to *_fusion.py to get a common heterogeneous processor.  You'll need to tune the parameters of the copy engine to get copies comparable to that of a PCIe transfer to/from something like a discrete GPU. Note that this is somewhat hacky, because in this case, the CPU and GPU still share a common DRAM (though you can modify their memory access characteristics separately - see below).

  • In addition to DMA transfer, i assume we would also need to configure the dedicated GPU memory as GDDR5 to accurately simulate the traditional scenario - how do we do that.
First, this only makes sense with the split memory hierarchy, since otherwise, the cache hierarchy wouldn't know which memory accesses to send to which DRAMs, so you'll need to pass '--split'. You'll need to configure the separate DRAM controllers with appropriate timing/frequency parameters for each memory. To configure the CPU-side directories/memory controllers, check out the CPU memory parameters ('--mem*') in gem5-gpu/configs/GPUMemConfig.py, and to configure the device-side directories/memory controller, check out the GPU parameters ('--gpu_mem*') in gem5-gpu/configs/GPUConfig.py.

  • I see that within the benchmark suite we have a no copy  version of benchmarks - is this conceptually equalent to pinned host model (where in the GPU ignores its dedicated memory and accesses data directly from DRAM)
It depends on the underlying system that you're modeling.  First, note that the rodinia-nocopy benchmarks only work when using a unified memory address space, since they elide the cudaMalloc/cudaMemcpy calls.  Pinned memory in existing discrete GPU systems typically requires either the CPU or GPU to send accesses to pinned memory across the PCIe bus, which we do not model in gem5-gpu.  In that sense you won't be able to model something like a discrete GPU using pinned memory.  If instead you are comparing to pinned memory in a heterogeneous processor, then yes, these are largely the same, since the separate cores of the heterogeneous processor share a common DRAM and do not need memory accesses to cross a PCIe.
 
  • where do we get the timing details to compare two models. (In the console output - we couldn't see timing details)
The stats.txt output files contain all of the stats collected by the simulator, including many for timing.

  Hope this helps,
  Joel


--
  Joel Hestness
  PhD Student, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/
Reply all
Reply to author
Forward
0 new messages