Understanding the Gem5-GPU results

Mateus Grellert

unread,

Jun 14, 2014, 4:49:59 PM6/14/14

to gem5-g...@googlegroups.com

I know this information is probably documented somewhere else, but understand I'm asking here because I need to write a report on a project that relies on Gem5-GPU that's due on Monday.

I need to extract the cycle count a particular kernel takes to run. I'm using two approaches:

(1) Run the code with m5_begin/end right before and after the kernel call (this makes the results for each checkpoint to be separated in the stats.txt file)

(2) Run the code with a loop that calls the kernel 10k times

Regardless of the case, I'd like to be sure which value in the stats.txt contains the cycle count for the kernel.

In traditional Gem5 mode, I account cpu.numCycles, but with GPUs I think this should not be the case, so could anyone enlighten me?

Jason Power

unread,

Jun 14, 2014, 8:45:08 PM6/14/14

to Mateus Grellert, gem5-g...@googlegroups.com

The file gpu_stats.txt likely has the information you need. Also, you can read through all of the stats.txt file and find a lot more interesting statistics.

Jason

Christian Pinto

unread,

Aug 7, 2014, 11:14:14 AM8/7/14

to gem5-g...@googlegroups.com

Hello everybody,

I am profiling some applications using gem5-gpu, and comparing the results with what i get from the execution on a tesla C2070 (fermi).

I have configured the system to have 14 SM, as what is available on the tesla card. Same size of the warp, same size of L2 cache etc etc...

The numbers in stats.txt are anyway quite strange:

- The number of instructions executed on the simulator is much higher than the one that i measure with nvprof on the GPU. In particular i get 8960 instructions executed on the gpu (measured with nvprof, by reading the instr_executed metric), while i get 540928 instructions executed on the simulator (reading instInstances in stats.txt). To better profile this situation i am running a reduced version of my kernel, which is spawning only one thread block. Since the size of the warp is matching the one on the real gpu i expect to see the same number of instructions executed.

- The number of memory accesses has the same behavior: Shared load transactions 14336, shared store transactions 896. But on the simulator i get 65536 shared loads and 4096 shared store

Of course the very same code is running on the two environments.

Can anyone help me in understanding the results coming out from the simulator?

Thanks,

Christian

Joel Hestness

unread,

Aug 7, 2014, 11:11:52 PM8/7/14

to Christian Pinto, gem5-gpu developers

Hi Christian,

I've validated many of gem5-gpu's performance results against both a Tesla C2050 and a GTX580, both of which are the Fermi architecture, so it's likely that the questions you're running into are related to interpretting the stats rather than their correctness. It's hard to say exactly what might be going on, but I can recommend a few things:

1) I would recommend running gem5-gpu using the --kernel_stats parameter, which will dump and reset statistics at the beginning and end of GPU kernels. This should allow you to isolate the stats from different segments of the run time

2) It is possible (probable?) that the stats you are collecting from the hardware profile are actually sampled numbers rather than exact numbers. I'd recommend re-running the profiler tests a few times to see if you get the same results and note that collected values may be inexact. If the performance counters are using sampling, you'll need to run fairly long tests to ensure you get statistically significant results

3) I suspect that the hardware instruction counts stat is easily misinterpretted. Specifically, there are 14k shared loads, but surely those come from more than 9k instructions. It is more likely that the instruction count numbers are either invalid (e.g. performance counter doesn't work), or they are a count of warp or thread block instructions executed rather than thread instructions. Note that there are about 60x more instructions in the sim - perhaps the hardware stat is thread block instructions, and your thread blocks have 64 threads? I'd recommend digging through some specs to see if you can find the appropriate interpretation of the hardware counters. In the gem5-gpu stats.txt output, the inst_counts stats count thread instructions (i.e. total instructions across all threads that executed).

4) Regarding memory accesses, you'll need to establish whether you're counting memory instructions or memory accesses. GPU load-store queues/caches do access coalescing, which groups many memory instruction requests to be sent as a single memory access. In addition to the complexity of counting these, hardware may also count separate port accesses as separate accesses, even if they came from a single coalesced access to a cache line. This is possible for both shared and global L1 cache accesses, which must schedule the use of the ports. Given that you're seeing larger shared memory access counts from the simulator, I suspect that hardware may be counting accesses instead of instructions. In the gem5-gpu stats.txt file, the stats 'shader_cores?.*_loads' and 'shader_cores?.*_stores' count the number of load and store instructions (i.e. across all threads), respectively, to the different memory spaces. These counts are prior to coalescing. The number of coalesced cache accesses is given by the 'l1_cntrl_sp??.demand_acceses' stat. When an application has good coalescing capability, the number of memory instructions is 8-32x greater than the number of actual accesses.

Hope this helps,

Joel

--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/

Christian Pinto

unread,

Aug 8, 2014, 10:26:14 AM8/8/14

to Joel Hestness, gem5-gpu developers

Hello Joel thanks for the reply,

i have managed to match the number of memory accesses. With nvprof the number of shared load/store instructions is counted per warp. Thus, if i divide the number of memory accesses got from gem5-gpu by the size of the warp i obtain exactly the same number given by the profiler.

However i did not yet found a way to match the number of executed instructions. The value read from the profiler is the number of warp instructions executed on the GPU, which they calculate as the total number of instructions executed on each of the two pipelines divided by the size of the warp. I expect that dividing the number of instructions simulated by gpgpusim by the size of the warp i should get the same number of instruction in the profiler. But unfortunately it is not going in that way.
What i have noticed is that between the number of instructions simulated and the real number of instructions there is always a ratio of ~64.

Christian

Christian Pinto

unread,

Aug 8, 2014, 10:37:48 AM8/8/14

to Joel Hestness, gem5-gpu developers

Another question came into mi mind...

From the documentation online it seems that in the fusion architecture CPU and GPU have separate L2 caches, but in many discussions i read that they share the L2 cache. In effect when looking at results from a simulation of the fusion model i see that there are no requests to teh shared L2 cache. Can you elaborate a bit on that? In my experiments i am interested in studying performance variations when changing the memory subsystem configuration (e.g. bandwidth of the interconnection, shared/private L2 cache configuration and so on). Which part of the simulator is handling the memory system so that i can modify its configuration?

Christian

Jason Power

unread,

Aug 8, 2014, 10:55:22 AM8/8/14

to Christian Pinto, Joel Hestness, gem5-gpu developers

Hi Christian,

The instruction count difference is probably due to the way gem5-gpu and nvprof counts instructions. I can't comment on exactly what nvprof counts as an instruction, but in gem5-gpu, we are counting "instruction instances." An instruction instance is a single instance of an instruction executed on a single lane (PE, CUDA core, etc). Thus, one static instruction could be counted as up to 32 dynamic instruction instances in a cycle. It may be less than 32 due to branch divergence, however.

For the caches, by default gem5-gpu simulates an APU-like system. Within the GPU there is a shared L2 cache that is shared by all of the compute units (SMs). For the CPU, each core has a private L1 and L2 cache. I believe I have heard of some integrated GPUs sharing an L3 (LLC) cache with the CPU (like Intel parts), but I am not familiar with any systems in which the L2 is shared between the CPU and GPU.

I'm not sure why you are seeing no requests to the shared L2 (which cache are you referring to?). To make minor modifications to the cache configuration you should be able to change the configuration files in gem5-gpu/config. To change the caches significantly (e.g. make them private instead of shared) you will either need to use the gem5 classic memory system or modify the Ruby coherence protocol. There is information related to these on the gem5 website.

Jason

Christian Pinto

unread,

Aug 11, 2014, 12:24:58 PM8/11/14

to Jason Power, Joel Hestness, gem5-gpu developers

Hello Jason,

the L2 cache i was looking is system.gpu.shader_mmu.l2hits, so for sure i was looking at the wrong place. Anyway i still have problems in understanding how to validate the results from the simulator against those coming from the profiler.

I tried to run the rodinia backprop with an input of 1024, i also execute the same benchmark on my Tesla C2070. What i want to validate is:

- number of memory accesses
    in backprop there is apparently no relation between the number of shared memory accesses done by the simulator and the one by the profiler. The profiler counts a higher number of memory accesses

- number of instructions executed
    same situation as the number of memory accesses

- bandwidth towards the GPU global memory
    i read in the paper gem5-gpu: A Heterogeneous CPU-GPU Simulator that the values of off-chip memory latency and bandwidth closely match the one of the real hardware, a GTX 580 in that specific case.

I have set the GPU clock to 575MHz (--gpu-core-clock=575MHz), the clock in tesla cards is 1150MHz for the cores. I have set the clock of the memory (--gpu_mem_freq=1.5GHz) as in a tesla card, also the number of SM is 14 as in the tesla card. Size of L1 and L2 caches are configured correctly. But still the numbers that i read are not even close to the real ones.

For instance: The write memory bandwidth is 4 GB/sec (read from system.gpu_physmem.bw_write::total, is it the correct entry?), while the profiler says ~10 GB/sec (dram_write_throughput).

So my question is, do you think I am doing domething wrong? How did you configure the simulator to obtain results which were comparable with the GTX580?

Christian

ichenh...@gmail.com

unread,

Jul 22, 2017, 7:39:45 AM7/22/17

to gem5-gpu Developers List, jthes...@gmail.com

Hi, Jason

Five "resetstats / dumpstats" in the script( full system mode). But six "Begin / End Simulation Statistics" in stats file. Why?

Jason Lowe-Power

unread,

Jul 24, 2017, 8:24:42 AM7/24/17

to ichenh...@gmail.com, gem5-gpu Developers List

The first statistics chunk is for the period from the beginning of simulation to the first dump stats. The last stats chunk is dumped at the end of simulation and is for the period from the last dump stats to the end.

Jason

Reply all

Reply to author

Forward