Question about simulation statistics

Kyu Hyun Choi

unread,

Mar 26, 2014, 5:05:33 AM3/26/14

to gem5-g...@googlegroups.com

Hi.

In stats.txt file, there are some results of

mshrHitQueued, mshrsFullCycles, mshrsFullCount

warpInstBufActive.

I don't understand what these are actually mean.

As far as I know, mshr means Miss Status Handling Register, and when a cache miss occurs, it stores and forwards the requests to the next memory hierarchy (L2, L3, Main mem).

But I found these variables in lsq, and lsq exists between each gpu core and L1cache.

So my question is,

1. Why mshr is in lsq, and what it actually do?

2. What is mshrsFullCycles? Description says it is number of cycles stalled waiting for an MSHR. But it is little bit ambiguous.

3. What do warpInstBufActive means? And I don't understand what "active warp inst buffers" means.

Thank you for your cooperation.

Regards.

Kyu Hyun Choi

unread,

Mar 26, 2014, 5:09:36 AM3/26/14

to gem5-g...@googlegroups.com

Oh I forgot a question.

4. How many requests can shared(CPU and GPU) L2cache service simultaneously in MESI_CMP_directory protocol by default? And how could I change it?

I cannot find the number of ports in parameters.

Jason Power

unread,

Mar 26, 2014, 9:53:53 AM3/26/14

to Kyu Hyun Choi, gem5-g...@googlegroups.com

Hi Kyu,

For 4), by default, the L2 cache services as many requests as it can. Changing this is rather complicated. There are many different nobs, most of which don't really do what you want them to. If you want to use a very simplistic port model for the L2 cache, you can pass the parameter "resourceStalls = True" to the CacheMemory when it is created in the python configuration file (gem5/configs/ruby/MESI_CMP_directory.py and gem5-gpu/configs/gpu_protocol/MESI_CMP_directory_fusion.py). Using this model you can also change the number of ports and port latencies with "data(tag)ArrayBanks" and "data(tag)AccessLatency".

The other option for limiting the number of requests to to change "transitions_per_cycle" on the ruby controller (in the same files). However, this is very imprecise, as often the controller needs to go through multiple transitions to perform what would be a single action in hardware. Said another way, the SLICC transitions are not a good proxy for cache ports.

For your other questions, I can give a high-level answer, but Joel can probably provide more details.

1) The MSHR is used to track the outstanding requests. It's used the same way the cache MSHRs are.

2) This is the number of cycles that there are no MSHRs available, even though we need to put a new request into an MSHR. It is meant to give you and idea of whether or not the number of buffers for outstanding requests at a single compute unit is constraining the system.

3) For each warp instructions (e.g. ld/st for a whole warp), a warp inst buffer is allocated. This buffer is held through the entire time the warp instruction has outstanding memory requests. It is used to track the requests and data until all requests complete since each warp instruction could generate many memory requests (i.e. it has memory divergence). The warp inst buffer collects all of the results, and finally sends them back to the "shader core" after they all complete.

Jason

Joel Hestness

unread,

Mar 26, 2014, 12:09:11 PM3/26/14

to Jason Power, Kyu Hyun Choi, gem5-g...@googlegroups.com

Hi guys,

Looks like Jason covered most of your questions, but here is a bit more info that may also be useful:

RE: MSHRs: Perhaps it may clear up some confusion to note that in contrast to CPU architectures, GPU LSQs and L1 caches are extremely tightly coupled. Specifically, our tests on NVIDIA hardware indicate that the L1 tag array is serially accessed before the data array, and the LSQ is notified whether there is an L1 miss - whether the access must occupy an MSHR - before any data array access. This allows the GPU core that notes the miss to act on it, for example, by descheduling the warp waiting on the miss. In order to model this tight coupling accurately while constrained by the way the Ruby cache hierarchies work, we take liberties to track MSHR statuses in the LSQ. GPGPU-Sim models an analogous architecture (e.g. check out the use of m_L1D in gpgpu-sim/gpgpu-sim/shader.cc).

To Kyu's first question, specifically, gem5-gpu models the MSHRs as physically located with the GPU L1 caches, but the LSQ also contains deep ordered buffering of accesses that will hit in lines with an outstanding access to the caches (i.e. queues for lines in MSHRs). This is one of many potential design points (a fairly aggressive one), and you can check out more options in the recent paper by Wenhao Jia et al. (http://www.princeton.edu/~wjia/mrpb.pdf).

RE: L2 porting: I'd strongly recommend being careful with how you modify the default settings in gem5-gpu. Real GPU L2 caches are architected in a highly parallel way (e.g. numerous banks) in order to preclude much per-port contention. Between this and the extreme aggressive interconnect buffering in most hardware, it is unlikely that per-bank L2 porting in real hardware will have much affect on cache performance. Instead, aggregate L2 bandwidth is primary performance factor. We model bandwidth limitations between the caches and the memory controllers using the gem5 Ruby interconnect (more below).

A while ago, I tried testing the resourceStalls setting as a way to limit cache bandwidth, but the results didn't make sense, and I'm still not clear why. Since then, we have updated the version of gem5 that we're using, and I've fixed bugs in the Ruby interconnect, so something may have changed, but we haven't validated the resourceStalls functionality. Even if resourceStalls are now working as intended, they'd probably only be useful for more precise per-bank contention rather than the bigger performance factor (bandwidth).

All that said, we have heavily validated that the cache bandwidth results make sense. The interconnect width is the primary constraining factor on L2 bandwidth and is defined in the VI_hammer config files. For example in gem5-gpu/configs/gpu_protocol/VI_hammer_fusion.py the line "l2_cluster = Cluster(intBW = 32, extBW = 32)" specifies a 32B-wide interconnect, and if Ruby is clocked at 2GHz (default), the cluster will include 32B * 2GHz = ~60GB/s links, and the L2 will not be able to serve more bandwidth than this. You can scale out the aggregate effective GPU L2 bandwidth by using multiple L2 caches by passing the command line parameter --num-l2caches=#. If you want to model something like a GTX580 (~276GB/s L2), using 4 gem5-gpu GPU L2s will get you pretty close (~240GB/s).

Hope this helps! Let me know if you have further questions,

Joel

--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/

Xiaoyu Zheng

unread,

Apr 13, 2014, 2:09:10 PM4/13/14

to gem5-g...@googlegroups.com, Jason Power

Hello Joel,

As you mentioned, if I want to model something like a GTX580, I should use 4 GPU L2 caches. Should I also increase the number of directories? When I used 4 directories, deadlock occurred, is this normal?

Thanks,

Xiaoyu

Joel Hestness

unread,

Apr 15, 2014, 11:52:13 AM4/15/14

to Xiaoyu Zheng, gem5-gpu developers, Jason Power

Hi Xiaoyu,

Yes, you should probably also use 4 directories.

It's hard to say what could be causing the deadlock without more information about your configuration. You'll want to make sure that you're supplying sufficient bandwidth off-chip, since otherwise the memory access latencies will increase very quickly. I believe the default bandwidth per directory/memory controller is ~23GB/s, so 4 of them would be less than 100GB/s. This skew compared to the aggregate L2 bandwidth (>250GB/s) could cause extremely long off-chip access latency, which could be causing the deadlock check to trigger even though there may not actually be deadlock.

You could try increasing the deadlock latency threshold or increasing the aggregate off-chip bandwidth (e.g. GTX580 has 179GB/s). If you still run into this issue, let us know and send over your config.ini output file.