Looks like Jason covered most of your questions, but here is a bit more info that may also be useful:
RE: MSHRs: Perhaps it may clear up some confusion to note that in contrast to CPU architectures, GPU LSQs and L1 caches are extremely tightly coupled. Specifically, our tests on NVIDIA hardware indicate that the L1 tag array is serially accessed before the data array, and the LSQ is notified whether there is an L1 miss - whether the access must occupy an MSHR - before any data array access. This allows the GPU core that notes the miss to act on it, for example, by descheduling the warp waiting on the miss. In order to model this tight coupling accurately while constrained by the way the Ruby cache hierarchies work, we take liberties to track MSHR statuses in the LSQ. GPGPU-Sim models an analogous architecture (e.g. check out the use of m_L1D in gpgpu-sim/gpgpu-sim/shader.cc).
To Kyu's first question, specifically, gem5-gpu models the MSHRs as physically located with the GPU L1 caches, but the LSQ also contains deep ordered buffering of accesses that will hit in lines with an outstanding access to the caches (i.e. queues for lines in MSHRs). This is one of many potential design points (a fairly aggressive one), and you can check out more options in the recent paper by Wenhao Jia et al. (
http://www.princeton.edu/~wjia/mrpb.pdf).
RE: L2 porting: I'd strongly recommend being careful with how you modify the default settings in gem5-gpu. Real GPU L2 caches are architected in a highly parallel way (e.g. numerous banks) in order to preclude much per-port contention. Between this and the extreme aggressive interconnect buffering in most hardware, it is unlikely that per-bank L2 porting in real hardware will have much affect on cache performance. Instead, aggregate L2 bandwidth is primary performance factor. We model bandwidth limitations between the caches and the memory controllers using the gem5 Ruby interconnect (more below).
A while ago, I tried testing the resourceStalls setting as a way to limit cache bandwidth, but the results didn't make sense, and I'm still not clear why. Since then, we have updated the version of gem5 that we're using, and I've fixed bugs in the Ruby interconnect, so something may have changed, but we haven't validated the resourceStalls functionality. Even if resourceStalls are now working as intended, they'd probably only be useful for more precise per-bank contention rather than the bigger performance factor (bandwidth).
All that said, we have heavily validated that the cache bandwidth results make sense. The interconnect width is the primary constraining factor on L2 bandwidth and is defined in the VI_hammer config files. For example in gem5-gpu/configs/gpu_protocol/VI_hammer_fusion.py the line "l2_cluster = Cluster(intBW = 32, extBW = 32)" specifies a 32B-wide interconnect, and if Ruby is clocked at 2GHz (default), the cluster will include 32B * 2GHz = ~60GB/s links, and the L2 will not be able to serve more bandwidth than this. You can scale out the aggregate effective GPU L2 bandwidth by using multiple L2 caches by passing the command line parameter --num-l2caches=#. If you want to model something like a GTX580 (~276GB/s L2), using 4 gem5-gpu GPU L2s will get you pretty close (~240GB/s).
Hope this helps! Let me know if you have further questions,
Joel