Configuring memory controllers

Christian Pinto

unread,

Aug 19, 2014, 1:29:09 PM8/19/14

to gem5-gpu developers

Hi All,

i am doing a few experiments with split and fusion architectures. If I
run the same benchmark on the two architectures (removing all the
cudamemcpy for the fusion), i expect to see a global speedup of the
application (measured from the host), and due to the absence of
host-device memory copies. At the same time I expect the kernel to last
longer on the fusion architecture, because the memory controller of the
DRAM is one and is shared with the CPU. While in the split architecture,
from what i see in gpgpusim.config, the gpu should have 6 memory
controllers providing a much higher memory bandwidth.

I had a look at simulation results and the execution time for the kernel
is not changing considerably between split and fusion. I nooticed that
in the config.ini for the split architecture there is only one memory
controller for the cpu (system.ruby.dir_cntrl0) and one for the gpu
(system.ruby.dev_dir_cntrl0), both configured with exactly the same
parameters. So the parameter in gpgpusim.config is ignored.

So two questions:

- How are memory controllers handled in gem5-gpu?

- Is it possible to instantiate the memory hierarchy of a fermi gpu?

Thanks,

Christian

Joel Hestness

unread,

Aug 20, 2014, 11:34:57 PM8/20/14

to Christian Pinto, gem5-gpu developers

Hi Christian,

Configuring the memory controllers in gem5-gpu is actually pretty different than GPGPU-Sim, and the configurations in gpgpusim.config file are ignored. To set up the modeling that you'd like to do, I'd recommend reviewing this previous post and this thread on this list. You'll need to use the gem5-gpu command line parameters --num-dirs and --num-dev-dirs to set the number of directories - and thus, memory controllers - that the simulated system contains. Note that device directories (--num-dev-dirs) are used exclusively by the GPU. I think you'll need to use powers of 2 directories.

I'd also investigating the gem5-gpu/configs/GPUMemConfig.py and gem5-gpu/configs/GPUConfig.py options, which allow you to adjust the capacity, frequency and latencies of shared and GPU-only memory controllers separately. By setting these appropriately, you can get reasonably close to a DDR- or GDDR-like memory that uses a first-reads then first-come, first-served scheduling policy.

Hope this helps,

Joel

--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/

Konstantinos Koukos

unread,

Aug 21, 2014, 4:54:46 AM8/21/14

to gem5-g...@googlegroups.com

Hello, this is a nice discussion brought up here.

Could someone provide an example configuration of how to
configure a GDDR-5 memory controller. For example for a
256bit memory interface width with 150GB/s bandwidth how
many directories and which settings should be used.

Best Regards,
Konstantinos.

Joel Hestness

unread,

Aug 21, 2014, 10:03:40 AM8/21/14

to Konstantinos Koukos, gem5-gpu developers

Hi guys,

Sure. I've used the following command line parameters to get in the ballpark of an NVIDIA GTX580:

--total-mem-size=3GB --num-l2caches=4 --num-dirs=4 --mem_ctl_latency=61 --mem_freq=3006MHz --membus_busy_cycles=8 --membank_busy_time=32ns --sc_l1_size=24kB --sc_l2_size=192kB

This configures the total memory size as 3GB striped across 4 memory controllers with latencies comparable to what you'd find with GDDR5 (i.e. including the deep 8n prefetch buffering and bank latency for a close-page policy). This also sets up the caches to be close to GTX580 partitioned with small L1D caches: 16kB data cache, but VI_hammer doesn't split the L1s, so I added 8kB for instruction. The memory bandwidth is 3006GHz * 8B/channel * dual-data rate (2) * 4 memory controllers = 179.17GB/s (i.e. exactly the same bandwidth as GTX580, but with 2 fewer memory controllers, so the frequency needs to be 1.5x higher).

In practice, the memory scheduling policy of the GTX580 is better than FR-FCFS, so the effective bandwidth for memory intensive applications can be about 5% less than actual NVIDIA hardware, but it tends to be very close.

Disclaimer: I always strongly recommend running short validation tests for L2 and memory bandwidth when choosing a new system configuration. It's pretty easy to miss a command line parameter, which can cause the memory performance to be way off. Though it's a little tricky to set up initially, the microbenchmark in benchmarks/unittests/global_reads is pretty handy for this testing.

Hope this helps,

Joel

Christian Pinto

unread,

Aug 25, 2014, 4:11:20 PM8/25/14

to Joel Hestness, Andrea Bartolini, gem5-gpu developers

Hi Joel,

thank you very much for your help, after your last email the situation was much clearer.

Now i started in doing a few experiments, and i an into a result that i do not understand.

Essentially I run the same benchmark (DGEMM - linear algebra) in split and fusion configuration, in a way that the shared memory hierarchy of the fusion arch is configured in exactly the same way as the GPU memory for the split arch.

So in the split i have:
GPU:    4 x L2 cache (192KB each); 4 x MEM GDDR5 mem controllers (configuration suggested by Joel in the previous email)
CPU:    4 x DDR3 mem controllers

(--total-mem-size=6GB --mem-size=6GB --clusters=14 --num-dirs=4 --num-l2caches=4 --sc_l2_size=192kB --sc_l1_size=24kB --mem_freq=533MHz --membus_busy_cycles=8 --membank_busy_time=39.645ns --num-dev-dirs=4 --gpu-mem-size=4GB --gpu_mem_freq=2.25GHz --gpu_membus_busy_cycles=8 --gpu_mem_ctl_latency=61 --gpu_membank_busy_time=32ns)

while for fusion:
GPU:                   4 x L2 cache (192KB each)
SHARED MEM:    4 x MEM GDDR5 mem controllers

( --total-mem-size=6GB --mem-size=6GB --clusters=14 --num-l2caches=4 --sc_l2_size=192kB --sc_l1_size=24kB --num-dirs=4 --mem_ctl_latency=61 --mem_freq=2.25GHz --membank_busy_time=32ns --membus_busy_cycles=8 --kernel_stats --access-host-pagetable)

What i expect to see from this experiment is to measure the very same performance, with at most the kernel executed in the fusion arch being slower than the one executed on the split. This because of page walks, or contention on the shared memory. And also I expect the two simulations to attain the same memory bandwidth towards the external memory.

What i have actually experienced is:

- the kernel has almost the same duration 75801652 (split) vs 75777355(fusion)

- but different bandwidth 661.8 MB/sec (split) vs 941.8 MB/sec (fusion)

- same number of global memory reads and writes in both experiments

Since the higher bandwidth of fusion sounds strange i started digging into the stats file and i noticed a few strange things:

- the amount of data written from each shader into external memory is the same in the two experiments, while the amount of data read by each shader is increased. In particular i measure 15% more data read from external memory, but the count of load/store instruction is matching between fusion and split

    I had 40960 bytes read in split, and 47104 bytes in fusion (bytes_read::gpu.shader_cores01.data in stats). This makes a difference of ~6kB

I started then thinking that this extra traffic could be generated by the GPU MMU, since it only reads from the memory. I have 32 page walks in total, can this read 6K from memory?

As far as i understood, in fusion mode, GPU shaders translate virtual addresses by a combination of a per shared tlb and a per gpu MMU. From the configuration files i saw thath TLBs are connected to the shared MMU, which in turn is connected to the page walk cache. The pw cache is connected to the L2(s), and finally to the DRAM of the system.

Is the translation traffic also passing through the L1 cache of each shader before going to the MMU?

- the number of accesses in L1 cache of the shader is increased in fusion, and this reflects in an increased traffic on the L2 cache. But the strange thing is that only 2 of those "extra" accesses are miss.

If the extra traffic that i see is generated by address translations, whose traffic goes through the L2, i would also expect to see more than 2 extra L2 cache misses.

To summarize i can't figure out how, by just switching from split to fusion, can a shader read 6KB more data from external memory. And this traffic is not somehow passing by the L2 cache. Is there a way how a shader can read the external memory, bypassing the L2 cache?

Address translations are the only guess i could make, but 32 page walks i think don't move 6KB of data(am i right?). And again i see in total only 2 extra L2 cache misses in fusion, only 256 bytes.

Do you have a clue on which source can cause this behavior? The increase in data read arrives to 25% when I execute bigger instances of the benchmark.

Thanks and sorry for the very long email.

Christian

Joel Hestness

unread,

Aug 25, 2014, 8:08:39 PM8/25/14

to Christian Pinto, Andrea Bartolini, gem5-gpu developers

Hi Christian,

In the stats.txt file, bytes_read::<master_id>.<port_name> is a stat that measures the total memory access bytes read from the backing store (PhysicalMemory) from the given master ID. In VI_hammer, all GPU memory accesses touch the backing store (their cache RubySequencers have access_phys_mem = True in the config files). Thus, I believe this count even includes bytes read on cache hits. Thus, the stat you are referencing there may be counting multiple accesses to the same couple cache lines that were missed into the L2 cache, but not necessarily accessed off-chip.

I don't know for sure, but I suspect you are correct that the page walkers are adding these extra accesses. I think their access sizes are the width of a cache line (128B), so 6kB = 48 accesses. If you're simulating 16 GPU cores, that's only 3 accesses per core, or roughly a 12kB working set for each core during the kernel. If that is the case, then it is likely that the full memory footprint can fit in your L2 caches (12kB * 16 cores = 192kB, which would be roughly the max footprint). It would only take a few cache lines from the page table to map this footprint, so all the stats you cite seem reasonable to me.

Hope this helps,

Joel

Christian Pinto

unread,

Aug 27, 2014, 6:23:55 AM8/27/14

to Joel Hestness, Andrea Bartolini, gem5-gpu developers

Hi Joel,

I gave you a small example to easily play with the numbers...

it sounds weird to me that in the total count of number of bytes accessed also cache hits are taken into account. To demonstrate it i go back to my example.

In the experiment described only one shader core over 14 is involved in computation. In the switch from split to fusion i measure exactly 6KB of extra data read by the core and a total of 32 page walks.

- I can not correlate this amount of data with the number of page walks. 6KB/32 = 192 bytes, which in my opinion is a bit too much for a single page walk

- I measure in L1 cache a total of 257 extra accesses, al misses. And in L2 this reflects to 255 extra hits and 2 extra misses. If i divide 6kB by the size of a line (128B, same size for both L1 and L2), i get 48. So i can not even correlate this extra amount of data with the extra cache accesses. Especially not with only 2 cache misses

Three questions:

1 - Why is the pw cache always reporting zero? The size is not zero, so it should not be bypassed. This component would help a lot to understand whether the extra data is due to page walks or not.

2 - Is the translation traffic also passing through the L1 cache of each shader?

3 - Where in the code can i have a look to check what is happening with the memory?

Thanks again

Christian

Joel Hestness

unread,

Aug 29, 2014, 9:51:12 AM8/29/14

to Christian Pinto, Andrea Bartolini, gem5-gpu developers, Jason Power

Hi Christian,

Perhaps Jason Power (cc'd) could speak more to the pagewalk cache, since I haven't used it outside of small tests previously.

To debug this though, I'd recommend turning on some debugging flags in your simulation. Specifically, there are four sets of flags that could be useful: (1) pass the ShaderMMU and ShaderTLB flags to --debug-flag= when you run with gem5.opt or gem5.debug to get a trace of the translation activity, or (2) the ShaderLSQ and CudaCoreFetch flags to get a trace of the memory accesses that the CudaCores are making, or (3) the ProtocolTrace flag will give you a trace with every detail of what and how data is flowing through the cache coherence protocol, or (4) the MemoryAccess flag will give a trace of the accesses as they hit the backing store (in gem5/src/mem/abstract_memory.cc). I would also recommend restricting your simulation to fewer cores, and a smaller application input set if possible, since 257 accesses is a lot to track down.

Hope this helps,

Joel

Jason Power

unread,

Aug 29, 2014, 10:21:42 AM8/29/14

to Joel Hestness, Christian Pinto, Andrea Bartolini, gem5-gpu developers

Hi Christian,

I believe the reason the PWC is reporting 0 accesses is that you have to "bypass_l1" option set to true. When using the PWC, this should automatically be set to false, but it may not for some reason in your configuration.

Each page walk makes 4 requests to completely different cache lines. Therefore, I would expect that a single page walk would require 4*128B or 512B of data. You are likely seeing much lower than this due to caching and reuse of some of the cache lines that contain the page table.

Hopefully this helps. I haven't read through all of the previous emails, so let me know if you have any other questions about the address translation in gem5-gpu.

Jason

------
Jason Power
University of Wisconsin-Madison,
Department of Computer Sciences

http://cs.wisc.edu/~powerjg/
pow...@cs.wisc.edu

Christian Pinto

unread,

Aug 29, 2014, 12:15:14 PM8/29/14

to Jason Power, Joel Hestness, gem5-gpu developers

Dear all,

we have finally found what was causing such memory access overhead. In fact it was not related to memory accesses translation, but instead it was due to data misalignment.

When applications are executed on the split architecture, data alignment of data to the cache line size on he GPU side is ensured by the cuda malloc. While for the fusion architecture data allocation is handled by the CPU. What was happening is that data allocated by the CPU was not aligned to 256 byte, and such misalignment was triggering more memory accesses. I have switched all the malloc calls to posix_memalign, specifying an alignment to 256 bytes. After that all the extra memory accesses disappeared.

This effect was not immediate to discover, ShaderLSQ traces helped figure it out. I saw that for some warp memory accesses the LSQ was issuing 2 x 32 bytes + 2 x 128 bytes accesses, instead of the only 2 x 256 bytes needed in the aligned case

Thank you all for the support.

Christian

Christian Pinto

unread,

Aug 29, 2014, 12:25:26 PM8/29/14

to Jason Power, Joel Hestness, gem5-gpu developers

I forgot to write that the bypass_l1 flag is set to false in my configuration

Christian

Reply all

Reply to author

Forward