Hi Christian,
I've validated many of gem5-gpu's performance results against both a Tesla C2050 and a GTX580, both of which are the Fermi architecture, so it's likely that the questions you're running into are related to interpretting the stats rather than their correctness. It's hard to say exactly what might be going on, but I can recommend a few things:
1) I would recommend running gem5-gpu using the --kernel_stats parameter, which will dump and reset statistics at the beginning and end of GPU kernels. This should allow you to isolate the stats from different segments of the run time
2) It is possible (probable?) that the stats you are collecting from the hardware profile are actually sampled numbers rather than exact numbers. I'd recommend re-running the profiler tests a few times to see if you get the same results and note that collected values may be inexact. If the performance counters are using sampling, you'll need to run fairly long tests to ensure you get statistically significant results
3) I suspect that the hardware instruction counts stat is easily misinterpretted. Specifically, there are 14k shared loads, but surely those come from more than 9k instructions. It is more likely that the instruction count numbers are either invalid (e.g. performance counter doesn't work), or they are a count of warp or thread block instructions executed rather than thread instructions. Note that there are about 60x more instructions in the sim - perhaps the hardware stat is thread block instructions, and your thread blocks have 64 threads? I'd recommend digging through some specs to see if you can find the appropriate interpretation of the hardware counters. In the gem5-gpu stats.txt output, the inst_counts stats count thread instructions (i.e. total instructions across all threads that executed).
4) Regarding memory accesses, you'll need to establish whether you're counting memory instructions or memory accesses. GPU load-store queues/caches do access coalescing, which groups many memory instruction requests to be sent as a single memory access. In addition to the complexity of counting these, hardware may also count separate port accesses as separate accesses, even if they came from a single coalesced access to a cache line. This is possible for both shared and global L1 cache accesses, which must schedule the use of the ports. Given that you're seeing larger shared memory access counts from the simulator, I suspect that hardware may be counting accesses instead of instructions. In the gem5-gpu stats.txt file, the stats 'shader_cores?.*_loads' and 'shader_cores?.*_stores' count the number of load and store instructions (i.e. across all threads), respectively, to the different memory spaces. These counts are prior to coalescing. The number of coalesced cache accesses is given by the 'l1_cntrl_sp??.demand_acceses' stat. When an application has good coalescing capability, the number of memory instructions is 8-32x greater than the number of actual accesses.
Hope this helps,
Joel