About the SIMD core pin ticks

daniel.tian

unread,

Dec 31, 2011, 3:18:34 PM12/31/11

to MV5sim

hi, Jiayuan:
Happy New Year!
I have a quick question about pint ticks.
Is the pin ticks collected in the api/tests/filter/filter.cpp
including the CPU scheduling overhead or only just SIMD kernel running
time?
And from the api/tests/filter/filter.cpp, I only seen the ""
pin_stats_reset()", and "pin_stats_pause()". And there is no
"pin_stats_resume()" calling in main().
Is this kind of function calling included in the launch()?

Thanks
Daniel

daniel tian

unread,

Dec 31, 2011, 3:54:34 PM12/31/11

to MV5sim

hi, Jiayuan:
I just emptied the kernel function and got the pin ticks. Is this
the CPU scheduling overhead?
Because right now the kernel function does nothing.
Xiaonan

Jiayuan Meng

unread,

Jan 1, 2012, 9:17:57 PM1/1/12

to mv5...@googlegroups.com

Hi Xiaonan,

Happy new year!

You are right. pin_stats are called implicitly within the fractal API. It is resumed right before spawning threads and is paused right after joining all threads. So pin_ticks includes the scheduling overhead. If the kernel does nothing, then it is purely the scheduling overhead. However, this overhead is unrealistically small, because the system-calls are emulated as a single instruction, so don't count on it if you are trying to measure "real overheads".

Jiayuan

daniel tian

unread,

Jan 2, 2012, 12:10:32 AM1/2/12

to mv5...@googlegroups.com

hi, jiayuan:
I gotta the statistics about the CPU scheduling. it 's tiny, less
than 1%. So I think the overhead can be ignored.
Thanks
Xiaonan

daniel tian

unread,

Jan 2, 2012, 4:45:46 AM1/2/12

to mv5...@googlegroups.com

Hi, Jiayuan:
I have one more question about the tick. If the CPU and SIMD core run
in different frequency, then the collected tick is based on CPU
frequency, right?

Thanks
Xiaonan

daniel tian

unread,

Jan 2, 2012, 5:25:17 AM1/2/12

to mv5...@googlegroups.com

hi, Jiayuan:
Another question, I just checked the source code in the MV5 (maybe
gem5 is also the same), there is no option for changing the memory
frequency, right? However, latency and ratio of CPU VS Mem bus speed
can be changed.
I wanna try some different CPU, Mem frequencies for SIMD core performance.
Do you have any suggestions?
Any one is appreciated.
Thank you very much
Xiaonan

Jiayuan Meng

unread,

Jan 4, 2012, 11:31:23 AM1/4/12

to mv5...@googlegroups.com

Actually, instead of being in the units of CPU cycles, the tick is an absolute number (1 tick = 0.001 nanoseconds) which is independent of CPU frequency.

On Mon, Jan 2, 2012 at 3:45 AM, daniel tian <daniel...@gmail.com> wrote:

Jiayuan Meng

unread,

Jan 4, 2012, 11:43:42 AM1/4/12

to mv5...@googlegroups.com

You can set the memory bus frequency in the python script for system configurations. For example, in configs/fractal/fractal_smp.py, you will find:

memnet = BusConn(clock = Frequency(options.MemNetFrequency),

bandwidth_Mbps = options.MemNetBandWidthMbps)...

That's where you set the bus frequency.

But, do you mean the physical memory? Yes, in MV5, the physical memory is not cycle-accurately simulated. Basically, we specify a fixed amount of delay to any request going to the physical memory (this physical memory latency can be changed as well in the same python script). I know that someone has integrated DramSim into M5, maybe that will be more accurate with cycles.

Jiayuan

daniel tian

unread,

Jan 4, 2012, 11:52:14 AM1/4/12

to mv5...@googlegroups.com

Jiayuan:
Thank you for your information.
I run my code (like memory copy, every thread will copy one
partial) in SIMD cores, like 8 SIMD cores, and 16threads for each
core. However, Copying 512bytes, the speedup is only 3 times. So I
wonder where is my performance bottleneck.
I will try your suggestion.
Thanks.
Xiaonan

Jiayuan Meng

unread,

Jan 4, 2012, 12:59:01 PM1/4/12

to mv5...@googlegroups.com

Usually, and especially for memory-intensive apps, the bottleneck is L1 cache contention. In your case, it seems you are streaming data, the memory bandwidth and latency may be the bottleneck.

daniel tian

unread,

Jan 4, 2012, 1:31:27 PM1/4/12

to mv5...@googlegroups.com

Thanks
Actually, I don't care about the accuracy about memory ticks. I just
want to make sure that the memory and L1/L2 cache are not the
bottleneck.
I have a question about the cache access in SIMD cores. Like 16
threads in each core, how those threads get data from L1 cache? There
is only one thread get the data from cache, and other threads will be
stalled. Right? Is there any configuration supporting multiport /
nonblock cache?

Thanks.

daniel tian

unread,

Jan 4, 2012, 10:01:34 PM1/4/12

to mv5...@googlegroups.com

hi, Jiayuan and folks:
I am still confused by the performance of SIMDcores. So I attached
two version of memcpy/memset code, my configuration and run script.
And I am looking forward to getting some suggestion.
Here is a brief description about the idea.
I just reduce to each SIMD core for 1 thread and warpsize is also 1.
And the workload is: memset 16kbytes and then memcpy to another location.
For O3 CPU, pin ticks are 135314000.

And for 8 SIMD cores, 1thread for each core, pin ticks are 2351355000.
What confuses me is that why multithreads model can not accelerate
performance but slow down. And You see, for 1 thead per core, there is
no competition for L1 cache. I don't think this situation is because
of memory bandwidth, because by increasing the threads per cores and
warpsize, the performance is speedup.
For 8 SIMD cores, 16 threads for each core, pin ticks are 369606000.
It is still slower than O3 CPU execution.

Do you guys have any suggestion?
Any advice is appreciated.

Many thanks.
Xiaonan

Run script:
#!/bin/sh
#$1: isSIMD
#$2: if $1 is true, $2 means num of SIMD Cores.
#$3: if $1 is true, $3 means num of threads for each SIMD core
#$4: warpsize
NetPortQueueLength=16
MemoryNetBandWidth=456000
MemNetFrequ=533MHz
MemNetRouterBufferSize=2048
PhyMemLatency=25ns
L2CacheSize=4096kB

echo "MV5 Benchmarks"
now=$(date +"%Y_%m_%d_%H:%M")
pwd
if $1 = true
then
build/ALPHA_SE/m5.fast configs/fractal/hetero.py --rootdir=.
--bindir=./api/binsimd_blocktask/ --simd=True --CpuFrequency=1.00GHz
--GpuFrequency=1.00GHz --DcacheAssoc=4 --DcacheBanks=4
--DcacheBlkSize=32 --DcacheHWPFdegree=1 --DcacheHWPFpolicy=none
--DcacheLookupLatency=2ns --DcachePropagateLatency=2ns
--DcacheRepl=LRU --DcacheSize=16kB --IcacheAssoc=4 --IcacheBlkSize=32
--IcacheHWPFdegree=1 --IcacheHWPFpolicy=none --IcacheSize=16kB
--IcacheUseCacti=False --L2NetBandWidthMbps=456000
--L2NetFrequency=300MHz --L2NetPortOutQueueLength=4
--L2NetRouterBufferSize=256 --L2NetRoutingLatency=1ns
--L2NetTimeOfFlight=13t --L2NetType=FullyNoC --L2NetWormHole=True
--MemNetBandWidthMbps=$MemoryNetBandWidth
--MemNetFrequency=$MemNetFrequ
--MemNetPortOutQueueLength=$NetPortQueueLength
--MemNetRouterBufferSize=$MemNetRouterBufferSize
--MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128
--l2HWPFDataOnly=False --l2HWPFdegree=1 --l2HWPFpolicy=none
--l2MSHRs=64 --l2Repl=LRU --l2Size=$L2CacheSize --l2TgtsPerMSHR=32
--l2lookupLatency=2ns --l2propagateLatency=12ns --l2tol1ratio=2
--localAddrPolicy=1 --maxThreadBlockSize=0 --numSWTCs=2
--physmemLatency=$PhyMemLatency --physmemSize=1024MB --portLookup=0
--protocol=mesi --randStackOffset=True --restoreContextDelay=0
--retryDcacheDelay=10 --stackAlloc=3 --switchOnDataAcc=True
--numcpus=$2 --warpSize=$4 --numHWTCs=$3 --benchmark=MEMGPU
else
build/ALPHA_SE/m5.fast configs/fractal/hetero.py --rootdir=.
--bindir=./api/binsimd_blocktask/ --simd=False --CpuFrequency=1.00GHz
--DcacheAssoc=4 --DcacheBanks=4 --DcacheBlkSize=32
--DcacheHWPFdegree=1 --DcacheHWPFpolicy=none --DcacheLookupLatency=2ns
--DcachePropagateLatency=2ns --DcacheRepl=LRU --DcacheSize=16kB
--IcacheAssoc=4 --IcacheBlkSize=32 --IcacheHWPFdegree=1
--IcacheHWPFpolicy=none --IcacheSize=16kB --IcacheUseCacti=False
--L2NetBandWidthMbps=456000 --L2NetFrequency=300MHz
--L2NetPortOutQueueLength=4 --L2NetRouterBufferSize=256
--L2NetRoutingLatency=1ns --L2NetTimeOfFlight=13t --L2NetType=FullyNoC
--L2NetWormHole=True --MemNetBandWidthMbps=$MemoryNetBandWidth
--MemNetFrequency=$MemNetFrequ
--MemNetPortOutQueueLength=$NetPortQueueLength
--MemNetRouterBufferSize=$MemNetRouterBufferSize
--MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128
--l2HWPFDataOnly=False --l2HWPFdegree=1 --l2HWPFpolicy=none
--l2MSHRs=64 --l2Repl=LRU --l2Size=$L2CacheSize --l2TgtsPerMSHR=32
--l2lookupLatency=2ns --l2propagateLatency=12ns --l2tol1ratio=2
--localAddrPolicy=1 --maxThreadBlockSize=0 --numSWTCs=2
--physmemLatency=$PhyMemLatency --physmemSize=1024MB --portLookup=0
--protocol=mesi --randStackOffset=True --restoreContextDelay=0
--retryDcacheDelay=10 --stackAlloc=3 --switchOnDataAcc=True
--benchmark=MEMCPU
fi

memcpu.cpp

memgpu.cpp

hetero.py

Jiayuan Meng

unread,

Jan 4, 2012, 10:43:11 PM1/4/12

to mv5...@googlegroups.com

When threads in a warp access memory, all threads stall until all of their requests are fulfilled. At the same time, the core switches to another warp to overlap the memory access.

Requests from threads in the same warp are sent to the L1 cache at the same time (as a multiported cache). The cache is non-blocking because it is MSHR-capable, so that the cache can accept more requests from other warps if the current warp is experiencing cache misses.

daniel tian

unread,

Jan 4, 2012, 11:10:02 PM1/4/12

to mv5...@googlegroups.com

Thank. Jiayuan.
This is important information for me.
I guess the CPU performance is better than SIMD cores because of its
O3 core with 32 Load/Store queue entries.
Anyway, I would like to try more benchmarks.

Many thanks.

Jiayuan Meng

unread,

Jan 4, 2012, 11:14:56 PM1/4/12

to mv5...@googlegroups.com

Hi Xiaonan,

What's the numcpus when you use O3 CPU?

Also, did you check that memcpu and memgpu have the same workload and produce the same results? It seems to me that for memcpy, you are looping imemsize/sizeof(int) times in memcpu.cpp while in memgpu.cpp, you iterate imemsize times.

It could also be that the kernel in memgpu is too small so the overhead is significant. Did you try increasing STRIDE there?

Yes, the ld/st queue is a good point.

Jiayuan

daniel tian

unread,

Jan 4, 2012, 11:29:32 PM1/4/12

to mv5...@googlegroups.com

The actually the numcpus option leaves empty (default is one) when
MEMCPU benchmark is chosen.
However, memcpu.cpp will run in single thread, so I think multicore
does not help improve performance.

Acutally, I tried stride in 32, 64, 128, 256, 512, 1024bytes. (8 SIMD
cores, and 16 threads for each, warpsize is 8. Running 1GHz)
Here are the ticks for each stride:
414029000 335184000 309951000 240737000 249396000 382537000
None of them can get the equivalent performance verse CPU's.

Jiayuan Meng

unread,

Jan 5, 2012, 9:50:15 PM1/5/12

to mv5...@googlegroups.com

Did you try using 1 SIMD core? I'm thinking it could also be false sharing. Did you check the issue with memcpu about imemsize/size(int) iterations?

Reply all

Reply to author

Forward