Thanks
Xiaonan
Thanks.
And for 8 SIMD cores, 1thread for each core, pin ticks are 2351355000.
What confuses me is that why multithreads model can not accelerate
performance but slow down. And You see, for 1 thead per core, there is
no competition for L1 cache. I don't think this situation is because
of memory bandwidth, because by increasing the threads per cores and
warpsize, the performance is speedup.
For 8 SIMD cores, 16 threads for each core, pin ticks are 369606000.
It is still slower than O3 CPU execution.
Do you guys have any suggestion?
Any advice is appreciated.
Many thanks.
Xiaonan
Run script:
#!/bin/sh
#$1: isSIMD
#$2: if $1 is true, $2 means num of SIMD Cores.
#$3: if $1 is true, $3 means num of threads for each SIMD core
#$4: warpsize
NetPortQueueLength=16
MemoryNetBandWidth=456000
MemNetFrequ=533MHz
MemNetRouterBufferSize=2048
PhyMemLatency=25ns
L2CacheSize=4096kB
echo "MV5 Benchmarks"
now=$(date +"%Y_%m_%d_%H:%M")
pwd
if $1 = true
then
build/ALPHA_SE/m5.fast configs/fractal/hetero.py --rootdir=.
--bindir=./api/binsimd_blocktask/ --simd=True --CpuFrequency=1.00GHz
--GpuFrequency=1.00GHz --DcacheAssoc=4 --DcacheBanks=4
--DcacheBlkSize=32 --DcacheHWPFdegree=1 --DcacheHWPFpolicy=none
--DcacheLookupLatency=2ns --DcachePropagateLatency=2ns
--DcacheRepl=LRU --DcacheSize=16kB --IcacheAssoc=4 --IcacheBlkSize=32
--IcacheHWPFdegree=1 --IcacheHWPFpolicy=none --IcacheSize=16kB
--IcacheUseCacti=False --L2NetBandWidthMbps=456000
--L2NetFrequency=300MHz --L2NetPortOutQueueLength=4
--L2NetRouterBufferSize=256 --L2NetRoutingLatency=1ns
--L2NetTimeOfFlight=13t --L2NetType=FullyNoC --L2NetWormHole=True
--MemNetBandWidthMbps=$MemoryNetBandWidth
--MemNetFrequency=$MemNetFrequ
--MemNetPortOutQueueLength=$NetPortQueueLength
--MemNetRouterBufferSize=$MemNetRouterBufferSize
--MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128
--l2HWPFDataOnly=False --l2HWPFdegree=1 --l2HWPFpolicy=none
--l2MSHRs=64 --l2Repl=LRU --l2Size=$L2CacheSize --l2TgtsPerMSHR=32
--l2lookupLatency=2ns --l2propagateLatency=12ns --l2tol1ratio=2
--localAddrPolicy=1 --maxThreadBlockSize=0 --numSWTCs=2
--physmemLatency=$PhyMemLatency --physmemSize=1024MB --portLookup=0
--protocol=mesi --randStackOffset=True --restoreContextDelay=0
--retryDcacheDelay=10 --stackAlloc=3 --switchOnDataAcc=True
--numcpus=$2 --warpSize=$4 --numHWTCs=$3 --benchmark=MEMGPU
else
build/ALPHA_SE/m5.fast configs/fractal/hetero.py --rootdir=.
--bindir=./api/binsimd_blocktask/ --simd=False --CpuFrequency=1.00GHz
--DcacheAssoc=4 --DcacheBanks=4 --DcacheBlkSize=32
--DcacheHWPFdegree=1 --DcacheHWPFpolicy=none --DcacheLookupLatency=2ns
--DcachePropagateLatency=2ns --DcacheRepl=LRU --DcacheSize=16kB
--IcacheAssoc=4 --IcacheBlkSize=32 --IcacheHWPFdegree=1
--IcacheHWPFpolicy=none --IcacheSize=16kB --IcacheUseCacti=False
--L2NetBandWidthMbps=456000 --L2NetFrequency=300MHz
--L2NetPortOutQueueLength=4 --L2NetRouterBufferSize=256
--L2NetRoutingLatency=1ns --L2NetTimeOfFlight=13t --L2NetType=FullyNoC
--L2NetWormHole=True --MemNetBandWidthMbps=$MemoryNetBandWidth
--MemNetFrequency=$MemNetFrequ
--MemNetPortOutQueueLength=$NetPortQueueLength
--MemNetRouterBufferSize=$MemNetRouterBufferSize
--MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128
--l2HWPFDataOnly=False --l2HWPFdegree=1 --l2HWPFpolicy=none
--l2MSHRs=64 --l2Repl=LRU --l2Size=$L2CacheSize --l2TgtsPerMSHR=32
--l2lookupLatency=2ns --l2propagateLatency=12ns --l2tol1ratio=2
--localAddrPolicy=1 --maxThreadBlockSize=0 --numSWTCs=2
--physmemLatency=$PhyMemLatency --physmemSize=1024MB --portLookup=0
--protocol=mesi --randStackOffset=True --restoreContextDelay=0
--retryDcacheDelay=10 --stackAlloc=3 --switchOnDataAcc=True
--benchmark=MEMCPU
fi
Many thanks.
Acutally, I tried stride in 32, 64, 128, 256, 512, 1024bytes. (8 SIMD
cores, and 16 threads for each, warpsize is 8. Running 1GHz)
Here are the ticks for each stride:
414029000 335184000 309951000 240737000 249396000 382537000
None of them can get the equivalent performance verse CPU's.