Finished memset in size 16384 bytes.
Thread: 16
Last tick: 826087000
Exiting @ tick 9223372036854775807 because simulate() limit reached
My configuration is two master O3 CPU and with additional 1 SIMD core
to run the memset&memcpy functions.
Thanks
Xiaonan
On Fri, Jan 6, 2012 at 1:50 AM, daniel tian <daniel...@gmail.com> wrote:
> Hi, Jiayuan:
> Sorry for replying later. And thank you for your meticulous
> inspection in memcpy.cpp. That, the loop iteration, is a bug.
> O3 cpu is the most efficient one. However, the SIMDTimingCPU is based
> on TimingSimpleCPU which has much lower performance than O3CPU.
> In my benchmark, memset/memcpy size is 16Kbytes. Performance(in ticks):
> O3: 218780000
> SIMDTimingCPU(sequence, 1thread) :619019000
>
> You see the O3 can speedup almost 3 times as SIMDTimingCPU.
>
>
> For SIMD multithreads:
> I tried several kind of configurations
> Warpsize:ThreadsPerCore =
> 1:1, 2:2, 4:4, 8:8, 16:16,
> 1:2, 2:4, 4:8 8:16
>
> And different granularity (32, 64, 128, 256, 512, 1024, 2048, 4096,
> 8192, 16384) for each config. The performance data is attached in the
> accessory. Please check.
>
> I will try more tomorrow morning.
> There may some bug in cache component.
> Thank you again.
> Best regards.
> Xiaonan
>
>
> On Fri, Jan 6, 2012 at 12:31 AM, Jiayuan Meng <meng.j...@gmail.com> wrote:
>> I have the same doubt that the problem is in cache. Please let me know what
>> you find out with 1 SIMD core (try both 1 thread/core and 8 threads/core).
>>
>>
>> On Thu, Jan 5, 2012 at 4:14 PM, daniel tian <daniel...@gmail.com> wrote:
>>>
>>> Sorry, it failed to send this mail in MV5Group. The attachments are
>>> larger than 4M.
>>>
>>> ---------- Forwarded message ----------
>>> From: daniel tian <daniel...@gmail.com>
>>> Date: Thu, Jan 5, 2012 at 4:10 PM
>>> Subject: Could u please help checking the bottleneck?
>>> To: MV5sim <mv5...@googlegroups.com>
>>>
>>>
>>> hi, Jiayuan:
>>> I just wanna simplify the problem.
>>> I took two different configuration in 8SIMD cores ( STRIDE is
>>> 32bytes): warpsize=1,
>>> ThreadsPerCore=1 VS warpsize=8 ThreadsPerCore=8:
>>> Here are the ticks for 16kbytes memset and memcpy: 2363267000 VS
>>> 374988000
>>> Here are schedule overhead ticks for each task: 6993000 VS 12919000
>>> Both are slower much slower than O3 versions 16Kbytes
>>> memset/memcpy in CPU side.
>>>
>>> I guess the problem is in cache (cache miss, I am not sure). I
>>> just didn't figure out why
>>> eight SIMD cores taking care a partial memory task will slowdown the
>>> performance.
>>> The statistics files are attached in the email.
>>> Thanks
>>> Xiaonan
>>
>>
Many thanks.
Xiaonan
#
import frCommonsParameter
# --------------------------
# system specific parameters
frCommonsParameter.parser.add_option("--physmemLatency",
default = '50ns')
frCommonsParameter.parser.add_option("--threadIterLatency",
default = '10ns')
frCommonsParameter.parser.add_option("--maxThreadBlockSize", type="int",
default = 0)
frCommonsParameter.parser.add_option("--enableBlockThreads",
default = True)
frCommonsParameter.parser.add_option("--enableStealThrdBlock",
default = True)
frCommonsParameter.parser.add_option("--enbSubdivThrdBlock",
default = "False")
from frCommons import *
from StaticOptions import *
if options.enbSubdivThrdBlock == "True":
_subdivThrdBlock = True
elif options.enbSubdivThrdBlock == "False":
_subdivThrdBlock = False
else:
print "enbSubdivThrdBlock is either True or False"
sys.exit(1)
globals.enableSubdivThreadBlock = _subdivThrdBlock
globals.hwThreadIterLatency = options.threadIterLatency
globals.enableBlockThreads = options.enableBlockThreads
globals.enableStealThrdBlock = options.enableStealThrdBlock
# ----------------------
# Define the cores
# ----------------------
# busFrequency = Frequency(options.frequency)
if options.simd == "True":
# regroup policy
if options.SIMTRegroupPolicy == "None":
_regroupPolicy = 0
elif options.SIMTRegroupPolicy == "LatAwareAdapt":
_regroupPolicy = 5
elif options.SIMTRegroupPolicy == "SlipAdapt":
_regroupPolicy = 6
elif options.SIMTRegroupPolicy == "StallSplit":
_regroupPolicy = 2
elif options.SIMTRegroupPolicy == "AggressSplit":
_regroupPolicy = 1
elif options.SIMTRegroupPolicy == "AdaptSplit":
_regroupPolicy = 7
else:
assert False
# schedule policy
if options.SIMTSchedPolicy == "RR":
_schedPolicy = 1
elif options.SIMTSchedPolicy == "Wider":
_schedPolicy = 2
elif options.SIMTSchedPolicy == "Shallower":
_schedPolicy = 4
elif options.SIMTSchedPolicy == "ShallowerWider":
_schedPolicy = 5
elif options.SIMTSchedPolicy == "WiderShallower":
_schedPolicy = 6
elif options.SIMTSchedPolicy == "SyncWarps":
_schedPolicy = 7
else:
assert False
# score strategy
if options.SIMTScoreStrategy == "NoWait":
_scoreStrategy = 0
elif options.SIMTScoreStrategy == "WaitMemOK":
_scoreStrategy = 1
elif options.SIMTScoreStrategy == "AllValid":
_scoreStrategy = 2
else:
assert False
# merge strategy
if options.SIMTMergeStrategy == "TowardsSlowest":
_mergeStrategy = 0
elif options.SIMTMergeStrategy == "TowardsClosest":
_mergeStrategy = 1
else:
assert False
# merge strategy
if options.SIMTSplitStrategy == "SplitFIFO":
_splitStrategy = 0
elif options.SIMTSplitStrategy == "SplitShallowest":
_splitStrategy = 1
else:
assert False
# adapt mode
if options.SIMTadaptMode == "NoAdapt":
_adaptMode = 0
elif options.SIMTadaptMode == "AdaptWidth":
_adaptMode = 1
elif options.SIMTadaptMode == "AdaptDepth":
_adaptMode = 2
elif options.SIMTadaptMode == "AdaptDFSBFS":
_adaptMode = 3
elif options.SIMTadaptMode == "AdaptHillClimb":
_adaptMode = 4
elif options.SIMTadaptMode == "AdaptBFSDFS":
_adaptMode = 5
else:
assert False
# adapt step depth
if options.SIMTadaptStepDepth == "Step1":
_adaptStepDepth = 1
elif options.SIMTadaptStepDepth == "Log2":
_adaptStepDepth = 2
else:
assert False
# adapt step width
if options.SIMTadaptStepWidth == "Step1":
_adaptStepWidth = 1
elif options.SIMTadaptStepWidth == "Log2":
_adaptStepWidth = 2
else:
assert False
# prophet mode
if options.prophetMode == "NoProphecies":
_prophetMode = 0
elif options.prophetMode == "PropheciesNorm":
_prophetMode = 1
elif options.prophetMode == "PropheciesLocGlb":
_prophetMode = 2
else:
assert False
simd_cpus = [SIMDTimingCPU(cpu_id = i,
loopBypass = options.SIMTLoopBypass,
schedPolicy = _schedPolicy,
preemptSched = options.SIMTPreemptSched,
regroupPolicy = _regroupPolicy,
scoreStrategy = _scoreStrategy,
mergeStrategy = _mergeStrategy,
splitStrategy = _splitStrategy,
alpha = options.SIMTProfileFactor,
adaptMode = _adaptMode,
adaptStepDepth = _adaptStepDepth,
adaptStepWidth = _adaptStepWidth,
adaptSampleInterval = options.SIMTadaptSampleIntv,
tasksPerIntvPerThrd = options.SIMTtasksPerIntvPerThrd,
sampleIntervalTicks = options.SIMTSampleInterval,
numberOfHWThreads = options.numHWTCs,
numberOfSWThreads = options.numSWTCs,
switchOnDataAcc = options.switchOnDataAcc,
prophetMode = _prophetMode,
prophDepth = options.prophDepth,
DcacheLineSize = options.DcacheBlkSize,
restoreContextDelay = options.restoreContextDelay,
retryDcacheDelay = options.retryDcacheDelay,
clock=options.GpuFrequency,
maxSlips = options.SIMTMaxSlips,
minReadyWarps = options.SIMTMinReadyWarps,
maxBatches = options.SIMTMaxBatches,
lowUtlThreshold = options.SIMTLowUtlThreshold,
highUtlThreshold = options.SIMTHighUtlThreshold,
numLanes = options.warpSize)
for i in xrange(options.numcpus)]
elif options.simd == "False":
simd_cpus = [DerivO3CPU(cpu_id = i,
clock = options.CpuFrequency)
for i in xrange(options.numcpus)]
else:
raise "simd is not specified"
oo_cpus = [DerivO3CPU(cpu_id = i,
clock = options.CpuFrequency)
for i in xrange(2)]
for i in xrange(2):
oo_cpus[i].LQEntries = 32
oo_cpus[i].SQEntries = 32
# ----------------------
# Create a system, and add system wide objects
# ----------------------
# memory = DRAM(range=options.physmemSize, latency=options.physmemLatency),
system = System(cpus_smp = simd_cpus,
cpus_master = oo_cpus,
physmem = PhysicalMemory(range=options.physmemSize,
latency=options.physmemLatency),
memnet = BusConn(clock = Frequency(options.MemNetFrequency),
bandwidth_Mbps = options.MemNetBandWidthMbps)
#memnet = FullyNoC(clock = Frequency(options.MemNetFrequency),
# wireModel=1,
# maxPortOutQueueLen =
options.MemNetPortOutQueueLength,
# timeOfFlight = options.MemNetTimeOfFlight,
# maxPortInBufferSize =
options.MemNetRouterBufferSize,
# linkQueueSize =
options.MemNetRouterBufferSize,
# bandwidth_Mbps = options.MemNetBandWidthMbps)
)
# ----------------------
# Connect the L2 cache and memory together
# ----------------------
system.l2 = L2()
if options.L2NetType == 'FullyNoC':
system.toL2net = FullyNoC(clock = Frequency(options.L2NetFrequency),
wireModel=1,
maxPortOutQueueLen =
options.L2NetPortOutQueueLength,
timeOfFlight = options.L2NetTimeOfFlight,
maxPortInBufferSize =
options.L2NetRouterBufferSize,
linkQueueSize = options.L2NetRouterBufferSize,
bandwidth_Mbps = options.L2NetBandWidthMbps)
system.l2.cpu_side = system.toL2net.port
elif options.L2NetType == 'Mesh2D':
layoutFile =
options.rootdir+"/configs/fractal/temp/tol2mesh-cpus-"+str(options.numcpus)+'.layout'
system.toL2net = Mesh2D(clock = Frequency(options.L2NetFrequency),
wireModel=1,
routingLatency = options.L2NetRoutingLatency,
maxPortOutQueueLen = options.L2NetPortOutQueueLength,
timeOfFlight = options.L2NetTimeOfFlight,
maxPortInBufferSize = options.L2NetRouterBufferSize,
linkQueueSize = options.L2NetRouterBufferSize,
bandwidth_Mbps = options.L2NetBandWidthMbps,
wormhole = options.L2NetWormHole,
geography = layoutFile)
staticOptionsFile =
options.rootdir+"/configs/fractal/temp/tol2mesh-cpus-"+str(options.numcpus)+'.options'
staticOptions=StaticOptions(staticOptionsFile)
for n in range(staticOptions.get('numL2CpuSidePorts')):
system.l2.cpu_side = system.toL2net.port
#system.toL2bus = Bus(clock = busFrequency, width = L2BusWidth)
system.physmem.port = system.memnet.port
system.l2.mem_side = system.memnet.port
# ----------------------
# Connect the L2 cache and clusters together
# and the load balancer
# ----------------------
for cpu in simd_cpus:
cpu.addPrivateSplitL1Caches(L1_ICache(), L1_DCache()),
cpu.mem = cpu.dcache
cpu.connectMemPorts(system.toL2net)
#cpu.connectMemPorts(system.membus)
cpu.dtb.size = options.DTBsize
cpu.itb.size = options.ITBsize
for cpu in oo_cpus:
cpu.addPrivateSplitL1Caches(L1_ICache(), L1_DCache()),
cpu.mem = cpu.dcache
cpu.connectMemPorts(system.toL2net)
#cpu.connectMemPorts(system.membus)
cpu.dtb.size = options.DTBsize
cpu.itb.size = options.ITBsize
# ----------------------
# Define the root
# ----------------------
root = Root(system = system, globals = globals)
# --------------------
# Pick the correct Splash2 Benchmarks
# ====================
if options.benchmark == 'FFT':
root.workload = FFT()
perThreadFootprint = 8
elif options.benchmark == 'FILTER':
root.workload = FILTER()
perThreadFootprint = 8
elif options.benchmark == 'MEMGPUOVERHEAD':
root.workload = MEMGPUOVERHEAD()
perThreadFootprint = 8
elif options.benchmark == 'MEMGPU':
root.workload = MEMGPU()
perThreadFootprint = 8
elif options.benchmark == 'MEMCPU':
root.workload = MEMCPU()
perThreadFootprint = 8
elif options.benchmark == 'MERGESORT':
root.workload = MERGESORT()
# changes from 2x4 to Nx4
perThreadFootprint = 64
elif options.benchmark == 'VARIANCE':
root.workload = VARIANCE()
perThreadFootprint = 20
elif options.benchmark == 'SHORTESTPATH':
root.workload = SHORTESTPATH()
perThreadFootprint = 12
elif options.benchmark == 'LU':
root.workload = LU()
perThreadFootprint = 8
elif options.benchmark == 'NEEDLE':
root.workload = NEEDLE()
perThreadFootprint = 12
elif options.benchmark == 'HOTSPOT':
root.workload = HOTSPOT()
perThreadFootprint = 8
elif options.benchmark == 'KMEANS':
root.workload = KMEANS()
# assuming a dimensionality of 20
# = dim*4+4
perThreadFootprint = 84
elif options.benchmark == 'SVM':
root.workload = SVM()
# assuming a dimension of 20
# = dim*2*4
perThreadFootprint = 160
elif options.benchmark == 'MTSMP':
root.workload = MTSMP()
perThreadFootprint = 4
elif options.benchmark == 'MISC':
root.workload = MISC()
perThreadFootprint = 8
else:
panic("The --benchmark environment variable was set to something" \
+" improper.\nUse FOO\n")
if options.maxThreadBlockSize == -1:
dcacheSize = options.DcacheSize
dcacheSize = int(dcacheSize[:-2])*1024;
options.maxThreadBlockSize = dcacheSize/perThreadFootprint
if options.maxThreadBlockSize <= 0:
options.maxThreadBlockSize = 1
root.globals.maxThreadBlockSize = options.maxThreadBlockSize
# --------------------
# Assign the workload to the cpus
# ====================
# workload
for cpu in simd_cpus:
cpu.workload = root.workload
for cpu in oo_cpus:
cpu.workload = root.workload
# ----------------------
# Run the simulation
# ----------------------
if options.timing or options.detailed:
root.system.mem_mode = 'timing'
# instantiate configuration
m5.instantiate(root)
# simulate until program terminates
try:
if options.maxtick:
exit_event = m5.simulate(options.maxtick)
else:
exit_event = m5.simulate(m5.MaxTick)
print 'Exiting @ tick', m5.curTick(), 'because', exit_event.getCause()
except:
print 'Failed @ tick', m5.curTick()
if not options.footprint == '':
os.system('touch '+options.footprint)
echo "MV5 Benchmarks"
now=$(date +"%Y_%m_%d_%H:%M")
pwd
if $1 = true
then
build/ALPHA_SE/m5.fast configs/fractal/hetero.py --rootdir=.
--bindir=./api/binsimd_blocktask/ --simd=True --CpuFrequency=1.00GHz
--GpuFrequency=1.00GHz --DcacheAssoc=4 --SIMDDcacheAssoc=512 --Dcache
Banks=4 --DcacheBlkSize=32 --SIMDDcacheBlkSize=32 --DcacheHWPFdegree=1
--DcacheHWPFpolicy=none --DcacheLookupLatency=2ns
--DcachePropagateLatency=2ns --DcacheRepl=LRU --DcacheSize=16kB
--SIMDDcacheSiz
e=16kB --IcacheAssoc=4 --IcacheBlkSize=32 --IcacheHWPFdegree=1
--IcacheHWPFpolicy=none --IcacheSize=16kB --IcacheUseCacti=False
--L2NetBandWidthMbps=456000 --L2NetFrequency=300MHz
--L2NetPortOutQueueL
ength=4 --L2NetRouterBufferSize=256 --L2NetRoutingLatency=1ns
--L2NetTimeOfFlight=13t --L2NetType=FullyNoC --L2NetWormHole=True
--MemNetBandWidthMbps=$MemoryNetBandWidth
--MemNetFrequency=$MemNetFrequ
--MemNetPortOutQueueLength=$NetPortQueueLength
--MemNetRouterBufferSize=$MemNetRouterBufferSize
--MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128
--l2HWPFDataOnly=False --l2HWPFdegr
ee=1 --l2HWPFpolicy=none --l2MSHRs=64 --l2Repl=LRU
--l2Size=$L2CacheSize --l2TgtsPerMSHR=32 --l2lookupLatency=2ns
--l2propagateLatency=12ns --l2tol1ratio=2 --localAddrPolicy=1
--maxThreadBlockSize=0 -
-numSWTCs=2 --physmemLatency=$PhyMemLatency --physmemSize=1024MB
--portLookup=0 --protocol=mesi --randStackOffset=True
--restoreContextDelay=0 --retryDcacheDelay=10 --stackAlloc=3
--switchOnDataAcc=Tr
ue --numcpus=$2 --warpSize=$4 --numHWTCs=$3 --benchmark=$5
else
build/ALPHA_SE/m5.fast configs/fractal/fractal_smp.py --rootdir=.
--bindir=./api/binsimd_blocktask/ --simd=False --CpuFrequency=1.00GHz
--DcacheAssoc=4 --DcacheBanks=4 --DcacheBlkSize=32 --DcacheHWPF
degree=1 --DcacheHWPFpolicy=none --DcacheLookupLatency=2ns
--DcachePropagateLatency=2ns --DcacheRepl=LRU --DcacheSize=16kB
--IcacheAssoc=4 --IcacheBlkSize=32 --IcacheHWPFdegree=1
--IcacheHWPFpolicy=no
ne --IcacheSize=16kB --IcacheUseCacti=False
--L2NetBandWidthMbps=456000 --L2NetFrequency=300MHz
--L2NetPortOutQueueLength=4 --L2NetRouterBufferSize=256
--L2NetRoutingLatency=1ns --L2NetTimeOfFlight=13
t --L2NetType=FullyNoC --L2NetWormHole=True
--MemNetBandWidthMbps=$MemoryNetBandWidth
--MemNetFrequency=$MemNetFrequ
--MemNetPortOutQueueLength=$NetPortQueueLength
--MemNetRouterBufferSize=$MemNetRouterBufferSize
--MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128
--l2HWPFDataOnly=False --l2HWPFdegree=1 --l2HWPFpolicy=none
--l2MSHRs=64 --l2Repl=LRU --l2Size=$L2CacheSize --l2TgtsPer
MSHR=32 --l2lookupLatency=2ns --l2propagateLatency=12ns
--l2tol1ratio=2 --localAddrPolicy=1 --maxThreadBlockSize=0
--numSWTCs=2 --physmemLatency=$PhyMemLatency --physmemSize=1024MB
--portLookup=0 --pr
otocol=mesi --randStackOffset=True --restoreContextDelay=0
--retryDcacheDelay=10 --stackAlloc=3 --switchOnDataAcc=True
--benchmark=MEMCPU
fi
Typically, I only change numHWTCs, numcpus, warpSize.
Master CPU: 2 O3 CPUs
Slave: one SIMD Core with 16 HWthreads, 2 numSWTCs=2
I found that if the number of SIMD Core is 2, this kind of abnormal
exiting will gone.
2. Dcache competition: in memcpy operation, two memory references are
avoidable. So if one thread loads a cache line, and another thread(in
another warp) may evict this line, then the efficiency will also be
impaired. Because in my case, I don't wanna the cache or memory
performance being a bottleneck. I wonder whether the fully associative
L1 cache is allowed in MV5 and how to configure.
3. Cache Coherence: Because cache block is 128 in shared L2 (32bytes
in Dcache L1). Then if 4 different threads take this L2 cache line,
does this situation make the cache inefficiency because of MSI
coherency protocol ? How could I confirm this kind of information in
MV5?
Of course, I don't think right now the memory bandwidth is the
bottleneck. Because performance will increase in different stride.
There are quite a few parameter and terminologies I cannot figure out.
Here they are:
--DcacheLookupLatency, this is the Dcache TAG lookup, I just don't
know how this parameter performs in L1 Latency.
--DcachePropagateLatency, I don't get this one.
system.cpus_smp0.batches_waitDcache 278887000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
system.cpus_smp0.batches_waitDcache_0 192790000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
system.cpus_smp0.batches_waitDcache_1 82871000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
system.cpus_smp0.batches_waitDcache_2 3226000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
system.cpus_smp0.batches_waitDcache_3 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_4 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_5 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_6 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_7 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_8 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_9 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_10 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_11 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_12 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_13 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_14 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_15 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_16 0.000000000000
What does "waitDcache_n" the n mean? I can not figure out what the n
batches are. Because in this config, 2 OOO master cores, and 4 SIMD
Cores, 8 threads per core, and warpsize is 4.
Thank you for your great help. Jiayuan.
Best regards.
Xiaonan
Here is my puzzle:
In 8threads per warp, if one thread (maybe more than one threads)
encounters the dcache miss, all other threads will be inactive until
all cache lines are ready. If so, larger warpsize means longer waiting
time for cache line ready. Is this correct in MV5?
2. Dcache competition: in memcpy operation, two memory references are
avoidable. So if one thread loads a cache line, and another thread(in
another warp) may evict this line, then the efficiency will also be
impaired. Because in my case, I don't wanna the cache or memory
performance being a bottleneck. I wonder whether the fully associative
L1 cache is allowed in MV5 and how to configure.
3. Cache Coherence: Because cache block is 128 in shared L2 (32bytes
in Dcache L1). Then if 4 different threads take this L2 cache line,
does this situation make the cache inefficiency because of MSI
coherency protocol ? How could I confirm this kind of information in
MV5?
There are quite a few parameter and terminologies I cannot figure out.
Here they are:
--DcacheLookupLatency, this is the Dcache TAG lookup, I just don't
know how this parameter performs in L1 Latency.
--DcachePropagateLatency, I don't get this one.
system.cpus_smp0.batches_waitDcache_0 192790000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
What does "waitDcache_n" the n mean? I can not figure out what the n
batches are. Because in this config, 2 OOO master cores, and 4 SIMD
Cores, 8 threads per core, and warpsize is 4.
sorry. I found it in another mail.
thanks.
and good night.
Xiaonan
Thanks
And I am trying to enlarge the memory bus width. I wonder how much
the performance can be speedup through the muiltihreads and multi
SIMDCores.
Do you have any suggestions?
Many thanks,
Xiaonan
On Thu, Jan 12, 2012 at 9:27 PM, Jiayuan Meng <meng.j...@gmail.com> wrote:
> My guess is that: 1) the benchmark is memory-intensive 2) the cache
> hierarchy somehow becomes the bottleneck.
>
> In fact, it may not be surprising---the memcpy benchmark is basically
> streaming data from the memory to the cpu and back tohi memory. Because there
Another question: for the threads in the same warp, do these threads
take consecutively 8*32bytes?
I mean, thread0 takes first 32bytes; thread1 takes the next 32bytes; and so on.
Thanks
Xiaonan
Jiayuan
Thanks,
Jiayuan
You can also do "Ctrl-C" and then look at the stats, see if there are
a lot of retries or cancellation of cache transitions, that usually
indicates that memory is the bottleneck.
You can further debug it by simplifying the cache hierarchy, or maybe
even just use the physical memory, just to make sure if it's something
there.
Another possibility is to use the example benchmarks, like FILTER, as
a starting point, which I've tried personally up to 256 threads/core.
Then, see if you can run it on your system configuration.
It's true that more threads/SIMD core does not always lead to better
performance...
Yes, you can do that. the smp_thread_create() API call triggers the
emulated system calls, which pick a tCPU and a thread context to start
that thread. You can insert scheduling policies there.
The exact code in the simulator can be found at
src/smp/smp_syscall_emul.cc, see SmpOS::Smp_sys_ConsumeFreeTC()
Jiayuan
Jiayuan