Re: Could u please help checking the bottleneck?

57 views
Skip to first unread message

daniel tian

unread,
Jan 7, 2012, 3:20:03 AM1/7/12
to Jiayuan Meng, MV5sim
hi, Jiayuan:
What does this mean? The MV5 seems exiting in unfinished status.

Finished memset in size 16384 bytes.
Thread: 16
Last tick: 826087000
Exiting @ tick 9223372036854775807 because simulate() limit reached

My configuration is two master O3 CPU and with additional 1 SIMD core
to run the memset&memcpy functions.
Thanks
Xiaonan

On Fri, Jan 6, 2012 at 1:50 AM, daniel tian <daniel...@gmail.com> wrote:
> Hi, Jiayuan:
>     Sorry for replying later. And thank you for your meticulous
> inspection in memcpy.cpp. That, the loop iteration, is a bug.
> O3 cpu is the most efficient one. However, the SIMDTimingCPU is based
> on TimingSimpleCPU which has much lower performance than O3CPU.
> In my benchmark, memset/memcpy size is 16Kbytes. Performance(in ticks):
> O3: 218780000
> SIMDTimingCPU(sequence, 1thread) :619019000
>
> You see the O3 can speedup almost 3 times as SIMDTimingCPU.
>
>
> For SIMD multithreads:
> I tried several kind of configurations
> Warpsize:ThreadsPerCore =
> 1:1, 2:2, 4:4, 8:8, 16:16,
> 1:2, 2:4, 4:8  8:16
>
> And different granularity (32, 64, 128, 256, 512, 1024, 2048, 4096,
> 8192, 16384) for each config. The performance data is attached in the
> accessory. Please check.
>
> I will try more tomorrow morning.
> There may some bug in cache component.
> Thank you again.
> Best regards.
> Xiaonan
>
>
> On Fri, Jan 6, 2012 at 12:31 AM, Jiayuan Meng <meng.j...@gmail.com> wrote:
>> I have the same doubt that the problem is in cache. Please let me know what
>> you find out with 1 SIMD core (try both 1 thread/core and 8 threads/core).
>>
>>
>> On Thu, Jan 5, 2012 at 4:14 PM, daniel tian <daniel...@gmail.com> wrote:
>>>
>>> Sorry, it failed to send this mail in MV5Group. The attachments are
>>> larger than 4M.
>>>
>>> ---------- Forwarded message ----------
>>> From: daniel tian <daniel...@gmail.com>
>>> Date: Thu, Jan 5, 2012 at 4:10 PM
>>> Subject: Could u please help checking the bottleneck?
>>> To: MV5sim <mv5...@googlegroups.com>
>>>
>>>
>>> hi, Jiayuan:
>>>     I just wanna simplify the problem.
>>>     I took two different configuration in 8SIMD cores ( STRIDE is
>>> 32bytes): warpsize=1,
>>> ThreadsPerCore=1 VS warpsize=8 ThreadsPerCore=8:
>>>     Here are the ticks for 16kbytes memset and memcpy: 2363267000 VS
>>> 374988000
>>>     Here are schedule overhead ticks for each task: 6993000  VS 12919000
>>>     Both are slower much slower than O3 versions 16Kbytes
>>> memset/memcpy in CPU side.
>>>
>>>     I guess the problem is in cache (cache miss, I am not sure). I
>>> just didn't figure out why
>>> eight SIMD cores taking care a partial memory task will slowdown the
>>> performance.
>>>     The statistics files are attached in the email.
>>> Thanks
>>> Xiaonan
>>
>>

Jiayuan Meng

unread,
Jan 7, 2012, 8:46:11 PM1/7/12
to mv5...@googlegroups.com
Hmm, that means the event-driven simulator has no more events to run, but the program has not finished yet... If you tell me your configuration I can take a look. 

daniel tian

unread,
Jan 7, 2012, 8:56:21 PM1/7/12
to mv5...@googlegroups.com
hi, Here is my configuration.
Two O3 CPUs are the master, and the additional SIMD Cores will be the slave.
I just copied the configuration "fractal_smp.py" and made tiny changes.

Many thanks.
Xiaonan

#

import frCommonsParameter
# --------------------------
# system specific parameters
frCommonsParameter.parser.add_option("--physmemLatency",
default = '50ns')
frCommonsParameter.parser.add_option("--threadIterLatency",
default = '10ns')
frCommonsParameter.parser.add_option("--maxThreadBlockSize", type="int",
default = 0)
frCommonsParameter.parser.add_option("--enableBlockThreads",
default = True)
frCommonsParameter.parser.add_option("--enableStealThrdBlock",
default = True)
frCommonsParameter.parser.add_option("--enbSubdivThrdBlock",
default = "False")

from frCommons import *
from StaticOptions import *

if options.enbSubdivThrdBlock == "True":
_subdivThrdBlock = True
elif options.enbSubdivThrdBlock == "False":
_subdivThrdBlock = False
else:
print "enbSubdivThrdBlock is either True or False"
sys.exit(1)

globals.enableSubdivThreadBlock = _subdivThrdBlock
globals.hwThreadIterLatency = options.threadIterLatency
globals.enableBlockThreads = options.enableBlockThreads
globals.enableStealThrdBlock = options.enableStealThrdBlock

# ----------------------
# Define the cores
# ----------------------
# busFrequency = Frequency(options.frequency)

if options.simd == "True":
# regroup policy
if options.SIMTRegroupPolicy == "None":
_regroupPolicy = 0
elif options.SIMTRegroupPolicy == "LatAwareAdapt":
_regroupPolicy = 5
elif options.SIMTRegroupPolicy == "SlipAdapt":
_regroupPolicy = 6
elif options.SIMTRegroupPolicy == "StallSplit":
_regroupPolicy = 2
elif options.SIMTRegroupPolicy == "AggressSplit":
_regroupPolicy = 1
elif options.SIMTRegroupPolicy == "AdaptSplit":
_regroupPolicy = 7
else:
assert False
# schedule policy
if options.SIMTSchedPolicy == "RR":
_schedPolicy = 1
elif options.SIMTSchedPolicy == "Wider":
_schedPolicy = 2
elif options.SIMTSchedPolicy == "Shallower":
_schedPolicy = 4
elif options.SIMTSchedPolicy == "ShallowerWider":
_schedPolicy = 5
elif options.SIMTSchedPolicy == "WiderShallower":
_schedPolicy = 6
elif options.SIMTSchedPolicy == "SyncWarps":
_schedPolicy = 7
else:
assert False
# score strategy
if options.SIMTScoreStrategy == "NoWait":
_scoreStrategy = 0
elif options.SIMTScoreStrategy == "WaitMemOK":
_scoreStrategy = 1
elif options.SIMTScoreStrategy == "AllValid":
_scoreStrategy = 2
else:
assert False
# merge strategy
if options.SIMTMergeStrategy == "TowardsSlowest":
_mergeStrategy = 0
elif options.SIMTMergeStrategy == "TowardsClosest":
_mergeStrategy = 1
else:
assert False
# merge strategy
if options.SIMTSplitStrategy == "SplitFIFO":
_splitStrategy = 0
elif options.SIMTSplitStrategy == "SplitShallowest":
_splitStrategy = 1
else:
assert False

# adapt mode
if options.SIMTadaptMode == "NoAdapt":
_adaptMode = 0
elif options.SIMTadaptMode == "AdaptWidth":
_adaptMode = 1
elif options.SIMTadaptMode == "AdaptDepth":
_adaptMode = 2
elif options.SIMTadaptMode == "AdaptDFSBFS":
_adaptMode = 3
elif options.SIMTadaptMode == "AdaptHillClimb":
_adaptMode = 4
elif options.SIMTadaptMode == "AdaptBFSDFS":
_adaptMode = 5
else:
assert False

# adapt step depth
if options.SIMTadaptStepDepth == "Step1":
_adaptStepDepth = 1
elif options.SIMTadaptStepDepth == "Log2":
_adaptStepDepth = 2
else:
assert False

# adapt step width
if options.SIMTadaptStepWidth == "Step1":
_adaptStepWidth = 1
elif options.SIMTadaptStepWidth == "Log2":
_adaptStepWidth = 2
else:
assert False

# prophet mode
if options.prophetMode == "NoProphecies":
_prophetMode = 0
elif options.prophetMode == "PropheciesNorm":
_prophetMode = 1
elif options.prophetMode == "PropheciesLocGlb":
_prophetMode = 2
else:
assert False


simd_cpus = [SIMDTimingCPU(cpu_id = i,
loopBypass = options.SIMTLoopBypass,
schedPolicy = _schedPolicy,
preemptSched = options.SIMTPreemptSched,
regroupPolicy = _regroupPolicy,
scoreStrategy = _scoreStrategy,
mergeStrategy = _mergeStrategy,
splitStrategy = _splitStrategy,
alpha = options.SIMTProfileFactor,
adaptMode = _adaptMode,
adaptStepDepth = _adaptStepDepth,
adaptStepWidth = _adaptStepWidth,
adaptSampleInterval = options.SIMTadaptSampleIntv,
tasksPerIntvPerThrd = options.SIMTtasksPerIntvPerThrd,
sampleIntervalTicks = options.SIMTSampleInterval,
numberOfHWThreads = options.numHWTCs,
numberOfSWThreads = options.numSWTCs,
switchOnDataAcc = options.switchOnDataAcc,
prophetMode = _prophetMode,
prophDepth = options.prophDepth,
DcacheLineSize = options.DcacheBlkSize,
restoreContextDelay = options.restoreContextDelay,
retryDcacheDelay = options.retryDcacheDelay,
clock=options.GpuFrequency,
maxSlips = options.SIMTMaxSlips,
minReadyWarps = options.SIMTMinReadyWarps,
maxBatches = options.SIMTMaxBatches,
lowUtlThreshold = options.SIMTLowUtlThreshold,
highUtlThreshold = options.SIMTHighUtlThreshold,
numLanes = options.warpSize)
for i in xrange(options.numcpus)]

elif options.simd == "False":
simd_cpus = [DerivO3CPU(cpu_id = i,
clock = options.CpuFrequency)
for i in xrange(options.numcpus)]
else:
raise "simd is not specified"

oo_cpus = [DerivO3CPU(cpu_id = i,
clock = options.CpuFrequency)
for i in xrange(2)]
for i in xrange(2):
oo_cpus[i].LQEntries = 32
oo_cpus[i].SQEntries = 32

# ----------------------
# Create a system, and add system wide objects
# ----------------------

# memory = DRAM(range=options.physmemSize, latency=options.physmemLatency),

system = System(cpus_smp = simd_cpus,
cpus_master = oo_cpus,
physmem = PhysicalMemory(range=options.physmemSize,
latency=options.physmemLatency),
memnet = BusConn(clock = Frequency(options.MemNetFrequency),
bandwidth_Mbps = options.MemNetBandWidthMbps)
#memnet = FullyNoC(clock = Frequency(options.MemNetFrequency),
# wireModel=1,
# maxPortOutQueueLen =
options.MemNetPortOutQueueLength,
# timeOfFlight = options.MemNetTimeOfFlight,
# maxPortInBufferSize =
options.MemNetRouterBufferSize,
# linkQueueSize =
options.MemNetRouterBufferSize,
# bandwidth_Mbps = options.MemNetBandWidthMbps)
)

# ----------------------
# Connect the L2 cache and memory together
# ----------------------

system.l2 = L2()

if options.L2NetType == 'FullyNoC':
system.toL2net = FullyNoC(clock = Frequency(options.L2NetFrequency),
wireModel=1,
maxPortOutQueueLen =
options.L2NetPortOutQueueLength,
timeOfFlight = options.L2NetTimeOfFlight,
maxPortInBufferSize =
options.L2NetRouterBufferSize,
linkQueueSize = options.L2NetRouterBufferSize,
bandwidth_Mbps = options.L2NetBandWidthMbps)
system.l2.cpu_side = system.toL2net.port

elif options.L2NetType == 'Mesh2D':
layoutFile =
options.rootdir+"/configs/fractal/temp/tol2mesh-cpus-"+str(options.numcpus)+'.layout'
system.toL2net = Mesh2D(clock = Frequency(options.L2NetFrequency),
wireModel=1,
routingLatency = options.L2NetRoutingLatency,
maxPortOutQueueLen = options.L2NetPortOutQueueLength,
timeOfFlight = options.L2NetTimeOfFlight,
maxPortInBufferSize = options.L2NetRouterBufferSize,
linkQueueSize = options.L2NetRouterBufferSize,
bandwidth_Mbps = options.L2NetBandWidthMbps,
wormhole = options.L2NetWormHole,
geography = layoutFile)
staticOptionsFile =
options.rootdir+"/configs/fractal/temp/tol2mesh-cpus-"+str(options.numcpus)+'.options'
staticOptions=StaticOptions(staticOptionsFile)
for n in range(staticOptions.get('numL2CpuSidePorts')):
system.l2.cpu_side = system.toL2net.port

#system.toL2bus = Bus(clock = busFrequency, width = L2BusWidth)


system.physmem.port = system.memnet.port

system.l2.mem_side = system.memnet.port


# ----------------------
# Connect the L2 cache and clusters together
# and the load balancer
# ----------------------
for cpu in simd_cpus:
cpu.addPrivateSplitL1Caches(L1_ICache(), L1_DCache()),
cpu.mem = cpu.dcache
cpu.connectMemPorts(system.toL2net)
#cpu.connectMemPorts(system.membus)
cpu.dtb.size = options.DTBsize
cpu.itb.size = options.ITBsize
for cpu in oo_cpus:
cpu.addPrivateSplitL1Caches(L1_ICache(), L1_DCache()),
cpu.mem = cpu.dcache
cpu.connectMemPorts(system.toL2net)
#cpu.connectMemPorts(system.membus)
cpu.dtb.size = options.DTBsize
cpu.itb.size = options.ITBsize

# ----------------------
# Define the root
# ----------------------

root = Root(system = system, globals = globals)

# --------------------
# Pick the correct Splash2 Benchmarks
# ====================
if options.benchmark == 'FFT':
root.workload = FFT()
perThreadFootprint = 8
elif options.benchmark == 'FILTER':
root.workload = FILTER()
perThreadFootprint = 8
elif options.benchmark == 'MEMGPUOVERHEAD':
root.workload = MEMGPUOVERHEAD()
perThreadFootprint = 8
elif options.benchmark == 'MEMGPU':
root.workload = MEMGPU()
perThreadFootprint = 8
elif options.benchmark == 'MEMCPU':
root.workload = MEMCPU()
perThreadFootprint = 8
elif options.benchmark == 'MERGESORT':
root.workload = MERGESORT()
# changes from 2x4 to Nx4
perThreadFootprint = 64
elif options.benchmark == 'VARIANCE':
root.workload = VARIANCE()
perThreadFootprint = 20
elif options.benchmark == 'SHORTESTPATH':
root.workload = SHORTESTPATH()
perThreadFootprint = 12
elif options.benchmark == 'LU':
root.workload = LU()
perThreadFootprint = 8
elif options.benchmark == 'NEEDLE':
root.workload = NEEDLE()
perThreadFootprint = 12
elif options.benchmark == 'HOTSPOT':
root.workload = HOTSPOT()
perThreadFootprint = 8
elif options.benchmark == 'KMEANS':
root.workload = KMEANS()
# assuming a dimensionality of 20
# = dim*4+4
perThreadFootprint = 84
elif options.benchmark == 'SVM':
root.workload = SVM()
# assuming a dimension of 20
# = dim*2*4
perThreadFootprint = 160
elif options.benchmark == 'MTSMP':
root.workload = MTSMP()
perThreadFootprint = 4
elif options.benchmark == 'MISC':
root.workload = MISC()
perThreadFootprint = 8
else:
panic("The --benchmark environment variable was set to something" \
+" improper.\nUse FOO\n")

if options.maxThreadBlockSize == -1:
dcacheSize = options.DcacheSize
dcacheSize = int(dcacheSize[:-2])*1024;
options.maxThreadBlockSize = dcacheSize/perThreadFootprint
if options.maxThreadBlockSize <= 0:
options.maxThreadBlockSize = 1

root.globals.maxThreadBlockSize = options.maxThreadBlockSize

# --------------------
# Assign the workload to the cpus
# ====================

# workload
for cpu in simd_cpus:
cpu.workload = root.workload

for cpu in oo_cpus:
cpu.workload = root.workload

# ----------------------
# Run the simulation
# ----------------------

if options.timing or options.detailed:
root.system.mem_mode = 'timing'

# instantiate configuration
m5.instantiate(root)

# simulate until program terminates
try:
if options.maxtick:
exit_event = m5.simulate(options.maxtick)
else:
exit_event = m5.simulate(m5.MaxTick)
print 'Exiting @ tick', m5.curTick(), 'because', exit_event.getCause()
except:
print 'Failed @ tick', m5.curTick()

if not options.footprint == '':
os.system('touch '+options.footprint)

Jiayuan Meng

unread,
Jan 8, 2012, 12:04:18 AM1/8/12
to mv5...@googlegroups.com
What's your commandline?

Two common usages will cause this to happen:
1. there are totally less than two software thread contexts in the whole system (any thread context in OO core is a software TC, for SIMD cores, you can set it with numSWTCs). We need one software thread for the main thread, and another for the threading library. Of course, if you create more software threads, you'll need more software thread contexts.

2. there is no hardware thread context but you are using SIMD threads. SIMD threads need hardware thread contexts (numHWTCs). 

As you know, one way to quickly check is to look at your configs.ini after you run the python script, and see whether the system is built in what you expect.  Can you also check whether the simulator stops before or after it created SIMD threads? To see that, use the trace flag "SIMD".

Thanks,

Jiayuan

daniel tian

unread,
Jan 8, 2012, 1:08:39 AM1/8/12
to mv5...@googlegroups.com
Hi, Jiayuan:
Thanks u for your information.
Here is the command:
#!/bin/sh
#$1: isSIMD
#$2: if $1 is true, $2 means num of SIMD Cores.
#$3: if $1 is true, $3 means num of threads for each SIMD core
#$4: warpsize
#$5: benchmark: MEMGPU/MEMGPUOVERHEAD
NetPortQueueLength=16
MemoryNetBandWidth=456000
MemNetFrequ=533MHz
MemNetRouterBufferSize=2048
PhyMemLatency=25ns
L2CacheSize=4096kB

echo "MV5 Benchmarks"
now=$(date +"%Y_%m_%d_%H:%M")
pwd
if $1 = true
then
build/ALPHA_SE/m5.fast configs/fractal/hetero.py --rootdir=.
--bindir=./api/binsimd_blocktask/ --simd=True --CpuFrequency=1.00GHz
--GpuFrequency=1.00GHz --DcacheAssoc=4 --SIMDDcacheAssoc=512 --Dcache
Banks=4 --DcacheBlkSize=32 --SIMDDcacheBlkSize=32 --DcacheHWPFdegree=1
--DcacheHWPFpolicy=none --DcacheLookupLatency=2ns
--DcachePropagateLatency=2ns --DcacheRepl=LRU --DcacheSize=16kB
--SIMDDcacheSiz
e=16kB --IcacheAssoc=4 --IcacheBlkSize=32 --IcacheHWPFdegree=1
--IcacheHWPFpolicy=none --IcacheSize=16kB --IcacheUseCacti=False
--L2NetBandWidthMbps=456000 --L2NetFrequency=300MHz
--L2NetPortOutQueueL
ength=4 --L2NetRouterBufferSize=256 --L2NetRoutingLatency=1ns
--L2NetTimeOfFlight=13t --L2NetType=FullyNoC --L2NetWormHole=True
--MemNetBandWidthMbps=$MemoryNetBandWidth
--MemNetFrequency=$MemNetFrequ
--MemNetPortOutQueueLength=$NetPortQueueLength
--MemNetRouterBufferSize=$MemNetRouterBufferSize
--MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128
--l2HWPFDataOnly=False --l2HWPFdegr
ee=1 --l2HWPFpolicy=none --l2MSHRs=64 --l2Repl=LRU
--l2Size=$L2CacheSize --l2TgtsPerMSHR=32 --l2lookupLatency=2ns
--l2propagateLatency=12ns --l2tol1ratio=2 --localAddrPolicy=1
--maxThreadBlockSize=0 -
-numSWTCs=2 --physmemLatency=$PhyMemLatency --physmemSize=1024MB
--portLookup=0 --protocol=mesi --randStackOffset=True
--restoreContextDelay=0 --retryDcacheDelay=10 --stackAlloc=3
--switchOnDataAcc=Tr
ue --numcpus=$2 --warpSize=$4 --numHWTCs=$3 --benchmark=$5
else
build/ALPHA_SE/m5.fast configs/fractal/fractal_smp.py --rootdir=.
--bindir=./api/binsimd_blocktask/ --simd=False --CpuFrequency=1.00GHz
--DcacheAssoc=4 --DcacheBanks=4 --DcacheBlkSize=32 --DcacheHWPF
degree=1 --DcacheHWPFpolicy=none --DcacheLookupLatency=2ns
--DcachePropagateLatency=2ns --DcacheRepl=LRU --DcacheSize=16kB
--IcacheAssoc=4 --IcacheBlkSize=32 --IcacheHWPFdegree=1
--IcacheHWPFpolicy=no
ne --IcacheSize=16kB --IcacheUseCacti=False
--L2NetBandWidthMbps=456000 --L2NetFrequency=300MHz
--L2NetPortOutQueueLength=4 --L2NetRouterBufferSize=256
--L2NetRoutingLatency=1ns --L2NetTimeOfFlight=13
t --L2NetType=FullyNoC --L2NetWormHole=True
--MemNetBandWidthMbps=$MemoryNetBandWidth
--MemNetFrequency=$MemNetFrequ
--MemNetPortOutQueueLength=$NetPortQueueLength
--MemNetRouterBufferSize=$MemNetRouterBufferSize
--MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128
--l2HWPFDataOnly=False --l2HWPFdegree=1 --l2HWPFpolicy=none
--l2MSHRs=64 --l2Repl=LRU --l2Size=$L2CacheSize --l2TgtsPer
MSHR=32 --l2lookupLatency=2ns --l2propagateLatency=12ns
--l2tol1ratio=2 --localAddrPolicy=1 --maxThreadBlockSize=0
--numSWTCs=2 --physmemLatency=$PhyMemLatency --physmemSize=1024MB
--portLookup=0 --pr
otocol=mesi --randStackOffset=True --restoreContextDelay=0
--retryDcacheDelay=10 --stackAlloc=3 --switchOnDataAcc=True
--benchmark=MEMCPU
fi

Typically, I only change numHWTCs, numcpus, warpSize.

daniel tian

unread,
Jan 8, 2012, 12:15:45 PM1/8/12
to mv5...@googlegroups.com
Just ignore the SIMD parameter, right now, those parameters are not processed.

Jiayuan Meng

unread,
Jan 8, 2012, 10:59:46 PM1/8/12
to mv5...@googlegroups.com
Hmm, I don't see much from your commandline. Can you check using the method described below? Let me know what you find. 

Thanks,

Jiayuan

daniel tian

unread,
Jan 9, 2012, 3:51:17 AM1/9/12
to mv5...@googlegroups.com
config.ini which cause the "Last tick: 830775000

Exiting @ tick 9223372036854775807 because simulate() limit reached"

Master CPU: 2 O3 CPUs
Slave: one SIMD Core with 16 HWthreads, 2 numSWTCs=2

I found that if the number of SIMD Core is 2, this kind of abnormal
exiting will gone.

config.ini

daniel tian

unread,
Jan 9, 2012, 10:18:02 PM1/9/12
to mv5...@googlegroups.com
Hi, Jiayuan:
Here are some issues I found:
1. When the threads & SimdCores are increased, the performance may be
decreased, In my situation, memset operation, every thread takes
32bytes operation. By increasing the SIMD Cores from 2 to 4, 8threads
per cores and warpsize in 4, the performance slows down from
149391000 to 278929000. This does not make sense.
I found the clue in the m5stats.txt, that the wait DCache becomes much longer.
Here is my puzzle:
In 8threads per warp, if one thread (maybe more than one threads)
encounters the dcache miss, all other threads will be inactive until
all cache lines are ready. If so, larger warpsize means longer waiting
time for cache line ready. Is this correct in MV5?

2. Dcache competition: in memcpy operation, two memory references are
avoidable. So if one thread loads a cache line, and another thread(in
another warp) may evict this line, then the efficiency will also be
impaired. Because in my case, I don't wanna the cache or memory
performance being a bottleneck. I wonder whether the fully associative
L1 cache is allowed in MV5 and how to configure.

3. Cache Coherence: Because cache block is 128 in shared L2 (32bytes
in Dcache L1). Then if 4 different threads take this L2 cache line,
does this situation make the cache inefficiency because of MSI
coherency protocol ? How could I confirm this kind of information in
MV5?

Of course, I don't think right now the memory bandwidth is the
bottleneck. Because performance will increase in different stride.

There are quite a few parameter and terminologies I cannot figure out.
Here they are:
--DcacheLookupLatency, this is the Dcache TAG lookup, I just don't
know how this parameter performs in L1 Latency.
--DcachePropagateLatency, I don't get this one.

system.cpus_smp0.batches_waitDcache 278887000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
system.cpus_smp0.batches_waitDcache_0 192790000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
system.cpus_smp0.batches_waitDcache_1 82871000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
system.cpus_smp0.batches_waitDcache_2 3226000.000000000000
# # Number of cycles this cpu bares with n batches
waiting for Dcache misses
system.cpus_smp0.batches_waitDcache_3 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_4 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_5 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_6 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_7 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_8 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_9 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_10 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_11 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_12 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_13 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_14 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_15 0.000000000000
# # Number of cycles this cpu bares with n batches waiting for
Dcache misses
system.cpus_smp0.batches_waitDcache_16 0.000000000000
What does "waitDcache_n" the n mean? I can not figure out what the n
batches are. Because in this config, 2 OOO master cores, and 4 SIMD
Cores, 8 threads per core, and warpsize is 4.

Thank you for your great help. Jiayuan.
Best regards.
Xiaonan

Jiayuan Meng

unread,
Jan 10, 2012, 1:21:56 AM1/10/12
to mv5...@googlegroups.com

Here is my puzzle:
   In 8threads per warp, if one thread (maybe more than one threads)
encounters the dcache miss, all other threads will be inactive until
all cache lines are ready. If so, larger warpsize means longer waiting
time for cache line ready.  Is this correct in MV5?

Right. Larger warpsize means it's more likely for threads to wait for each other upon other's D-cache misses. 
 
2. Dcache competition: in memcpy operation, two memory references are
avoidable. So if one thread loads a cache line, and another thread(in
another warp) may evict this line, then the efficiency will also be
impaired.  Because in my case, I don't wanna the cache or memory
performance being a bottleneck. I wonder whether the fully associative
L1 cache is allowed in MV5 and how to configure.

You can just set the associativity to be the number of cache line to model fully-associative.
 
3. Cache Coherence: Because cache block is 128 in shared L2 (32bytes
in Dcache L1). Then if 4 different threads take this L2 cache line,
does this situation make the cache inefficiency because of MSI
coherency protocol ? How could I confirm this kind of information in
MV5?

Why would MSI make this situation inefficient? MV5 currently don't have stats to differentiate per-thread cache accesses. But you can add it --- each memory request records the issuing cpuID and threadID (if it's issued from L1 to L2 then the thread ID is always 0).
 
There are quite a few parameter and terminologies I cannot figure out.
Here they are:
--DcacheLookupLatency, this is the Dcache TAG lookup, I just don't
know how this parameter performs in L1 Latency.
--DcachePropagateLatency, I don't get this one.

lookupLatency is the tag lookup (which is by default set by cacti)
propagatelatency is the additional latency to reach different banks if the cache is banked. 
 
system.cpus_smp0.batches_waitDcache_0    192790000.000000000000
              # # Number of cycles this cpu bares with n batches
waiting for Dcache misses
What does "waitDcache_n" the n mean? I can not figure out what the n
batches are. Because in this config, 2 OOO master cores, and 4 SIMD
Cores, 8 threads per core, and warpsize is 4.

A batch is basically a SIMD group, it is equal to a warp if there is no divergence. Note control flow divergence occurs when there are branches or exceptions (e.g. TLB miss). Actually in your operations, there might be DTB misses, in such case, threads might diverge---the first thread that triggers the miss will turn to the miss handler, which may also lead to SIMD inefficiency. 
The above means there are 192790000 cycles in which 0 batches are waiting for D-cache. 
 
Jiayuan

Jiayuan Meng

unread,
Jan 10, 2012, 1:27:36 AM1/10/12
to mv5...@googlegroups.com
config.ini looks fine. Can you send me a tarball with the benchmark binary, the python system configuration file, and the commandline so that I can debug this case?

Thanks,

Jiayuan

Jiayuan Meng

unread,
Jan 12, 2012, 12:01:44 AM1/12/12
to mv5...@googlegroups.com
Hi Xiaonan,

The problem is in the O3CPU. The bug occurs when a thread exits on O3CPU but later the program tries to reallocate a new thread onto the same thread context (see below for more details). I tried to change all ">=" to "<" but it seems there is a problem with ROB. Not sure how exactly O3CPUs are dealing with ROB but I'm guess it's a bug in this old M5 version that I'm using. Is this a critical problem for you? If so I'll work on it, otherwise I'll leave it as is since I'm slowly moving towards Gem5. 

More details: It was because the O3CPU suspends the monitor thread of the second frContext due to limited hardware resources, which it shouldn't. 
The problem occurred in src/cpu/o3/cpu.cc:

template <class Impl>
void
FullO3CPU<Impl>::activateWhenReady(ThreadID tid)
{
   ...
---    if (freeList.numFreeIntRegs() >= TheISA::NumIntRegs) {
+++    if (freeList.numFreeIntRegs() < TheISA::NumIntRegs) {

        DPRINTF(O3CPU,"[tid:%i] Suspending thread due to not enough "
                "Phys. Int. Regs.\n",
                tid);
        ready = false;
    }

Thanks,

Jiayuan

Jiayuan Meng

unread,
Jan 12, 2012, 12:07:43 AM1/12/12
to mv5...@googlegroups.com
So, a quick fix is to use a single frContext, or avoid using O3CPU. The modified source code for memgpu.cpp that I sent you uses one frContext and should work with O3CPUs.  Please let me if things worked out.

Thanks,

Jiayuan

daniel tian

unread,
Jan 12, 2012, 12:21:57 AM1/12/12
to mv5...@googlegroups.com
I didn't receive your modified source code. Is the file in the email?
Thanks
Xiaonan

daniel tian

unread,
Jan 12, 2012, 12:35:35 AM1/12/12
to mv5...@googlegroups.com

sorry. I found it in another mail.
thanks.
and good night.

Jiayuan Meng

unread,
Jan 12, 2012, 9:17:44 AM1/12/12
to mv5...@googlegroups.com
Thanks Xiaonan. Let me know if there are other issues and I'll be happy to help. I'm gradually working on integrating MV5 into Gem5, and I assume many issues with the old M5 version will go away. There will be some updates hopefully by the end of the year.

Cheers,

Jiayuan

daniel tian

unread,
Jan 12, 2012, 6:32:22 PM1/12/12
to mv5...@googlegroups.com
Ok. Many thanks.
The "limit reach exit" error does not happen again.
However, I have question that why, by increasing the SIMDCores (or
threads), the performance of the same workload seems declining. It
means that the SIMDCore is not scalable.
I don't know why. There must be something in MV5 limiting the
performance when many SIMDCores are integrated.
Do you have any suggestions?

Xiaonan
Thanks

Jiayuan Meng

unread,
Jan 12, 2012, 10:27:41 PM1/12/12
to mv5...@googlegroups.com
My guess is that: 1) the benchmark is memory-intensive 2) the cache hierarchy somehow becomes the bottleneck. 

In fact, it may not be surprising---the memcpy benchmark is basically streaming data from the memory to the cpu and back to memory. Because there is no prefetching and little reuse, requests are mostly misses, so everything is going to be serialized at the L2 cache and physical memory, no matter how many concurrent threads you have. I just feel that memcpy is not a typical "SIMD" workload per se...

I have done experiments with other benchmarks included in api/tests, they do scale with more cpus and warp sizes...

Jiayuan

daniel tian

unread,
Jan 12, 2012, 11:31:30 PM1/12/12
to mv5...@googlegroups.com
hi, Jiayuan:
Is there any workaround to bypass this bottleneck?
You see, in my configuration, CPU&SIMDCores share the same L2
cache and memory controller. Is this configuration also a problem?
I just checked the cache parameter in the frCommons.py. There are
some kinds of prefetch parameters, like "prefetcher_size",
"prefetch_access", "prefetch_miss", "prefetch_serial_squash", etc.
How to open this kind prefetch function in Cache? Because memcpy
&memset access memory sequential. My benchmark will probably benefit
from it.

And I am trying to enlarge the memory bus width. I wonder how much
the performance can be speedup through the muiltihreads and multi
SIMDCores.


Do you have any suggestions?

Many thanks,
Xiaonan

On Thu, Jan 12, 2012 at 9:27 PM, Jiayuan Meng <meng.j...@gmail.com> wrote:
> My guess is that: 1) the benchmark is memory-intensive 2) the cache
> hierarchy somehow becomes the bottleneck.
>
> In fact, it may not be surprising---the memcpy benchmark is basically

> streaming data from the memory to the cpu and back tohi memory. Because there

daniel tian

unread,
Jan 13, 2012, 3:07:15 AM1/13/12
to mv5...@googlegroups.com
hi, Jiayuan:
Actually, I am interested in how the SIMD threads are scheduling.
I checked the code,however, it is a little complex. So i guess if you
have time and give me an brief overview, that would be a great help.
Take this memcpy for example:
if there are 16KBytes workload, and stride is 32bytes, 4SIMD Cores and
16threads per core; warpsize is 8 threads.
Then every HW thread will copy 32bytes every time. So this workload
needs 16Kbytes/32bytes=512 threads to finish the work.
There are totally 4*16=64 HW threads.
So the master CPU will initially create 64 threads to finish
2kbytes(64*32Bytes);
When this finishes, master CPU will again create another 64threads to
finish next 2kbytes;
And so on, until all the work has to be finished.
Is this correct?

Another question: for the threads in the same warp, do these threads
take consecutively 8*32bytes?
I mean, thread0 takes first 32bytes; thread1 takes the next 32bytes; and so on.

Thanks
Xiaonan

Jiayuan Meng

unread,
Jan 13, 2012, 11:18:50 AM1/13/12
to mv5...@googlegroups.com
The prefetch should help, although prefetching is not well tested in MV5.

Things to tune:

The toL2net: frequency, bandwidth. Also, instead of Mesh2D, try FullyNoC
The L2 cache: # of L2 banks, L2 lookup and propagate latency. # of MSHRs and TgtsPerMSHR
memnet: frequency & bandwidth
physmem: physmemLatency

You can set those parameters to some unrealistic ideal case, just to see if that's really the performance/scaling bottleneck. You might get more hints by looking at the m5stats.txt and identify the exact bottlenecks.  You know, once a workload is memory-bound, SIMD may not help much...

Jiayuan

daniel tian

unread,
Jan 14, 2012, 12:29:28 AM1/14/12
to mv5...@googlegroups.com
hi, Jiayuan:
Do have any time that you can give me an brief description of SIMDCore
scheduling?
I am really interested in this issue.
Thank you very much
Xiaonan

daniel tian

unread,
Jan 14, 2012, 6:15:10 PM1/14/12
to mv5...@googlegroups.com
Sorry. I just noticed there are several papers included in the MV5 website.
I will read them first.
Thanks
Xiaonan

daniel tian

unread,
Mar 21, 2012, 10:13:43 PM3/21/12
to mv5...@googlegroups.com
Hi, Jiayuan:
Mv5 doesn't support 32threads per core, does it? 
Because my MV5 got stuck when 32threads being setup.

I am looking forward to your confirmation.
Thanks
Xiaonan

Jiayuan Meng

unread,
Mar 21, 2012, 10:30:49 PM3/21/12
to mv5...@googlegroups.com
It should... I've tried up to 256. The only constraint is that warp
size must be <= 64.
What do you mean by "got stuck"? the simulation doesn't stop? Usually
that's because intense L1 contention or something in the cache
hierarchy. Try larger caches and higher associativity.

Jiayuan

Ziyi Liu

unread,
Mar 22, 2012, 2:29:06 AM3/22/12
to mv5...@googlegroups.com
Hi Jiayuan,

I tried to enlarge the L1 Dcache Size and Associate, it doesn't help. I increase DcacheSize to 1024 KB, but Cacti only allow us to use 16 associate cache.
This is how my command line looks like:

build/ALPHA_SE/m5.fast  configs/fractal/hetero.py --rootdir=. --bindir=./api/binsimd_blocktask/ --simd=True --CpuFrequency=1.00GHz --GpuFrequency=1.00GHz --DcacheAssoc=16 --DcacheUseCacti=True --SIMDDcacheAssoc=512 --DcacheBanks=4 --DcacheBlkSize=32 --SIMDDcacheBlkSize=32 --DcacheHWPFdegree=1 --DcacheHWPFpolicy=none --DcacheLookupLatency=1ns --DcachePropagateLatency=2ns --DcacheRepl=LRU --DcacheSize=1024kB --SIMDDcacheSize=256kB --IcacheAssoc=16 --IcacheBlkSize=32 --IcacheHWPFdegree=1 --IcacheHWPFpolicy=none --IcacheSize=256kB --IcacheUseCacti=False --L2NetBandWidthMbps=456000 --L2NetFrequency=300MHz --L2NetPortOutQueueLength=128 --L2NetRouterBufferSize=512 --L2NetRoutingLatency=1ns --L2NetTimeOfFlight=13t --L2NetType=FullyNoC --L2NetWormHole=True --MemNetBandWidthMbps=$MemoryNetBandWidth --MemNetFrequency=$MemNetFrequ --MemNetPortOutQueueLength=$NetPortQueueLength --MemNetRouterBufferSize=$MemNetRouterBufferSize --MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128 --l2HWPFDataOnly=False --l2HWPFdegree=1 --l2HWPFpolicy=none --l2MSHRs=256 --l2Repl=LRU --l2Size=$L2CacheSize --l2TgtsPerMSHR=256 --l2lookupLatency=2ns --l2propagateLatency=12ns --l2cacheUseCacti=True --l2tol1ratio=2 --localAddrPolicy=1 --maxThreadBlockSize=0 --numSWTCs=2 --physmemLatency=$PhyMemLatency --physmemSize=1024MB --portLookup=0 --protocol=mesi --randStackOffset=True --restoreContextDelay=0 --retryDcacheDelay=10 --stackAlloc=3 --switchOnDataAcc=True --numcpus=$2  --warpSize=$4 --numHWTCs=$3 --benchmark=$5 --DTBsize=128

Any advice would be appreciate.
Thanks,
-Ziyi
--
Best Regards,

Ziyi

Jiayuan Meng

unread,
Mar 22, 2012, 2:32:23 AM3/22/12
to mv5...@googlegroups.com
Can you give me a command with those variables ($..) replaced with
actual values?

Thanks,

Jiayuan

Ziyi Liu

unread,
Mar 22, 2012, 2:39:14 AM3/22/12
to mv5...@googlegroups.com
Sure, sorry for the mistake.

build/ALPHA_SE/m5.fast  configs/fractal/hetero.py --rootdir=. --bindir=./api/binsimd_blocktask/ --simd=True --CpuFrequency=1.00GHz --GpuFrequency=1.00GHz --DcacheAssoc=16 --DcacheUseCacti=True --SIMDDcacheAssoc=512 --DcacheBanks=4 --DcacheBlkSize=32 --SIMDDcacheBlkSize=32 --DcacheHWPFdegree=1 --DcacheHWPFpolicy=none --DcacheLookupLatency=1ns --DcachePropagateLatency=2ns --DcacheRepl=LRU --DcacheSize=1024kB --SIMDDcacheSize=256kB --IcacheAssoc=16 --IcacheBlkSize=32 --IcacheHWPFdegree=1 --IcacheHWPFpolicy=none --IcacheSize=256kB --IcacheUseCacti=False --L2NetBandWidthMbps=456000 --L2NetFrequency=300MHz --L2NetPortOutQueueLength=128 --L2NetRouterBufferSize=512 --L2NetRoutingLatency=1ns --L2NetTimeOfFlight=13t --L2NetType=FullyNoC --L2NetWormHole=True --MemNetBandWidthMbps=456000 --MemNetFrequency=533MHz --MemNetPortOutQueueLength=64 --MemNetRouterBufferSize=2048 --MemNetTimeOfFlight=130t --l2Assoc=16 --l2Banks=16 --l2BlkSize=128 --l2HWPFDataOnly=False --l2HWPFdegree=1 --l2HWPFpolicy=none --l2MSHRs=256 --l2Repl=LRU --l2Size=4096kB --l2TgtsPerMSHR=256 --l2lookupLatency=2ns --l2propagateLatency=12ns --l2cacheUseCacti=True --l2tol1ratio=2 --localAddrPolicy=1 --maxThreadBlockSize=0 --numSWTCs=2 --physmemLatency=25ns --physmemSize=1024MB --portLookup=0 --protocol=mesi --randStackOffset=True --restoreContextDelay=0 --retryDcacheDelay=10 --stackAlloc=3 --switchOnDataAcc=True --numcpus=2  --warpSize=16 --numHWTCs=64 --benchmark=MEMGPU --DTBsize=128

Thanks,
-Ziyi
On Thu, Mar 22, 2012 at 1:29 AM, Ziyi Liu <brian....@gmail.com> wrote:



--
Best Regards,

Ziyi

Ziyi Liu

unread,
Mar 22, 2012, 3:21:52 PM3/22/12
to mv5...@googlegroups.com
Hi Jiayuan,

Any suggestion about how to run threads as much as possible on SIMD cores?

Thanks.
--
Best Regards,

Ziyi

Jiayuan Meng

unread,
Mar 22, 2012, 5:28:43 PM3/22/12
to mv5...@googlegroups.com
I'm not sure what caused the problem. Try increase retryDcacheDelay to
a large number, like 100000. When the cache hierarchy is thrashing,
packets will be rejected, and if this number is too small, it will
constantly resend the packet only to be rejected again and increase
simulation overhead.

You can also do "Ctrl-C" and then look at the stats, see if there are
a lot of retries or cancellation of cache transitions, that usually
indicates that memory is the bottleneck.

You can further debug it by simplifying the cache hierarchy, or maybe
even just use the physical memory, just to make sure if it's something
there.

Another possibility is to use the example benchmarks, like FILTER, as
a starting point, which I've tried personally up to 256 threads/core.
Then, see if you can run it on your system configuration.

It's true that more threads/SIMD core does not always lead to better
performance...

Ziyi Liu

unread,
Mar 23, 2012, 12:25:35 AM3/23/12
to mv5...@googlegroups.com
Hi Jiayuan,

I have some question about workload distribution. Say if we have 2 cpu cores, is there anyway to distribute work evenly to O3 cpu cores in code?

I noticed that there is a middleware api called smp_thread_create()/smp_thread_exit(). Could we use these functions in our code to create thread?

Or we need to add syscall to support fork()  in MV5?

Thanks,
-Ziyi
--
Best Regards,

Ziyi

Jiayuan Meng

unread,
Mar 23, 2012, 6:05:28 PM3/23/12
to mv5...@googlegroups.com
Hi Ziyi,

Yes, you can do that. the smp_thread_create() API call triggers the
emulated system calls, which pick a tCPU and a thread context to start
that thread. You can insert scheduling policies there.

The exact code in the simulator can be found at
src/smp/smp_syscall_emul.cc, see SmpOS::Smp_sys_ConsumeFreeTC()

Jiayuan

Ziyi Liu

unread,
Mar 23, 2012, 11:43:12 PM3/23/12
to mv5...@googlegroups.com
Hi Jiayuan,

I have another question. In MV5, each simd core has 16 ALU by default. How about in O3 cores?

Thanks,
-Ziyi

Jiayuan Meng

unread,
Mar 24, 2012, 10:52:25 AM3/24/12
to mv5...@googlegroups.com
I'm actually not quite familiar with M5's O3 core configuration. I
would guess M5 have some parameters for you to choose the number of
ALUs.

Jiayuan

Reply all
Reply to author
Forward
0 new messages