Hi Hanfeng,
> As you pointed, the calculation of SimPoint region number is not correct.
> For warmup, the SimPoint paper states that if the slice size is large, the
> effect of warmup may not be a problem though it may fail to capture the
> short-lived phase change. In the prior work of Nair. [1] and Patil et al.
> [2], they set slice size as 100M and 250M respectively. I think in my case,
> 250M may be large enough.
It depends on how large the unwarmed microarchitectural state is. If
you have a very large last-level cache, it may take 100s of M
instructions before each entry is used once, and thus before it can be
assumed to be fully warmed. But for typical 8M LLC sizes, 250M or 1B
SimPoints should be Ok.
> I think I should explain why I ignore to simulate all the pinpoints in
> parallel. Since I need to build multi-programmed workload consists of
> multiple single-threaded programs. However it is impossible to run the whole
> workload due to long simulation time. Therefore, I want to mix the
> representative program regions traces similar with Sim et al [3] .(Please
This also reminds me of the Co-Phase Matrix work:
http://cseweb.ucsd.edu/~calder/papers/ISPASS-04-CoPhaseMatrix.pdf
> refer to section 4.) Considering this requirement, I choose only one
> representative region for each single-threaded program. As I pointed in
You could select two representatives, if they turn out to be very
different, but you would treat them as two different applications that
each have some probability of being co-scheduled with anything else.
> prior mail, I choose the one with the highest weight. Obviously, it is not
> correspondent with SimPoint methodology, which requires to simulate all
> regions. However, if I simulate all regions. e.g., M regions for each
I think you're supposed to use the maxK variable for this. Basically
that's the way of telling SimPoints how many regions you want. If you
only want one region, set maxK = 1 and you'll see that the
representative region will (usually) be different from the one with
highest weight from a larger maxK clustering. (Imagine maxK=2 which
results in two clusters some way apart: each cluster's center will be
a representative region, whereas with maxK=1 the global average /
cluster center will be better represented by a region somewhere in
between the two clusters).
> As an extra evaluation, to validate these pinpoints are really
> representative, I simulate them with sniper. I also employ fast-forward (5B)
> plus detailed measurement (1B) approach to simulate SPEC CPU 2k6 in gem5.
> This approach is commonly used in architecture research literature. I also
Did you do a comparison of how this specific region (through its BBV)
compares to the SimPoint representative? I know this approach is used
often, but it's so wrong... You have no guarantee at all that this
region in any way corresponds to what the benchmark is doing most of
the time. Some of the CPU2k6 benchmarks are very long (100s of B
instructions), whatever happens 5B instructions from the start may
still be initialization. Also look at the traces that Jaleel has in
[4]. Omnet's L3 trace is a good example: you see one type of behavior
up to 20B instructions, then something else for another 700B
instructions:
http://www.jaleels.org/ajaleel/workload/SPEC_CPU2006/omnetpp.omnetpp.FULL_RUN.UL3.jpg
> compare the final simulation result with Jaleel's work [4] in the same
> simulation configurations ( Though same configuration, I am able to control
> the internal micro-archtiecture implementation.) Attached please find the
> result figure. On average, the simulation result sounds good. However, for
> several programs, the disparity is large, such as 403.gcc, 410.bwaves and
> 471.omnetpp. The blue bar below the horizontal axis indicates the
> corresponding program is memory intensive as pointed in Jaleel's work.
What are you trying to evaluate exactly? The simulation models, or the
region selection? There are a number of differences here: the
simulators are different, the configuration is most likely slightly
different, and the region of code is different. If you want to see the
effect of anything specific you'll have to isolate that effect by
keeping everything else constant.
Regards,
Wim