how to use pinpoint to select representative samples

1,790 views
Skip to first unread message

hanfeng QIN

unread,
Apr 2, 2013, 10:04:41 AM4/2/13
to snip...@googlegroups.com
Hi all,

As the huge instruction counts of SPEC CPU2K6 suite, I want to use pinpoint to select representative instruction slices. After pinpoints are generated, I choose those with the highest weight. I am not sure whether this step is OK.

E.g., take 429.mcf as an example, I got the following pinpoints output as well as weight. The main parameters of simpoint is configured as maxK to 10, slice_size to 250M.

154 0 0.006 0
959 1 0.008 1
84 2 0.146 2
770 3 0.054 3
1133 4 0.166 4
304 5 0.099 5
911 6 0.071 6
215 7 0.309 7
816 8 0.141 8

I choose 7th pinpoint as a representative instruction region. Later record-trace is employed to generate traces feed to sniper. The begining 215 * 250M instructions are skipped.

$ record-trace -f 53750000000 -d 250000000 -o 429.mcf -- ./mcf_base.x86_64 inp.in 1> mcf.ref.out 2>> mcf.ref.err

$ run-sniper -d 429 -c arch.cfg --traces=429.mcf

Anything I am missing?

Best,
Hanfeng

Trevor E. Carlson

unread,
Apr 2, 2013, 11:07:23 AM4/2/13
to snip...@googlegroups.com
Hanfeng,

You've missed a few points about the SimPoint methodology and you should definitely read their paper [1] as it gives a good overview of the methodology. PinPoints is a Pin (and Pinplay) implementation of SimPoints. Definitely take a look at our recent tutorial [2] if you'd like to read more about PinPoints.

Generation of the PinPoints themselves look good, except that I think you might have made a mistake with respect to the SimPoint region numbers. Isn't it correct that the 215th region should be (215-1)*250M (as the 1st region starts at instruction 0, which is (1-0)*250M)? You also want to think a bit about warmup as longer SimPoints might not require warmup as the affect is small, but it all depends on the memory hierarchy that you are using (large DRAM caches for example could require quite a bit of warmup).

The most common use of SimPoints is to simulate the representative regions to determine the average CPI of your application. What is interesting about SimPoints is that you can simulate each region in parallel and determine a good approximation for the CPI (and therefore runtime) of your application (usually within 5% or better). You will need to simulate all (or most, usually 90 to 95% coverage, depending on your methodology) of the regions that you created, and determine the final CPI based on the weighted average of the CPI of each of the simulated regions. Finally, using the weighted CPI and total instruction count you can determine the runtime of the application.

I hope that this helps explain SimPoints. Please get back to us if you have any other issues,
Trevor

[1] http://cseweb.ucsd.edu/~calder/simpoint/
--
--
--
You received this message because you are subscribed to the Google
Groups "Sniper simulator" group.
To post to this group, send email to snip...@googlegroups.com
To unsubscribe from this group, send email to
snipersim+...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/snipersim?hl=en
 
---
You received this message because you are subscribed to the Google Groups "Sniper simulator" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snipersim+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

hanfeng QIN

unread,
Apr 2, 2013, 10:15:34 PM4/2/13
to snip...@googlegroups.com
Hi Trevor,

Thanks for your response.

As you pointed, the calculation of SimPoint region number is not correct. For warmup, the SimPoint paper states that if the slice size is large, the effect of warmup may not be a problem though it may fail to capture the short-lived phase change. In the prior work of Nair. [1] and Patil et al. [2], they set slice size as 100M and 250M respectively. I think in my case, 250M may be large enough.

I think I should explain why I ignore to simulate all the pinpoints in parallel. Since I need to build multi-programmed workload consists of multiple single-threaded programs. However it is impossible to run the whole workload due to long simulation time. Therefore, I want to mix the representative program regions traces similar with Sim et al [3] .(Please refer to section 4.)  Considering this requirement, I choose only one representative region for each single-threaded program. As I pointed in prior mail, I choose the one with the highest weight. Obviously, it is not correspondent with SimPoint methodology, which requires to simulate all regions. However, if I simulate all regions. e.g., M regions for each program on average, then I have no idea on how to control my simulation on the workload? Actually, I  do not know how to use build mixed workloads with representative regions sampled used SimPoints. Could I have your suggestions? It's my honor.

As an extra evaluation, to validate these pinpoints are really representative, I simulate them with sniper. I also employ fast-forward (5B) plus detailed measurement (1B) approach to simulate SPEC CPU 2k6 in gem5. This approach is commonly used in architecture research literature. I also compare the final simulation result with Jaleel's work [4] in the same simulation configurations ( Though same configuration, I am able to control the internal micro-archtiecture implementation.) Attached please find the result figure. On average, the simulation result sounds good. However, for several programs, the disparity is large, such as 403.gcc, 410.bwaves and 471.omnetpp. The blue bar below the horizontal axis indicates the corresponding program is memory intensive as pointed in Jaleel's work.


Thanks,
Hanfeng

[1] A. Nair and L. John. Simulation Points for SPEC CPU 2006. In ICCD'08.
[2] H. Patil et al. Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation. In Micro'08.
[3] FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion. In ISCA'12.
[4] http://www.jaleels.org/ajaleel/workload/SPEC_CPU2006/
sniper_gem5_cmpim.pdf

Wim Heirman

unread,
Apr 3, 2013, 4:07:41 AM4/3/13
to snip...@googlegroups.com
Hi Hanfeng,

> As you pointed, the calculation of SimPoint region number is not correct.
> For warmup, the SimPoint paper states that if the slice size is large, the
> effect of warmup may not be a problem though it may fail to capture the
> short-lived phase change. In the prior work of Nair. [1] and Patil et al.
> [2], they set slice size as 100M and 250M respectively. I think in my case,
> 250M may be large enough.

It depends on how large the unwarmed microarchitectural state is. If
you have a very large last-level cache, it may take 100s of M
instructions before each entry is used once, and thus before it can be
assumed to be fully warmed. But for typical 8M LLC sizes, 250M or 1B
SimPoints should be Ok.

> I think I should explain why I ignore to simulate all the pinpoints in
> parallel. Since I need to build multi-programmed workload consists of
> multiple single-threaded programs. However it is impossible to run the whole
> workload due to long simulation time. Therefore, I want to mix the
> representative program regions traces similar with Sim et al [3] .(Please

This also reminds me of the Co-Phase Matrix work:
http://cseweb.ucsd.edu/~calder/papers/ISPASS-04-CoPhaseMatrix.pdf

> refer to section 4.) Considering this requirement, I choose only one
> representative region for each single-threaded program. As I pointed in

You could select two representatives, if they turn out to be very
different, but you would treat them as two different applications that
each have some probability of being co-scheduled with anything else.

> prior mail, I choose the one with the highest weight. Obviously, it is not
> correspondent with SimPoint methodology, which requires to simulate all
> regions. However, if I simulate all regions. e.g., M regions for each

I think you're supposed to use the maxK variable for this. Basically
that's the way of telling SimPoints how many regions you want. If you
only want one region, set maxK = 1 and you'll see that the
representative region will (usually) be different from the one with
highest weight from a larger maxK clustering. (Imagine maxK=2 which
results in two clusters some way apart: each cluster's center will be
a representative region, whereas with maxK=1 the global average /
cluster center will be better represented by a region somewhere in
between the two clusters).

> As an extra evaluation, to validate these pinpoints are really
> representative, I simulate them with sniper. I also employ fast-forward (5B)
> plus detailed measurement (1B) approach to simulate SPEC CPU 2k6 in gem5.
> This approach is commonly used in architecture research literature. I also

Did you do a comparison of how this specific region (through its BBV)
compares to the SimPoint representative? I know this approach is used
often, but it's so wrong... You have no guarantee at all that this
region in any way corresponds to what the benchmark is doing most of
the time. Some of the CPU2k6 benchmarks are very long (100s of B
instructions), whatever happens 5B instructions from the start may
still be initialization. Also look at the traces that Jaleel has in
[4]. Omnet's L3 trace is a good example: you see one type of behavior
up to 20B instructions, then something else for another 700B
instructions:
http://www.jaleels.org/ajaleel/workload/SPEC_CPU2006/omnetpp.omnetpp.FULL_RUN.UL3.jpg

> compare the final simulation result with Jaleel's work [4] in the same
> simulation configurations ( Though same configuration, I am able to control
> the internal micro-archtiecture implementation.) Attached please find the
> result figure. On average, the simulation result sounds good. However, for
> several programs, the disparity is large, such as 403.gcc, 410.bwaves and
> 471.omnetpp. The blue bar below the horizontal axis indicates the
> corresponding program is memory intensive as pointed in Jaleel's work.

What are you trying to evaluate exactly? The simulation models, or the
region selection? There are a number of differences here: the
simulators are different, the configuration is most likely slightly
different, and the region of code is different. If you want to see the
effect of anything specific you'll have to isolate that effect by
keeping everything else constant.

Regards,
Wim

hanfeng QIN

unread,
Apr 8, 2013, 10:32:26 AM4/8/13
to snip...@googlegroups.com
Hi Wim,

Thanks for your help. I can get the wanted sample by setting maxK to 1.
Compared with prior run, the final simulation result seems to be good
this time.

But I still have another question when I feed multiple sift traces to
sniper with the run-time configuration '--sim-end=last-restart'.

$./run-sniper -c arch.cfg -d ../run/410-429 --sim-end=last-restart
--traces=410.sift,429.sift

I thought the '--sim-end=last-restart' option will enforce both 410.sift
and 429.sift execute at least once. However, I find the simulation does
not terminate as expected. The related output is listed as follows.

[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[TRACE:0] -- DONE --
[TRACE:1] -- DONE --

Any misunderstandings of this option? Further, if this option is used,
how many instructions are collected to calculate final performance
metrics? Only the first run? Or the total?

Thanks again.

Regards,
Hanfeng

Wim Heirman

unread,
Apr 8, 2013, 3:25:48 PM4/8/13
to snip...@googlegroups.com
Hanfeng,

Unless your two traces execute for the exact same amount of cycles,
your run-sniper command line should cause one of the traces to be
restarted. Did you make any changes to the source? Is this version 4.2
of Sniper?

Keep in mind that the default number of cores is 1, so you'll need to
add -n2 to run these two traces on different cores. Your command will
time-share both threads on a single core (unless arch.cfg overrides
general/total_cores).

If you'll look at the statistics after a run like this, it will
contain all instructions executed by all runs of each benchmark. If
you want to know information about the first run, you can use a Python
script that listens for the HOOK_APPLICATION_EXIT, which is called
just before an application is restarted, and extract statistics at
that point (either by writing a statistics snapshot and doing some
postprocessing using --partial, or by directly querying the relevant
statistics from sim.stats).

Regards,
Wim

hanfeng QIN

unread,
Apr 9, 2013, 7:51:24 PM4/9/13
to snip...@googlegroups.com
Wim,

Sorry for my delayed feedback.

The version is 4.2. I do not change the source code that controls
simulation process. I set the total core number to 2 in arch.cfg as well.

And last morning I found the simulation exited at last. It cost 10s
hours to execute the two traces. I thinks this long time simulation can
explain why Sniper did not exit as expected yesterday.

However, the simulation process and final statistics are interesting.
Before 429.mcf (trace 1) exits, 410.bwaves restarts 8 times (trace
0,2,3,4,5,6,7,8). Please see the following simulation process screen-shot.

[TRACE: 0] -- DONE --
[TRACE: 2] -- DONE --
[TRACE: 3] -- DONE --
[TRACE: 4] -- DONE --
[TRACE: 5] -- DONE --
[TRACE: 6] -- DONE --
[TRACE: 7] -- DONE --
[TRACE: 8] -- DONE --
[TRACE: 1] -- DONE --
[TRACE: 9] -- STOP --

I check the final stats and find the instruction number of both
410.bwaves and 429.mcf execute are far more than expected 1 B which I
specified in SIFT trace generation.

| Core 0 | Core 1
Instructions | 4999999881 | 4203856637
Cycles | 6266670228 | 8787062044
Time | 2355891063 | 3303406784

It is reasonable that 429.mcf is slower than 410.bwaves. However, it is
really strange that it consumes so many instructions. I notice that in
solo run, the instruction number is as expected 1s billion. I am
wondering why are so huge size of instructions consumed by sniper.




However

Regards,
Hanfeng
>> To post to this group, send email tosni...@googlegroups.com
>> To unsubscribe from this group, send email to
>> snipersim+...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/snipersim?hl=en
>>
>> --- You received this message because you are subscribed to the Google
>> Groups "Sniper simulator" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email tosnipersim...@googlegroups.com.
>> For more options, visithttps://groups.google.com/groups/opt_out.
>>
>>

Trevor E. Carlson

unread,
Apr 9, 2013, 8:00:51 PM4/9/13
to snip...@googlegroups.com
HanFeng,

There is a difference between instructions simulated on a core and instructions simulated by software threads. My guess here is that the scheduler that you are using is scheduling subsequent runs of bwaves on the same core with mcf. Use the dumpstats.py command to view the detailed output from that run, and take a look at thread.instructions_by_core[]. In there, it should list the instructions that each of the software threads executed on each of the hardware cores.

Trevor

To post to this group, send email to snip...@googlegroups.com

To unsubscribe from this group, send email to
snipersim+...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/snipersim?hl=en

--- You received this message because you are subscribed to the Google Groups "Sniper simulator" group.

hanfeng QIN

unread,
Apr 9, 2013, 10:07:06 PM4/9/13
to snip...@googlegroups.com
Trevor,

You are right. I am using pinned scheduler. It hands out core in round-robin way. I review the detailed outputs through dumpstats.py. and find the following data.

thread.instructions_by_core[0] = 999999885, 0, 999999999, 0, 999999999, 0, 999999999, 0, 999999999
thread.instructions_by_core[1] = 0, 999998945, 0, 1000000267, 0, 999999966, 0, 1000000131, 0, 203857328

Obviously, it conforms with scheduler_pinned's  feature.

Many thanks again.

Hanfeng

Newton

unread,
Jun 21, 2014, 12:10:58 AM6/21/14
to snip...@googlegroups.com
I didn't understand how we can run multiple simpoints in parallel on sniper.

What i understood from sniper manual [1] (page 8-9) is that
1) I can record traces of simpoints required (as explained in first post).
2) If i have recorded N traces, i can simulate traces in parallel on N cores

Please correct me if what i understood is wrong.

I have 1 doubt regarding simpoints:
Suppose simpoints tool gives me 10 simpoints. Now to find estimated CPI of the workload, i need to run simulation on all the 10 simpoints,
and weight their CPIs to find the total CPI.

If i have run simulations for all the simpoints, then where i am saving on simulation time.
Am i not running the entire workload again even after using simpoints.

[1] http://snipersim.com/documents/sniper-manual.pdf

Thanks
Newton

Newton Singh

unread,
Jun 21, 2014, 7:52:25 AM6/21/14
to Trevor Carlson, snip...@googlegroups.com
Dear Trevor,

Thanks a lot of clearing the doubt.

For parallel simulation, are these 2 steps fine and sufficient?


1) I can record traces of simpoints required (as explained in first post).
2) I have recorded N traces, i can simulate traces in parallel on N cores on sniper using (N = 2)
./run-sniper -c gainestown -n 2 --traces=gcc_1.sift,gcc_2.sift

Thanks
Newton




On Sat, Jun 21, 2014 at 2:05 PM, Trevor Carlson <trevor....@elis.ugent.be> wrote:
Newton,

     There seem to be two questions here. The first one is in regards to workload reduction. A typical scenario is that by running the SimPoint methodology, you will slice up your single-threaded workload into, for example 1000 slices of 100 million instructions each. After each slice is compared for similarity, SimPoint, in this example, determines that 10 slices are enough to cover 98% of the application. You gain because although you are simulating all 10 SimPoints, you don't have to simulate the entire application (which in this case is 1000 slices). That results in a 100x reduction in detailed simulation.

     The second question revolves around parallel simulation of these SimPoints. Continuing with the example above, if we have a 10 core machine, then we can simulate all 10 SimPoints in parallel, resulting in an additional speedup. You can then use the results of these ten simulations to estimate the performance of the entire application.

I hope that clears things up,
Trevor

For more options, visit https://groups.google.com/d/optout.



--
Thanks
Newton
MTech, IIT Bombay

Trevor E. Carlson

unread,
Jun 23, 2014, 10:33:28 AM6/23/14
to snip...@googlegroups.com, newto...@gmail.com
Newton,

    Yes, you can start with recording the SimPoint traces needed, so the first step is fine. Unfortunately, the parallel second step does now make much sense. What I meant by parallel was to run two copies of Sniper in parallel (both simulating a single-core configuration). What you are doing here is running one multi-core simulation with two traces, which will not let you get to the correct performance estimation.

    You'll want to do something like this:
$ ./run-sniper --traces=gcc_1.sift -d gcc_1 &
$ ./run-sniper --traces=gcc_2.sift -d gcc_2 &
$ ./wait-for-sniper-to-finish.sh
$ # This script uses the gcc_1 and gcc_2 runs with SimPoint weights to estimate the entire application's CPI
$ ./estimate-cpi-from-sniper-and-simpoint.sh

    As an alternative to generating the representatives yourself, you could use the pre-packaged Pinballs that we distribute of the SPEC CPU2006 benchmarks on our website [1]. You could need to use the Pinplay version of Pin, but it should allow you to get started a bit faster.

Good luck,
Trevor

[1] http://snipersim.org/w/Pinballs

Leye Olorode

unread,
Sep 2, 2014, 11:55:29 AM9/2/14
to snip...@googlegroups.com, newto...@gmail.com
Hi Trevor,

What file contains the weights for the downloadable pinpoints given in the referenced link http://snipersim.org/w/Pinballs?

Regards,
Leye

Trevor Carlson

unread,
Sep 3, 2014, 4:43:31 PM9/3/14
to snip...@googlegroups.com, newto...@gmail.com
Leye,

     The weights are encoded in the file name itself. You should see a number if digits that look something like this: *-0_XYZAB-*. What that means is that this SimPoint has a weight of 0.XYZAB.

I hope that this helps,
Trevor

Leye Olorode

unread,
Sep 5, 2014, 12:52:38 PM9/5/14
to snip...@googlegroups.com, newto...@gmail.com
Thanks Trevor.

I see those. I was expecting the weights should then add up to 1 but they do not. It must be raw weights which means I should multiply by the weight and divide by the total of all weights for any given benchmark.

Regards,
Leye
Message has been deleted

Trevor Carlson

unread,
Apr 12, 2016, 2:26:40 AM4/12/16
to snip...@googlegroups.com, newto...@gmail.com
Newton,

To simulation multi-program workloads with pinballs you can use the --pinballs option separating each .address file with a comma. Statistics are generated into the sim.stats.* file and sim.out file. Take a look at --viz to see additional visualization and statistics options.

Trevor

On Apr 10, 2016, at 5:37 PM, Zicong Wang <wzc3...@gmail.com> wrote:

Hello, Trevor
I have already know how to use the pinpoints files to simulate a single-threaded application, but how to simulate multiple single-threaded workload (e.g. multi-programmed workload) using pinpoints files? Should I pass multiple *.address files to "--pinballs" parameter?
And besides CPI, how to calculate other statistics?

Thank you!

Zicong Wang

unread,
Apr 12, 2016, 9:33:24 PM4/12/16
to Sniper simulator, newto...@gmail.com
Trevor,

But how to estimate each entire application's CPI? In other words, we can only know the CPI and other statistics of each snippet(pinpoint), but not the entire application. Please correct me if my understanding is wrong.

Thank you!
Zicong

Trevor Carlson

unread,
Apr 13, 2016, 3:23:50 AM4/13/16
to snip...@googlegroups.com, newto...@gmail.com
Zicong,

Things obviously get quite complicated once you move to multi-threaded workloads. You are right that we typically want to use a number of SimPoints to accurately estimate the performance of an application. But, sometimes (depending on the study), one can use a single large (1B instructions +) SimPoint as a representative for an entire single-threaded application. Once you make this assumption (that we have effectively created a new benchmark out of the representative sample of the original) then we can combine representatives in interesting ways with other benchmarks to see what happens.

Once we combine applications into a multi-program workload, we have to consider a number of different things. First, do we consider the time only when both applications are co-running, or do we consider the execution time the sum of them for both, even if one completes earlier than the other. Will you always start the applications at the same time, or will you create a deterministic or random offset? You might also consider moving to a time-based metric once you start working with a number of applications, as they will have different CPIs, and the important metric that you are looking for (execution time) should be looked at directly. Also, depending on your research goals, other metrics like ANTT (average normalized turnaround time) or STP (system throughput) could be more applicable for what you are doing.

The metric of interest is experiment dependent, and I would say learning which one applies to the research at hand is an important part of becoming a better researcher.

Good luck,
Trevor

Zicong Wang

unread,
Apr 13, 2016, 10:23:34 AM4/13/16
to Sniper simulator, newto...@gmail.com
Trevor,

Thanks for your detailed reply and it is pretty inspiring of your words for me.

In fact, my research is quite similar to this paper [1](MIRCO, 2012). In this paper, the authors use normalized weighted speedup metric to quantify the benefits of their proposed scheme(Section 4.1), which is a proper metric for evaluation because I think it can avoid the problems you mentioned. I hope to use a similar baseline configuration(Table 1) and execute a multi-threaded workload which is composed of several different applications from SPEC2006(Table 2).

Could you give me some advice about performing the experiment using sniper?

Thank you!
Reply all
Reply to author
Forward
0 new messages