[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

Stefanos Baziotis via llvm-dev

unread,

Jul 18, 2021, 12:14:19 PM7/18/21

to llvm-dev

Hi,

I'm not very familiar with the LLVM test suite and I'd like to ask some questions.

I wanted to get a feeling of the impact to runtime performance of some changes inside LLVM,

so I thought of running llvm test-suite benchmarks.

Build options: O3.cmake + DTEST_SUITE_BENCHMARKING_ONLY=True

Run:

llvm-lit -v -j 1 -o out.json .

What I think I should be looking for is the "metrics" -> "exec_time" in the JSON file.

Now, to the questions. First, there doesn't seem to be a common time unit for

"exec_time" among the different tests. For instance, SingleSource/ seem to use

seconds while MicroBenchmarks seem to use μs. So, we can't reliably judge

changes. Although I get the fact that micro-benchmarks are different in nature

than Single/MultiSource benchmarks, so maybe one should focus only on

the one or the other depending on what they're interested in.

In any case, it would at least be great if the JSON data contained the time unit per test,

but that is not happening either.

Do you think that the lack of time unit info is a problem ? If yes, do you like the

solution of adding the time unit in the JSON or do you want to propose an alternative?

The second question has to do with re-running the benchmarks: I do

cmake + make + llvm-lit -v -j 1 -o out.json .

but if I try to do the latter another time, it just does/shows nothing. Is there any reason

that the benchmarks can't be run a second time? Could I somehow run it a second time ?

Lastly, slightly off-topic but while we're on the subject of benchmarking,

do you think it's reliable to run with -j <number of cores> ? I'm a little bit afraid of

the shared caches (because misses should be counted in the CPU time, which

is what is measured in "exec_time" AFAIU)

and any potential multi-threading that the tests may use.

Best,

Stefanos

Michael Kruse via llvm-dev

unread,

Jul 18, 2021, 11:58:16 PM7/18/21

to Stefanos Baziotis, llvm-dev

Am So., 18. Juli 2021 um 11:14 Uhr schrieb Stefanos Baziotis via
llvm-dev <llvm...@lists.llvm.org>:

> Now, to the questions. First, there doesn't seem to be a common time unit for
> "exec_time" among the different tests. For instance, SingleSource/ seem to use
> seconds while MicroBenchmarks seem to use μs. So, we can't reliably judge
> changes. Although I get the fact that micro-benchmarks are different in nature
> than Single/MultiSource benchmarks, so maybe one should focus only on
> the one or the other depending on what they're interested in.

Usually one does not compare executions of the entire test-suite, but
look for which programs have regressed. In this scenario only relative
changes between programs matter, so μs are only compared to μs and
seconds only compared to seconds.

> In any case, it would at least be great if the JSON data contained the time unit per test,
> but that is not happening either.

What do you mean? Don't you get the exec_time per program?

> Do you think that the lack of time unit info is a problem ? If yes, do you like the
> solution of adding the time unit in the JSON or do you want to propose an alternative?

You could also normalize the time unit that is emitted to JSON to s or ms.

>
> The second question has to do with re-running the benchmarks: I do
> cmake + make + llvm-lit -v -j 1 -o out.json .
> but if I try to do the latter another time, it just does/shows nothing. Is there any reason
> that the benchmarks can't be run a second time? Could I somehow run it a second time ?

Running the programs a second time did work for me in the past.
Remember to change the output to another file or the previous .json
will be overwritten.

> Lastly, slightly off-topic but while we're on the subject of benchmarking,
> do you think it's reliable to run with -j <number of cores> ? I'm a little bit afraid of
> the shared caches (because misses should be counted in the CPU time, which
> is what is measured in "exec_time" AFAIU)
> and any potential multi-threading that the tests may use.

It depends. You can run in parallel, but then you should increase the
number of samples (executions) appropriately to counter the increased
noise. Depending on how many cores your system has, it might not be
worth it, but instead try to make the system as deterministic as
possible (single thread, thread affinity, avoid background processes,
use perf instead of timeit, avoid context switches etc. ). To avoid
systematic bias because always the same cache-sensitive programs run
in parallel, use the --shuffle option.

Michael
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Mircea Trofin via llvm-dev

unread,

Jul 19, 2021, 10:36:39 AM7/19/21

to Michael Kruse, llvm-dev

Also, depending on what you are trying to achieve (and what your platform target is), you could enable perfcounter collection; if instruction counts are sufficient (for example), the value will probably not vary much with multi-threading.

...but it's probably best to avoid system noise altogether. On Intel, afaik that includes disabling turbo boost and hyperthreading, along with Michael's recommendations.

Stefanos Baziotis via llvm-dev

unread,

Jul 19, 2021, 3:47:09 PM7/19/21

to Mircea Trofin, llvm-dev

Hi,

Usually one does not compare executions of the entire test-suite, but
look for which programs have regressed. In this scenario only relative
changes between programs matter, so μs are only compared to μs and
seconds only compared to seconds.

That's true, but there are different insights one can get from, say, a 30%

increase in a program that initially took 100μs and one which initially

took 10s.

What do you mean? Don't you get the exec_time per program?

Yes, but JSON file does not include the time _unit_. Actually, I think the correct phrasing

is "unit of time", not "time unit", my bad. In any case, I mean that you get

e.g., "exec_time": 4, but you don't know if this 4 is 4 seconds or

4 μs or whatever other unit of time.

For example, the only reason that it seems that MultiSource/ use

seconds is just because I ran a bunch of them manually (and because

some outputs saved by llvm-lit, which measure in seconds, match

the numbers on JSON).

If we know the unit of time per test case (or per X grouping of tests

for that matter), we could then, e.g., normalize the times, as you

suggest, or anyway, know the unit of time and act accordingly.

Running the programs a second time did work for me in the past.

Ok, it seems it works for me if I wait, but it seems it behaves differently

the second time. Anyway, not important.

It depends. You can run in parallel, but then you should increase the
number of samples (executions) appropriately to counter the increased
noise. Depending on how many cores your system has, it might not be
worth it, but instead try to make the system as deterministic as
possible (single thread, thread affinity, avoid background processes,
use perf instead of timeit, avoid context switches etc. ). To avoid
systematic bias because always the same cache-sensitive programs run
in parallel, use the --shuffle option.

I see, thanks. I didn't know about the --shuffle option, interesting.

Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems that perf runs both during the

build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not important but do you happen to know

why does this happen?

Also, depending on what you are trying to achieve (and what your platform target is), you could enable perfcounter collection;

Thanks, that can be useful in a bunch of cases. I should not that perf stats are not included in the

JSON file. Is the "canonical" way to access them to follow the CMakeFiles/<benchmark name>.dir/<benchmark name>.time.perfstats ?

For example, let's say that I want the perf stats for test-suite/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp

To find them, I should go to the same path but in the build directory, i.e.,: test-suite-build/SingleSource/Benchmarks/Adobe-C++/

and then follow the pattern above, so, the .perfstats file will be in: test-suite-build/SingleSource/Benchmarks/Adobe-C++/CMakeFiles/loop_unroll.dir/loop_unroll.cpp.time.perfstats

Sorry for the long path strings, but I couldn't make it clear otherwise.

Thanks to both,

Stefanos

Stefanos Baziotis via llvm-dev

unread,

Jul 19, 2021, 3:53:27 PM7/19/21

to Mircea Trofin, llvm-dev

Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems that perf runs both during the
build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not important but do you happen to know
why does this happen?

It seems the one gathers measurements for the compilation command and the other for the run. My bad, I hadn't noticed.

- Stefanos

Michael Kruse via llvm-dev

unread,

Jul 19, 2021, 6:25:48 PM7/19/21

to Stefanos Baziotis, llvm-dev

Am Mo., 19. Juli 2021 um 14:47 Uhr schrieb Stefanos Baziotis
<stefanos...@gmail.com>:

> For example, the only reason that it seems that MultiSource/ use
> seconds is just because I ran a bunch of them manually (and because
> some outputs saved by llvm-lit, which measure in seconds, match
> the numbers on JSON).
>
> If we know the unit of time per test case (or per X grouping of tests
> for that matter), we could then, e.g., normalize the times, as you
> suggest, or anyway, know the unit of time and act accordingly.

You know the unit of time from the top-level folder. MicroBenchmarks
is microseconds (because Google Benchmark reports microseconds),
everything is seconds.

That might be confusing when you don't know about it, but if you do
you there is no ambiguity.

Stefanos Baziotis via llvm-dev

unread,

Jul 19, 2021, 7:01:08 PM7/19/21

to Michael Kruse, llvm-dev

Yes, I agree. And as I mentioned, one can figure it out by manually reproducing some measurements. It's just that it leaves you wondering "is there some other dir that uses something different?" if nobody tells you

about it. Ok, good, I'll try to add some documentation on that.

By the way, does lit have any flags to set core affinity? Currently, I have modified `timeit.sh` to use `taskset` as in: `taskset --cpu-list 2,4,6 perf stat ...`

It seems reliable, although I'd like to find a way to actually test the reliability. But, if lit has an option already, I could use that.

Best,

Stefanos

Mircea Trofin via llvm-dev

unread,

Jul 19, 2021, 7:13:22 PM7/19/21

to Stefanos Baziotis, llvm-dev

You need to specify which counters you want collected, up to 3 - see the link above (also, you need to opt in to linking libpfm)

Reply all

Reply to author

Forward