Hi All,
My colleague Sanne has found performance improvement with ‘-enable-loop-distribute’ option from hmmer on SPEC2006.
On the hmmer, there is a loop with dependence. The Loop Distribute pass splits the loop into three sperate loops. One loop has still dependence, another is vectorizable, the other is vectorizable after running LoopBoundSplit pass which needs to be updated a bit. On AArch64, we have seen 40% improvement with enabling Loop Distribute pass and 80% improvement with enabling Loop Distribute pass and LoopBoundSplit from hmmer on SPEC2006.
From llvm-test-suite and spec benchmarks, I have not seen any performance degradation with enabling the Loop Distribute pass because almost all tests are not handled by Loop Distribute pass with mainly below messages. I think the messages are reasonable.
Skipping; memory operations are safe for vectorization
Skipping; no unsafe dependences to isolate
Skipping; multiple exit blocks
For compile time, there is no big change because the almost all tests are not handled by the pass due to mainly above three reasons which comes from cached analysis information.
At this moment, we can enable the pass with metadata or command line option. If possible, can we enable the Loop Distribute pass as default in the pipeline of new pass manager please?
Thanks
JinGu Kang
I’d be in favour of enabling loop distribution by default as long as it doesn’t hurt compile-time when it’s not needed.
FWIW GCC enables this by default to get the speedup on hmmer. I don’t know enough about the LLVM implementation to compare with GCC’s, but GCC’s loop distribution pass aims to help vectorisation and help detect manual memset, memcpy implementations (I think LLVM does that detection in another pass).
You can read the high-level GCC design in the source: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/tree-loop-distribution.c;h=65aa1df4abae2c6acf40299f710bc62ee6bacc07;hb=HEAD#l39
Thanks,
Kyrill
To make -enable-loop-distribute on by default would mean that we could
consider loop distribution to be usually beneficial without causing
major regressions. We need a lot more data to support that conclusion.
Alternatively, we could consider loop-distribution a canonicalization.
A later LoopFuse would do the profitability heuristic to re-fuse loops
again if loop distribution did not gain anything.
Michael
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
The compile time data is as below. There could be a bit noise but it looks there is no big compile time regression.
From llvm-test-suite
Metric: compile_time
Program results_base results_loop_dist diff
test-suite...arks/VersaBench/dbms/dbms.test 0.94 0.95 1.6%
test-suite...s/MallocBench/cfrac/cfrac.test 0.89 0.90 1.5%
test-suite...ks/Prolangs-C/gnugo/gnugo.test 0.72 0.73 1.4%
test-suite...yApps-C++/PENNANT/PENNANT.test 8.65 8.75 1.2%
test-suite...marks/Ptrdist/yacr2/yacr2.test 0.84 0.85 1.1%
test-suite.../Builtins/Int128/Builtins.test 0.86 0.87 1.0%
test-suite...s/ASC_Sequoia/AMGmk/AMGmk.test 0.69 0.70 1.0%
test-suite...decode/alacconvert-decode.test 1.16 1.17 0.9%
test-suite...encode/alacconvert-encode.test 1.16 1.17 0.9%
test-suite...peg2/mpeg2dec/mpeg2decode.test 1.71 1.72 0.9%
test-suite.../Applications/spiff/spiff.test 0.88 0.89 0.9%
test-suite...terpolation/Interpolation.test 0.96 0.97 0.9%
test-suite...chmarks/MallocBench/gs/gs.test 4.58 4.62 0.9%
test-suite...-C++/stepanov_abstraction.test 0.69 0.70 0.8%
test-suite...marks/7zip/7zip-benchmark.test 52.35 52.74 0.7%
Geomean difference nan%
results_base results_loop_dist diff
count 117.000000 118.000000 117.000000
mean 4.636126 4.616575 0.002171
std 7.725991 7.737663 0.006310
min 0.607300 0.602200 -0.041930
25% 1.345700 1.313650 -0.001577
50% 1.887000 1.888800 0.002463
75% 4.340800 4.343275 0.005754
max 52.351200 52.736000 0.015861
From SPEC2017
|
benchmarks |
baseline |
enable-loop-distribute |
diff (seconds) |
|
500.perlbench_r |
00:01:06 |
00:01:04 |
-2 |
|
502.gcc_r |
00:05:24 |
00:05:25 |
1 |
|
505.mcf_r |
00:00:02 |
00:00:02 |
0 |
|
520.omnetpp_r |
00:00:58 |
00:00:58 |
0 |
|
523.xalancbmk_r |
00:02:30 |
00:02:30 |
0 |
|
525.x264_r |
00:00:32 |
00:00:31 |
-1 |
|
531.deepsjeng_r |
00:00:04 |
00:00:04 |
0 |
|
541.leela_r |
00:00:06 |
00:00:06 |
0 |
|
557.xz_r |
00:00:05 |
00:00:05 |
0 |
|
999.specrand_ir |
00:00:01 |
00:00:00 |
1 |
From SPEC2006 (number is seconds)
|
benchmarks |
baseline |
enable-loop-distribute |
diff (seconds) |
|
400.perlbench |
00:00:29 |
00:00:29 |
0 |
|
401.bzip2 |
00:00:04 |
00:00:03 |
-1 |
|
403.gcc |
00:01:28 |
00:01:26 |
-2 |
|
429.mcf |
00:00:01 |
00:00:01 |
0 |
|
445.gobmk |
00:00:24 |
00:00:24 |
0 |
|
456.hmmer |
00:00:06 |
00:00:06 |
0 |
|
458.sjeng |
00:00:03 |
00:00:03 |
0 |
|
462.libquantum |
00:00:03 |
00:00:02 |
-1 |
|
464.h264ref |
00:00:29 |
00:00:29 |
0 |
|
471.omnetpp |
00:00:23 |
00:00:24 |
1 |
|
473.astar |
00:00:02 |
00:00:02 |
0 |
|
483.xalancbmk |
00:02:07 |
00:02:06 |
-1 |
|
999.specrand |
00:00:01 |
00:00:01 |
0 |
Thanks
JinGu Kang
From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of
Sjoerd Meijer via llvm-dev
LoopDistribute currently already iterates over all loops to find the
llvm.loop.distribute.enable metadata. Additional compile-time overhead
would be the LoopAccessAnalysis which could be done for cheap if
LoopAccessAnalysis is used for LoopVectorize anyways.
> ________________________________
> From: Jingu Kang <Jingu...@arm.com>
> Sent: 21 June 2021 14:27
> To: Michael Kruse <llv...@meinersbur.de>; Kyrylo Tkachov <Kyrylo....@arm.com>; Sjoerd Meijer <Sjoerd...@arm.com>
> Cc: llvm...@lists.llvm.org <llvm...@lists.llvm.org>
> Subject: RE: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager
For some reason I cannot find this email in my inbox, although it was
definitely sent to the mailing-list:
https://lists.llvm.org/pipermail/llvm-dev/2021-June/151306.html
So I am replying within Sjoerd's email.
> For considering the LoopDistribute pass as a canonicalization with the profitability heuristic of LoopFuse pass, it looks the LoopFuse pass does not also have proper profitability function.
Within the loop optimization working group we were considering adding
a heuristic to LoopFuse. but is also not restricted to innermost
loops. However, the advantage is that it could run after LoopVectorize
and re-fuse loops that turned out to be non-vectorizable, or to loops
that have been vectorized independently. Unfortunately I think the
legality/profitability is comparatively expensive since it does not
unse LoopAccessAnalsysis,
Do you have any data on how often LoopDistribute triggers on a larger set of programs (like llvm-test-suite + SPEC)? AFAIK the implementation is very limited at the moment (geared towards catching the case in hmmer) and I suspect lack of generality is one of the reasons why it is not enabled by default yet.
Also, there’s been an effort to improve the cost-modeling for LoopDistribute (https://reviews.llvm.org/D100381) Should we make progress in that direction first, before enabling by default?
Ping.
Additionally, I was not able to see the pass triggered from llvm-test-suite and spec benchmark except hmmer.
Thanks
JinGu Kang
From: Sanne Wouda <Sanne...@arm.com>
Sent: 25 June 2021 11:23
To: Jingu Kang <Jingu...@arm.com>; llvm...@lists.llvm.org; Florian Hahn <floria...@apple.com>
Cc: ni...@php.net
Subject: Re: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager
Hi,
On 5 Jul 2021, at 15:40, Jingu Kang <jingu...@arm.com> wrote:Ping.Additionally, I was not able to see the pass triggered from llvm-test-suite and spec benchmark except hmmer.
On 25 Jun 2021, at 12:23, Sanne Wouda <Sanne...@arm.com> wrote:Hi,Do you have any data on how often LoopDistribute triggers on a larger set of programs (like llvm-test-suite + SPEC)? AFAIK the implementation is very limited at the moment (geared towards catching the case in hmmer) and I suspect lack of generality is one of the reasons why it is not enabled by default yet.It would be good to have some fresh numbers on how often LoopDistribute triggers. From what I remember, there are a handful of cases in the test suite, but nothing that significantly affects performance (other than hmmer, obviously).Also, there’s been an effort to improve the cost-modeling for LoopDistribute (https://reviews.llvm.org/D100381) Should we make progress in that direction first, before enabling by default?Unfortunately, there were some problems with this effort. First, the current implementation of LoopDistribute relies heavily on LoopAccessAnalysis, which made it difficult to adapt.More importantly though, I'm not convinced that LoopDistribute will be beneficial other than in cases where it enables more vectorization. (The memcpy detection gcc might be interesting, I didn't look at that.) It reduces both ILP and MLP, which in some cases might be made up by lower register or cache pressure, but this is hard or impossible for the compiler to know.
While working on this, with a more aggressive LoopDistribute across several benchmarks, I did not see any improvements that didn't turn out to be noise, and plenty of cases where it was actively degrading performance.
Hi Florian,
Thanks for your kind reply.
The loop distribute pass was not triggered with one of below messages on almost all tests.
Skipping; memory operations are safe for vectorization
Skipping; no unsafe dependences to isolate
Skipping; multiple exit blocks
It looks like the first and second message are reasonable.
From third message, we could try to improve the pass to handle loop with the multiple exit blocks… I am not sure how much effort we need for it…
Thanks
JinGu Kang
From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of
Florian Hahn via llvm-dev
Sent: 14 July 2021 13:06
To: Jingu Kang <Jingu...@arm.com>; llvm...@lists.llvm.org
Cc: ni...@php.net
Subject: Re: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager
Hi,