[llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager

Jingu Kang via llvm-dev

unread,

Jun 17, 2021, 8:52:01 AM6/17/21

to llvm...@lists.llvm.org

Hi All,

My colleague Sanne has found performance improvement with ‘-enable-loop-distribute’ option from hmmer on SPEC2006.

On the hmmer, there is a loop with dependence. The Loop Distribute pass splits the loop into three sperate loops. One loop has still dependence, another is vectorizable, the other is vectorizable after running LoopBoundSplit pass which needs to be updated a bit. On AArch64, we have seen 40% improvement with enabling Loop Distribute pass and 80% improvement with enabling Loop Distribute pass and LoopBoundSplit from hmmer on SPEC2006.

From llvm-test-suite and spec benchmarks, I have not seen any performance degradation with enabling the Loop Distribute pass because almost all tests are not handled by Loop Distribute pass with mainly below messages. I think the messages are reasonable.

Skipping; memory operations are safe for vectorization

Skipping; no unsafe dependences to isolate

Skipping; multiple exit blocks

For compile time, there is no big change because the almost all tests are not handled by the pass due to mainly above three reasons which comes from cached analysis information.

At this moment, we can enable the pass with metadata or command line option. If possible, can we enable the Loop Distribute pass as default in the pipeline of new pass manager please?

Thanks

JinGu Kang

Sjoerd Meijer via llvm-dev

unread,

Jun 17, 2021, 11:06:48 AM6/17/21

to llvm...@lists.llvm.org, Jingu Kang

My 2 cents:

It's not really convincing if a pass trigger on only benchmark case. But on the other hand, if it is a really cheap pass to run (compile-times) and benefits a case, then why not? Perhaps you need to quantify this to make it more convincing. Additional benefit of enabling it by default is that it gets more exposure and testing, which I think is a good thing.

Lastly, is there anything we can learn from GCC here? E.g., do they have this enabled, and perhaps support more/other cases?

From: llvm-dev <llvm-dev...@lists.llvm.org> on behalf of Jingu Kang via llvm-dev <llvm...@lists.llvm.org>
Sent: 17 June 2021 13:51
To: llvm...@lists.llvm.org <llvm...@lists.llvm.org>
Subject: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager

Sjoerd Meijer via llvm-dev

unread,

Jun 17, 2021, 11:08:49 AM6/17/21

to llvm...@lists.llvm.org, Jingu Kang, Sjoerd Meijer

typo:

> It's not really convincing if a pass trigger on only benchmark case.

-> trigger on only 1 benchmark case.

From: llvm-dev <llvm-dev...@lists.llvm.org> on behalf of Sjoerd Meijer via llvm-dev <llvm...@lists.llvm.org>
Sent: 17 June 2021 16:06
To: llvm...@lists.llvm.org <llvm...@lists.llvm.org>; Jingu Kang <Jingu...@arm.com>
Subject: Re: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager

Kyrylo Tkachov via llvm-dev

unread,

Jun 17, 2021, 12:08:34 PM6/17/21

to Sjoerd Meijer, Jingu Kang, llvm...@lists.llvm.org

I’d be in favour of enabling loop distribution by default as long as it doesn’t hurt compile-time when it’s not needed.

FWIW GCC enables this by default to get the speedup on hmmer. I don’t know enough about the LLVM implementation to compare with GCC’s, but GCC’s loop distribution pass aims to help vectorisation and help detect manual memset, memcpy implementations (I think LLVM does that detection in another pass).

You can read the high-level GCC design in the source: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/tree-loop-distribution.c;h=65aa1df4abae2c6acf40299f710bc62ee6bacc07;hb=HEAD#l39

Thanks,

Kyrill

Michael Kruse via llvm-dev

unread,

Jun 17, 2021, 2:14:12 PM6/17/21

to Jingu Kang, llvm...@lists.llvm.org

The LoopDistribute pass doesn't do anything unless it sees
llvm.loop.distribute.enable (`#pragma clang loop distribute(enable)`)
because it does not have a profitability heuristic. It cannot say
whether loop distribution is good for performance or not. What makes
it improve hmmer is that the distributed loops can be vectoried.
However, LoopDistribute is located before the vectorizer and cannot
say in advance whether a distributed loop will be vectorized or not.
If not, then it potentially only increased loop overhead.

To make -enable-loop-distribute on by default would mean that we could
consider loop distribution to be usually beneficial without causing
major regressions. We need a lot more data to support that conclusion.

Alternatively, we could consider loop-distribution a canonicalization.
A later LoopFuse would do the profitability heuristic to re-fuse loops
again if loop distribution did not gain anything.

Michael
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Jingu Kang via llvm-dev

unread,

Jun 18, 2021, 8:13:55 AM6/18/21

to Michael Kruse, Kyrylo Tkachov, Sjoerd Meijer, llvm...@lists.llvm.org

I appreciate your replies. I have seen below performance data.

For AArch64, the performance data from llvm-test-suite is as below.

Metric: exec_time

Program results_base results_loop_dist diff
test-suite...ications/JM/lencod/lencod.test 3.95 4.29 8.8%
test-suite...emCmp<5, GreaterThanZero, Mid> 1456.09 1574.29 8.1%
test-suite...st:BM_BAND_LIN_EQ_LAMBDA/44217 22.83 24.50 7.3%
test-suite....test:BM_BAND_LIN_EQ_RAW/44217 23.00 24.17 5.1%
test-suite...st:BM_INT_PREDICT_LAMBDA/44217 589.54 616.70 4.6%
test-suite...t:BENCHMARK_asin_novec_double_ 330.25 342.17 3.6%
test-suite...ow-dbl/GlobalDataFlow-dbl.test 2.58 2.67 3.3%
test-suite...da.test:BM_PIC_2D_LAMBDA/44217 781.30 806.36 3.2%
test-suite...est:BM_ENERGY_CALC_LAMBDA/5001 63.02 64.93 3.0%
test-suite...gebra/kernels/syr2k/syr2k.test 6.53 6.73 3.0%
test-suite...t/StatementReordering-flt.test 2.33 2.40 2.8%
test-suite...sCRaw.test:BM_PIC_2D_RAW/44217 789.90 810.05 2.6%
test-suite...s/gramschmidt/gramschmidt.test 1.44 1.48 2.5%
test-suite...Raw.test:BM_HYDRO_1D_RAW/44217 38.42 39.37 2.5%
test-suite....test:BM_INT_PREDICT_RAW/44217 597.73 612.34 2.4%
Geomean difference -0.0%
results_base results_loop_dist diff
count 584.000000 584.000000 584.000000
mean 2761.681991 2759.451499 -0.000020
std 30145.555650 30124.858004 0.011093
min 0.608782 0.608729 -0.116286
25% 3.125425 3.106625 -0.000461
50% 130.212207 130.582658 0.000004
75% 602.708659 612.931769 0.000438
max 511340.880000 511059.980000 0.087630

For AArch64, the performance data from SPEC benchmark is as below.

SPEC2006
Benchmark Improvement(%)
400.perlbench -1.786911228
401.bzip2 -3.174199894
403.gcc 0.717990522
429.mcf 2.053027806
445.gobmk 0.775388165
456.hmmer 43.39308377
458.sjeng 0.133933093
462.libquantum 4.647923489
464.h264ref -0.059568786
471.omnetpp 1.352515266
473.astar 0.362752409
483.xalancbmk 0.746580249

SPEC2017
Benchmark Improvement(%)
500.perlbench_r 0.415424516
502.gcc_r -0.112915812
505.mcf_r 0.238633706
520.omnetpp_r 0.114830748
523.xalancbmk_r 0.460107636
525.x264_r -0.401915964
531.deepsjeng_r 0.010064227
541.leela_r 0.394797504
557.xz_r 0.111781366

Thanks
JinGu Kang

> -----Original Message-----
> From: Michael Kruse <llv...@meinersbur.de>
> Sent: 17 June 2021 19:13
> To: Jingu Kang <Jingu...@arm.com>
> Cc: llvm...@lists.llvm.org
> Subject: Re: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline
> of new pass manager
>

Jingu Kang via llvm-dev

unread,

Jun 21, 2021, 9:28:08 AM6/21/21

to Michael Kruse, Kyrylo Tkachov, Sjoerd Meijer, llvm...@lists.llvm.org

For considering the LoopDistribute pass as a canonicalization with the profitability heuristic of LoopFuse pass, it looks the LoopFuse pass does not also have proper profitability function.

If possible, I would like to enable the LoopDistribute pass based on the performance data.

As you can see on the previous email, the Geomean difference from llvm-test-suite is -0.0%. From spec benchmarks, we can see 43% performance improvement on 456.hmmer of SPEC2006. Based on this data, I think we could say the pass is usually beneficial without causing major regression.

How do you think about it?

Sjoerd Meijer via llvm-dev

unread,

Jun 21, 2021, 9:35:58 AM6/21/21

to Jingu Kang, Michael Kruse, Kyrylo Tkachov, llvm...@lists.llvm.org

> Based on this data, I think we could say the pass is usually beneficial without causing major regression.

I think we need to look at compile-times too before we can draw that conclusion, i.e. we need to justify it's worth spending extra compile-time for optimising a few cases. Hopefully loop distribution is a cheap pass to run (also when it is running but not triggering), but that's something that needs to be checked I think.

From: Jingu Kang <Jingu...@arm.com>
Sent: 21 June 2021 14:27

To: Michael Kruse <llv...@meinersbur.de>; Kyrylo Tkachov <Kyrylo....@arm.com>; Sjoerd Meijer <Sjoerd...@arm.com>

Cc: llvm...@lists.llvm.org <llvm...@lists.llvm.org>

Jingu Kang via llvm-dev

unread,

Jun 21, 2021, 1:54:46 PM6/21/21

to Sjoerd Meijer, Michael Kruse, Kyrylo Tkachov, llvm...@lists.llvm.org

The compile time data is as below. There could be a bit noise but it looks there is no big compile time regression.

From llvm-test-suite

Metric: compile_time

Program results_base results_loop_dist diff

test-suite...arks/VersaBench/dbms/dbms.test 0.94 0.95 1.6%

test-suite...s/MallocBench/cfrac/cfrac.test 0.89 0.90 1.5%

test-suite...ks/Prolangs-C/gnugo/gnugo.test 0.72 0.73 1.4%

test-suite...yApps-C++/PENNANT/PENNANT.test 8.65 8.75 1.2%

test-suite...marks/Ptrdist/yacr2/yacr2.test 0.84 0.85 1.1%

test-suite.../Builtins/Int128/Builtins.test 0.86 0.87 1.0%

test-suite...s/ASC_Sequoia/AMGmk/AMGmk.test 0.69 0.70 1.0%

test-suite...decode/alacconvert-decode.test 1.16 1.17 0.9%

test-suite...encode/alacconvert-encode.test 1.16 1.17 0.9%

test-suite...peg2/mpeg2dec/mpeg2decode.test 1.71 1.72 0.9%

test-suite.../Applications/spiff/spiff.test 0.88 0.89 0.9%

test-suite...terpolation/Interpolation.test 0.96 0.97 0.9%

test-suite...chmarks/MallocBench/gs/gs.test 4.58 4.62 0.9%

test-suite...-C++/stepanov_abstraction.test 0.69 0.70 0.8%

test-suite...marks/7zip/7zip-benchmark.test 52.35 52.74 0.7%

Geomean difference nan%

results_base results_loop_dist diff

count 117.000000 118.000000 117.000000

mean 4.636126 4.616575 0.002171

std 7.725991 7.737663 0.006310

min 0.607300 0.602200 -0.041930

25% 1.345700 1.313650 -0.001577

50% 1.887000 1.888800 0.002463

75% 4.340800 4.343275 0.005754

max 52.351200 52.736000 0.015861

From SPEC2017

benchmarks	baseline	enable-loop-distribute	diff (seconds)
500.perlbench_r	00:01:06	00:01:04	-2
502.gcc_r	00:05:24	00:05:25	1
505.mcf_r	00:00:02	00:00:02	0
520.omnetpp_r	00:00:58	00:00:58	0
523.xalancbmk_r	00:02:30	00:02:30	0
525.x264_r	00:00:32	00:00:31	-1
531.deepsjeng_r	00:00:04	00:00:04	0
541.leela_r	00:00:06	00:00:06	0
557.xz_r	00:00:05	00:00:05	0
999.specrand_ir	00:00:01	00:00:00	1

From SPEC2006 (number is seconds)

benchmarks	baseline	enable-loop-distribute	diff (seconds)
400.perlbench	00:00:29	00:00:29	0
401.bzip2	00:00:04	00:00:03	-1
403.gcc	00:01:28	00:01:26	-2
429.mcf	00:00:01	00:00:01	0
445.gobmk	00:00:24	00:00:24	0
456.hmmer	00:00:06	00:00:06	0
458.sjeng	00:00:03	00:00:03	0
462.libquantum	00:00:03	00:00:02	-1
464.h264ref	00:00:29	00:00:29	0
471.omnetpp	00:00:23	00:00:24	1
473.astar	00:00:02	00:00:02	0
483.xalancbmk	00:02:07	00:02:06	-1
999.specrand	00:00:01	00:00:01	0

Thanks

JinGu Kang

From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of Sjoerd Meijer via llvm-dev

Michael Kruse via llvm-dev

unread,

Jun 21, 2021, 2:04:23 PM6/21/21

to Sjoerd Meijer, llvm...@lists.llvm.org, Jingu Kang

Am Mo., 21. Juni 2021 um 08:36 Uhr schrieb Sjoerd Meijer
<Sjoerd...@arm.com>:

> I think we need to look at compile-times too before we can draw that conclusion, i.e. we need to justify it's worth spending extra compile-time for optimising a few cases. Hopefully loop distribution is a cheap pass to run (also when it is running but not triggering), but that's something that needs to be checked I think.

LoopDistribute currently already iterates over all loops to find the
llvm.loop.distribute.enable metadata. Additional compile-time overhead
would be the LoopAccessAnalysis which could be done for cheap if
LoopAccessAnalysis is used for LoopVectorize anyways.

> ________________________________
> From: Jingu Kang <Jingu...@arm.com>
> Sent: 21 June 2021 14:27
> To: Michael Kruse <llv...@meinersbur.de>; Kyrylo Tkachov <Kyrylo....@arm.com>; Sjoerd Meijer <Sjoerd...@arm.com>
> Cc: llvm...@lists.llvm.org <llvm...@lists.llvm.org>
> Subject: RE: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager

For some reason I cannot find this email in my inbox, although it was
definitely sent to the mailing-list:
https://lists.llvm.org/pipermail/llvm-dev/2021-June/151306.html
So I am replying within Sjoerd's email.

> For considering the LoopDistribute pass as a canonicalization with the profitability heuristic of LoopFuse pass, it looks the LoopFuse pass does not also have proper profitability function.

Within the loop optimization working group we were considering adding
a heuristic to LoopFuse. but is also not restricted to innermost
loops. However, the advantage is that it could run after LoopVectorize
and re-fuse loops that turned out to be non-vectorizable, or to loops
that have been vectorized independently. Unfortunately I think the
legality/profitability is comparatively expensive since it does not
unse LoopAccessAnalsysis,

Michael Kruse via llvm-dev

unread,

Jun 21, 2021, 2:13:09 PM6/21/21

to Jingu Kang, llvm...@lists.llvm.org, ni...@php.net

[adding nikc to CC]

@nikc Would you consider this amount of regression acceptable?

Jingu Kang via llvm-dev

unread,

Jun 22, 2021, 1:11:20 PM6/22/21

to Michael Kruse, ni...@php.net, llvm...@lists.llvm.org

@nikic If you need more information for loop distribute pass, please let me know.

Thanks

JinGu Kang

Jingu Kang via llvm-dev

unread,

Jun 24, 2021, 12:39:13 PM6/24/21

to Michael Kruse, ni...@php.net, llvm...@lists.llvm.org

Sorry for Ping.

As I mentioned on previous email, if you need more information for enabling the loop distribute pass, please let me know. @Michael @nikic

Regards

JinGu Kang

Florian Hahn via llvm-dev

unread,

Jun 24, 2021, 1:29:43 PM6/24/21

to Jingu Kang, llvm...@lists.llvm.org, ni...@php.net

Hi,

On Jun 24, 2021, at 17:38, Jingu Kang via llvm-dev <llvm...@lists.llvm.org> wrote:

Sorry for Ping.

As I mentioned on previous email, if you need more information for enabling the loop distribute pass, please let me know. @Michael @nikic

Do you have any data on how often LoopDistribute triggers on a larger set of programs (like llvm-test-suite + SPEC)? AFAIK the implementation is very limited at the moment (geared towards catching the case in hmmer) and I suspect lack of generality is one of the reasons why it is not enabled by default yet.

Also, there’s been an effort to improve the cost-modeling for LoopDistribute (https://reviews.llvm.org/D100381) Should we make progress in that direction first, before enabling by default?

Cheers,

Florian

Sanne Wouda via llvm-dev

unread,

Jun 25, 2021, 6:23:36 AM6/25/21

to Jingu Kang, llvm...@lists.llvm.org, Florian Hahn, ni...@php.net

Hi,

Do you have any data on how often LoopDistribute triggers on a larger set of programs (like llvm-test-suite + SPEC)? AFAIK the implementation is very limited at the moment (geared towards catching the case in hmmer) and I suspect lack of generality is one of the reasons why it is not enabled by default yet.

It would be good to have some fresh numbers on how often LoopDistribute triggers. From what I remember, there are a handful of cases in the test suite, but nothing that significantly affects performance (other than hmmer, obviously).

Also, there’s been an effort to improve the cost-modeling for LoopDistribute (https://reviews.llvm.org/D100381) Should we make progress in that direction first, before enabling by default?

Unfortunately, there were some problems with this effort. First, the current implementation of LoopDistribute relies heavily on LoopAccessAnalysis, which made it difficult to adapt.

More importantly though, I'm not convinced that LoopDistribute will be beneficial other than in cases where it enables more vectorization. (The memcpy detection gcc might be interesting, I didn't look at that.) It reduces both ILP and MLP, which in some cases might be made up by lower register or cache pressure, but this is hard or impossible for the compiler to know.

While working on this, with a more aggressive LoopDistribute across several benchmarks, I did not see any improvements that didn't turn out to be noise, and plenty of cases where it was actively degrading performance.

Therefore, I'm not sure this direction is worth pursuing further, and I believe the current heuristic of "distribute when it enables new vectorization" is actually pretty reasonable, if not very general.

Cheers,

Sanne

Jingu Kang via llvm-dev

unread,

Jul 5, 2021, 9:40:27 AM7/5/21

to Sanne Wouda, llvm...@lists.llvm.org, Florian Hahn, ni...@php.net

Ping.

Additionally, I was not able to see the pass triggered from llvm-test-suite and spec benchmark except hmmer.

Thanks

JinGu Kang

From: Sanne Wouda <Sanne...@arm.com>
Sent: 25 June 2021 11:23
To: Jingu Kang <Jingu...@arm.com>; llvm...@lists.llvm.org; Florian Hahn <floria...@apple.com>
Cc: ni...@php.net
Subject: Re: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager

Hi,

Florian Hahn via llvm-dev

unread,

Jul 14, 2021, 8:06:28 AM7/14/21

to Jingu Kang, llvm...@lists.llvm.org, ni...@php.net

Hi,

On 5 Jul 2021, at 15:40, Jingu Kang <jingu...@arm.com> wrote:

Ping.

Additionally, I was not able to see the pass triggered from llvm-test-suite and spec benchmark except hmmer.

Hm, this seems to indicate that the current version of loop-distribute is not very general at all. I’m not sure if we want to enable the pass by default, as long as it effectively only benefits a single benchmark?

Taking a step back, the question this raises for me is *why* it does not trigger anywhere else at all? I doubt there are no other cases in the test-suite and/or various SPEC benchmarks that would benefit from loop distribution. It would at least be helpful to understand *why* the pass misses them at the moment and what kind of changes are needed to make it more general.

Florian Hahn via llvm-dev

unread,

Jul 14, 2021, 8:18:44 AM7/14/21

to Sanne Wouda, llvm...@lists.llvm.org, ni...@php.net, Jingu Kang

On 25 Jun 2021, at 12:23, Sanne Wouda <Sanne...@arm.com> wrote:

Hi,
Do you have any data on how often LoopDistribute triggers on a larger set of programs (like llvm-test-suite + SPEC)? AFAIK the implementation is very limited at the moment (geared towards catching the case in hmmer) and I suspect lack of generality is one of the reasons why it is not enabled by default yet.
It would be good to have some fresh numbers on how often LoopDistribute triggers. From what I remember, there are a handful of cases in the test suite, but nothing that significantly affects performance (other than hmmer, obviously).
Also, there’s been an effort to improve the cost-modeling for LoopDistribute (https://reviews.llvm.org/D100381) Should we make progress in that direction first, before enabling by default?
Unfortunately, there were some problems with this effort. First, the current implementation of LoopDistribute relies heavily on LoopAccessAnalysis, which made it difficult to adapt.

More importantly though, I'm not convinced that LoopDistribute will be beneficial other than in cases where it enables more vectorization. (The memcpy detection gcc might be interesting, I didn't look at that.) It reduces both ILP and MLP, which in some cases might be made up by lower register or cache pressure, but this is hard or impossible for the compiler to know.

I think we should be able to make an educated guess at least if we wanted to, although it won’t be straightforward. I think there can be cases where loop distribution can be beneficial on its own, especially for large loops where enough parallelism remains after distributing, but they can be highly target-specific.

While working on this, with a more aggressive LoopDistribute across several benchmarks, I did not see any improvements that didn't turn out to be noise, and plenty of cases where it was actively degrading performance.

Thanks for the update! It might be good to close the loop on the review as well?

Cheers,

Florian

Jingu Kang via llvm-dev

unread,

Jul 14, 2021, 8:55:47 AM7/14/21

to Florian Hahn, ni...@php.net, llvm...@lists.llvm.org

Hi Florian,

Thanks for your kind reply.

The loop distribute pass was not triggered with one of below messages on almost all tests.

Skipping; memory operations are safe for vectorization

Skipping; no unsafe dependences to isolate

Skipping; multiple exit blocks

It looks like the first and second message are reasonable.

From third message, we could try to improve the pass to handle loop with the multiple exit blocks… I am not sure how much effort we need for it…

Thanks

JinGu Kang

From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of Florian Hahn via llvm-dev
Sent: 14 July 2021 13:06
To: Jingu Kang <Jingu...@arm.com>; llvm...@lists.llvm.org
Cc: ni...@php.net
Subject: Re: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager

Hi,

Reply all

Reply to author

Forward