+ AOSP LLVM's public email list, CBL
On Wed, Mar 11, 2020 at 4:40 PM Danny Lin <
da...@kdrag0n.dev> wrote:
>
> Hi Nick,
Hi Danny, great to hear from you. Thank you for taking the time to
write all this up! I can tell a lot of care was put into it.
>
> Polly is LLVM's polyhedral loop optimizer, which analyzes loops and optimizes them for cache locality as well as reduced memory accesses, similar to GCC Graphite. It can also perform automatic loop parallelization using OpenMP, but that's definitely out-of-scope for the kernel and most likely for Android as well. You can read more about it here:
https://polly.llvm.org/
I admit I don't know too much about Polly (though I can describe it
better than most people); my opinion is limited from my experience on
TensorFlow's XLA compiler which didn't use Polly intentionally, though
that team is mostly replaced with folks working on MLIR now actively
researching Polly.
>
> LLVM needs to be built with the "polly" project enabled in LLVM_ENABLE_PROJECTS for it to be available. After that you need to enable the use of Polly by passing flags when invoking Clang. These are LLVM flags, not Clang ones, so each one needs to have "-mllvm" before it. Below is a list of the ones I've tested, along with the explanations to the best of my knowledge:
It's probably low hanging fruit for us to enable polly in the
configuration of AOSP LLVM, since it seems like making use of it at
compile time is also gated on the flags you describe below. That
seems like it should give us fine grain control over disabling it for
projects should they miscompile with it enabled, if any.
>
> -polly: Base flag to enable Polly itself
> -polly-run-dce: Polyhedral dead code elimination, analyzes and eliminates statements that can be proven dead (
https://polly.llvm.org/doxygen/DeadCodeElimination_8cpp_source.html)
> -polly-run-inliner: Run an early LLVM inlining pass before running Polly
> -polly-opt-fusion=max: Optimization fusion strategy (default min)
> -polly-ast-use-context: Pass context around loops to the optimizer so that it can make better decisions
> -polly-detect-keep-going: Don't fail on the first error encountered (this is probably a bad idea)
> -polly-vectorizer=stripmine: Generate vector code automatically (
https://polly.llvm.org/docs/UsingPollyWithClang.html#automatic-vector-code-generation)
> -polly-invariant-load-hoisting: Hoist loads of invariant memory values out of loops, when possible (
https://reviews.llvm.org/D31842)
Thanks for the research (and testing of D31842, though seeing Johannes
resign is curious), this is a great starting point for folks looking
to try Polly!
>
> I have a kernel commit that exposes the ones I deemed useful through a Kconfig option:
https://github.com/kdrag0n/proton_zf6/commit/00f711eead423
> And a prebuilt toolchain with Polly support that can be used for preliminary testing and evaluation:
https://github.com/kdrag0n/proton-clang
So no new compiler warnings, boot issues, or otherwise noticeable
runtime issues? That's impressive, and worth paying attention to.
Thanks for all of the work you do on Proton kernels (and toolchains),
too BTW, which I've watched from afar. The Android ROM scene on XDA
is a vibrant and competitive edge that Android has in the market.
> While it hasn't provided much of a performance improvement for kernels in the past, I've recently done some new tests and it looks like that's changed drastically. On my 4.14 kernel, Polly is now showing a larger performance improvement than LTO in terms of hackbench times. Without LTO the improvement is 14% over not using Polly, and with LTO it's 10% — still substantial. The results are available here:
https://docs.google.com/spreadsheets/d/1mhjyshujZz8jYI7dMoCe-yFbxymW-fWaC08vMhBbEmQ/edit
That's a good start, though hackbench is not the be-all end-all of
benchmarks, and N=3 isn't statistically significant. Internally, we
have a composite of numerous first and third party performance test
suites. Quantifying it in terms of speedup relative to what we saw
with LTO is brilliant, and eye opening though!
> At least with the kernel, I haven't observed any noticeable differences in compile times.
Impressive as well. Another thing we measure is binary size; kernel
boot time is strongly correlated with kernel image size (decompressing
the kernel image scaling with kernel image size). But everything is a
tradeoff; quantifying the tradeoff is important.
> I'd be willing to submit an AOSP LLVM patch to enable Polly given instructions on doing so.
Great! Personally, that would make me so happy to help you do this; I
think Google+Android in general could be doing more outreach to the
fantastic XDA developers who sacrifice a lot of their time and money
for improving the Android ecosystem. I'm committed to changing the
current status quo in that regard.
If you clone this repo:
https://android.googlesource.com/toolchain/llvm_android/
and modify the stage 2 cmakes flags (stage2_extra_defines) in do_build.py:
https://android.googlesource.com/toolchain/llvm_android/+/refs/heads/master/do_build.py#1371
Send me a patch on
https://android-review.googlesource.com/ and we'll
go from there.
>
> At any rate, I'd be curious to see the results of your testing and whether there are any stability issues with it in production. Let me know if you need more info!
--
Thanks,
~Nick Desaulniers