Having done no tests at all on my side, I think we need to have
similar numbers on Linux to be able to flip across the board.
I don't want to flip it only for Darwin and not Linux, as that will
fragment the effort too much.
I'll check with Diana and Kristof to know what's the best way forward,
but it should be reasonably quick.
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Just a heads-up for anyone building things with CMake: if you use the
default build type (Release), it will put CMAKE_CXX_FLAGS_RELEASE
after the CMAKE_CXX_FLAGS, so you'll end up running -O3 instead of
-O0. You might want to check your command lines.
Kristof has also found an issue in the test-suite where it actually
runs -O2 due to some weirdness in the build system, we'll try to
commit a fix for that.
Cheers,
Diana
On 30 March 2017 at 16:13, Evgeny Astigeevich via llvm-dev
It does actually work if you add "-fno-vectorize -fno-slp-vectorize"
to CMAKE_CXX_FLAGS_RELASE (and C). Not a standard configuration, but
probably throwing more varied IR at it than just doing -O0.
Cheers.
Tim.
Here's my results so far:
On the test-suite, we get 2 timeouts during execution (paq8p and
scimark2). Other than that, everything seems to run just fine. We have
about 7410 unique fallbacks out of 52367 unique functions. Assuming I
counted right (please let me know if this looks fishy, I personally
find the total number of functions terrifyingly small).
On a stage 2 build of clang, we run check-all successfully. We have
about 64784 unique fallbacks out of 661461 functions.
I'm currently trying to run a stage 3 build to compare binaries (just
for kicks) and I'll also try to do some runs without fallbacks and
count what problems we run into most often.
The way I've been counting the total number of functions was to run
objdump -t on all the .o files in the build directory, grab everything
with an "F" flag and remove duplicates. If anyone knows a better way
to do this I'm all ears.
Thanks,
Diana
Interesting. I'd not seen those failures in the configurations I'd
run. I'll look into them (other than that my best bet for debugging is
a kernel panic, this has to be easier!).
> On the test-suite, we get 2 timeouts during execution (paq8p and
> scimark2). Other than that, everything seems to run just fine. We have
> about 7410 unique fallbacks out of 52367 unique functions. Assuming I
> counted right (please let me know if this looks fishy, I personally
> find the total number of functions terrifyingly small).
It's about 275000 before uniqueing (which roughly matches an earlier
measurement I did). The duplication seems to be dominated by the
halide tests, though you have to be a little careful with "main"
(which occurs about 500 times). At a glance I'd put the total closer
to 53000 (52420 + 492 extra copies of main), but that's a small
difference to your figure.
Tim.
I just ran the tests on Linux. Those two were last to finish, but they
did terminate eventually for me. Bother, back to the kernel.
So, FastISel on AArch64 isn't really an "O0" selector: it has a lot
of smarts and peepholes, because some JIT users had it as the main
optimizing selector for a while.
In that sense, it's a pretty aggressive target that IMO we don't have to match.
> * FastISel doesn't seem to handle functions with switch statements, so it
> falls back to DAGISel. DAGISel produces code that's a lot better than
> GlobalISel for switch statement at -O0. I'm not sure if we need to do
> something here before enabling GlobalISel by default. I'm thinking we may
> need to add a smarter way to lower switch statements rather than just a
> cascaded sequence of conditional branches.
D31080 seems promising, I've been wanting to take a look, hoping we
can use that to emit an optimized lowering. I'm not sure we want that
at O0 though (even if only for FastISel+DAGISel parity).
> I'll try to add the above content to the document Diana created at
> https://goo.gl/IS2Bdw too.
Thanks for the investigation! These are also some of the biggest
problems I've seen (in particular the FP regbanks).
I'll make sure I find the time to file bugs for all the other issues
I'm aware of. (sorry I haven't done that earlier!)
-Ahmed
But how do the compile-times compare vs FastIsel?
On 6 Apr 2017, at 21:06, Ahmed Bougacha <ahmed.b...@gmail.com> wrote:
On Thu, Apr 6, 2017 at 6:53 AM, Kristof Beyls via llvm-dev
<llvm...@lists.llvm.org> wrote:
I've been digging a little bit deeper into the biggest performance
regressions I've observed.
What I've observed so far is:
* A lot of the biggest regressions are caused by unnecessarily moving
floating point values through general purpose registers. I've raised
http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one
definitely needs fixing before enabling GlobalISel by default at -O0.
* FastISel seems to transform division-by-constant-power-of-2 into right
shift (see
https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468).
GlobalISel does not. It seems to me that at -O0 there may be reasons not
perform this transformation, but maybe there is a good reason why FastISel
does this?
So, FastISel on AArch64 isn't really an "O0" selector: it has a lot
of smarts and peepholes, because some JIT users had it as the main
optimizing selector for a while.
In that sense, it's a pretty aggressive target that IMO we don't have to match.
* FastISel doesn't seem to handle functions with switch statements, so it
falls back to DAGISel. DAGISel produces code that's a lot better than
GlobalISel for switch statement at -O0. I'm not sure if we need to do
something here before enabling GlobalISel by default. I'm thinking we may
need to add a smarter way to lower switch statements rather than just a
cascaded sequence of conditional branches.
D31080 seems promising, I've been wanting to take a look, hoping we
can use that to emit an optimized lowering. I'm not sure we want that
at O0 though (even if only for FastISel+DAGISel parity).
I'll try to add the above content to the document Diana created at
https://goo.gl/IS2Bdw too.
Thanks for the investigation! These are also some of the biggest
problems I've seen (in particular the FP regbanks).
I'll make sure I find the time to file bugs for all the other issues
I'm aware of. (sorry I haven't done that earlier!)
+1, we already do this for lots of other things (inline assembly bugs,
bugs building the Linux kernel etc) and it's a good way to keep things
organized.
On Apr 6, 2017, at 6:53 AM, Kristof Beyls <kristo...@arm.com> wrote:I've been digging a little bit deeper into the biggest performance regressions I've observed.
What I've observed so far is:* A lot of the biggest regressions are caused by unnecessarily moving floating point values through general purpose registers. I've raised http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one definitely needs fixing before enabling GlobalISel by default at -O0.
* FastISel seems to transform division-by-constant-power-of-2 into right shift (see https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468). GlobalISel does not. It seems to me that at -O0 there may be reasons not perform this transformation, but maybe there is a good reason why FastISel does this?
* FastISel doesn’t\ seem to handle functions with switch statements, so it falls back to DAGISel. DAGISel produces code that's a lot better than GlobalISel for switch statement at -O0. I'm not sure if we need to do something here before enabling GlobalISel by default. I'm thinking we may need to add a smarter way to lower switch statements rather than just a cascaded sequence of conditional branches.
On 27 Apr 2017, at 00:48, Quentin Colombet <qcol...@apple.com> wrote:
Hi Kristof,
On Apr 6, 2017, at 6:53 AM, Kristof Beyls <kristo...@arm.com> wrote:
I've been digging a little bit deeper into the biggest performance regressions I've observed.
What I've observed so far is:* A lot of the biggest regressions are caused by unnecessarily moving floating point values through general purpose registers. I've raised http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one definitely needs fixing before enabling GlobalISel by default at -O0.
I commented in the PR. This is a known problem and we have a solution. Given this is an optimization in the sense that it does not affect the correctness of the program, we didn’t push for fixing it now.
For O0 we wanted to focus ourselves on generating correct code. Unless the regressions you are seeing are preventing debugging/running of the program, I wouldn’t block the flip of the switch on that.
What do you think?
* FastISel seems to transform division-by-constant-power-of-2 into right shift (see https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468). GlobalISel does not. It seems to me that at -O0 there may be reasons not perform this transformation, but maybe there is a good reason why FastISel does this?
I think FastISel tries to generate the best code it can no matter what. For GISel O0 however, not doing this optimization sounds sensible to me.Now, I would say that the same remark as the previous bullet point apply: we shouldn’t do it unless it gets in the way of running/debugging the program.
* FastISel doesn’t\ seem to handle functions with switch statements, so it falls back to DAGISel. DAGISel produces code that's a lot better than GlobalISel for switch statement at -O0. I'm not sure if we need to do something here before enabling GlobalISel by default. I'm thinking we may need to add a smarter way to lower switch statements rather than just a cascaded sequence of conditional branches.
Sounds optimization-ish to me. Same remark.
On Apr 27, 2017, at 9:47 AM, Kristof Beyls <kristo...@arm.com> wrote:Hi Quentin,On 27 Apr 2017, at 00:48, Quentin Colombet <qcol...@apple.com> wrote:Hi Kristof,On Apr 6, 2017, at 6:53 AM, Kristof Beyls <kristo...@arm.com> wrote:I've been digging a little bit deeper into the biggest performance regressions I've observed.What I've observed so far is:* A lot of the biggest regressions are caused by unnecessarily moving floating point values through general purpose registers. I've raised http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one definitely needs fixing before enabling GlobalISel by default at -O0.I commented in the PR. This is a known problem and we have a solution. Given this is an optimization in the sense that it does not affect the correctness of the program, we didn’t push for fixing it now.For O0 we wanted to focus ourselves on generating correct code. Unless the regressions you are seeing are preventing debugging/running of the program, I wouldn’t block the flip of the switch on that.What do you think?For O0, I think most users care about fast compile time and an excellent debug experience.Next to that, I am told that some users also use -O0 to have a straightforward, simple, mapping between source code and the generated assembly.Out of the issues I've seen so far, I think this is the single "optimization" issue that I feel gives a negative first impression of GlobalISel.If users look at the generated assembly for floating point code it looks more like active de-optimization rather than "no optimization".My guess is also that this may be the biggest reason for the about 20% performance slow-down and 11% code size increase I measured earlier.OTOH, I clearly agree this is an optimization issue, not a correctness issue.Combining the above, I think it would be best if this issue got fixed before we turn on GlobalISel by default at -O0, or we may end up hearing quite a few (non-critical) complaints about this from users.Basically I think this is a tradeoff between giving a better first impression of GlobalISel vs getting more people to use and test it earlier.Thanks for the write-up on the PR, that is very useful.Do you have any rough idea how much effort would be involved in getting this fixed?
I got the impression Daniel is making good progress on the tablegen generation, which is key to getting this issue fixed?
I think that no matter whether we decide to switch the default before or after fixing this issue, this is one of the most urgent issues to fix as far as I can see.
If there is some way I can help contribute to fixing PR32550, I would like to help; but with the dependency on tablegen generation, I'm not sure what the best way is to help make that PR get fixed faster?
* FastISel seems to transform division-by-constant-power-of-2 into right shift (see https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468). GlobalISel does not. It seems to me that at -O0 there may be reasons not perform this transformation, but maybe there is a good reason why FastISel does this?I think FastISel tries to generate the best code it can no matter what. For GISel O0 however, not doing this optimization sounds sensible to me.Now, I would say that the same remark as the previous bullet point apply: we shouldn’t do it unless it gets in the way of running/debugging the program.I agree that these optimizations should not be done at -O0. I think not doing them is actually an improvement: you give the user what they asked, i.e. "no optimization", and an as-straigthforward-as-possible mapping from source to assembly.* FastISel doesn’t\ seem to handle functions with switch statements, so it falls back to DAGISel. DAGISel produces code that's a lot better than GlobalISel for switch statement at -O0. I'm not sure if we need to do something here before enabling GlobalISel by default. I'm thinking we may need to add a smarter way to lower switch statements rather than just a cascaded sequence of conditional branches.Sounds optimization-ish to me. Same remark.Agreed, optimization-ish. In comparison to the above point on FastISel "peepholes", I think that lowering switch statements to something else than a cascaded sequence of conditional branches doesn't make the generated code harder to map to the source. So, in comparison to the above point on FastISel "peepholes", I'm not actively against being smarter about lowering switch statements at -O0.But I agree, this shouldn't hold up turning on GlobalISel by default at -O0.Other than the above remarks, before turning on GlobalISel by default, we'd better test/verify that debug info quality remains good.I haven't looked into that at all, but am hoping to start looking into that soon, with the help of the DIVA tool (https://github.com/SNSystems/DIVA) presented at last EuroLLVM (http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf). I don't recall anyone so far making any statements about the quality of the debug info they've measured or experienced with GlobalISel enabled?
Other than the all of the above, I just wanted to mention that Oliver recently started running csmith on AArch64 GlobalISel and found one issue so far that already got fixed (https://bugs.llvm.org//show_bug.cgi?id=32733).If he finds other correctness issues, I'm sure he'll keep on reporting them.
On May 9, 2017, at 3:41 AM, Kristof Beyls <kristo...@arm.com> wrote:Great Quentin :).
I've rerun the benchmarks comparing "-O0 -g" with "-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2" on r302453, on AArch64 Cortex-A57.I indeed see almost no moves between GPR and FPR registers anymore (see details below for where I still see some).On geomean, I see 13% slow down (down from 17% on my previous run).On geomean, code size increase is about 3% (down from 11% on my previous run).I also checked debug info quality with the DIVA tool on the test-suite; and asked one of our DS-5 validation guys to test debug experience for GlobalISel in combination with the ARM DS-5 Development Studio debugger. Both of us didn't find any issues with the debug info produced at -O0 -g with GlobalISel.I didn't check compile time, nor do I intend to check that as I don't have a good infrastructure set up for that.
Overall, my impression is that GlobalISel is ready to be enabled by default for AArch64 at -O0, if others also believe it's ready.
- Other overall impression when comparing code generation: The code size increase is probably mainly due to creating constants always in the entry BB of the function. For functions with many constants, this leads to lots and lots of constant creation and then immediately spilling it to the stack. (https://bugs.llvm.org//show_bug.cgi?id=32561)
Hi Kristof,
On May 9, 2017, at 3:41 AM, Kristof Beyls <kristo...@arm.com> wrote:Great Quentin :).
I've rerun the benchmarks comparing "-O0 -g" with "-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2" on r302453, on AArch64 Cortex-A57.I indeed see almost no moves between GPR and FPR registers anymore (see details below for where I still see some).On geomean, I see 13% slow down (down from 17% on my previous run).On geomean, code size increase is about 3% (down from 11% on my previous run).I also checked debug info quality with the DIVA tool on the test-suite; and asked one of our DS-5 validation guys to test debug experience for GlobalISel in combination with the ARM DS-5 Development Studio debugger. Both of us didn't find any issues with the debug info produced at -O0 -g with GlobalISel.I didn't check compile time, nor do I intend to check that as I don't have a good infrastructure set up for that.
Overall, my impression is that GlobalISel is ready to be enabled by default for AArch64 at -O0, if others also believe it's ready.Sounds good to me!Renato, Diana, how does that sound?
On May 9, 2017, at 11:55 AM, Eric Christopher <echr...@gmail.com> wrote:Hi Quentin,On Tue, May 9, 2017 at 11:47 AM Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:Hi Kristof,On May 9, 2017, at 3:41 AM, Kristof Beyls <kristo...@arm.com> wrote:Great Quentin :).I've rerun the benchmarks comparing "-O0 -g" with "-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2" on r302453, on AArch64 Cortex-A57.I indeed see almost no moves between GPR and FPR registers anymore (see details below for where I still see some).On geomean, I see 13% slow down (down from 17% on my previous run).On geomean, code size increase is about 3% (down from 11% on my previous run).I also checked debug info quality with the DIVA tool on the test-suite; and asked one of our DS-5 validation guys to test debug experience for GlobalISel in combination with the ARM DS-5 Development Studio debugger. Both of us didn't find any issues with the debug info produced at -O0 -g with GlobalISel.I didn't check compile time, nor do I intend to check that as I don't have a good infrastructure set up for that.Overall, my impression is that GlobalISel is ready to be enabled by default for AArch64 at -O0, if others also believe it's ready.Sounds good to me!Renato, Diana, how does that sound?I'd definitely like to give this a run through on my side as well before you flip the switch. I'll try to do that this week.
Hi Quentin,
I'll defer to Diana, as she was looking into this. If she's happy, I'm happy. :)
On May 9, 2017, at 11:47 AM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi Kristof,
On 10 May 2017, at 17:36, Quentin Colombet <qcol...@apple.com> wrote:
- MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode (46%): Function Reference_IDCT: Probably due to creating all constants in the entry BB + spilling floating point data through an X register:
- FastISel:
fadd d0, d1, d0
str d0, [sp,#528]- GlobalISel:
fadd d0, d1, d0
fmov x9, d0
stur x9, [x29,#-48]
Good finding, I forgot to do stores in my previous fix. I’ll do them shortly.
Should be fixed by r302679
I'm still running some validation on this, I'll send an email when
it's done. If that goes well I don't have anything against making the
switch.
For the record, here's a summary of issues that were deferred for
later on (some of which are optimization-ish and we might decide to
never do at -O0 at all):
* Crash in RegBankSelect for half fp types:
https://bugs.llvm.org/show_bug.cgi?id=32560
* Improving constant placement: http://bugs.llvm.org/show_bug.cgi?id=32561
* Fancy switch lowering
* Transforming division-by-constant-power-of-2 into right shift
AFAICT all the other issues that were brought up were fixed (yay!).
Cheers,
Diana
On 11 May 2017 at 08:44, Kristof Beyls via llvm-dev
On May 11, 2017, at 4:01 AM, Diana Picus <diana...@linaro.org> wrote:Hi all,
I'm still running some validation on this, I'll send an email when
it's done. If that goes well I don't have anything against making the
switch.
For the record, here's a summary of issues that were deferred for
later on (some of which are optimization-ish and we might decide to
never do at -O0 at all):
* Crash in RegBankSelect for half fp types:
https://bugs.llvm.org/show_bug.cgi?id=32560
* Fancy switch lowering
* Transforming division-by-constant-power-of-2 into right shift
AFAICT all the other issues that were brought up were fixed (yay!).
Cheers,
Diana
I ran into a little snag on the test-suite:
https://bugs.llvm.org/show_bug.cgi?id=33021
It boils down to GlobalISel generating calls to fabs instead of using
FABSDr (so we get undefined references).
Cheers,
Diana
I don't suppose it's crucial to handle the fabs intrinsic nicely at -O0.
I'm still investigating whether this is an actual GlobalISel issue or
something else (I'll also start a build on a more recent revision to
see how that behaves).
Cheers,
Diana
Indeed.
@Quentin, what is the status of that patch? Have you been working on
it since then? Should I investigate the segfault more and send you a
reproducer? Is this patch the way forward, or do you have other ideas
for reducing the stack usage?
Thanks,
Diana
> On May 18, 2017, at 1:15 AM, Diana Picus <diana...@linaro.org> wrote:
>
> On 18 May 2017 at 09:06, Kristof Beyls <Kristo...@arm.com> wrote:
>> I think Diana found that when enabling r299283, the bootstrap failed with
>> llvm-tblgen segfaulting.
>> So there clearly is some work required there.
>
> Indeed.
>
> @Quentin, what is the status of that patch?
Initially it was meant as a prototype to investigate how we could address those issues at O0. I didn’t mean to publish it.
> Have you been working on
> it since then?
No, I haven’t touched it and I honestly didn’t plan to do that.
> Should I investigate the segfault more and send you a
> reproducer?
Depends, do you guys want to pursue in that direction?
My though was that it is probably best to rely on a better reg allocator (basic or greedy) scheme for O0. I believe the only concern may be compile time but it may just fly (+ I have ideas how to make it better pretty easy).
> Is this patch the way forward, or do you have other ideas
> for reducing the stack usage?
Better reg alloc scheme for O0 :).
Not necessarily, if you think a better reg alloc scheme will do the
trick then I see no point in complicating the pass pipeline for now.
> My though was that it is probably best to rely on a better reg allocator (basic or greedy) scheme for O0. I believe the only concern may be compile time but it may just fly (+ I have ideas how to make it better pretty easy).
>
>> Is this patch the way forward, or do you have other ideas
>> for reducing the stack usage?
>
> Better reg alloc scheme for O0 :).
Ok, let's go with that then :)
On May 19, 2017, at 1:33 AM, Diana Picus <diana...@linaro.org> wrote:On 18 May 2017 at 19:09, Quentin Colombet <qcol...@apple.com> wrote:Hi Diana,On May 18, 2017, at 1:15 AM, Diana Picus <diana...@linaro.org> wrote:
On 18 May 2017 at 09:06, Kristof Beyls <Kristo...@arm.com> wrote:I think Diana found that when enabling r299283, the bootstrap failed with
llvm-tblgen segfaulting.
So there clearly is some work required there.
Indeed.
@Quentin, what is the status of that patch?
Initially it was meant as a prototype to investigate how we could address those issues at O0. I didn’t mean to publish it.Have you been working on
it since then?
No, I haven’t touched it and I honestly didn’t plan to do that.Should I investigate the segfault more and send you a
reproducer?
Depends, do you guys want to pursue in that direction?
Not necessarily, if you think a better reg alloc scheme will do the
trick then I see no point in complicating the pass pipeline for now.My though was that it is probably best to rely on a better reg allocator (basic or greedy) scheme for O0. I believe the only concern may be compile time but it may just fly (+ I have ideas how to make it better pretty easy).Is this patch the way forward, or do you have other ideas
for reducing the stack usage?
Better reg alloc scheme for O0 :).
Ok, let's go with that then :)
I actually did a run with -mllvm -optimize-regalloc -mllvm
-regalloc=greedy over the weekend and the test does pass with that.
Haven't measured the compile time though.
Cheers,
Diana
On May 22, 2017, at 12:51 AM, Kristof Beyls <kristo...@arm.com> wrote:On 22 May 2017, at 09:09, Diana Picus <diana...@linaro.org> wrote:
Hi Quentin,
I actually did a run with -mllvm -optimize-regalloc -mllvm
-regalloc=greedy over the weekend and the test does pass with that.
Haven't measured the compile time though.
Cheers,
Diana
I also did my usual benchmarking run with the same options as Diana did above:
- Comparing against -O0 without globalisel: 2.5% performance drop, 0.8% code size improvement.
- Comparing against -O0 without globalisel but with the above regalloc options: 5.6% performance drop, 1% code size drop.
In summary, the measurements indicate some good improvements.
I also haven't measure the impact on compile time.
Doing a few quick spot checks on the generated code, I still see some constants being created in the entry BB and stored on the stack.
On 23 May 2017, at 21:48, Quentin Colombet <qcol...@apple.com> wrote:
Great!I thought I had to look at our pipeline at O0 to make sure optimized regalloc was supported (https://bugs.llvm.org/show_bug.cgi?id=33022 in mind). Glad I was wrong, it saves me some time.
On May 22, 2017, at 12:51 AM, Kristof Beyls <kristo...@arm.com> wrote:
On 22 May 2017, at 09:09, Diana Picus <diana...@linaro.org> wrote:
Hi Quentin,
I actually did a run with -mllvm -optimize-regalloc -mllvm
-regalloc=greedy over the weekend and the test does pass with that.
Haven't measured the compile time though.
Cheers,
Diana
I also did my usual benchmarking run with the same options as Diana did above:
- Comparing against -O0 without globalisel: 2.5% performance drop, 0.8% code size improvement.
That’s compared to 9.5% performance drop and 2.8% code size regression, without that regalloc scheme, right?
- Comparing against -O0 without globalisel but with the above regalloc options: 5.6% performance drop, 1% code size drop.
In summary, the measurements indicate some good improvements.
I also haven't measure the impact on compile time.
Do you have a mean to make this measurement?Ahmed did a bunch of compile time measurements on our side and I wanted to see if I need to put him on the hook again :).
gisel | gisel+greedy | |
CTMark/7zip/7zip-benchmark | 102.8% | 106.5% |
CTMark/Bullet/bullet | 100.5% | 105.1% |
CTMark/ClamAV/clamscan | 101.6% | 130.8% |
CTMark/SPASS/SPASS | 101.2% | 120.0% |
CTMark/consumer-typeset/consumer-typeset | 105.7% | 138.2% |
CTMark/kimwitu++/kc | 103.1% | 122.6% |
CTMark/lencod/lencod | 106.2% | 143.4% |
CTMark/mafft/pairlocalalign | 96.2% | 135.4% |
CTMark/sqlite3/sqlite3 | 109.1% | 155.1% |
CTMark/tramp3d-v4/tramp3d-v4 | 109.1% | 132.0% |
GEOMEAN | 103.5% | 128.0% |
On 24 May 2017, at 19:31, Quentin Colombet <qcol...@apple.com> wrote:
Hi Kristof,
Thanks for the measurements.
On May 24, 2017, at 6:00 AM, Kristof Beyls <kristo...@arm.com> wrote:
- Comparing against -O0 without globalisel but with the above regalloc options: 5.6% performance drop, 1% code size drop.
In summary, the measurements indicate some good improvements.
I also haven't measure the impact on compile time.
Do you have a mean to make this measurement?Ahmed did a bunch of compile time measurements on our side and I wanted to see if I need to put him on the hook again :).
I did a quick setup with CTMark (part of the test-suite). I ran each of* '-O0 -g',* '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0', and* '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0 -mllvm -optimize-regalloc -mllvm -regalloc=greedy'5 times, cross-compiling from X86 to AArch64, and took the median measured compile times.In summary, I see GlobalISel having a compile time that's 3.5% higher than the current -O0 default.With enabling the greedy register allocator, this increases to 28%.28% is probably too high?
I think it is yes.I have attached a quick hack to the greedy allocator to feature a fast mode.Could you give it a try?
To enable the fast mode, please use (-mllvm) -regalloc-greedy-fast=true (default is false).
<regalloc-fastmode.diff>
On May 24, 2017, at 12:57 PM, Kristof Beyls <kristo...@arm.com> wrote:
On 24 May 2017, at 19:31, Quentin Colombet <qcol...@apple.com> wrote:Hi Kristof,Thanks for the measurements.On May 24, 2017, at 6:00 AM, Kristof Beyls <kristo...@arm.com> wrote:- Comparing against -O0 without globalisel but with the above regalloc options: 5.6% performance drop, 1% code size drop.
In summary, the measurements indicate some good improvements.
I also haven't measure the impact on compile time.Do you have a mean to make this measurement?Ahmed did a bunch of compile time measurements on our side and I wanted to see if I need to put him on the hook again :).I did a quick setup with CTMark (part of the test-suite). I ran each of* '-O0 -g',* '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0', and* '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0 -mllvm -optimize-regalloc -mllvm -regalloc=greedy'5 times, cross-compiling from X86 to AArch64, and took the median measured compile times.In summary, I see GlobalISel having a compile time that's 3.5% higher than the current -O0 default.With enabling the greedy register allocator, this increases to 28%.28% is probably too high?I think it is yes.I have attached a quick hack to the greedy allocator to feature a fast mode.Could you give it a try?To enable the fast mode, please use (-mllvm) -regalloc-greedy-fast=true (default is false).I'm afraid it doesn't seem to save much compile time. On geomean, I see about 26% compile time increase against the current -O0 default (compared to 28% increase for regalloc greedy without your patch).
Do you know where the time is spent (-time-passes)?
Anyhow, fixing all of those, although this is I think the right approach, will take time, so we can go with the localizer.
Do you know where the time is spent (-time-passes)?I'm afraid I won't have time to have a closer look in the next couple of days - I don't know where the time is spent at the moment.
Anyhow, fixing all of those, although this is I think the right approach, will take time, so we can go with the localizer.Right, I don't understand the register allocator well enough to know if that compile time overhead can be fixed, while still getting the necessary codegen benefits the greedy allocator gives.Is there any specific help you're looking for with getting the localizer work well enough for production use?
Thanks,Kristof
Cheers,
Diana
<localizer_tentative_fix.diff>
<localizer-mo-order.mir>
On 31 May 2017, at 15:33, Diana Picus via llvm-dev <llvm...@lists.llvm.org> wrote:
Cool test :)It seems to work fine now, I don't see any new failures. IIUC, Kristof is also giving it another run.
Cheers,Diana
On May 31, 2017, at 7:45 AM, Kristof Beyls <kristo...@arm.com> wrote:
On 31 May 2017, at 15:33, Diana Picus via llvm-dev <llvm...@lists.llvm.org> wrote:Cool test :)It seems to work fine now, I don't see any new failures. IIUC, Kristof is also giving it another run.Cheers,DianaLatest comparisons on my side, after picking up r304244, i.e. the correct Localizer pass.* CTMark compile time, comparing "-O0 -g" vs '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0': about 6% increase with globalisel. This was about 3.5% before the Localizer pass landed.
* My usual performance benchmarking run: 8.5% slow-down. This was about 9.5% before the Localizer pass landed, so a slight improvement.* Code size: 3.14% larger. This was about 2.8% before the Localizer pass landed, so a slight regression.
* Debug info quality: I didn't do another recheck, trusting that the Localizer pass wouldn't change debug info quality.* Stack size usage: I don't know of a good way to measure this, but Diana's experiments show that at least for bootstrapping it went from "problematically bad" to "OK".Thanks,
On 31 May 2017, at 17:07, Quentin Colombet <qcol...@apple.com> wrote:
Latest comparisons on my side, after picking up r304244, i.e. the correct Localizer pass.* CTMark compile time, comparing "-O0 -g" vs '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0': about 6% increase with globalisel. This was about 3.5% before the Localizer pass landed.
That one is surprising too. I wouldn’t have expected this pass to show up in the compile time profile. At least not to this extend.What is the biggest offender?
* My usual performance benchmarking run: 8.5% slow-down. This was about 9.5% before the Localizer pass landed, so a slight improvement.* Code size: 3.14% larger. This was about 2.8% before the Localizer pass landed, so a slight regression.
That one is surprising. Do you have an idea of what is happening?Alternatively if you can point me to the biggest offender, I can have a look.
O0 -g | O0 -g gisel-with-localizer | O0 -g gisel-without-localizer | ||
SingleSource/Benchmarks/Misc/perlin | 14272 | 14640 | 18344 | 25.95% |
SingleSource/Benchmarks/Dhrystone/dry | 16560 | 17144 | 20160 | 18.21% |
SingleSource/Benchmarks/Stanford/QueensProfile | 13912 | 14192 | 15136 | 6.79% |
MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url | 71400 | 72272 | 75504 | 4.53% |
Hi all,I added a buildbot [1] running the test-suite with -O0 -global-isel. It runs into the same 2 timeouts that I reported previously on this thread (paq8p and scimark2). It would be nice to make it green before flipping the switch.
On Jun 14, 2017, at 7:27 AM, Diana Picus <diana...@linaro.org> wrote:On 12 June 2017 at 18:54, Diana Picus <diana...@linaro.org> wrote:Hi all,I added a buildbot [1] running the test-suite with -O0 -global-isel. It runs into the same 2 timeouts that I reported previously on this thread (paq8p and scimark2). It would be nice to make it green before flipping the switch.I did some more investigations on a machine similar to the one running the buildbot. For paq8p and scimark2, I get these results for O0:PAQ8p:Fast isel: 666.344Global isel: 731.384SciMark2-C:Fast isel: 463.908Global isel: 496.22The current timeout is 500s (so in this particular case we didn't hit it for scimark2, and it ran successfully to completion). I don't think the difference between FastISel and GlobalISel is too atrocious, so I would propose increasing the timeout for these 2 benchmarks. I'm not sure if we can do this on a per-bot basis, but I see some precedent for setting custom timeout thresholds for various benchmarks on different architectures (sometimes with comments that it's done so we can run O0 on that particular benchmark).Something along these lines works:What do you guys think about this approach?
Hi all,We had some internal discussions about flipping the default for O0 and we concluded that we wanted to postpone it.*** Why Is That? ***We don’t want to send the wrong message that GlobalISel’s design is set in stone and ready for broader adoption.In particular,1. The APIs are still evolving and can still possibly change significantly2. The TableGen backend to reuse the existing SD patterns is still at its early stage3. We want to investigate closely the performance of global-isel (compile-time, runtime, code size, fallbacks)The rationale behind those items is that we want to minimize the pain of moving forward for everybody. We also want the out-of-the-box experience to be pleasant (like all/most of the tablegen patterns just work, we have documentation on how to target a new backend, etc.) Finally, we want to gain confidence we are going to be able to address the performance issues we have with the current design and if not, derive a plan for that.We purposely left out of the conversation what will be the right time and requirements to flip the switch. We want to gather more data first. Your help would be appreciated!*** Short-Term Proposal ***What we would like to do instead short-term is:A. Repurpose or create an option “-aarch64-enable-global-isel-at-O” to enable GISel with fallbacks and warnings enables (i.e., equivalent of -global-isel -global-isel-abort=2)B. Advertise this option in the next open source release to allow compiler enthusiastic to try it and report problemsC. Have GISel always built so we can push thing in the right place, MachineVerifier in mind, and stop doing some weird gymnasticWhat do people think?
On Jun 16, 2017, at 4:58 PM, Eric Christopher <echr...@gmail.com> wrote:On Fri, Jun 16, 2017 at 4:43 PM Quentin Colombet <qcol...@apple.com> wrote:Hi all,We had some internal discussions about flipping the default for O0 and we concluded that we wanted to postpone it.*** Why Is That? ***We don’t want to send the wrong message that GlobalISel’s design is set in stone and ready for broader adoption.In particular,1. The APIs are still evolving and can still possibly change significantly2. The TableGen backend to reuse the existing SD patterns is still at its early stage3. We want to investigate closely the performance of global-isel (compile-time, runtime, code size, fallbacks)The rationale behind those items is that we want to minimize the pain of moving forward for everybody. We also want the out-of-the-box experience to be pleasant (like all/most of the tablegen patterns just work, we have documentation on how to target a new backend, etc.) Finally, we want to gain confidence we are going to be able to address the performance issues we have with the current design and if not, derive a plan for that.We purposely left out of the conversation what will be the right time and requirements to flip the switch. We want to gather more data first. Your help would be appreciated!*** Short-Term Proposal ***What we would like to do instead short-term is:A. Repurpose or create an option “-aarch64-enable-global-isel-at-O” to enable GISel with fallbacks and warnings enables (i.e., equivalent of -global-isel -global-isel-abort=2)B. Advertise this option in the next open source release to allow compiler enthusiastic to try it and report problemsC. Have GISel always built so we can push thing in the right place, MachineVerifier in mind, and stop doing some weird gymnasticWhat do people think?How about -fexperimental-global-isel as a flag to clang?
On Jun 16, 2017, at 4:58 PM, Eric Christopher <echr...@gmail.com> wrote:On Fri, Jun 16, 2017 at 4:43 PM Quentin Colombet <qcol...@apple.com> wrote:Hi all,We had some internal discussions about flipping the default for O0 and we concluded that we wanted to postpone it.*** Why Is That? ***We don’t want to send the wrong message that GlobalISel’s design is set in stone and ready for broader adoption.In particular,1. The APIs are still evolving and can still possibly change significantly2. The TableGen backend to reuse the existing SD patterns is still at its early stage3. We want to investigate closely the performance of global-isel (compile-time, runtime, code size, fallbacks)The rationale behind those items is that we want to minimize the pain of moving forward for everybody. We also want the out-of-the-box experience to be pleasant (like all/most of the tablegen patterns just work, we have documentation on how to target a new backend, etc.) Finally, we want to gain confidence we are going to be able to address the performance issues we have with the current design and if not, derive a plan for that.We purposely left out of the conversation what will be the right time and requirements to flip the switch. We want to gather more data first. Your help would be appreciated!*** Short-Term Proposal ***What we would like to do instead short-term is:A. Repurpose or create an option “-aarch64-enable-global-isel-at-O” to enable GISel with fallbacks and warnings enables (i.e., equivalent of -global-isel -global-isel-abort=2)B. Advertise this option in the next open source release to allow compiler enthusiastic to try it and report problemsC. Have GISel always built so we can push thing in the right place, MachineVerifier in mind, and stop doing some weird gymnasticWhat do people think?How about -fexperimental-global-isel as a flag to clang?I thought about that and that would work, but I thought we had an implicit contract that clang options are not going away.If that’s not the case, then, yes, we should do that!
On Jun 16, 2017, at 5:17 PM, Eric Christopher <echr...@gmail.com> wrote:On Fri, Jun 16, 2017 at 5:11 PM Quentin Colombet <qcol...@apple.com> wrote:On Jun 16, 2017, at 4:58 PM, Eric Christopher <echr...@gmail.com> wrote:On Fri, Jun 16, 2017 at 4:43 PM Quentin Colombet <qcol...@apple.com> wrote:Hi all,We had some internal discussions about flipping the default for O0 and we concluded that we wanted to postpone it.*** Why Is That? ***We don’t want to send the wrong message that GlobalISel’s design is set in stone and ready for broader adoption.In particular,1. The APIs are still evolving and can still possibly change significantly2. The TableGen backend to reuse the existing SD patterns is still at its early stage3. We want to investigate closely the performance of global-isel (compile-time, runtime, code size, fallbacks)The rationale behind those items is that we want to minimize the pain of moving forward for everybody. We also want the out-of-the-box experience to be pleasant (like all/most of the tablegen patterns just work, we have documentation on how to target a new backend, etc.) Finally, we want to gain confidence we are going to be able to address the performance issues we have with the current design and if not, derive a plan for that.We purposely left out of the conversation what will be the right time and requirements to flip the switch. We want to gather more data first. Your help would be appreciated!*** Short-Term Proposal ***What we would like to do instead short-term is:A. Repurpose or create an option “-aarch64-enable-global-isel-at-O” to enable GISel with fallbacks and warnings enables (i.e., equivalent of -global-isel -global-isel-abort=2)B. Advertise this option in the next open source release to allow compiler enthusiastic to try it and report problemsC. Have GISel always built so we can push thing in the right place, MachineVerifier in mind, and stop doing some weird gymnasticWhat do people think?How about -fexperimental-global-isel as a flag to clang?I thought about that and that would work, but I thought we had an implicit contract that clang options are not going away.If that’s not the case, then, yes, we should do that!*shrug* we have one for the new pass manager to make testing easier, this seems reasonable as well for something that's going to be tested over an extended period of time.
While we're at it, these patches could use some review:
* [GlobalISel] combine not symmetric merge/unmerge nodes [1]
* [GlobalISel] Make multi-step legalization work [2]
* [GlobalISel] Enable specifying how to legalize non-power-of-2 size
types. [NFC-ish] [3]
Note that [2] is a spin-off from [3], which has been in review for a
long, long time.
[1] https://reviews.llvm.org/D33626
[2] https://reviews.llvm.org/D32529
[3] https://reviews.llvm.org/D30529
Cheers,
Diana
On 19 Jun 2017, at 10:43, Diana Picus via llvm-dev <llvm...@lists.llvm.org> wrote:
On 17 June 2017 at 01:43, Quentin Colombet <qcol...@apple.com> wrote:
*** Your Help Is Needed ***
- Please share your experience in using the GISel APIs and how we can make
them better. Moving forward we’ll have those conversations on open source
instead of internally/with a narrower audience.
- Report any performance problem you identify
- Propose patches!
While we're at it, these patches could use some review:
* [GlobalISel] combine not symmetric merge/unmerge nodes [1]
* [GlobalISel] Make multi-step legalization work [2] (https://reviews.llvm.org/D32529)
* [GlobalISel] Enable specifying how to legalize non-power-of-2 size
types. [NFC-ish] [3] (https://reviews.llvm.org/D30529)
FYI, this is now live on the public master:
http://lab.llvm.org:8011/builders/clang-cmake-aarch64-global-isel
I hope people will find it useful.
On Jul 3, 2017, at 5:08 AM, Diana Picus <diana...@linaro.org> wrote:On 12 June 2017 at 18:54, Diana Picus <diana...@linaro.org> wrote:
Hi all,
I added a buildbot [1] running the test-suite with -O0 -global-isel. It runs into the same 2 timeouts that I reported previously on this thread (paq8p and scimark2). It would be nice to make it green before flipping the switch.
At the moment, it lives in an internal buildmaster that I've setup for this purpose. If we fix it and it proves to be stable for a week or two, I'll move it to the public master.
FYI, this is now live on the public master:
http://lab.llvm.org:8011/builders/clang-cmake-aarch64-global-isel
I hope people will find it useful.
On 2017-11-07 19:42, Quentin Colombet via llvm-dev wrote:
> Hi all,
>
> I’d like to resurrect this thread and ask if people are on board for
> enabling this by default for AArch64 O0.
>
> *** What Changed Since June? ***
>
> - We added a way to describe the legalization actions for
> non-power-of-2
> - We gave a tutorial that covers the best practices to target
> GlobalISel
> - We improved the TableGen backend to reuse existing SDISel patterns
> - We built and ran huge internal software with GISel
> - We evaluated the performance of GISel and are confident things are
> in a good shape (with https://reviews.llvm.org/D39034) and moving
> forward would look even better (see the last LLVM Dev talk:
> _GlobalISel: Present, Past, and Future_ when it is available)
>> https://reviews.llvm.org/differential/diff/102547/ [1]
>>
>> What do you guys think about this approach?
>
> Looks reasonable to me.
>
>> Thanks,
>> Diana
>>
>> PS: The buildbot is using the Makefiles because that's what our
>> other AArch64 test-suite bots use. Moving all of them to CMake is a
>> transition for another time.
>
>
>
> Links:
> ------
> [1] https://reviews.llvm.org/differential/diff/102547/
On Nov 13, 2017, at 9:10 AM, Kristof Beyls <Kristo...@arm.com> wrote:Hi Quentin,
My only remaining concern is around ABI compatibility.The following commit seems to indicate that in the previous round of evaluation, we didn’t find an existing ABI compatibility issue:I haven’t looked into the details of this issue - so maybe I’m worried over nothing?
I’m wondering if since then on your side you did any testing around ABI compatibility?E.g. building software where you semi-randomly build some functions through GlobalISel and some functions through DAGISel?
Hi Quentin,
I’ve started running an ABI test suite with global isel on AArch64, and while it hasn’t found any ABI issues it has hit an assertion in clang when using the __fp16 type. Here’s a reproducer:
__fp16 pass_f16(__fp16 p) {
return p;
}
$ /work/llvm/build/bin/clang --target=aarch64-arm-none-eabi -march=armv8-a -c test.c -O0 -mllvm -global-isel -mllvm -global-isel-abort=0
clang-6.0: /work/llvm/llvm/lib/CodeGen/GlobalISel/RegisterBankInfo.cpp:446: static void llvm::RegisterBankInfo::applyDefaultMapping(const llvm::RegisterBankInfo::OperandsMapper &): Assertion `OrigTy.getSizeInBits() == NewTy.getSizeInBits() && "Types with difference size cannot be handled by the default " "mapping"' failed.
#0 0x000000000362a764 PrintStackTraceSignalHandler(void*) (/work/llvm/build/bin/clang-6.0+0x362a764)
#1 0x000000000362aac6 SignalHandler(int) (/work/llvm/build/bin/clang-6.0+0x362aac6)
#2 0x00007f9193b78330 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x10330)
#3 0x00007f919276bc37 gsignal /build/eglibc-oGUzwX/eglibc-2.19/signal/../nptl/sysdeps/unix/sysv/linux/raise.c:56:0
#4 0x00007f919276f028 abort /build/eglibc-oGUzwX/eglibc-2.19/stdlib/abort.c:91:0
#5 0x00007f9192764bf6 __assert_fail_base /build/eglibc-oGUzwX/eglibc-2.19/assert/assert.c:92:0
#6 0x00007f9192764ca2 (/lib/x86_64-linux-gnu/libc.so.6+0x2fca2)
#7 0x0000000003d70eb9 (/work/llvm/build/bin/clang-6.0+0x3d70eb9)
#8 0x0000000003d6b00c llvm::RegBankSelect::applyMapping(llvm::MachineInstr&, llvm::RegisterBankInfo::InstructionMapping const&, llvm::SmallVectorImpl<llvm::RegBankSelect::RepairingPlacement>&) (/work/llvm/build/bin/clang-6.0+0x3d6b00c)
#9 0x0000000003d6b366 llvm::RegBankSelect::assignInstr(llvm::MachineInstr&) (/work/llvm/build/bin/clang-6.0+0x3d6b366)
#10 0x0000000003d6b7f1 llvm::RegBankSelect::runOnMachineFunction(llvm::MachineFunction&) (/work/llvm/build/bin/clang-6.0+0x3d6b7f1)
#11 0x0000000002d934c8 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/work/llvm/build/bin/clang-6.0+0x2d934c8)
#12 0x00000000030c998f llvm::FPPassManager::runOnFunction(llvm::Function&) (/work/llvm/build/bin/clang-6.0+0x30c998f)
#13 0x00000000030c9c53 llvm::FPPassManager::runOnModule(llvm::Module&) (/work/llvm/build/bin/clang-6.0+0x30c9c53)
#14 0x00000000030ca136 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/work/llvm/build/bin/clang-6.0+0x30ca136)
#15 0x00000000037c3dcf clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) (/work/llvm/build/bin/clang-6.0+0x37c3dcf)
#16 0x0000000003d421a0 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/work/llvm/build/bin/clang-6.0+0x3d421a0)
#17 0x0000000004457376 clang::ParseAST(clang::Sema&, bool, bool) (/work/llvm/build/bin/clang-6.0+0x4457376)
#18 0x0000000003ca6ea0 clang::FrontendAction::Execute() (/work/llvm/build/bin/clang-6.0+0x3ca6ea0)
#19 0x0000000003c1fa31 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/work/llvm/build/bin/clang-6.0+0x3c1fa31)
#20 0x0000000003d3bf4b clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/work/llvm/build/bin/clang-6.0+0x3d3bf4b)
#21 0x0000000001f85629 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/work/llvm/build/bin/clang-6.0+0x1f85629)
#22 0x0000000001f83096 main (/work/llvm/build/bin/clang-6.0+0x1f83096)
#23 0x00007f9192756f45 __libc_start_main /build/eglibc-oGUzwX/eglibc-2.19/csu/libc-start.c:321:0
#24 0x0000000001f80029 _start (/work/llvm/build/bin/clang-6.0+0x1f80029)
Stack dump:
0. Program arguments: /work/llvm/build/bin/clang-6.0 -cc1 -triple aarch64-arm-none-eabi -emit-obj -mrelax-all -disable-free -main-file-name test.c -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -fuse-init-array -target-cpu generic -target-feature +neon -target-abi aapcs -fallow-half-arguments-and-returns -dwarf-column-info -debugger-tuning=gdb -coverage-notes-file /work/innovation/cctest/test.gcno -resource-dir /work/llvm/build/lib/clang/6.0.0 -O0 -fdebug-compilation-dir /work/innovation/cctest -ferror-limit 19 -fmessage-length 226 -fno-signed-char -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -mllvm -global-isel -mllvm -global-isel-abort=0 -o test.o -x c test.c
1. <eof> parser at end of file
2. Code generation
3. Running pass 'Function Pass Manager' on module 'test.c'.
4. Running pass 'RegBankSelect' on function '@pass_f16'
clang-6.0: error: unable to execute command: Aborted (core dumped)
clang-6.0: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 6.0.0 (ssh://olis...@ds-gerrit.euhpc.arm.com:29418/armcompiler/clang aa2b9952ef98a5fe2d47384ef17106855b8bae51) (ssh://olis...@ds-gerrit.euhpc.arm.com:29418/armcompiler/llvm 29f89772107a79b5f2a816d4748ed9c19416c1b6)
Target: aarch64-arm-none-eabi
Thread model: posix
InstalledDir: /work/llvm/build/bin
clang-6.0: note: diagnostic msg: PLEASE submit a bug report to http://llvm.org/bugs/ and include the crash backtrace, preprocessed source, and associated run script.
clang-6.0: note: diagnostic msg:
********************
PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang-6.0: note: diagnostic msg: /tmp/test-e06964.c
clang-6.0: note: diagnostic msg: /tmp/test-e06964.sh
clang-6.0: note: diagnostic msg:
********************
Oliver
On 13 Nov 2017, at 18:26, Quentin Colombet <qcol...@apple.com> wrote:
Hi Kristof,
On Nov 13, 2017, at 9:10 AM, Kristof Beyls <Kristo...@arm.com> wrote:
Hi Quentin,
My only remaining concern is around ABI compatibility.The following commit seems to indicate that in the previous round of evaluation, we didn’t find an existing ABI compatibility issue:I haven’t looked into the details of this issue - so maybe I’m worried over nothing?
No, you’re right. The problem with ABI is if you are consistently wrong, then you won’t notice :).
I’m wondering if since then on your side you did any testing around ABI compatibility?E.g. building software where you semi-randomly build some functions through GlobalISel and some functions through DAGISel?
Justin will look into that. Clang has utility script for that utils/ABITest.
Given we will only be able to check iOS ABI, you may want to follow the same kind of validation on your side.
I let you sync up with Justin for the method.
A bit light is putting it lightly ;)
> I’m happy to try and run it on linux if you can share some info on the best
> way to run that test, Justin?
I played around with utils/ABITest a bit and I managed to get it to run
with some minor changes, though it is not at all obvious how this thing
works.
The main thing is to change from trying to compare gcc to clang to
compare clang and clang -mllvm -global-isel instead:
diff --git a/utils/ABITest/Makefile.test.common b/utils/ABITest/Makefile.test.common
index 3c208adf0c..419d7cfab1 100644
--- a/utils/ABITest/Makefile.test.common
+++ b/utils/ABITest/Makefile.test.common
@@ -11,13 +11,14 @@ TESTARGS := --no-unsigned --no-vector --no-complex --no-bool
COUNT := 1
TIMEOUT := 5
-CFLAGS := -std=gnu99
+CFLAGS := -std=gnu99 -target arm64-apple-ios
-X_COMPILER := gcc
-X_LL_CFLAGS := -emit-llvm -S
-Y_COMPILER := clang
-Y_LL_CFLAGS := -emit-llvm -S
-CC := gcc
+CC := ${HOME}/build/llvm/bin/clang
+X_COMPILER := ${CC}
+Y_COMPILER := ${CC}
+
+X_CFLAGS :=
+Y_CFLAGS := -mllvm -global-isel
###
I also made a change to the "temps/test.%.out" target to copy the binary
onto a device before trying to run it, since I was cross compiling, and
removed the 32-bit cases from build-and-summarize-all.sh since we're
only dealing with AArch64 here.
To see if it works, `cd` into one of the test directories (like
single-args-64) and run `make VERBOSE=1 test.0.report`. If it finds a
bug make will fail with something like "TEST 0: temps/test.0.yy.diff
failed".
Then, once things work, you can run ./build-and-summarize-all.sh 100
(or any number you like) from the ABITest directory and it will tell
you if it finds anything.
Hi Quentin,
At Kristof’s suggestion, I tried running our ABI test suite for a big-endian AArch64 target, and this found an ABI mismatch between global-isel and regular -O0. Here’s a reproducer for the first one I’ve investigated:
struct foo {
float first;
float second;
};
float get_first(foo p) {
return p.first;
}
This is the code that global-isel currently generates:
// /work/llvm/build/bin/clang --target=aarch64--none-eabi -march=armv8-a -c callees.cpp -O0 -mllvm -global-isel=true -mllvm -global-isel-abort=0 -mbig-endian -o - -S
_Z9get_first3foo: // @_Z9get_first3foo
// BB#0: // %entry
sub sp, sp, #16 // =16
// implicit-def: %X8
fmov w9, s0
mov w10, w9
bfxil x8, x10, #0, #32
fmov w9, s1
mov w10, w9
bfi x8, x10, #32, #32
add x10, sp, #8 // =8
str x8, [sp, #8]
ldr w9, [x10]
fmov s0, w9
add sp, sp, #16 // =16
ret
When run on a big-endian target, this incorrectly returns the second member of the struct, instead of the first.
Oliver
Hi Quentin,
It seems that we also get the calling convention wrong for vector types on big-endian:
#include <arm_neon.h>
int32x2_t load_vector(int32x2_t *p) {
return *p;
}
Global-isel generates this:
// armclang --target=aarch64-arm-none-eabi -march=armv8-a -c callees.cpp -O0 -Wall -std=c++11 -mllvm -global-isel=true -mllvm -global-isel-abort=0 -mbig-endian -o - -S
_Z11load_vectorP11__Int32x2_t: // @_Z11load_vectorP11__Int32x2_t
// BB#0: // %entry
sub sp, sp, #16 // =16
str x0, [sp, #8]
ldr x0, [sp, #8]
ld1 { v0.2s }, [x0]
add sp, sp, #16 // =16
ret
With global-isel off, there is a rev64 instruction between the ld1 and the add, which fixes up the endianness of the vector.
Oliver
Hi Quentin,
One more reproducer, this time with small (<64bit) values being passed on the stack:
int foo(int x0, int x1, int x2, int x3, int x4, int x5, int x6, int x7,
int stack1) {
return stack1;
}
int main() {
int ret = foo(0,1,2,3,4,5,6,7,8);
printf("%d\n", ret);
}
Global isel thinks that the incoming value of stack1 is stored in bytes [0,4) above SP, but for big-endian targets this should be in bytes [4,8):
// /work/llvm/build/bin/clang --target=aarch64-arm-none-eabi -march=armv8-a -c callees.cpp -O0 -Wall -std=c++11 -mllvm -global-isel=true -mllvm -global-isel-abort=0 -mbig-endian -o - -S
_Z3fooiiiiiiiii: // @_Z3fooiiiiiiiii
// BB#0: // %entry
sub sp, sp, #48 // =48
ldr w8, [sp, #48] // <= Should be [sp, #52]
str w0, [sp, #44]
str w1, [sp, #40]
str w2, [sp, #36]
str w3, [sp, #32]
str w4, [sp, #28]
str w5, [sp, #24]
str w6, [sp, #20]
str w7, [sp, #16]
str w8, [sp, #12]
ldr w0, [sp, #12]
add sp, sp, #48 // =48
ret
Oliver
On 17 Nov 2017, at 10:34, Daniel Sanders <daniel_l...@apple.com> wrote:Does the MIR for this one have a G_BITCAST from one vector type to another? It sounds like we haven't implemented the fact that bitcasts are sometimes shuffles on big-endian.
Looping back here, I generated and ran 1000000 test cases with the
ABITest utility and hit some fallbacks and an "invalid zero-sized type"
assert that need looking at, but it didn't uncover any ABI issues.
Hi Quentin,
I’ve raised:
https://bugs.llvm.org/show_bug.cgi?id=35359
https://bugs.llvm.org/show_bug.cgi?id=35360
https://bugs.llvm.org/show_bug.cgi?id=35361
I also left the test suite running over the weekend in little-endian mode with the __fp16 type disabled, and it didn’t find any bugs there.
We’ve had some more internal discussion and we think that we can, and should, still aim to enable this by default before the release. We have a month of testing time available to shake out critical issues which is normally enough. After the release branch is taken we can also merge in any fixes required during the stabilization period.
Thanks,
Amara
On Dec 18, 2017, at 12:37 PM, Kristof Beyls <Kristo...@arm.com> wrote:Hi Amara,
My reasons for preferring the switch to happen after the release branch is based on the following observations:
- As far as I can see, the projects and products following top-of-trunk tend to test much more extensively than the testing that happens for llvm.org releases. I expect that the llvm.org release testing won’t find potential remaining issues, whereas the projects and products following trunk are far more likely to find remaining issues with GlobalISel. Therefore, I think it’s best to give as much time as possible for these projects/products following top-of-trunk to find issues before there’s an llvm.org release with GlobalISel enabled-by-default.
- For the projects and products that follow top-of-trunk that I know off, if globalisel-by-default would really break things, it’s possible to disable it within that project/product without end users of it needing to know about this. If for the llvm.org release it would turn out that globalisel has a big issue after release, we’d need to somehow let all users of the release know to disable it by adding a convoluted command line option (‘-mllvm -global-isel=0’?)
Combining the 2 observations above, I think it’s better to do the switch shortly after a release branch is taken, rather than just before it.
On 18 Dec 2017, at 15:11, Amara Emerson <aeme...@apple.com> wrote:
I wasn’t suggesting that we proceed with the full release with GISel enabled by default if there are serious issues, I was more saying that we have additional time during the RC period to test further on trunk, and then merge in fixes into the RC. Worst case is that we back out the change completely but I doubt it’ll get to that stage. I do agree that trunk sees more of the testing coverage. If one month isn’t enough, what would have been the cut off point?On Dec 18, 2017, at 12:37 PM, Kristof Beyls <Kristo...@arm.com> wrote:
Hi Amara,
My reasons for preferring the switch to happen after the release branch is based on the following observations:
- As far as I can see, the projects and products following top-of-trunk tend to test much more extensively than the testing that happens for llvm.org releases. I expect that the llvm.org release testing won’t find potential remaining issues, whereas the projects and products following trunk are far more likely to find remaining issues with GlobalISel. Therefore, I think it’s best to give as much time as possible for these projects/products following top-of-trunk to find issues before there’s an llvm.org release with GlobalISel enabled-by-default.
- For the projects and products that follow top-of-trunk that I know off, if globalisel-by-default would really break things, it’s possible to disable it within that project/product without end users of it needing to know about this. If for the llvm.org release it would turn out that globalisel has a big issue after release, we’d need to somehow let all users of the release know to disable it by adding a convoluted command line option (‘-mllvm -global-isel=0’?)
Combining the 2 observations above, I think it’s better to do the switch shortly after a release branch is taken, rather than just before it.
Thanks,Amara