[LLVMdev] Enabling the vectorizer for -Os -- ping

4 views
Skip to first unread message

Nadav Rotem

unread,
Jun 14, 2013, 12:37:58 AM6/14/13
to Dev
Hi,

Last week I wrote llvm-dev and presented data that shows how enabling the vectorizer on -Os can improve the performance of many workloads and that it has negligible effects on code size. I also added a command line switch to make it easier for people to benchmark the vectorizer using -Os directly from clang without changing LLVM. Has anyone done any benchmarks on -Os + vectorization ?

Thanks,
Nadav
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Renato Golin

unread,
Jun 14, 2013, 4:29:16 AM6/14/13
to Nadav Rotem, Dev
On 14 June 2013 05:37, Nadav Rotem <nro...@apple.com> wrote:
Last week I wrote llvm-dev and presented data that shows how enabling the vectorizer on -Os can improve the performance of many workloads and that it has negligible effects on code size.  I also added a command line switch to make it easier for people to benchmark the vectorizer using -Os directly from clang without changing LLVM.  Has anyone done any benchmarks on -Os + vectorization ?

Hi Nadav,

I haven't, sorry. I'll run some tests on my Chromebook and will let you know.

cheers,
--renato

Renato Golin

unread,
Jun 14, 2013, 10:00:25 AM6/14/13
to Nadav Rotem, Dev
Hi Nadav,

No noticeable difference between "-Os" and "-Os -fvectorize" in code size or compilation times in my tests, and only minimal performance improvements (small enough to be ignored).

cheers,
--renato

Nadav Rotem

unread,
Jun 14, 2013, 12:13:33 PM6/14/13
to Renato Golin, Dev
Excellent. Thanks! 

Chandler Carruth

unread,
Jun 14, 2013, 2:53:34 PM6/14/13
to Nadav Rotem, LLVM Developers Mailing List

Sorry for the delays here. I am running our benchmark suite and will have data in a day or so.

Chandler Carruth

unread,
Jun 16, 2013, 12:10:53 AM6/16/13
to Nadav Rotem, LLVM Developers Mailing List
All I have to say is *wow*. The vectorizer performs *remarkably* better now than it did the last time I benchmarked it. I'm stunned.

I measured -O2 and -Os, as well as -march=x86-64 and -march=corei7-avx. My hope with the latter two was to cover both worst-case and best-case in terms of the quality of the vector ISA available.

First, binary size growth. This is measured on average across a reasonably wide selection of binaries including large servers, video codecs, image processing, etc.

O2, x86-64: 1% larger w/ vectorizer
O2, corei7-avx: 1.2% larger
Os, x86-64: 0.1% larger
Os, corei7-avx: < 0.1% larger

This is incredibly impressive IMO. =]

The performance numbers are also pretty good. There are a couple of minor regressions, only one significant one. That one happens to be open source: https://code.google.com/p/snappy/source/browse/trunk/snappy.cc this slows down -- the vectorizer vectorizes a cold loop, which then gets inlined and blocks subsequent inlining. (Many thanks to Ben Kramer for pointing out the cause so quickly for me.) But there are a lot of potential solutions to this problem:

1) vectorize after inlining -- this has some problems (code growth mostly) but we might be able to solve them.
2) mark the cold path as cold so the optimizer is aware of it (tested this, it seems to work, but i'm still experimenting)
3) rewrite this part of snappy to be fundamentally better (the code as it is doesn't make a lot of sense to me, but i'm not an expert on it and will need time to figure out the best way to solve the issue)

I'm actually happy with any of the 3, although #2 isn't terribly satisfying. But even if that's the result, I can live with it.

So essentially, I think you should turn the vectorizer on completely. What's left seem very much like small isolated issues.

Thanks for driving this whole thing and giving me time to do some evaluation. I'm really thrilled by the result.
-Chandler

Chandler Carruth

unread,
Jun 16, 2013, 1:14:38 AM6/16/13
to Nadav Rotem, LLVM Developers Mailing List

On Sat, Jun 15, 2013 at 9:10 PM, Chandler Carruth <chan...@google.com> wrote:
The performance numbers are also pretty good. There are a couple of minor regressions, only one significant one.

Sorry, my email wasn't as clear as I intended.

I meant this sentence to indicate that I saw only one significant regression (the snappy one I discussed). Everything else was in the noise or got faster.

I did see some code known to vectorize well that sped up significantly, and some server code that slowed down by a small amount, but on average it was a wash. When snappy and some other things are fixed, it'll probably end up being a small performance win on average, and significant in those applications with hot vectorizable loops.

Xinliang David Li

unread,
Jun 16, 2013, 12:15:40 PM6/16/13
to Nadav Rotem, Dev
More data point for you: Intel's ICC turns on loop vectorizer at -O2
and -Os too.

Cheers,

David

Renato Golin

unread,
Jun 17, 2013, 4:40:06 AM6/17/13
to Chandler Carruth, LLVM Developers Mailing List
On 16 June 2013 05:10, Chandler Carruth <chan...@google.com> wrote:
So essentially, I think you should turn the vectorizer on completely. What's left seem very much like small isolated issues.

Thanks for driving this whole thing and giving me time to do some evaluation. I'm really thrilled by the result.

+1

--renato

Nadav Rotem

unread,
Jun 17, 2013, 12:14:02 PM6/17/13
to Renato Golin, LLVM Developers Mailing List
Thanks for evaluating the vectorizer and giving me feedback. I will send an email to llvm-dev and enable it later today. 
Reply all
Reply to author
Forward
0 new messages