All I have to say is *wow*. The vectorizer performs *remarkably* better now than it did the last time I benchmarked it. I'm stunned.
I measured -O2 and -Os, as well as -march=x86-64 and -march=corei7-avx. My hope with the latter two was to cover both worst-case and best-case in terms of the quality of the vector ISA available.
First, binary size growth. This is measured on average across a reasonably wide selection of binaries including large servers, video codecs, image processing, etc.
O2, x86-64: 1% larger w/ vectorizer
O2, corei7-avx: 1.2% larger
Os, x86-64: 0.1% larger
Os, corei7-avx: < 0.1% larger
This is incredibly impressive IMO. =]
The performance numbers are also pretty good. There are a couple of minor regressions, only one significant one. That one happens to be open source:
https://code.google.com/p/snappy/source/browse/trunk/snappy.cc this slows down -- the vectorizer vectorizes a cold loop, which then gets inlined and blocks subsequent inlining. (Many thanks to Ben Kramer for pointing out the cause so quickly for me.) But there are a lot of potential solutions to this problem:
1) vectorize after inlining -- this has some problems (code growth mostly) but we might be able to solve them.
2) mark the cold path as cold so the optimizer is aware of it (tested this, it seems to work, but i'm still experimenting)
3) rewrite this part of snappy to be fundamentally better (the code as it is doesn't make a lot of sense to me, but i'm not an expert on it and will need time to figure out the best way to solve the issue)
I'm actually happy with any of the 3, although #2 isn't terribly satisfying. But even if that's the result, I can live with it.
So essentially, I think you should turn the vectorizer on completely. What's left seem very much like small isolated issues.
Thanks for driving this whole thing and giving me time to do some evaluation. I'm really thrilled by the result.
-Chandler