Re: Vectorization control flow analysis different on Oz vs O2?

19 views
Skip to first unread message

Takuto Ikuta

unread,
Aug 6, 2020, 5:11:34 AM8/6/20
to Albert Wong (王重傑), Clang maintainers, build
+Clang maintainers 

Better to ask this to clang maintainers?

On Thu, Aug 6, 2020 at 4:57 PM 'Albert Wong (王重傑)' via build <bu...@chromium.org> wrote:
Hi build@,

I was playing with vectorization of some of the Blink ASCII functions and ran the following sample code in https://godbolt.org/z/vojz6v to see how the vectorizer treated it:

#include <stddef.h>

static inline bool isUpperAscii(char ch) {
  return ch > 'A' && ch < 'Z';
}

bool CharacterProperties(const char* str, size_t length) {
  int x = 0;
  int has_upper = 0;
  #pragma clang loop vectorize(enable) interleave(enable)
  for( size_t i = 0; i < length; ++i) {
    x |= str[i] & 0xA0;
    has_upper |= isUpperAscii(str[i]);
  }
  return x;
}

On armv7-a with -O2, this produces vectorized code.  But if I run it with -Oz, then the loop vectorizer says the loop control flow is not understood.

I wouldn't have expected the optimizer profile to affect whether or not the loop vectorizer could analyze the control flow.  Am I missing something?

Thanks,
Albert

--
You received this message because you are subscribed to the Google Groups "build" group.
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/build/CALcbsXAmKbQVvpS20Zz414hLiBwsQJ3NeoWwEPbzZvB-qrTEQw%40mail.gmail.com.

Matt Denton

unread,
Aug 6, 2020, 5:25:14 AM8/6/20
to Takuto Ikuta, Albert Wong (王重傑), Clang maintainers, build
I’m very much not an expert, but many optimizations in clang rely on inlining functions first (cross-function optimization isn’t so good) so my guess would be the compiler flag is preventing inlining isUpperAscii() which means the vectorizer can’t “understand” the control flow.

--
To unsubscribe from this group and stop receiving emails from it, send an email to clang+un...@chromium.org.

Hans Wennborg

unread,
Aug 6, 2020, 5:29:44 AM8/6/20
to Matt Denton, Sjoerd Meijer, Hal Finkel, Takuto Ikuta, Albert Wong (王重傑), Clang maintainers, build
I think in the case of the godbolt example, isUpperAscii() does get
inlined. I think the problem is really just that llvm's vectorizer
doesn't run at -Oz, even when the pragma is explicitly asking for it,
which is confusing.

Maybe Hal or Sjoerd who are more familiar with this can elaborate.
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/build/CAFsT0xiTt_%2BpKffv_vniDDoAF_mtjMhWZ8getV7AVKT0bA%3DrAA%40mail.gmail.com.

Sjoerd Meijer

unread,
Aug 6, 2020, 6:39:49 AM8/6/20
to Hans Wennborg, Matt Denton, Hal Finkel, Takuto Ikuta, Albert Wong (王重傑), Clang maintainers, build
I think vectorisation is disabled at -Oz:


...unless requested with pragma. Requesting extra debug messages on the command line (by adding -mllvm -debug) shows the vectoriser running and dumping:

  "LV: Not vectorizing: The exiting block is not the loop latch."

which is a slightly more specific message why the control flow is not understood. I think I have seen this before, it must be one of the prep passes that is not running under -Oz, I am guessing LoopRotate, but it could be something else as I haven't looked into it.

The confusing thing is of course this:

  "warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering"

The problem is that the first part of this sentence is spot on, the problematic part is the 2nd. I have also seen this before and had a discussion about this with Michael Kruse. Michael explained to me that there is apparently a good reason for this, but I can't reproduce this accurately. But being very vague: was something related to the pass manager, collecting these remarks (and optimisation passes not really added to the pipeline).

A high-level remark, because -Oz sacrifices everything for code-size, so I am not expecting expecting much vectorisation, and also the codegen to be quite poor.

Hope this helps a bit, cheers,
Sjoerd.




From: Hans Wennborg <ha...@chromium.org>
Sent: 06 August 2020 10:29
To: Matt Denton <mpde...@google.com>; Sjoerd Meijer <Sjoerd...@arm.com>; Hal Finkel <hfi...@anl.gov>
Cc: Takuto Ikuta <tik...@chromium.org>; Albert Wong (王重傑) <ajw...@google.com>; Clang maintainers <cl...@chromium.org>; build <bu...@chromium.org>
Subject: Re: Vectorization control flow analysis different on Oz vs O2?
 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Nico Weber

unread,
Aug 6, 2020, 9:56:55 AM8/6/20
to Sjoerd Meijer, Hans Wennborg, Matt Denton, Hal Finkel, Takuto Ikuta, Albert Wong (王重傑), Clang maintainers, build
From a chromium build setup perspective, code that's performance sensitive should be in a target with the optimize_max config applied (which makes it build with -O2). Trying to optimize code that builds with -Oz for performance is kind of inherently contradictory.

Nico Weber

unread,
Aug 6, 2020, 9:57:21 AM8/6/20
to Sjoerd Meijer, Hans Wennborg, Matt Denton, Hal Finkel, Takuto Ikuta, Albert Wong (王重傑), Clang maintainers, build
From a chromium build setup perspective, code that's performance sensitive should be in a target with the optimize_max config applied (which makes it build with -O2). Trying to optimize code that builds with -Oz for performance is kind of inherently contradictory.

On Thu, Aug 6, 2020 at 6:39 AM Sjoerd Meijer <Sjoerd...@arm.com> wrote:

Victor Costan

unread,
Aug 6, 2020, 1:19:34 PM8/6/20
to Nico Weber, Sjoerd Meijer, Hans Wennborg, Matt Denton, Hal Finkel, Takuto Ikuta, Albert Wong (王重傑), Clang maintainers, build
On Thu, Aug 6, 2020 at 6:57 AM Nico Weber <tha...@chromium.org> wrote:
From a chromium build setup perspective, code that's performance sensitive should be in a target with the optimize_max config applied (which makes it build with -O2). Trying to optimize code that builds with -Oz for performance is kind of inherently contradictory.

Related question: Do we have a doc for safety requirements around compiling code with different code generation options? Would there be interest in me starting one and collecting our experts' wisdom?

An example I consider very subtle is that the code in the optimize_max target cannot share templates with code outside the target. These templates include the standard library, for example, std::vector. This is needed to avoid ODR violations stemming from having the templates instantiated under different compilation options.

In case anyone is wondering what makes me think of such things :) ... I've looked at this issue recently in the context of adding fast code paths for modern CPUs (e.g. AVX), I wasn't able to find much documentation online, and I ended up relying on first principles and internal docs. The GCC docs skim the surface with "Applications which perform runtime CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options. This is a start, but doesn't talk about ODR violations and ABIs.

Thank you,
    Victor

Bruce Dawson

unread,
Aug 6, 2020, 1:30:20 PM8/6/20
to Victor Costan, Nico Weber, Sjoerd Meijer, Hans Wennborg, Matt Denton, Hal Finkel, Takuto Ikuta, Albert Wong (王重傑), Clang maintainers, build
Different optimization options (/O1 versus /O2) should be safe to mix, but different CPU targets (AVX, for instance) are not, or at least must be done with extreme care, because of the ODR risks you mention. I hit this in Chrome a few years ago and blogged about it:

Bruce Dawson

Albert Wong (王重傑)

unread,
Aug 6, 2020, 3:29:01 PM8/6/20
to Nico Weber, Sjoerd Meijer, Hans Wennborg, Matt Denton, Hal Finkel, Takuto Ikuta, Clang maintainers, build
Thanks for all the responses!

So this ends up being a little odd from a coding perspective. For your average C++ programmer (including me), my instinct would be produce pull out hot-path code into an inline function in a header that would "always be fast on all platforms".  This specific example is coming from such a pattern here.

Since this is inline header code, the utility-function author, one cannot really specify the optimization level. One can go to each translation unit calling the function up the optimization level, but that can become unscalable in other cases if there are a lot of users. This incentivizes manual vectorization like what is seen in the current source link.

I was hoping to simplify the logic to be a simple loop with a pragma specifying the loop should always be vectorized but with the O2 vs Oz divergence, the pragma both doesn't produce the wanted performance behavior consistently across configurations AND it produces a warning that just breaks the compile on other platforms.

From the programmer perspective, it almost feels like I want a way to say "the codegen on for this inlined function in a header should be O2 because there is a performance contract for the code that doesn't directly correlate to a translation unit."  Any thoughts on how I might get there? Or should I just stick to the manually vectorized version

Thanks,
Albert


On Thu, Aug 6, 2020 at 6:57 AM Nico Weber <tha...@chromium.org> wrote:

Andrew Grieve

unread,
Aug 6, 2020, 3:55:23 PM8/6/20
to Albert Wong (王重傑), Nico Weber, Sjoerd Meijer, Hans Wennborg, Matt Denton, Hal Finkel, Takuto Ikuta, Clang maintainers, build
For ThinLTO builds (which I think we use for all platforms?), then it might work out to put the function definition into a .cc file and it will still be inlinable. 

Hal Finkel

unread,
Aug 6, 2020, 4:03:26 PM8/6/20
to Albert Wong (王重傑), Nico Weber, Sjoerd Meijer, Hans Wennborg, Matt Denton, Takuto Ikuta, Clang maintainers, build

Would -Os work for you instead of -Oz?

At a higher level, two things:

 1. The optimization level, and other settings, certainly can affect how the vectorizer sees the loop, and thus whether or not it understands the control flow -- the vectorizer runs near the end of the pipeline and Oz vs O2 can affect things before it.

 2. For Oz, the vectorizer should not do anything that might increase code size. This includes, for example, having tail loops. We don't have this kind of hard restriction otherwise. Thus, there are loop structures that we just can't vectorize at Oz that we can at O2.

If you want certain functions to be compiled with particular optimization-size levels, and for these to get optimized along with other code compiled with different optimization-size levels, you need to use LTO. Our LTO can keep track of the Os/Oz of a function on a per-function basis even during cross-translation-unit optimization.

 -Hal

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Nico Weber

unread,
Aug 6, 2020, 4:03:50 PM8/6/20
to Andrew Grieve, Albert Wong (王重傑), Sjoerd Meijer, Hans Wennborg, Matt Denton, Hal Finkel, Takuto Ikuta, Clang maintainers, build
We currently only  use ThinLTO on Linux and Android, and I think we use it in the "don't optimize, just CFI" mode even there. (Eventually we're hoping to use it in optimizing mode everywhere, but not yet.)

Albert: That particular function is only called in wtf which iirc is optimize_max (not sure though).

A function in a header can't know if it should be compiled super optimized, since different callers might be differently hot. So even marking it ALWAYS_INLINE might not be the most appropriate thing -- another option is to instead mark the functions that call it that are hot as __attribute__(flatten) to make them inline their callees.
Reply all
Reply to author
Forward
0 new messages