[LLVMdev] Is there pass to break down <4 x float> to scalars

Liu Xin

unread,

Oct 24, 2013, 11:02:52 PM10/24/13

to LLVM Developers Mailing List

Hi, LLVM community,

I write some code in hand using LLVM IR. for simplicity, I write them in <4 x float>. now I found some stores for elements are useless.

for example, If I store {0.0, 1.0, 2.0, 3.0} to a <4 x float> %a. maybe only %a.xy is alive in my program. our target doesn't feature SIMD instruction, which means we have to lower vector to many scalar instructions. I found llvm doesn't have DSE in codegen , right?

Is there a pass which can break down vector operation to scalars?

thanks,

--lx

Richard Sandiford

unread,

Oct 25, 2013, 6:06:10 AM10/25/13

to Liu Xin, LLVM Developers Mailing List

I wanted the same thing for SystemZ, which doesn't have vectors,
in order to improve the llvmpipe code. FWIW, here's what I have locally.

It is able to decompose loads and stores, but I found in the llvmpipe case
that this made things worse with TBAA, because DAGCombiner::GaterAllAliases
has some fairly strict limits. So I disabled that by default; use
-decompose-vector-load-store to reenable.

The main motivation for z was instead to get InstCombine to rewrite
things like scalarised selects.

I haven't submitted it yet because it's less of a win than the TBAA
DAGCombiner patch I posted, so I didn't want to distract from that.
It would also need some TargetTransformInfo hooks to decide which
vectors should be decomposed.

Thanks,
Richard

decompose-vectors.diff

Renato Golin

unread,

Oct 25, 2013, 8:53:35 AM10/25/13

to Liu Xin, LLVM Developers Mailing List, rsan...@linux.vnet.ibm.com

On 25 October 2013 11:06, Richard Sandiford <rsan...@linux.vnet.ibm.com> wrote:

I wanted the same thing for SystemZ, which doesn't have vectors,

in order to improve the llvmpipe code.

Hi Richard,

This is a nice patch. I was wondering how hard it'd be to do that, and it seems that you're catching lots of corner cases.

My interest is also due to converting odd vectors into scalars, but to convert them again to CPU vectors, say from OpenCL to NEON code.

It would also need some TargetTransformInfo hooks to decide which
vectors should be decomposed.

If I got it right, this may not be necessary, or it may even be harmful.

Say you decide that <4 x i32> vectors should be left alone, so that your pass only scalarise the others. But when the vectorizer passes again (to try and use CPU vector instructions), it might not match the scalarised version with the vector, and you end up with data movement between scalar and vector pipelines, which normally slows down CPUs (at least in ARM's case). Also, problematic cases like <5 x i32> could be better split into 3+2 pairs, rather than 4+1.

If you scalarise everything, than the vectorizers will have a better chance of spotting patterns and vectorising the whole lot, then based on target transform info.

Is that what you had in mind?

cheers,

--renato

Richard Sandiford

unread,

Oct 25, 2013, 9:15:20 AM10/25/13

to Renato Golin, Liu Xin, LLVM Developers Mailing List

Renato Golin <renato...@linaro.org> writes:
> On 25 October 2013 11:06, Richard Sandiford <rsan...@linux.vnet.ibm.com>wrote>> It would also need some TargetTransformInfo hooks to decide which

>> vectors should be decomposed.
>
> If I got it right, this may not be necessary, or it may even be harmful.
>
> Say you decide that <4 x i32> vectors should be left alone, so that your
> pass only scalarise the others. But when the vectorizer passes again (to
> try and use CPU vector instructions), it might not match the scalarised
> version with the vector, and you end up with data movement between scalar
> and vector pipelines, which normally slows down CPUs (at least in ARM's
> case). Also, problematic cases like <5 x i32> could be better split into
> 3+2 pairs, rather than 4+1.
>
> If you scalarise everything, than the vectorizers will have a better chance
> of spotting patterns and vectorising the whole lot, then based on target
> transform info.
>
> Is that what you had in mind?

To be honest I hadn't really thought about targets with vector units
at all. :-) I was just assuming that we'd want to keep vector operations
together if there's native support. E.g. ISTR comments about not wanting
to rewrite vec_selects because it can be hard to synthesise optimal
sequences from a single canonical form. But I might have got that wrong.
Also, llvmpipe uses intrinsics for some things, so it might be strange
if we decompose IR operations but leave the intriniscs alone.

I'd half wondered whether, as an extension, the pass should split wide
vectors into supported widths. I hadn't thought about the possiblity of
decomposing everything and them reassembling it though. I can see how
that would cope with more cases, like you say.

Thanks,
Richard

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Renato Golin

unread,

Oct 25, 2013, 9:28:10 AM10/25/13

to Renato Golin, Liu Xin, LLVM Developers Mailing List, rsan...@linux.vnet.ibm.com

On 25 October 2013 14:15, Richard Sandiford <rsan...@linux.vnet.ibm.com> wrote:

Also, llvmpipe uses intrinsics for some things, so it might be strange
if we decompose IR operations but leave the intriniscs alone.

Yes, this is a problem with OpenCL as well. I wonder if it'd be useful to have a compiler-implemented inlined library code for each type of supported intrinsic, so that you can also lower them to code.

For targets that don't support the intrinsics, well, the IR wouldn't compile, so being slow is better than not working at all, but I wonder how much worse it would be to do that.

cheers,

--renato

Liu Xin

unread,

Oct 25, 2013, 9:49:44 AM10/25/13

to Liu Xin, LLVM Developers Mailing List, rsan...@linux.vnet.ibm.com

Hi, Richard,

I think we are solving a same problem. I am working on shader language too. I am not satisfied with current binaries because vector operations are kept in llvm opt.

glsl shader language has an operation called "swizzle". It can select sub-components of a vector. If a shader only takes components "xy" for a vec4. it's certainly wasteful to generate 4 operations for a scalar processor.

i think a good solution for llvm is in codegen. Many compiler has codegen optimizer. A DSE is good enough.

Which posted patch about TBAA? you have yet another solution except decompose-vectors?

thanks,

--lx

Pekka Jääskeläinen

unread,

Oct 25, 2013, 10:14:22 AM10/25/13

to Renato Golin, Liu Xin, LLVM Developers Mailing List, rsan...@linux.vnet.ibm.com

Hi,

Great to see someone working on this. This will benefit the performance
portability goal of the pocl's OpenCL kernel compiler. It has been one of
the low hanging fruits in improving its implicit WG vectorization
applicability.

The use case there is that sometimes it makes sense to devectorize
the explicitly used vector datatype code of OpenCL kernels in order to make
better opportunities for the "horizontal" vectorization across work-items
inside the work-group.

E.g., the last time I checked, the inner loop vectorizer (which pocl exploits)
just refused to vectorize loops with vector instructions. It might not
be so drastic with the SLP or the BB vectorizer, but in general, it might
make sense to let the vectorizer to do the decisions on how to map the
parallel (scalar) operations best to the vector hardware, and just help it
with the parallelism knowledge propagated from the parallel program.
One can then fall back to the original (hand vectorized) code in case
the autovectorization failed, to get some vector hardware utilization
still.

On 10/25/2013 04:15 PM, Richard Sandiford wrote:
> To be honest I hadn't really thought about targets with vector units
> at all.:-) I was just assuming that we'd want to keep vector operations
> together if there's native support. E.g. ISTR comments about not wanting
> to rewrite vec_selects because it can be hard to synthesise optimal
> sequences from a single canonical form. But I might have got that wrong.
> Also, llvmpipe uses intrinsics for some things, so it might be strange
> if we decompose IR operations but leave the intriniscs alone.

The issue of intrinsics and vectorization was discussed some time ago.
There it might be better to devectorize to a scalar version of the
instrinsics (if available) as at least the loopvectorizer can vectorize
also a set of selected intrinsics, and the target might have direct
machine instructions for those (which could not be exploited easily from
"inlined" versions).

--
Pekka

Richard Sandiford

unread,

Oct 25, 2013, 10:19:36 AM10/25/13

to Liu Xin, LLVM Developers Mailing List

Liu Xin <navy...@gmail.com> writes:
> I think we are solving a same problem. I am working on shader language
> too. I am not satisfied with current binaries because vector operations
> are kept in llvm opt.
>
> glsl shader language has an operation called "swizzle". It can select
> sub-components of a vector. If a shader only takes components "xy" for a
> vec4. it's certainly wasteful to generate 4 operations for a scalar
> processor.
>
> i think a good solution for llvm is in codegen. Many compiler has codegen
> optimizer. A DSE is good enough.
>
> Which posted patch about TBAA? you have yet another solution except
> decompose-vectors?

Ah, no, the TBAA thing is separate really. llvmpipe generally operates
on 4 rows at a time, so some functions end up with patterns like:

load <16 x i8> row0 ...
load <16 x i8> row1 ...
load <16 x i8> row2 ...
load <16 x i8> row3 ...
... do stuff ...
store <16 x i8> row0 ...
store <16 x i8> row1 ...
store <16 x i8> row2 ...
store <16 x i8> row3 ...

Since the row stride is variable, llvm doesn't have enough information
to tell that these rows don't alias. So it has to keep the loads and
stores in order. And z only has 16 general registers, so a naively-
scalarised 16 x i8 operation rapidly runs out. With unmodified llvmpipe
IR we get lots of spills.

Since z also has x86-like register-memory operations, a few spills are
usually OK. But in this case we have to load i8s and immediately
spill them.

So the idea was to add TBAA information to the llvmpipe IR to say that
the rows don't alias. (At the moment I'm only doing that by hand on
saved IR, I've not done it in llvmpipe itself yet.) Combined with
-combiner-alias-analysis -combiner-global-alias-analysis, this allows
the loads and stores to be reordered, which gives much better code.

However, the problem at the moment is that there are other scalar loads
that get rewritten by DAGCombiner and the legalisation code, and in the
process lose their TBAA info. This then interferes with the optimisation
above. So I wanted to make sure that the TBAA information is kept around:

http://llvm-reviews.chandlerc.com/D1894

It was just that if I had a choice of only getting one of the two patches in,
it'd definitely be the D1894 one. It sounds like there's more interest in
the DecomposeVectors patch than I'd expected though, so I'll get back to it.

Maybe as a first cut we can have a TargetTransformInfo hook to enable or
disable the pass wholesale, with a command-line option to override it.

Thanks to you an Renato for the feedback.

Richard

Richard Sandiford

unread,

Oct 25, 2013, 10:38:54 AM10/25/13

to Pekka Jääskeläinen, Liu Xin, LLVM Developers Mailing List

Pekka Jääskeläinen <pekka.jaa...@tut.fi> writes:
> E.g., the last time I checked, the inner loop vectorizer (which pocl exploits)
> just refused to vectorize loops with vector instructions. It might not
> be so drastic with the SLP or the BB vectorizer, but in general, it might
> make sense to let the vectorizer to do the decisions on how to map the
> parallel (scalar) operations best to the vector hardware, and just help it
> with the parallelism knowledge propagated from the parallel program.
> One can then fall back to the original (hand vectorized) code in case
> the autovectorization failed, to get some vector hardware utilization
> still.

Sounds like a nice compromise if it could be made to work. Would it be
LLVM that reverts to the autovectorised version, or pocl?

> On 10/25/2013 04:15 PM, Richard Sandiford wrote:
>> To be honest I hadn't really thought about targets with vector units
>> at all.:-) I was just assuming that we'd want to keep vector operations
>> together if there's native support. E.g. ISTR comments about not wanting
>> to rewrite vec_selects because it can be hard to synthesise optimal
>> sequences from a single canonical form. But I might have got that wrong.
>> Also, llvmpipe uses intrinsics for some things, so it might be strange
>> if we decompose IR operations but leave the intriniscs alone.
>
> The issue of intrinsics and vectorization was discussed some time ago.
> There it might be better to devectorize to a scalar version of the
> instrinsics (if available) as at least the loopvectorizer can vectorize
> also a set of selected intrinsics, and the target might have direct
> machine instructions for those (which could not be exploited easily from
> "inlined" versions).

Yeah, I vaguely remember some objections to handling target-specific
intrinsics at the IR level, which I heard is what put others off doing
the pass. In my case life is much simpler: there are no intrinsics
and there's no native vector support. So in some ways I've only done
the easy bit. I'm just hoping it's also the less controversial bit.

Do the OpenCL loops that don't get vectorised (because they already
have some vector ops) also contain vector intrinsics, or is it usually
generic vector IR? Would a pass that just scalarises the generic
operations but keeps intrinsics as-is be any use to you, or would the
intrinsics really need to be handled too?

Thanks for the feedback.

Richard

Pekka Jääskeläinen

unread,

Oct 25, 2013, 11:10:21 AM10/25/13

to Renato Golin, Liu Xin, LLVM Developers Mailing List, rsan...@linux.vnet.ibm.com

On 10/25/2013 05:38 PM, Richard Sandiford wrote:
> Sounds like a nice compromise if it could be made to work. Would it be
> LLVM that reverts to the autovectorised version, or pocl?

In my opinion LLVM, because this benefits not only the OpenCL WG
autovectorization of pocl, but any code that uses explicit vector instructions
and might be more efficiently autovectorized if those were devectorized first.
E.g. C code that uses the vector datatypes using
the Clang's attributes.

> Yeah, I vaguely remember some objections to handling target-specific
> intrinsics at the IR level, which I heard is what put others off doing
> the pass. In my case life is much simpler: there are no intrinsics
> and there's no native vector support. So in some ways I've only done
> the easy bit. I'm just hoping it's also the less controversial bit.

One solution is to try to scalarize the intrinsic calls
too (if one knows of the matching ones), and if it fails, keep them
intact (potentially leads to additional unpack/pack etc. overheads if
the autovectorization of them fails).

> Do the OpenCL loops that don't get vectorised (because they already
> have some vector ops) also contain vector intrinsics, or is it usually
> generic vector IR? Would a pass that just scalarises the generic
> operations but keeps intrinsics as-is be any use to you, or would the
> intrinsics really need to be handled too?

Yes to both. It is useful without the intrinsics support, but the
above handling might improve the results for some kernels.

OpenCL builtins (math functions) have vector versions so they are
called if one uses the vector data types. Then sometimes one ends up
having vector instrinsics in the bitcode.

In case I wasn't clear, there are two dimensions on how one can
autovectorize the OpenCL C kernels: inside the SPMD kernel
descriptions (single work-item) itself, or the "implicit parallel loops"
across all work-items in the work-group. I was referring to the latter
as that is where the massive data parallelism and, thus more scalable
vectorization opportunities, usually are.

--
Pekka

Pekka Jääskeläinen

unread,

Oct 25, 2013, 11:19:43 AM10/25/13

to Liu Xin, LLVM Developers Mailing List, rsan...@linux.vnet.ibm.com

On 10/25/2013 05:19 PM, Richard Sandiford wrote:
> Since the row stride is variable, llvm doesn't have enough information
> to tell that these rows don't alias.

This sounds like a use case for parallelism metadata for
unrolled parallel code (which a "parallel program AA"
would exploit). I proposed a format for it some time ago here,
but haven't had time to go on with it.

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-March/060270.html

--
Pekka

Liu Xin

unread,

Oct 30, 2013, 5:04:32 AM10/30/13

to Liu Xin, LLVM Developers Mailing List, rsan...@linux.vnet.ibm.com

Hi, Richard,

Your decompose vector patch works perfect on my site. Unfortunately, I still get stupid code because llvm '-dse' fails followed by 'decompose-vector' .

I read the DSE code and it is definitely capable of eliminating unused memory stores if its AA works. I don't think basic AA works for me. I found my program have complex memory accesses, such as bi-dimentional arrays.

Sorry, I am not good at AA. In my concept, TBAA is just for C++. Do you mean that you can make use of TBAA to help DSE?

Why TBAA is total null for my program ? basicaa is even better than -tbaa.

liuxin@rd58:~/testbed$ opt -tbaa -aa-eval -decompose-vectors -mem2reg -dse test.bc -debug-pass=Structure -o test.opt.bc -stats

Pass Arguments: -targetlibinfo -no-aa -tbaa -aa-eval -decompose-vectors -domtree -mem2reg -memdep -dse -preverify -verify

Target Library Information

No Alias Analysis (always returns 'may' alias)

Type-Based Alias Analysis

ModulePass Manager

FunctionPass Manager

Exhaustive Alias Analysis Precision Evaluator

Decompose vector operations into smaller pieces

Dominator Tree Construction

Promote Memory to Register

Memory Dependence Analysis

Dead Store Elimination

Preliminary module verification

Module Verifier

Bitcode Writer

===== Alias Analysis Evaluator Report =====

1176 Total Alias Queries Performed

0 no alias responses (0.0%)

1176 may alias responses (100.0%)

0 partial alias responses (0.0%)

0 must alias responses (0.0%)

Alias Analysis Evaluator Pointer Alias Summary: 0%/100%/0%/0%

49 Total ModRef Queries Performed

0 no mod/ref responses (0.0%)

0 mod responses (0.0%)

0 ref responses (0.0%)

49 mod & ref responses (100.0%)

Alias Analysis Evaluator Mod/Ref Summary: 0%/0%/0%/100%

Our c/c++ compiler uses steensguaard's points-to algorithm, so I turns to find -steens-aa. It seems that llvm's poolalloc implements steens-aa, right? does it still maintain?

I found I can not build rDSA using the latest llvm headers.

thanks,

--lx

Richard Sandiford

unread,

Oct 30, 2013, 8:13:52 AM10/30/13

to Liu Xin, LLVM Developers Mailing List

Liu Xin <navy...@gmail.com> writes:
> Your decompose vector patch works perfect on my site. Unfortunately, I
> still get stupid code because llvm '-dse' fails followed by
> 'decompose-vector' .
> I read the DSE code and it is definitely capable of eliminating unused
> memory stores if its AA works. I don't think basic AA works for me. I
> found my program have complex memory accesses, such as bi-dimentional
> arrays.
>
> Sorry, I am not good at AA. In my concept, TBAA is just for C++.

Well, as I understand it, what's called TBAA in llvm is mostly an alias
set hierarchy. You can use the same infrastructure for any situation in
which you know that two accesses can't overlap, even if that doesn't
really map to "type"s in the language sense. So it's more than just C++
(or other languages, like LangRef.rst says).

In my case, I'm using TBAA for IR generated by llvmpipe. The information
I'm adding isn't really related to the C types in llvmpipe (or gallium/mesa
generally). It just says that two accesses can't overlap because they
refer to different arrays, or different rows/slices of the same array.

> Do you mean that you can make use of TBAA to help DSE?

The main reason I wanted TBAA is to help scheduling. None of the
accesses are dead, but I want to able to interleave them to reduce
register pressure.

Sorry, I don't really know this part of llvm, so I'm not sure what to suggest.
Hopefully someone else will comment.

Thanks,

Liu Xin

unread,

Oct 31, 2013, 10:56:29 AM10/31/13

to Liu Xin, LLVM Developers Mailing List, rsan...@linux.vnet.ibm.com

Richard,

Thank you. Building up a points-to algorithm is non-trivia. I will investigate on this thread. thank you for the suggest!

--lx

Reply all

Reply to author

Forward