Auto-vectorization: what can I do to help?

14 views
Skip to first unread message

Matt Turner

unread,
Oct 26, 2010, 12:43:19 AM10/26/10
to poll...@googlegroups.com
Hi,
As part of a class research project, I'm interested in implementing
some sort of auto-vectorization in LLVM, but I see that I'd probably
be duplicating a bunch of effort. My due date is 4~5 weeks away, a
which point I'd like to have some proof-of-concept in place (but I'm
not planning to stop work then!)

It looks like you just finished Phase 1, and Phase 2 includes
vectorization, so maybe I came just at the right time. Can someone
tell me if a basic implementation is doable in 4~5 weeks, and if so,
what's the best way to proceed?

Thanks!
Matt Turner

Tobias Grosser

unread,
Oct 26, 2010, 10:43:19 AM10/26/10
to Matt Turner, poll...@googlegroups.com
Hey Matt,

Great that you are planning to work on auto vectorization in LLVM. Polly
might be a good choice. ;-)

I do not believe it is too difficult to work on a basic vectorizer. An
approach that might be doable in 4-5 weeks would be to ignore any
transformations and concentrate on generating vector instructions. This
is not a realistic case, but it would show that we can do this. Later on
we can add the required transformations to generate the vectorizable
loops and work on heuristics and legality checks.
For the moment we can force vectorization using command line arguments.

A way to start would be:

1. Write some C code that contains a loop that can easily be vectorized.

We suppose all transformations for vectorization have already been
applied, and the loops we are working on match the following criteria:

* A (small) constant number of iterations
* All memory accesses are stride 1

E.g. something like this:

for (int i = 0; i < 4; ++i)
A[i] = B[i]

2. Check if Polly can recognize it.

3. Introduce a flag that marks a loop in polly vectorizable

This should be set as soon as a loop is easily vectorizable. Which means
it matches the criteria mentioned above.

4. Add a command line argument that flags all loops in polly to be
vectorizable

We should write an analysis to decide this. However to get started we
can use a command line argument to flag the loops vectorizable, without
applying any analysis.

5. Change polly codegeneration to generate vector instructions for loops
flagged as vectorizable.

The code generation should generate vector instructions instead of a
control flow structure based loop for all loops marked vectorizable.

Instead of generating code like this:
for (int i = 0; i < 4; ++i)
A[i] = B[i]

It should generate code like this:

vec_a[0] = A[0]
vec_a[1] = A[1]
vec_a[2] = A[2]
vec_a[3] = A[3]

vec_b = vec_a

B[0] = vec_b[0]
B[1] = vec_b[1]
B[2] = vec_b[2]
B[3] = vec_b[3]

I hope the llvm backend will transform this code into the correct vector
instructions. You probably need to test this too.

That's it. A proof of concept implementation of vectorization. In case
you need any help feel free to ask back on the mailing list.

Cheers
Tobi

Matt Turner

unread,
Nov 29, 2010, 10:19:10 PM11/29/10
to Tobias Grosser, poll...@googlegroups.com
On Tue, Oct 26, 2010 at 2:43 PM, Tobias Grosser
<gro...@fim.uni-passau.de> wrote:
> 4. Add a command line argument that flags all loops in polly to be
> vectorizable
>
> We should write an analysis to decide this. However to get started we can
> use a command line argument to flag the loops vectorizable, without applying
> any analysis.

Attached is a basic implementation. It's essentially just a copy of
559c6696fe19d54b22598313a35009ca43c89365 (and probably doesn't apply
to current git master). Is this more or less what you had in mind?

Some kind of analysis to determine if vectorization is appropriate is
still to-do.

> 5. Change polly codegeneration to generate vector instructions for loops
> flagged as vectorizable.
>
> The code generation should generate vector instructions instead of a control
> flow structure based loop for all loops marked vectorizable.
>
> Instead of generating code like this:
> for (int i = 0; i < 4; ++i)
>  A[i] = B[i]
>
> It should generate code like this:
>
> vec_a[0] = A[0]
> vec_a[1] = A[1]
> vec_a[2] = A[2]
> vec_a[3] = A[3]
>
> vec_b = vec_a
>
> B[0] = vec_b[0]
> B[1] = vec_b[1]
> B[2] = vec_b[2]
> B[3] = vec_b[3]

Being pretty unfamiliar with LLVM-IR codegen, can you give me a few
pointers for how to do this? Even things like which classes and
functions I should take a look at would be helpful.

> I hope the llvm backend will transform this code into the correct vector
> instructions. You probably need to test this too.

I've been playing with things like

typedef __attribute__(( ext_vector_type(4) )) int int4;

int4 test1(int4 A, int4 B) {
return A + B;
}

and it seems to do OK. :)

> That's it. A proof of concept implementation of vectorization. In case you
> need any help feel free to ask back on the mailing list.
>
> Cheers
> Tobi

Thanks a lot, Tobias. I really appreciate your help :)

Matt

add-vector-command-line-flag.patch

Tobias Grosser

unread,
Nov 30, 2010, 12:56:48 PM11/30/10
to Matt Turner, poll...@googlegroups.com
On 11/29/2010 10:19 PM, Matt Turner wrote:
> On Tue, Oct 26, 2010 at 2:43 PM, Tobias Grosser
> <gro...@fim.uni-passau.de> wrote:
>> > 4. Add a command line argument that flags all loops in polly to be
>> > vectorizable
>> >
>> > We should write an analysis to decide this. However to get started we can
>> > use a command line argument to flag the loops vectorizable, without applying
>> > any analysis.
> Attached is a basic implementation. It's essentially just a copy of
> 559c6696fe19d54b22598313a35009ca43c89365 (and probably doesn't apply
> to current git master). Is this more or less what you had in mind?

Sure. It is enough to get started.

> Some kind of analysis to determine if vectorization is appropriate is
> still to-d

We can work on this.

>> > 5. Change polly codegeneration to generate vector instructions for loops
>> > flagged as vectorizable.
>> >
>> > The code generation should generate vector instructions instead of a control
>> > flow structure based loop for all loops marked vectorizable.
>> >
>> > Instead of generating code like this:
>> > for (int i = 0; i< 4; ++i)
>> > A[i] = B[i]
>> >
>> > It should generate code like this:
>> >
>> > vec_a[0] = A[0]
>> > vec_a[1] = A[1]
>> > vec_a[2] = A[2]
>> > vec_a[3] = A[3]
>> >
>> > vec_b = vec_a
>> >
>> > B[0] = vec_b[0]
>> > B[1] = vec_b[1]
>> > B[2] = vec_b[2]
>> > B[3] = vec_b[3]
> Being pretty unfamiliar with LLVM-IR codegen, can you give me a few
> pointers for how to do this? Even things like which classes and
> functions I should take a look at would be helpful.

I have no idea. ;-) We need to figure this out somehow.

A good starter is Support/IRBuilder.h. This is used to build the LLVM-IR
all over the place.

I believe a good start is to find an example, that we want to vectorize.
E.g. this loop:

float A[4];

void foo() {
for (int i = 0; i < 4; i++)
A[i] = 1;
}

I would start to write a small test program around this loop, that
initializes A to "2", executes foo, and checks if A is "1" at every value.

Than I would generate the LLVM-IR for that program and vectorize it by
hand and see if this was correct.

Cheers
Tobi

>> > I hope the llvm backend will transform this code into the correct vector
>> > instructions. You probably need to test this too.
> I've been playing with things like
>
> typedef __attribute__(( ext_vector_type(4) )) int int4;
>
> int4 test1(int4 A, int4 B) {
> return A + B;
> }
>
> and it seems to do OK.:)
>
>> > That's it. A proof of concept implementation of vectorization. In case you
>> > need any help feel free to ask back on the mailing list.
>> >
>> > Cheers
>> > Tobi
> Thanks a lot, Tobias. I really appreciate your help:)
>
> Matt
>
>

> add-vector-command-line-flag.patch
>
>
> diff --git a/lib/CodeGeneration.cpp b/lib/CodeGeneration.cpp
> index 8320c72..c038307 100644
> --- a/lib/CodeGeneration.cpp
> +++ b/lib/CodeGeneration.cpp
> @@ -49,6 +49,11 @@ ParallelDimension("polly-codegen-parallel",
> cl::desc("The dimension which is parallel"), cl::Hidden,
> cl::value_desc("Name of dimension"),
> cl::ValueRequired, cl::init(""));
> +static cl::opt<std::string>
> +VectorDimension("polly-codegen-vector",
> + cl::desc("The dimension which is vectorizable"), cl::Hidden,
> + cl::value_desc("Name of dimension"),
> + cl::ValueRequired, cl::init(""));

I just changed the name of the other flag. Maybe
-enable-polly-vectorize would be a good match. Can you update this?

> typedef DenseMap<const Value*, Value*> ValueMapT;
> typedef DenseMap<const char*, Value*> CharMapT;
> @@ -445,9 +450,48 @@ public:
> << " is parallel\n");
> }
>
> + /// @brief Check if a loop is vectorizable
> + ///
> + /// Detect if a clast_for loop can be vectorized.
> + ///
> + /// @param f The clast for loop to check.
> + bool isVectorizableFor(struct clast_for *f) {
> +
> + // TODO: Implement the correct analysis. At the moment we just trust the
> + // information given at the command line.
> + std::string DimName(f->iterator);
> + return DimName == VectorDimension;
> + }
> +
> + /// @brief Create an vectorized for loop.
> + void codegenForVector(struct clast_for *f) {
> + APInt Stride = APInt_from_MPZ(f->stride);
> + PHINode *IV;
> + Value *IncrementedIV;
> + BasicBlock *AfterBB;
> + Value *LB = ExpGen.codegen(f->LB);
> + Value *UB = ExpGen.codegen(f->UB);
> + createLoop(Builder, LB, UB, Stride, IV, AfterBB, IncrementedIV, DT);
> + (CharMap)[f->iterator] = IV;
> +
> + if (f->body)
> + codegen(f->body);
> +
> + BasicBlock *HeaderBB = *pred_begin(AfterBB);
> + BasicBlock *LastBodyBB = Builder->GetInsertBlock();
> + Builder->CreateBr(HeaderBB);
> + IV->addIncoming(IncrementedIV, LastBodyBB);
> + Builder->SetInsertPoint(AfterBB);
> + DEBUG(dbgs()<< "Loop with header: "<< HeaderBB->getNameStr()
> +<< " is vectorizable\n");
> + }
> +
> void codegen(struct clast_for *f) {
> if (isParallelFor(f))
> codegenForOpenMP(f);
> + else if (isVectorizableFor(f))
> + codegenForVector(f);
> else
> codegenForSequential(f);
> }

Otherwise the code looks good.

Cheers
Tobi

Tobias Grosser

unread,
Nov 30, 2010, 1:05:09 PM11/30/10
to poll...@googlegroups.com
I forgot to say. I added on inline comment. Please address this and
submit an up to date patch. We can than commit it.

Cheers
Tobi

Matt Turner

unread,
Dec 6, 2010, 5:58:15 PM12/6/10
to Tobias Grosser, poll...@googlegroups.com
On Tue, Nov 30, 2010 at 6:05 PM, Tobias Grosser
<gro...@fim.uni-passau.de> wrote:
> I forgot to say. I added on inline comment. Please address this and submit
> an up to date patch. We can than commit it.
>
> Cheers
> Tobi

Sorry for the delay. It looks like you committed something similar in
the meantime?

Matt

Tobias Grosser

unread,
Dec 7, 2010, 3:26:30 PM12/7/10
to Matt Turner, poll...@googlegroups.com

Yes, I continued the work in this area, as we needed it. However there
is still a lot more that needs to be done. The basic code generation is
already done, but now the interesting stuff starts to pop up. ;-)

You could e.g. take a simple example and try to optimize it and get
performance numbers. Just figuring out the right loop structure for
matrix multiplication is an interesting problem and will help you to
understand how all this stuff works. As stated in my earlier email,
creating a simple test case for which you would like to improve
performance is a good start to get into all this.

If you want more programming you could e.g. work on array privatization
or the recognition of reductions.

Let me know what your plans are and I can give you more details.

Cheers
Tobi

Matt Turner

unread,
Dec 7, 2010, 11:55:49 PM12/7/10
to Tobias Grosser, poll...@googlegroups.com
On Tue, Dec 7, 2010 at 8:26 PM, Tobias Grosser

Thanks for the email! I think recognition of reductions would be a
good topic, so if you're not planning to work on this until at least
Sunday (my project's due date), then I'll see what I can do.

My copy of Allen & Kennedy Optimizing Compilers book has a section on
recognition of reductions and outlines the properties of a reduction:
1. reduce a vector/array down to a single element
2. only final result is used later (ie, no intermediate use)
3. loop operates on the vector/array and nothing else

What do you suggest?

Thanks,
Matt

Tobias Grosser

unread,
Dec 8, 2010, 12:33:45 AM12/8/10
to poll...@googlegroups.com
> My copy of Allen& Kennedy Optimizing Compilers book has a section on

> recognition of reductions and outlines the properties of a reduction:
> 1. reduce a vector/array down to a single element
> 2. only final result is used later (ie, no intermediate use)
> 3. loop operates on the vector/array and nothing else

I am not working on this for the next couple of weeks. So you should be
save. (And sorry again for getting into your project).

Regarding reductions I propose to take a simple example:

for (int i = 0; i <= N; i++)
R[0] += A[i];

Generate a full test program and check that polly is able to detect that
scop. Have a look at the output of -polly-scops -analyze and
-polly-dependences -analyze.

Than you can start and reason on how we can detect a reduction.

What do we know about them?

As you stated reductions are statements, that always read from one same
memory location and later write to this memory location again.

Here a the whole calculation:

R = (((R[0] + A[0]) + A[1]) + A[2]) + ..

And the structure of a single statement:

R[x] A[i]
\ /
STMT
|
R[x]

R[x] = A[i] + R[x]

I propose to start to check each statement/basic_block and see:

1. It has exactly one store and two loads.
2. One of the loads accesses the same memory location as the store
3. .....


You can start in TempScopInfo.cpp by deriving your function from
TempScopInfo::buildAccessFunctions

bool TempScopInfo::isReduction();

Iterate over all instructions and count the number of read and write
accesses. Return true if read == 2 and write == 1.

Cheers
Tobi

Matt Turner

unread,
Dec 8, 2010, 11:36:36 PM12/8/10
to Tobias Grosser, poll...@googlegroups.com

I just sent you two patches doing exactly these things.

Please give any feedback you have, and if they're acceptable please
apply (with `git am` or with `git apply --author=...` so I can have
something to point to :)

isReduction correctly returns false for

void vector_add(void) {
int i;
for (i = 0; i < NUM; i++) {
A[i] = B[i] * C[i];
}
}

and correctly return true for

int reduction(void) {
int i;
for (i = 0; i < NUM; i++) {
R += A[i];
}
return R;
}

Perhaps I should submit a test case too to be included in the test suite?

Anyway, I can now recognize basic reductions, though probably it
incorrectly detects some as well. I'll do some research on how to
optimize and transform reductions and see what I can come up with. Any
suggestions greatly appreciated. :)

Thanks!
Matt

Tobias Grosser

unread,
Dec 8, 2010, 11:41:25 PM12/8/10
to Matt Turner, poll...@googlegroups.com

Good work. I will apply them in a minute. The patches are fine.


> Perhaps I should submit a test case too to be included in the test suite?

Yes. Please add those two test cases to the test suite.

For this I propose to add a attribute

std::set<BasicBlock*> Reductions;

to TempScopInfo. Check all basic blocks, if they are reductions and add
them to this set, if they are. Than you can enhance
TempScop::printDetail() to add print this. In the test suite you can now
check with opt -polly-analyze-ir -analyze %s | if the reductions are
detected correctly.

Cheers
Tobi

Tobias Grosser

unread,
Dec 8, 2010, 11:53:35 PM12/8/10
to poll...@googlegroups.com, Matt Turner

Done.

>> Perhaps I should submit a test case too to be included in the test suite?
> Yes. Please add those two test cases to the test suite.
>
> For this I propose to add a attribute
>
> std::set<BasicBlock*> Reductions;
>
> to TempScopInfo. Check all basic blocks, if they are reductions and add
> them to this set, if they are. Than you can enhance
> TempScop::printDetail() to add print this. In the test suite you can now
> check with opt -polly-analyze-ir -analyze %s | if the reductions are
> detected correctly.
>
> Cheers
> Tobi
>
>
>>
>> Anyway, I can now recognize basic reductions, though probably it
>> incorrectly detects some as well. I'll do some research on how to
>> optimize and transform reductions and see what I can come up with. Any
>> suggestions greatly appreciated. :)

The next check I prose is to check that there is only a single use of
the reduction read (the read that has the same address as the write).
This use needs to be the instruction of which the result is stored by
our single.
This is very conservative, but will allow us to to detect very fast most
of the common cases, while never having any false positives.

The last step you need to take to be 100% sure that this basic block is
a reduction is to check for this single instruction that it is
Associative and Commutative. The corresponding functions are in
Instruction.cpp.

Thanks a lot

Tobi

Matt Turner

unread,
Dec 9, 2010, 7:38:44 PM12/9/10
to Tobias Grosser, poll...@googlegroups.com
On Thu, Dec 9, 2010 at 4:53 AM, Tobias Grosser

Two more patches on the way. In the first, to check the number of uses
I had to check for ->hasNUses(3) since the Value is loaded and stored,
causing 2 uses already.

I'll send along some patches for the test cases tonight. I currently
have four: a basic sanity check, testing for multiple uses, testing
for associativity/commutative, and finally one which is actually a
reduction. Are there any other cases I should consider?

Thanks!
Matt

Tobias Grosser

unread,
Dec 9, 2010, 8:05:53 PM12/9/10
to Matt Turner, poll...@googlegroups.com
I am still not following this. Are you talking about the value itself or
of the pointer used by the load and store instructions. I believe the
code as you send it will probably not work, because the third use is a
use of the actual value.

> I'll send along some patches for the test cases tonighti. I currently


> have four: a basic sanity check, testing for multiple uses, testing
> for associativity/commutative, and finally one which is actually a
> reduction. Are there any other cases I should consider?

Yes, please send at least a patch that connects this functionality to
the build and a single patch to check for what we have already. This is
extremely helpful to understand if the code we commit is correct.

We can than add test cases one by one if needed.

Cheers
Tobi

Matt Turner

unread,
Dec 10, 2010, 12:16:29 AM12/10/10
to Tobias Grosser, poll...@googlegroups.com
On Thu, Dec 9, 2010 at 4:53 AM, Tobias Grosser
<gro...@fim.uni-passau.de> wrote:
> The next check I prose is to check that there is only a single use of the
> reduction read (the read that has the same address as the write). This use
> needs to be the instruction of which the result is stored by our single.

This is what I'm having trouble with. I can certainly iterate through
the second load instruction's uses and test them again the next
instruction in the loop, but this feels dirty. I'm not sure how I
should do this.

Matt

Tobias Grosser

unread,
Dec 10, 2010, 12:42:51 AM12/10/10
to Matt Turner, poll...@googlegroups.com

You could get the value that is stored by the store. Check if it is an
instruction and if this instructions is using the loaded value. Finally
you check if the loaded value is only used once.

Cheers
Tobi

Reply all
Reply to author
Forward
0 new messages