[llvm-dev] [RFC] Register Rematerialization (remat) Extension

250 views
Skip to first unread message

vivek pandya via llvm-dev

unread,
Sep 12, 2016, 11:51:55 AM9/12/16
to llvm-dev, Hal Finkel, Matthias Braun
Hello Developers,

I am working with my other batchmates to improve register remat in LLVM.
We want to remat live ranges made of multiple instruction.

Just to support our proposal here is a simple example that currently remat does
not cover

$ cat ~/tmp/tl.c
void foo(long);
void bar() {
  for (int i = 0; i < 1600; ++i)
    foo(3494348345984503943);
}

$ clang -O3 -S -o - ~/tmp/tl.c -target powerpc64
...
# BB#0:                                 # %entry
...
        lis 3, 12414
        ori 3, 3, 27470
        sldi 3, 3, 32
        oris 3, 3, 35809
        ori 30, 3, 20615
...
.LBB0_1:                                # %for.body
        mr 3, 30
        bl foo
...

There is a sequence of instructions used to materialize the constant, the first 
one (the lis) is trivially rematerialiable, and the others depend only on that one,
and have no side effects. If we otherwise needed to spill the constant, we might
wish to move the entire set of instructions that compute the value into the loop body.
(Many thanks to Hal Finkel for this example and head start)

We are following very old but effective paper "Rematerialization"
http://dl.acm.org/citation.cfm?id=143143 ------------------------------[1]

This extension will specially improve code quality for RICS backends like 
powerpc, MIPS, ARM, AArch64 etc.

Here is a tentative apporach ( after reading the above mentioned paper and current remat code) that I would like to follow.

Please share your views because this may be totally wrong direction. Also I will
be happy if this gets into main line LLVM code but if community don't want
to make remat heavy than please guide me for my class project perspective.

1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I think it does not require to build SSA graph and converting it back after optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional Constant
Propagation based on Wegman and Zadeck's work http://dl.acm.org/citation.cfm?id=103136) as desribed in [1]. This pass will be scheduled to run before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to instructions pointers (instructions which can be used to remat the given live range). The map will contain live ranges which is due to single instruction and multiple instructions.

4 ) The remat APIs defined in LiveRangeEdit.cpp will use analysis from the Map
when a spill is required for RA.

5 ) The remat transformation APIs like rematerializeAt() will be teached to remat
live ranges with multiple instructions too.

6 ) A cost analysis will be require to decide between remat and spill. This should be based on at least two factors register pressure and spill cost

Few points:
--------------
* The analysis pass to be addes as per (2) will use target specific information
from TargetInstrInfo.cpp as the current remat infrastructure uses.

* This approach will not be on demand as the current approach is (i.e remat specific
code will be executed only if there is a spill) so the pass in (2) can be an
overhead so we may want it to enable only for higher level of optimization.

* Will it be possible to use existing SCCP.cpp code with few modification to lattice
and related mathematical operation so that it can serve both purpose?

* No changes in current register allocators or spill framework will be required
because remat entry point will be LiveRangeEdit.

Any other way with less overhead is always welcomed.
Please help us developing a plan to implement this.

Hoping for comments!

Sincerely,
Vivek

Andrew Trick via llvm-dev

unread,
Sep 12, 2016, 1:14:12 PM9/12/16
to vivek pandya, llvm-dev, Matthias Braun
On Sep 12, 2016, at 8:51 AM, vivek pandya via llvm-dev <llvm...@lists.llvm.org> wrote:


1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I think it does not require to build SSA graph and converting it back after optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional Constant
Propagation based on Wegman and Zadeck's work http://dl.acm.org/citation.cfm?id=103136) as desribed in [1]. This pass will be scheduled to run before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to instructions pointers (instructions which can be used to remat the given live range). The map will contain live ranges which is due to single instruction and multiple instructions.

LiveIntervals maintains a quasi-SSA form via VNInfo. It does not allow efficient def-use queries, but use-def is there, which is all that you should need.

It would be great to have better remat during regalloc, but please try to avoid building additional state that needs to be maintained.

-Andy


4 ) The remat APIs defined in LiveRangeEdit.cpp will use analysis from the Map
when a spill is required for RA.

5 ) The remat transformation APIs like rematerializeAt() will be teached to remat
live ranges with multiple instructions too.

6 ) A cost analysis will be require to decide between remat and spill. This should be based on at least two factors register pressure and spill cost

Few points:
--------------
* The analysis pass to be addes as per (2) will use target specific information
from TargetInstrInfo.cpp as the current remat infrastructure uses.

* This approach will not be on demand as the current approach is (i.e remat specific
code will be executed only if there is a spill) so the pass in (2) can be an
overhead so we may want it to enable only for higher level of optimization.

* Will it be possible to use existing SCCP.cpp code with few modification to lattice
and related mathematical operation so that it can serve both purpose?

* No changes in current register allocators or spill framework will be required
because remat entry point will be LiveRangeEdit.

Any other way with less overhead is always welcomed.
Please help us developing a plan to implement this.

Hoping for comments!

Sincerely,
Vivek

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Philip Reames via llvm-dev

unread,
Sep 13, 2016, 10:50:36 AM9/13/16
to vivek pandya, llvm-dev, Hal Finkel, Matthias Braun
OT: Instead of viewing this as a *single* remat problem, can you view this as a series of remat problems?  If each instruction can be moved while only increasing the live range of a single register and shrinking the live range of another, then moving each instruction one by one and iterating gives the same effect.  Note that this only (obviously) works for single use defs.  Extending it to multiple uses would require a more complicated cost model as you point out. 

There is a sequence of instructions used to materialize the constant, the first 
one (the lis) is trivially rematerialiable, and the others depend only on that one,
and have no side effects. If we otherwise needed to spill the constant, we might
wish to move the entire set of instructions that compute the value into the loop body.
(Many thanks to Hal Finkel for this example and head start)

We are following very old but effective paper "Rematerialization"
http://dl.acm.org/citation.cfm?id=143143 ------------------------------[1]

This extension will specially improve code quality for RICS backends like 
powerpc, MIPS, ARM, AArch64 etc.

Here is a tentative apporach ( after reading the above mentioned paper and current remat code) that I would like to follow.

Please share your views because this may be totally wrong direction. Also I will
be happy if this gets into main line LLVM code but if community don't want
to make remat heavy than please guide me for my class project perspective.

1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I think it does not require to build SSA graph and converting it back after optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional Constant
Propagation based on Wegman and Zadeck's work http://dl.acm.org/citation.cfm?id=103136) as desribed in [1]. This pass will be scheduled to run before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to instructions pointers (instructions which can be used to remat the given live range). The map will contain live ranges which is due to single instruction and multiple instructions.
This sounds overly complex.  Can you implement this without needing the new side structure?  Maintaining extra state and keeping it up to date is expensive.  (From a maintenance and code complexity perspective.)


4 ) The remat APIs defined in LiveRangeEdit.cpp will use analysis from the Map
when a spill is required for RA.

5 ) The remat transformation APIs like rematerializeAt() will be teached to remat
live ranges with multiple instructions too.

6 ) A cost analysis will be require to decide between remat and spill. This should be based on at least two factors register pressure and spill cost

Few points:
--------------
* The analysis pass to be addes as per (2) will use target specific information
from TargetInstrInfo.cpp as the current remat infrastructure uses.

* This approach will not be on demand as the current approach is (i.e remat specific
code will be executed only if there is a spill) so the pass in (2) can be an
overhead so we may want it to enable only for higher level of optimization.
This would be unfortunate.  Not fatal, just unfortunate.

* Will it be possible to use existing SCCP.cpp code with few modification to lattice
and related mathematical operation so that it can serve both purpose?

* No changes in current register allocators or spill framework will be required
because remat entry point will be LiveRangeEdit.

Any other way with less overhead is always welcomed.
Please help us developing a plan to implement this.

Hoping for comments!

Sincerely,
Vivek



vivek pandya via llvm-dev

unread,
Sep 13, 2016, 12:03:39 PM9/13/16
to Philip Reames, llvm-dev, Matthias Braun, Deboleen Roychowdhury, Jishan, Nirav Rana
Hello Philip,

Your idea seems to be effective and can be tried out with few changes. But correct me if I got this wrong, by multiple uses do you mean a same remat sequence being remat again? In that case we can run the analysis again to decide which all instructions are required to be remat or we may add a simple caching layer which can hold analysis result for once compilation unit. 
Any thoughts ?  
 

There is a sequence of instructions used to materialize the constant, the first 
one (the lis) is trivially rematerialiable, and the others depend only on that one,
and have no side effects. If we otherwise needed to spill the constant, we might
wish to move the entire set of instructions that compute the value into the loop body.
(Many thanks to Hal Finkel for this example and head start)

We are following very old but effective paper "Rematerialization"
http://dl.acm.org/citation.cfm?id=143143 ------------------------------[1]

This extension will specially improve code quality for RICS backends like 
powerpc, MIPS, ARM, AArch64 etc.

Here is a tentative apporach ( after reading the above mentioned paper and current remat code) that I would like to follow.

Please share your views because this may be totally wrong direction. Also I will
be happy if this gets into main line LLVM code but if community don't want
to make remat heavy than please guide me for my class project perspective.

1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I think it does not require to build SSA graph and converting it back after optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional Constant
Propagation based on Wegman and Zadeck's work http://dl.acm.org/citation.cfm?id=103136) as desribed in [1]. This pass will be scheduled to run before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to instructions pointers (instructions which can be used to remat the given live range). The map will contain live ranges which is due to single instruction and multiple instructions.
This sounds overly complex.  Can you implement this without needing the new side structure?  Maintaining extra state and keeping it up to date is expensive.  (From a maintenance and code complexity perspective.)

Currently I don't think we have any facility to annotate instructions so to keep analysis available till register allocation we need an immutable pass.
 
4 ) The remat APIs defined in LiveRangeEdit.cpp will use analysis from the Map
when a spill is required for RA.

5 ) The remat transformation APIs like rematerializeAt() will be teached to remat
live ranges with multiple instructions too.

6 ) A cost analysis will be require to decide between remat and spill. This should be based on at least two factors register pressure and spill cost

Few points:
--------------
* The analysis pass to be addes as per (2) will use target specific information
from TargetInstrInfo.cpp as the current remat infrastructure uses.

* This approach will not be on demand as the current approach is (i.e remat specific
code will be executed only if there is a spill) so the pass in (2) can be an
overhead so we may want it to enable only for higher level of optimization.
This would be unfortunate.  Not fatal, just unfortunate.

-Vivek 

Gerolf Hoflehner via llvm-dev

unread,
Sep 13, 2016, 9:43:03 PM9/13/16
to Andrew Trick, vivek pandya, llvm-dev, Matthias Braun
On Sep 12, 2016, at 10:14 AM, Andrew Trick via llvm-dev <llvm...@lists.llvm.org> wrote:


On Sep 12, 2016, at 8:51 AM, vivek pandya via llvm-dev <llvm...@lists.llvm.org> wrote:


1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I think it does not require to build SSA graph and converting it back after optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional Constant
Propagation based on Wegman and Zadeck's work http://dl.acm.org/citation.cfm?id=103136) as desribed in [1]. This pass will be scheduled to run before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to instructions pointers (instructions which can be used to remat the given live range). The map will contain live ranges which is due to single instruction and multiple instructions.

LiveIntervals maintains a quasi-SSA form via VNInfo. It does not allow efficient def-use queries, but use-def is there, which is all that you should need.

I also only see a narrow and specific remat cost problem in the example: is it cheaper is issue a  chain of instructions rather than a fill? And for this a use-def is sufficient.


It would be great to have better remat during regalloc, but please try to avoid building additional state that needs to be maintained.

You proposed a fairly complex scheme. The question then is always is it worth it? To answer that question you would need to investigate and break down the current remat problems (spills but should remat, remat but should spill, should remat at a different location, etc) eg. for the llvm test suite, and show that your algorithms solves the most important ones.

James Molloy via llvm-dev

unread,
Sep 19, 2016, 8:51:36 AM9/19/16
to Gerolf Hoflehner, Andrew Trick, vivek pandya, llvm-dev, Matthias Braun
Hi,

I've been looking at this myself for ARM, and came up with a much simpler solution: lower immediate materializations to a post-RA pseudo and expand the chain of materialization instructions after register allocation / remat. Remat only sees one instruction with no dependencies.

Did you look down this route and discount it?

Cheers,

James

Bruce Hoult via llvm-dev

unread,
Sep 19, 2016, 10:10:34 AM9/19/16
to vivek pandya, llvm-dev, Matthias Braun
The idea seems sound, but do you really have a CPU in which such a complex rematerialization is better than an L1 cache load from the stack frame?

        lis 3, 12414
        ori 3, 3, 27470
        sldi 3, 3, 32
        oris 3, 3, 35809
        ori 30, 3, 20615

I'm not familiar with modern PPC64 but seems like a lose on PPC G5 and from the docs I quickly found (2 cycle latency on dependent int ALU ops) Power8 too.

OK, maybe (if I didn't screw up the rldimi):

        lis 3, 12414
        lis 30, 35809
        ori 3, 3, 27470
        ori 30, 30, 20615
        rldimi 30, 3, 32, 0

Or is there something that optimizes such sequences building constants?

vivek pandya via llvm-dev

unread,
Sep 19, 2016, 2:17:07 PM9/19/16
to James Molloy, llvm-dev, Nirav Rana, Matthias Braun
On Mon, Sep 19, 2016 at 6:21 PM, James Molloy <ja...@jamesmolloy.co.uk> wrote:
Hi,

I've been looking at this myself for ARM, and came up with a much simpler solution: lower immediate materializations to a post-RA pseudo and expand the chain of materialization instructions after register allocation / remat. Remat only sees one instruction with no dependencies.

Did you look down this route and discount it?
No actually I am not much familiar with this topic so I mostly reply on research papers available.
But your idea seems to be simple and good solution but I am not sure if this can cover every possible cases. 

Do you have a patch for this? I can work on this to make it work for other architectures for which this will be beneficial.

Sincerely,
Vivek

Quentin Colombet via llvm-dev

unread,
Sep 19, 2016, 2:27:17 PM9/19/16
to vivek pandya, llvm-dev, Nirav Rana, Matthias Braun
Hi Vivek,

On Sep 19, 2016, at 11:17 AM, vivek pandya via llvm-dev <llvm...@lists.llvm.org> wrote:



On Mon, Sep 19, 2016 at 6:21 PM, James Molloy <ja...@jamesmolloy.co.uk> wrote:
Hi,

I've been looking at this myself for ARM, and came up with a much simpler solution: lower immediate materializations to a post-RA pseudo and expand the chain of materialization instructions after register allocation / remat. Remat only sees one instruction with no dependencies.

Did you look down this route and discount it?
No actually I am not much familiar with this topic so I mostly reply on research papers available.
But your idea seems to be simple and good solution but I am not sure if this can cover every possible cases. 

This is the way all targets deal with simple rematerialization involving several instructions in LLVM AFAIK.

Basically, the target defines a pseudo instruction that encodes this sequence of instructions and expands it after register allocation. This is a case by case thing, there is no patch that can be generalized for other target.
For instance, look at the expansion of AArch64::MOVi64imm.

The bottom line is, our rematerialization scheme is currently limited, but I am not sure your proposal get us beyond what we already support.

Cheers,
Q.

Hal Finkel via llvm-dev

unread,
Sep 24, 2016, 1:39:12 PM9/24/16
to Bruce Hoult, llvm-dev, Matthias Braun, vivek pandya
From: "Bruce Hoult" <br...@hoult.org>
To: "vivek pandya" <vivekv...@gmail.com>
Cc: "llvm-dev" <llvm...@lists.llvm.org>, "Hal Finkel" <hfi...@anl.gov>, "Matthias Braun" <ma...@braunis.de>
Sent: Monday, September 19, 2016 9:10:17 AM
Subject: Re: [llvm-dev] [RFC] Register Rematerialization (remat) Extension

The idea seems sound, but do you really have a CPU in which such a complex rematerialization is better than an L1 cache load from the stack frame?

        lis 3, 12414
        ori 3, 3, 27470
        sldi 3, 3, 32
        oris 3, 3, 35809
        ori 30, 3, 20615

I'm not familiar with modern PPC64 but seems like a lose on PPC G5 and from the docs I quickly found (2 cycle latency on dependent int ALU ops) Power8 too.

OK, maybe (if I didn't screw up the rldimi):

        lis 3, 12414
        lis 30, 35809
        ori 3, 3, 27470
        ori 30, 30, 20615
        rldimi 30, 3, 32, 0

Or is there something that optimizes such sequences building constants?
I don't know about the G5, but I did some experiments on the P8 and I was unable to distinguish the performance of a load vs. the materialization sequence. This matches my expectations: The load should have a load-to-use latency of 3 cycles. The materialization-sequence instructions can issue together, each instruction has only a 1-cycle latency with forwarding (and 2 can execute per cycle), and the height of the dependency chain is 3 instructions. All other things being roughly equal, keeping pressure off of the memory subsystem tends to trump other concerns.

 -Hal
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel via llvm-dev

unread,
Sep 26, 2016, 1:13:34 PM9/26/16
to Quentin Colombet, llvm-dev, Nirav Rana, Matthias Braun, vivek pandya
From: "Quentin Colombet via llvm-dev" <llvm...@lists.llvm.org>
To: "vivek pandya" <vivekv...@gmail.com>
Cc: "llvm-dev" <llvm...@lists.llvm.org>, "Nirav Rana" <h201...@pilani.bits-pilani.ac.in>, "Matthias Braun" <ma...@braunis.de>
Sent: Monday, September 19, 2016 1:27:10 PM
Subject: Re: [llvm-dev] [RFC] Register Rematerialization (remat) Extension

Hi Vivek,

On Sep 19, 2016, at 11:17 AM, vivek pandya via llvm-dev <llvm...@lists.llvm.org> wrote:



On Mon, Sep 19, 2016 at 6:21 PM, James Molloy <ja...@jamesmolloy.co.uk> wrote:
Hi,

I've been looking at this myself for ARM, and came up with a much simpler solution: lower immediate materializations to a post-RA pseudo and expand the chain of materialization instructions after register allocation / remat. Remat only sees one instruction with no dependencies.

Did you look down this route and discount it?
No actually I am not much familiar with this topic so I mostly reply on research papers available.
But your idea seems to be simple and good solution but I am not sure if this can cover every possible cases. 

This is the way all targets deal with simple rematerialization involving several instructions in LLVM AFAIK.

Basically, the target defines a pseudo instruction that encodes this sequence of instructions and expands it after register allocation. This is a case by case thing, there is no patch that can be generalized for other target.
For instance, look at the expansion of AArch64::MOVi64imm.

The bottom line is, our rematerialization scheme is currently limited, but I am not sure your proposal get us beyond what we already support.
I might have misunderstood the proposal, but why do you say that? The problem is not limited to constants (as perhaps evidenced by Ivan Baev's talk at the 2014 dev meeting). One basic thing we should get, for example, is:

  q = ...;
  r = ...;
  for (...) {
    // complicated stuff
    foo(q, r, q - r); // should prefer to remat (q-r) here instead of spilling.
  }

Also, this is all perhaps related to https://llvm.org/bugs/show_bug.cgi?id=25373#c9

Thanks again,
Hal

Quentin Colombet via llvm-dev

unread,
Sep 27, 2016, 5:10:48 PM9/27/16
to Hal Finkel, llvm-dev, Nirav Rana, Matthias Braun, vivek pandya
I agree, but this is not what Vivek is aiming at, is he?

vivek pandya via llvm-dev

unread,
Sep 27, 2016, 6:00:22 PM9/27/16
to Quentin Colombet, llvm-dev, Nirav Rana, Matthias Braun
Hello Quentin,

I am certainly looking to implement best what can be achieved and that will also require a cost estimation module which can address the mentioned bug. How ever I am still looking to some simpler (or less heavy)  approach than what I proposed so that whatever I implement that can be sent to upstream.

Sincerely,
Vivek
Reply all
Reply to author
Forward
0 new messages