[llvm-dev] Open Project : Inter-procedural Register Allocation [GSoC 2016]

80 views
Skip to first unread message

vivek pandya via llvm-dev

unread,
Feb 10, 2016, 12:17:17 AM2/10/16
to llvm...@lists.llvm.org
Hello Community,

I would like to know status of the project and also importance of it. If the project is still open I would like to work on GSoC 2016 proposal for Inter-procedural Register Allocation, in that case please also suggest possible mentor or let me know if anyone is willing to be mentor for this.

Sincerely, 
Vivek Pandya

Sanjoy Das via llvm-dev

unread,
Mar 22, 2016, 8:25:18 PM3/22/16
to vivek pandya, llvm-dev, Matthias Braun
Hi Vivek,

[+CC Matthias, Quentin]

Inter-procedural register allocation can be a big win, but my estimate
is that it will be challenging to complete within one summer unless
you're already familiar with LLVM's register allocator.

I've CC'ed some people who can give you some more detailed information.

-- Sanjoy

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

--
Sanjoy Das
http://playingwithpointers.com
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sanjoy Das via llvm-dev

unread,
Mar 22, 2016, 8:27:51 PM3/22/16
to vivek pandya, llvm-dev, Matthias Braun
Apologies: didn't notice how old this thread is before replying.

Matthias Braun via llvm-dev

unread,
Mar 22, 2016, 9:04:18 PM3/22/16
to Sanjoy Das, llvm-dev, vivek pandya
No need to apologize this thread surely deserved some answers :)

From my perspective this project sounds doable. I would expect the register allocation parts to be not too hard: I imagine this being just distilling a new clobber regmask after allocating a function. I would expect the challenging (or annoying) part to get a machine module pass (or a similar mechanism to influence the order in which functions are processed) and a callgraph in the backend. So this might end up being more pass manager / infrastructure work than register allocation.

I'd be happy to answer detail questions or give guidance on the register allocation aspects.

- Matthias

Mehdi Amini via llvm-dev

unread,
Mar 23, 2016, 12:04:46 AM3/23/16
to Matthias Braun, llvm-dev, vivek pandya

> On Mar 22, 2016, at 6:04 PM, Matthias Braun via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> No need to apologize this thread surely deserved some answers :)
>
> From my perspective this project sounds doable. I would expect the register allocation parts to be not too hard: I imagine this being just distilling a new clobber regmask after allocating a function. I would expect the challenging (or annoying) part to get a machine module pass (or a similar mechanism to influence the order in which functions are processed) and a callgraph in the backend.

I have a very tiny patch that wrap the backend in a CGSCC pass manager, which will achieve what is needed here I believe: i.e. running CodeGen for every callee before any caller.
I can rebase it if anyone is interested.

--
Mehdi

C Bergström

unread,
Mar 23, 2016, 12:14:02 AM3/23/16
to Mehdi Amini, llvm-dev, vivek pandya
From the research and code I've seen - Doesn't this break regalloc
down into a global and location allocation strategy? (maybe I'm
remembering incorrectly)

vivek pandya via llvm-dev

unread,
Mar 23, 2016, 2:30:33 AM3/23/16
to Sanjoy Das, llvm-dev, Matthias Braun


Vivek Pandya


On Wed, Mar 23, 2016 at 5:54 AM, Sanjoy Das <san...@playingwithpointers.com> wrote:
Hi Vivek,

[+CC Matthias, Quentin]

Inter-procedural register allocation can be a big win, but my estimate
is that it will be challenging to complete within one summer unless
you're already familiar with LLVM's register allocator.

I am currently working on a graph coloring based register allocator for my college project in which I am applying nature inspired heuristics to find better coloring. So I am now quite familiar with LLVM's register allocation code and how related classes can be used. Many thanks to Lang Hames who have helped me understanding all these ( on chat ). I feel that with proper guidance I can complete this.

vivek pandya via llvm-dev

unread,
Mar 23, 2016, 2:40:59 AM3/23/16
to Sanjoy Das, llvm-dev, Matthias Braun


Vivek Pandya


On Wed, Mar 23, 2016 at 5:57 AM, Sanjoy Das <san...@playingwithpointers.com> wrote:
Apologies: didn't notice how old this thread is before replying.

Thank you for reply ! 
At the time of this thread I have also gathered some papers related to this. 
Minimum Cost Interprocedural Register Allocation - Steven M. Kurlander, Charles N. Fischer
Global Register Allocation at Link Time - David W. Wall 
Interprocedural Register Allocation for Lazy Function Languages - Urban Boquist
A Simple Interprocedural Register Allocation Algorithm and Its Effectiveness for LISP - PETER A. STEENKISTE and JOHN L. HENNESSY

But due to less interest from the community I thought that this is not useful for LLVM. Apparently GCC has some work on this .

vivek pandya via llvm-dev

unread,
Mar 23, 2016, 2:44:11 AM3/23/16
to Matthias Braun, llvm-dev


Vivek Pandya


On Wed, Mar 23, 2016 at 6:34 AM, Matthias Braun <mbr...@apple.com> wrote:
No need to apologize this thread surely deserved some answers :)

From my perspective this project sounds doable. I would expect the register allocation parts to be not too hard: I imagine this being just distilling a new clobber regmask after allocating a function. I would expect the challenging (or annoying) part to get a machine module pass (or a similar mechanism to influence the order in which functions are processed) and a callgraph in the backend. So this might end up being more pass manager / infrastructure work than register allocation.

I'd be happy to answer detail questions or give guidance on the register allocation aspects.

Actually  I have a draft proposal for Add machine Module pass ( which is not a very concrete approach )
I was thinking to implement a simple Interprocedural Register Allocator as an example that uses this work.But that may be too much of work to complete in the summer.

vivek pandya via llvm-dev

unread,
Mar 23, 2016, 2:45:17 AM3/23/16
to Mehdi Amini, llvm-dev


Vivek Pandya


On Wed, Mar 23, 2016 at 9:34 AM, Mehdi Amini <mehdi...@apple.com> wrote:

> On Mar 22, 2016, at 6:04 PM, Matthias Braun via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> No need to apologize this thread surely deserved some answers :)
>
> From my perspective this project sounds doable. I would expect the register allocation parts to be not too hard: I imagine this being just distilling a new clobber regmask after allocating a function. I would expect the challenging (or annoying) part to get a machine module pass (or a similar mechanism to influence the order in which functions are processed) and a callgraph in the backend.

I have a very tiny patch that wrap the backend in a CGSCC pass manager, which will achieve what is needed here I believe: i.e. running CodeGen for every callee before any caller.
I can rebase it if anyone is interested.

Yes most of InterProcedural analysis in GCC uses call graphs so this can be very useful. 

vivek pandya via llvm-dev

unread,
Mar 23, 2016, 2:47:17 AM3/23/16
to C Bergström, llvm-dev


Vivek Pandya


On Wed, Mar 23, 2016 at 9:43 AM, C Bergström <cberg...@pathscale.com> wrote:
From the research and code I've seen - Doesn't this break regalloc
down into a global and location allocation strategy? (maybe I'm
remembering incorrectly)

Yes I think you are correct. If I recall IP Reg allocation allocates some registers to varibale that are used across the procedures and after that  remaining allocation will be done at IP level.

vivek pandya via llvm-dev

unread,
Mar 23, 2016, 3:33:20 AM3/23/16
to C Bergström, llvm-dev
The dead line for proposal submission is 26 March. I don't think this is sufficient time to make a solid proposal but I am also interested to work on this without stipend. 

Sincerely,
Vivek

Vivek Pandya

Quentin Colombet via llvm-dev

unread,
Mar 23, 2016, 12:49:00 PM3/23/16
to Matthias Braun, llvm-dev, vivek pandya
The pass manager already has support for calligraph connected region IIRC.
As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

I think the main challenge of a real inter-procedural register allocator is to change all of the calling convention dynamically and more importantly convey the right information to other tools (via CFA, CFI, etc.).

Cheers,
Q.

vivek pandya via llvm-dev

unread,
Mar 23, 2016, 5:44:26 PM3/23/16
to Quentin Colombet, llvm-dev


Vivek Pandya


On Wed, Mar 23, 2016 at 10:18 PM, Quentin Colombet <qcol...@apple.com> wrote:
The pass manager already has support for calligraph connected region IIRC.
If I am not wrong Quentin and Mehdi Amini refers to CallGraphSCCPass.cpp 
As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

Which operands should be kept in registers between function call should be justifying and for that we can take help from some research work ( some of I mentioned previously I have to read it again. Please suggest some more relevant papers  ) once that is implemented we can update the regmask for a call instruction to indicate which registers are free to be used. Am I going in correct direction ?

I think the main challenge of a real inter-procedural register allocator is to change all of the calling convention dynamically and more importantly convey the right information to other tools (via CFA, CFI, etc.).

Here for calling convention do you mean that has to be handle for different kind of backends differently  or you are referring some thing I don't know. I don't understand what do you mean by 'convey the right information to other tool' if we have updated regmask for a call instruction then MachineFunction should be able to reflect that fact in MachineFunction pass which is used for intra-procedural register allocation, all we have done is allocated some registers that should live across the function call.

Sincerely,
Vivek

Quentin Colombet via llvm-dev

unread,
Mar 23, 2016, 5:59:44 PM3/23/16
to vivek pandya, llvm-dev
On Mar 23, 2016, at 2:44 PM, vivek pandya <vivekv...@gmail.com> wrote:



Vivek Pandya


On Wed, Mar 23, 2016 at 10:18 PM, Quentin Colombet <qcol...@apple.com> wrote:
The pass manager already has support for calligraph connected region IIRC.
If I am not wrong Quentin and Mehdi Amini refers to CallGraphSCCPass.cpp 

Yes.

As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

Which operands should be kept in registers between function call should be justifying and for that we can take help from some research work ( some of I mentioned previously I have to read it again. Please suggest some more relevant papers  ) once that is implemented we can update the regmask for a call instruction to indicate which registers are free to be used. Am I going in correct direction ?

I do not know if there is a paper on this as this is quite trivial, but IIRC Open64 register allocator does that.
Anyhow, the algo is:
Given a call graph SCC
- Allocate the function with no calls or where each callee has been allocated
- Propagate the clobbered registers to the callers of that function by updating the related regmasks on the callsites.
Repeat until no more candidate.

Allocate remaining functions “normally”.


I think the main challenge of a real inter-procedural register allocator is to change all of the calling convention dynamically and more importantly convey the right information to other tools (via CFA, CFI, etc.).

Here for calling convention do you mean that has to be handle for different kind of backends differently  or you are referring some thing I don't know. I don't understand what do you mean by 'convey the right information to other tool' if we have updated regmask for a call instruction then MachineFunction should be able to reflect that fact in MachineFunction pass which is used for intra-procedural register allocation, all we have done is allocated some registers that should live across the function call.

My mistake, I though you had in mind what I call a “true” inter procedural registers allocator: one that changes the allocation at function boundaries as well. I.e., it may choose that it is more efficient to put the first argument of function foo is register FP0 even if the ABI says R0.
With this kind of scheme, you break the ABI (and you need LTO to be allowed to do that), you need to “dynamically” adjust the calling convention to what the register allocator chooses, and moreover you need to be able to communicate to the other tools (dynamic linker, debugger, etc.) where are the things that are usually defined by the ABI, like the frame pointer, the return value, etc.

Cheers,
-Quentin

Gerolf Hoflehner via llvm-dev

unread,
Mar 23, 2016, 9:38:32 PM3/23/16
to Quentin Colombet, vivek pandya, llvm-dev
On Mar 23, 2016, at 2:59 PM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:


On Mar 23, 2016, at 2:44 PM, vivek pandya <vivekv...@gmail.com> wrote:



Vivek Pandya


On Wed, Mar 23, 2016 at 10:18 PM, Quentin Colombet <qcol...@apple.com> wrote:
The pass manager already has support for calligraph connected region IIRC.
If I am not wrong Quentin and Mehdi Amini refers to CallGraphSCCPass.cpp 

Yes.

As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

Which operands should be kept in registers between function call should be justifying and for that we can take help from some research work ( some of I mentioned previously I have to read it again. Please suggest some more relevant papers  ) once that is implemented we can update the regmask for a call instruction to indicate which registers are free to be used. Am I going in correct direction ?

I do not know if there is a paper on this as this is quite trivial, but IIRC Open64 register allocator does that.
Anyhow, the algo is:
Given a call graph SCC
- Allocate the function with no calls or where each callee has been allocated
- Propagate the clobbered registers to the callers of that function by updating the related regmasks on the callsites.
Repeat until no more candidate.

Right direction overall. The simplest approach to this is feasible within a summer and should definitely give you good results when you have cases of hot calls with many spill/fills around it that could be eliminated.

One does not necessarily need the call graph. The compiler can do this as an opportunistic optimization. The callee collects a resource mask and the caller consumes it when it is “there”. Within a module when the callee”leaf” is compiled before the caller the information is “there”. When the call graph is available you want a bottom up walk for this optimization. 

A few things to keep an eye on:
- The twist here could be that the bottom up order conflicts with the layout order, so the two optimizations would have to run independently. ( I have not looked into the layout algorithm so this might not be an actual issue here). 
- You also need to consider the supported preemption model. When a function can be preempted dynamically the statically collected information for a callee cannot be used and the optimization may not kick in. 
- Most of the work I would expect to be tuning the assignment heuristics in the allocator (a live range that spans two calls sites, should it go into a scratch register that is not used in one call but in the other? How could profile change that? etc). But again, perhaps the cheapest approach is not to go into the heuristics and only remove a scratch register fill/spill around a call sit when that register is not destroyed anywhere down in the call tree.

Pete Cooper via llvm-dev

unread,
Mar 24, 2016, 2:09:10 PM3/24/16
to Gerolf Hoflehner, llvm-dev, vivek pandya
On Mar 23, 2016, at 6:38 PM, Gerolf Hoflehner via llvm-dev <llvm...@lists.llvm.org> wrote:


On Mar 23, 2016, at 2:59 PM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:


On Mar 23, 2016, at 2:44 PM, vivek pandya <vivekv...@gmail.com> wrote:



Vivek Pandya


On Wed, Mar 23, 2016 at 10:18 PM, Quentin Colombet <qcol...@apple.com> wrote:
The pass manager already has support for calligraph connected region IIRC.
If I am not wrong Quentin and Mehdi Amini refers to CallGraphSCCPass.cpp 

Yes.

As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

Which operands should be kept in registers between function call should be justifying and for that we can take help from some research work ( some of I mentioned previously I have to read it again. Please suggest some more relevant papers  ) once that is implemented we can update the regmask for a call instruction to indicate which registers are free to be used. Am I going in correct direction ?

I do not know if there is a paper on this as this is quite trivial, but IIRC Open64 register allocator does that.
Anyhow, the algo is:
Given a call graph SCC
- Allocate the function with no calls or where each callee has been allocated
- Propagate the clobbered registers to the callers of that function by updating the related regmasks on the callsites.
Repeat until no more candidate.

Right direction overall. The simplest approach to this is feasible within a summer and should definitely give you good results when you have cases of hot calls with many spill/fills around it that could be eliminated.

One does not necessarily need the call graph. The compiler can do this as an opportunistic optimization. The callee collects a resource mask and the caller consumes it when it is “there”. Within a module when the callee”leaf” is compiled before the caller the information is “there”. When the call graph is available you want a bottom up walk for this optimization. 

A few things to keep an eye on:
- The twist here could be that the bottom up order conflicts with the layout order, so the two optimizations would have to run independently. ( I have not looked into the layout algorithm so this might not be an actual issue here). 
Layout is just the order functions reach the AsmPrinter, so you’re right that this is going to make the function output different.  If we care about the order, which we may do, then we’d need to cache the data in the AsmPrinter and reorder it there somehow.

Some bonus features that come from codegen on the calligraphy, and specifically having accurate regmasks and similar information:
- The X86 VZeroUpper pass should insert fewer VZeroUpper instructions before calls, and could possibly even learn that after the call the state of vzeroupper is known.
- Values in registers can be used by the callee instead of loading them.

The second one here is fun.  Imagine this pseudo code:

foo:
r0 = 1000
ret

bar:
call foo
vreg1 = vreg2 + 1000

You know which registers contain which values after the call to foo.  In this case you know that the value of 1000 is available in a register already so you can avoid loading it for use in the add.  You could have other values in registers too, even those which are passed in to foo.  The ‘this’ pointer is the best example as its probably incredibly likely that r0 contains the this pointer after a function call which didn’t override r0 for the return.

The this pointer example is actually related to what Quentin mentioned as a future direction here: rewriting calling conventions.  If you have

int A::foo() {
  return this->value;
}

then you are going to have code something like

foo:
r0 = load r0, #offset_of_value
ret

If the this pointer is live after the call, and it almost certainly is, then it would be better to rewrite this call to avoid clobbering r0.  That is, return the this pointer in r0 and the value in r1.  That could actually be done as an IR level pass too though if its deemed profitable.

Anyway, didn’t mean to distract from the immediate goals of this project.  I’m excited to see the SCC code make it in tree and see what else it enables.

Cheers,
Pete

Pete Cooper via llvm-dev

unread,
Mar 24, 2016, 4:01:02 PM3/24/16
to Gerolf Hoflehner, llvm-dev, vivek pandya
One more, just for fun: Inter-procedural stack allocation.  That is of calls bar, bar needs 4 bytes of stack space.  Instead of bar allocating 4 bytes, it adds an attribute to itself, then foo allocates 4 bytes of space at the bottom of the stack for bar to use.

Pete

vivek pandya via llvm-dev

unread,
Mar 24, 2016, 4:09:10 PM3/24/16
to Quentin Colombet, llvm-dev


Vivek Pandya


On Thu, Mar 24, 2016 at 3:29 AM, Quentin Colombet <qcol...@apple.com> wrote:

On Mar 23, 2016, at 2:44 PM, vivek pandya <vivekv...@gmail.com> wrote:



Vivek Pandya


On Wed, Mar 23, 2016 at 10:18 PM, Quentin Colombet <qcol...@apple.com> wrote:
The pass manager already has support for calligraph connected region IIRC.
If I am not wrong Quentin and Mehdi Amini refers to CallGraphSCCPass.cpp 

Yes.

As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

Which operands should be kept in registers between function call should be justifying and for that we can take help from some research work ( some of I mentioned previously I have to read it again. Please suggest some more relevant papers  ) once that is implemented we can update the regmask for a call instruction to indicate which registers are free to be used. Am I going in correct direction ?

I do not know if there is a paper on this as this is quite trivial, but IIRC Open64 register allocator does that.
Anyhow, the algo is:
Given a call graph SCC
- Allocate the function with no calls or where each callee has been allocated
- Propagate the clobbered registers to the callers of that function by updating the related regmasks on the callsites.
Repeat until no more candidate.

Allocate remaining functions “normally”.


I think the main challenge of a real inter-procedural register allocator is to change all of the calling convention dynamically and more importantly convey the right information to other tools (via CFA, CFI, etc.).

Here for calling convention do you mean that has to be handle for different kind of backends differently  or you are referring some thing I don't know. I don't understand what do you mean by 'convey the right information to other tool' if we have updated regmask for a call instruction then MachineFunction should be able to reflect that fact in MachineFunction pass which is used for intra-procedural register allocation, all we have done is allocated some registers that should live across the function call.

My mistake, I though you had in mind what I call a “true” inter procedural registers allocator: one that changes the allocation at function boundaries as well. I.e., it may choose that it is more efficient to put the first argument of function foo is register FP0 even if the ABI says R0.
With this kind of scheme, you break the ABI (and you need LTO to be allowed to do that), you need to “dynamically” adjust the calling convention to what the register allocator chooses, and moreover you need to be able to communicate to the other tools (dynamic linker, debugger, etc.) where are the things that are usually defined by the ABI, like the frame pointer, the return value, etc.

I feel that as I don't have exposure to LTO in LLVM ( for GCC I have the basic idea ) I should not go with true register allocator  at this stage instead minimizing spill code by propagating calles' register usage to caller would be enough.

vivek pandya via llvm-dev

unread,
Mar 24, 2016, 4:23:57 PM3/24/16
to Pete Cooper, llvm-dev


Vivek Pandya


On Fri, Mar 25, 2016 at 1:30 AM, Pete Cooper <peter_...@apple.com> wrote:

On Mar 24, 2016, at 11:09 AM, Pete Cooper <peter_...@apple.com> wrote:


On Mar 23, 2016, at 6:38 PM, Gerolf Hoflehner via llvm-dev <llvm...@lists.llvm.org> wrote:


On Mar 23, 2016, at 2:59 PM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:


On Mar 23, 2016, at 2:44 PM, vivek pandya <vivekv...@gmail.com> wrote:



Vivek Pandya


On Wed, Mar 23, 2016 at 10:18 PM, Quentin Colombet <qcol...@apple.com> wrote:
The pass manager already has support for calligraph connected region IIRC.
If I am not wrong Quentin and Mehdi Amini refers to CallGraphSCCPass.cpp 

Yes.

As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

Which operands should be kept in registers between function call should be justifying and for that we can take help from some research work ( some of I mentioned previously I have to read it again. Please suggest some more relevant papers  ) once that is implemented we can update the regmask for a call instruction to indicate which registers are free to be used. Am I going in correct direction ?

I do not know if there is a paper on this as this is quite trivial, but IIRC Open64 register allocator does that.
Anyhow, the algo is:
Given a call graph SCC
- Allocate the function with no calls or where each callee has been allocated
- Propagate the clobbered registers to the callers of that function by updating the related regmasks on the callsites.
Repeat until no more candidate.

Right direction overall. The simplest approach to this is feasible within a summer and should definitely give you good results when you have cases of hot calls with many spill/fills around it that could be eliminated.

One does not necessarily need the call graph. The compiler can do this as an opportunistic optimization. The callee collects a resource mask and the caller consumes it when it is “there”. Within a module when the callee”leaf” is compiled before the caller the information is “there”. When the call graph is available you want a bottom up walk for this optimization. 

A few things to keep an eye on:
- The twist here could be that the bottom up order conflicts with the layout order, so the two optimizations would have to run independently. ( I have not looked into the layout algorithm so this might not be an actual issue here). 
Layout is just the order functions reach the AsmPrinter, so you’re right that this is going to make the function output different.  If we care about the order, which we may do, then we’d need to cache the data in the AsmPrinter and reorder it there somehow.
Pete Cooper Do you mean to cache function order related data in AsmPrinter ? 

Some bonus features that come from codegen on the calligraphy, and specifically having accurate regmasks and similar information:
- The X86 VZeroUpper pass should insert fewer VZeroUpper instructions before calls, and could possibly even learn that after the call the state of vzeroupper is known.
- Values in registers can be used by the callee instead of loading them.

The second one here is fun.  Imagine this pseudo code:

foo:
r0 = 1000
ret

bar:
call foo
vreg1 = vreg2 + 1000

You know which registers contain which values after the call to foo.  In this case you know that the value of 1000 is available in a register already so you can avoid loading it for use in the add.  You could have other values in registers too, even those which are passed in to foo.  The ‘this’ pointer is the best example as its probably incredibly likely that r0 contains the this pointer after a function call which didn’t override r0 for the return.

The above mentioned case is interesting and useful, perhaps and simple analysis pass which can return a map from value to register will help. 
The this pointer example is actually related to what Quentin mentioned as a future direction here: rewriting calling conventions.  If you have

int A::foo() {
  return this->value;
}

then you are going to have code something like

foo:
r0 = load r0, #offset_of_value
ret

If the this pointer is live after the call, and it almost certainly is, then it would be better to rewrite this call to avoid clobbering r0.  That is, return the this pointer in r0 and the value in r1.  That could actually be done as an IR level pass too though if its deemed profitable.

Anyway, didn’t mean to distract from the immediate goals of this project.  I’m excited to see the SCC code make it in tree and see what else it enables.
One more, just for fun: Inter-procedural stack allocation.  That is of calls bar, bar needs 4 bytes of stack space.  Instead of bar allocating 4 bytes, it adds an attribute to itself, then foo allocates 4 bytes of space at the bottom of the stack for bar to use.

Can you please provide some links to understand benefits of IP stack allocation ? 

I have also write the draft proposal, I will share it through the summer of code site.
This is not much effective but still I would like to give it a try. Please review it quickly I have 23 hours to submit the final PDF.

Thanks !
Vivek

Pete Cooper via llvm-dev

unread,
Mar 24, 2016, 4:33:13 PM3/24/16
to vivek pandya, llvm-dev
Hi Vivek
Yeah, exactly.  So if the module has [foo, bar] in that order, but you compile them as [bar, foo] because of SCC, then you may want to somehow reorder them during the AsmPrinter so that they are emitted as [foo, bar] again.

Of course this *shouldn’t* matter (I can’t think of a case where it would matter), but for ease of debugging at least, it is nice to have functions emitted in the same order as they are in the IR.


Some bonus features that come from codegen on the calligraphy, and specifically having accurate regmasks and similar information:
- The X86 VZeroUpper pass should insert fewer VZeroUpper instructions before calls, and could possibly even learn that after the call the state of vzeroupper is known.
- Values in registers can be used by the callee instead of loading them.

The second one here is fun.  Imagine this pseudo code:

foo:
r0 = 1000
ret

bar:
call foo
vreg1 = vreg2 + 1000

You know which registers contain which values after the call to foo.  In this case you know that the value of 1000 is available in a register already so you can avoid loading it for use in the add.  You could have other values in registers too, even those which are passed in to foo.  The ‘this’ pointer is the best example as its probably incredibly likely that r0 contains the this pointer after a function call which didn’t override r0 for the return.

The above mentioned case is interesting and useful, perhaps and simple analysis pass which can return a map from value to register will help. 
Yeah, I think it could be interesting.  Of course one of the interesting things is decided when its more profitable to not use the map.  You would not, for example, choose to reserve a register containing a constant for a long time as it would almost always be cheaper to just regenerate the constant when needed.  But a constant used very soon after a call may still be useful.

The this pointer example is actually related to what Quentin mentioned as a future direction here: rewriting calling conventions.  If you have

int A::foo() {
  return this->value;
}

then you are going to have code something like

foo:
r0 = load r0, #offset_of_value
ret

If the this pointer is live after the call, and it almost certainly is, then it would be better to rewrite this call to avoid clobbering r0.  That is, return the this pointer in r0 and the value in r1.  That could actually be done as an IR level pass too though if its deemed profitable.

Anyway, didn’t mean to distract from the immediate goals of this project.  I’m excited to see the SCC code make it in tree and see what else it enables.
One more, just for fun: Inter-procedural stack allocation.  That is of calls bar, bar needs 4 bytes of stack space.  Instead of bar allocating 4 bytes, it adds an attribute to itself, then foo allocates 4 bytes of space at the bottom of the stack for bar to use.

Can you please provide some links to understand benefits of IP stack allocation ? 
I actually don’t have any links.  Its just something I thought about implementing a while ago.  The main benefits I can think of are saving code size and performance as ‘bar’ in my example would not contain any stack manipulation code.

I have also write the draft proposal, I will share it through the summer of code site.
This is not much effective but still I would like to give it a try. Please review it quickly I have 23 hours to submit the final PDF.
I just read it.  It looks good to me, although i’m not a register allocator or SCC expert, so hopefully others will have good feedback for you.


Thanks
Pete

Mehdi Amini via llvm-dev

unread,
Mar 24, 2016, 4:55:25 PM3/24/16
to Pete Cooper, llvm-dev, vivek pandya
This is something that can be performed with a module pass at the IR level right? The codegen would just have to be teached to recognize the attribute.

-- 
Mehdi

Mehdi Amini via llvm-dev

unread,
Mar 24, 2016, 4:59:38 PM3/24/16
to Gerolf Hoflehner, llvm-dev, vivek pandya
On Mar 23, 2016, at 6:38 PM, Gerolf Hoflehner via llvm-dev <llvm...@lists.llvm.org> wrote:


On Mar 23, 2016, at 2:59 PM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:


On Mar 23, 2016, at 2:44 PM, vivek pandya <vivekv...@gmail.com> wrote:



Vivek Pandya


On Wed, Mar 23, 2016 at 10:18 PM, Quentin Colombet <qcol...@apple.com> wrote:
The pass manager already has support for calligraph connected region IIRC.
If I am not wrong Quentin and Mehdi Amini refers to CallGraphSCCPass.cpp 

Yes.

As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

Which operands should be kept in registers between function call should be justifying and for that we can take help from some research work ( some of I mentioned previously I have to read it again. Please suggest some more relevant papers  ) once that is implemented we can update the regmask for a call instruction to indicate which registers are free to be used. Am I going in correct direction ?

I do not know if there is a paper on this as this is quite trivial, but IIRC Open64 register allocator does that.
Anyhow, the algo is:
Given a call graph SCC
- Allocate the function with no calls or where each callee has been allocated
- Propagate the clobbered registers to the callers of that function by updating the related regmasks on the callsites.
Repeat until no more candidate.

Right direction overall. The simplest approach to this is feasible within a summer and should definitely give you good results when you have cases of hot calls with many spill/fills around it that could be eliminated.

One does not necessarily need the call graph. The compiler can do this as an opportunistic optimization. The callee collects a resource mask and the caller consumes it when it is “there”. Within a module when the callee”leaf” is compiled before the caller the information is “there”. When the call graph is available you want a bottom up walk for this optimization. 

A few things to keep an eye on:
- The twist here could be that the bottom up order conflicts with the layout order, so the two optimizations would have to run independently. ( I have not looked into the layout algorithm so this might not be an actual issue here). 

Don't we have the linker reorganizing the layout? 
Or is your comment just targeting "section based" object file without -ffunction-section?


- You also need to consider the supported preemption model. When a function can be preempted dynamically the statically collected information for a callee cannot be used and the optimization may not kick in. 

We could only do it on private/internal function anyway, which are not interposable, unless I missed something?


- Most of the work I would expect to be tuning the assignment heuristics in the allocator (a live range that spans two calls sites, should it go into a scratch register that is not used in one call but in the other? How could profile change that? etc). But again, perhaps the cheapest approach is not to go into the heuristics and only remove a scratch register fill/spill around a call sit when that register is not destroyed anywhere down in the call tree.

How these calls would be different than any other instruction that clobber a (set of) fixed register(s)? I'd expect it should already be handled (albeit maybe not tuned) by the current infrastructure.

-- 
Mehdi

Pete Cooper via llvm-dev

unread,
Mar 24, 2016, 5:00:15 PM3/24/16
to Mehdi Amini, llvm-dev, vivek pandya
On Mar 24, 2016, at 1:55 PM, Mehdi Amini <mehdi...@apple.com> wrote:

One more, just for fun: Inter-procedural stack allocation.  That is of calls bar, bar needs 4 bytes of stack space.  Instead of bar allocating 4 bytes, it adds an attribute to itself, then foo allocates 4 bytes of space at the bottom of the stack for bar to use.

This is something that can be performed with a module pass at the IR level right? The codegen would just have to be teached to recognize the attribute.
I thought you would need to run codegen to get a specific number of bytes to allocate.  You would compile bar, note down how many bytes of stack it would have required, then add that as an attribute.  The IR level could only make a good guess as to how many bytes we need.

Saying that, this is basically like having a compiler controlled redzone.  Thats what made me think of it in the first place.  If bar needed only 4 bytes, and the system supports a redzone, then its likely bar wouldn’t have allocated anything on the stack.  I just extended that so that the number of bytes is able to be larger that the number rezone’s typically provide.

Cheers,
Pete

-- 
Mehdi

Mehdi Amini via llvm-dev

unread,
Mar 24, 2016, 5:01:21 PM3/24/16
to Quentin Colombet, llvm-dev, vivek pandya
On Mar 23, 2016, at 2:59 PM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:


On Mar 23, 2016, at 2:44 PM, vivek pandya <vivekv...@gmail.com> wrote:



Vivek Pandya


On Wed, Mar 23, 2016 at 10:18 PM, Quentin Colombet <qcol...@apple.com> wrote:
The pass manager already has support for calligraph connected region IIRC.
If I am not wrong Quentin and Mehdi Amini refers to CallGraphSCCPass.cpp 

Yes.

As for the regmask part, we probably could hack something up in a week or so, but I believe this is not what Vivek had in mind.

Which operands should be kept in registers between function call should be justifying and for that we can take help from some research work ( some of I mentioned previously I have to read it again. Please suggest some more relevant papers  ) once that is implemented we can update the regmask for a call instruction to indicate which registers are free to be used. Am I going in correct direction ?

I do not know if there is a paper on this as this is quite trivial, but IIRC Open64 register allocator does that.
Anyhow, the algo is:
Given a call graph SCC
- Allocate the function with no calls or where each callee has been allocated
- Propagate the clobbered registers to the callers of that function by updating the related regmasks on the callsites.
Repeat until no more candidate.

Here is the patch that I believe achieve the first point.

CGCCC.diff

Mehdi Amini via llvm-dev

unread,
Mar 24, 2016, 5:02:05 PM3/24/16
to Pete Cooper, llvm-dev, vivek pandya
On Mar 24, 2016, at 1:59 PM, Pete Cooper <peter_...@apple.com> wrote:


On Mar 24, 2016, at 1:55 PM, Mehdi Amini <mehdi...@apple.com> wrote:

One more, just for fun: Inter-procedural stack allocation.  That is of calls bar, bar needs 4 bytes of stack space.  Instead of bar allocating 4 bytes, it adds an attribute to itself, then foo allocates 4 bytes of space at the bottom of the stack for bar to use.

This is something that can be performed with a module pass at the IR level right? The codegen would just have to be teached to recognize the attribute.
I thought you would need to run codegen to get a specific number of bytes to allocate.  You would compile bar, note down how many bytes of stack it would have required, then add that as an attribute.  The IR level could only make a good guess as to how many bytes we need.

OK make sense. Thanks.

-- 
Mehdi

Adam Husar via llvm-dev

unread,
Apr 10, 2016, 11:10:14 AM4/10/16
to Mehdi Amini, llvm...@lists.llvm.org
Hello Mehdi,

could I ask you if you could share your patch?
We are testing whether at least some leaf call register allocation optimization could help and you patch or some pointers on what to do will be very useful.

Thank you
Adam


On Wed, 23 Mar 2016 05:04:41 +0100, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:

>
>> On Mar 22, 2016, at 6:04 PM, Matthias Braun via llvm-dev <llvm...@lists.llvm.org> wrote:
>>
>> No need to apologize this thread surely deserved some answers :)
>>
>> From my perspective this project sounds doable. I would expect the register allocation parts to be not too hard: I imagine this being just distilling a new clobber regmask after allocating a function. I would expect the challenging (or annoying) part to get a machine module pass (or a similar mechanism to influence the order in which functions are processed) and a callgraph in the backend.
>
> I have a very tiny patch that wrap the backend in a CGSCC pass manager, which will achieve what is needed here I believe: i.e. running CodeGen for every callee before any caller.
> I can rebase it if anyone is interested.

Mehdi Amini via llvm-dev

unread,
Apr 10, 2016, 1:13:35 PM4/10/16
to Adam Husar, llvm...@lists.llvm.org
Hi,

Here is the patch (I already sent it in this thread):

CGCCC.diff
Reply all
Reply to author
Forward
0 new messages