[llvm-dev] Tail call optimization is getting affected due to local function related optimization with IPRA

vivek pandya via llvm-dev

unread,

Jun 25, 2016, 1:33:24 PM6/25/16

to llvm-dev, Matthias Braun

Hello LLVM Community,

To improve Interprocedural Register Allocation (IPRA) we are trying to force caller

saved registers for local functions (which has likage type local). To achive it

I have modified TargetFrameLowering::determineCalleeSaves() to return early for

function which satisfies if (F->hasLocalLinkage() && !F->hasAddressTaken()) and

also reflecting the fact that for local function there are no caller saved registers

I am also changing RegUsageInfoCollector.cpp to not to mark regiseters as callee

saved in RegMask due to CC with follwoing change in code:

if (!F->hasLocalLinkage() || F->hasAddressTaken()) {

const uint32_t *CallPreservedMask =

TRI->getCallPreservedMask(MF, MF.getFunction()->getCallingConv());

// Set callee saved register as preserved.

for (unsigned i = 0; i < RegMaskSize; ++i)

RegMask[i] = RegMask[i] | CallPreservedMask[i];

}

For more details please follow following link.

https://groups.google.com/d/msg/llvm-dev/XRzGhJ9wtZg/bYFMzppXEwAJ

Now consider following bug due to forcing caller saved registers for local function

when IPRA enable:

void makewt(int nw, int *ip, double *w) {

...

bitrv2(nw, ip, w);

}

here bitrv2 is local fuction and for that when IPRA enable callee saved registers

are set to none. So for that function following is set of collbered register as

per regmaks collected by RegUsageInfoCollector pass.

Function Name : bitrv2

Clobbered Registers:

AH AL AX BH BL BP BPL BX CH CL CX DI DIL EAX EBP EBX ECX EDI EFLAGS ESI ESP RAX

RBP RBX RCX RDI RSI RSP SI SIL SP SPL R8 R9 R10 R11 R12 R13 R14 R15 R8B R9B R10B

R11B R12B R13B R14B R15B R8D R9D R10D R11D R12D R13D R14D R15D R8W R9W R10W R11W

R12W R13W R14W R15W

How ever caller of bitrv2, makewt has callee saved registers as per CC, but this

code results in segmentation fault when compliled with O1 because makewt has value

of *ip in R14 register and that is stored and restore by makewt at begining of call

but due to tail call optimization following code is generated and here bitrv2 does

not preserve R14 so whwn execution returns to main (which is caller of makewt)

value of *ip is gone from R14 (which sould not) and when main calls makewt again

then value of *ip (R14) is wrong and result into segmentation fault.

Assembly code of makewt:

_makewt:

...

popq %rbx

popq %r12

popq %r13

popq %r14

popq %r15

popq %rbp

jmp _bitrv2 ## TAILCALL

There is one more case of faluire due to local function related optimization.

I am analysing that (sorry for taking more time but I am not much good at assembly).

I need some hints for how to solve this. If you feel some problem with my analyses

please let me know if you want me to send generated .s file and source .c file.

Sincerely,

Vivek

Eli Friedman via llvm-dev

unread,

Jun 25, 2016, 3:49:43 PM6/25/16

to vivek pandya, llvm-dev, Matthias Braun

If you're tail calling bitrv2 from makewt, bitrv2 has to preserve the value of all the registers which the caller of makewt expects it to preserve, i.e. the callee-save registers according to makewt's calling convention. Otherwise, you're just going to get nonsensical results, like you've discovered.

There aren't really that many options here... you can suppress tail call optimization in certain cases, you can suppress your optimization for functions which are tail-called, or you can redirect tail calls to a different entry point from normal calls. Which of those options is best probably depends on the specific code you're trying to optimize. (You aren't allowed to suppress tail call optimization for calls marked musttail.)

More generally, it's not clear that shoving all the spill code from the callee into the callers is always beneficial; in some cases, you could end up with a substantial code-size increase without reducing the number of instructions executed at runtime. So you probably need to analyze the callers somehow anyway.

-Eli

vivek pandya via llvm-dev

unread,

Jun 25, 2016, 3:56:15 PM6/25/16

to llvm-dev, Matthias Braun

A very naive solution to this problem come to me is to convert above code to following:

_makewt:

...

jmp _bitrv2 ## TAILCALL

popq %rbx

popq %r12

popq %r13

popq %r14

popq %r15

popq %rbp

So that when _bitrv2 returns caller will over write callee saved register ( as per CC of that function ) to correct values.

I wanted to try it out but I am not able to find correct code where I can do that.

-Vivek

vivek pandya via llvm-dev

unread,

Jun 26, 2016, 12:35:51 PM6/26/16

to llvm-dev, Matthias Braun

According to this http://llvm.org/docs/CodeGenerator.html#tail-call-section, it seems that adding a new CC for the purpose of local function optimization seems a good idea because tail call optimization only takes place when both caller and callee have fastcc or GHC or HiPE calling convention.

-Vivek

vivek pandya via llvm-dev

unread,

Jun 27, 2016, 12:25:29 PM6/27/16

to llvm-dev, Matthias Braun

Hello ,

To solve this bug locally I have given preference to tail call optimization over local function related optimization in IPRA. I have added following method to achieve this:

bool isEligibleForTailCallOptimization(Function *F) {

CallingConv::ID CC = F->getCallingConv();

if (CC == CallingConv::Fast || CC == CallingConv::GHC || CC == CallingConv::HiPE)

return true;

return false;

}

Any other suggestions are always welcomed.

and I am checking this condition along with hasLocalLinkage() and hasAddressTaken().

Due to this test-suite now has only 2 runtime failure where as before this there were around 43 due to local function related optimization. But of course by giving preference to tail call many opportunities to optimize IPRA is missed.

The 2 existing failure are interesting and hard to debug, namely they are MultiSource/Applications/SPASS/SPASS and MultiSource/Applications/sqlite3/sqlite3

However for test-suite sqlite3 is only complied as CLI program and thus it only contains 2 (big) source files. In sqlite3 source code there are few static methods with variable number of arguments and due to that these static function should not get tail call optimized and thus IPRA optimization can be done but I am digging more into this to know reason of failures.

Sincerely,

Vivek

vivek pandya via llvm-dev

unread,

Jun 28, 2016, 1:51:57 AM6/28/16

to llvm-dev, Matthias Braun

On Mon, Jun 27, 2016 at 9:55 PM, vivek pandya <vivekv...@gmail.com> wrote:

Hello ,

To solve this bug locally I have given preference to tail call optimization over local function related optimization in IPRA. I have added following method to achieve this:

bool isEligibleForTailCallOptimization(Function *F) {
CallingConv::ID CC = F->getCallingConv();
if (CC == CallingConv::Fast || CC == CallingConv::GHC || CC == CallingConv::HiPE)
return true;
return false;

I have changed above code to :

auto Attr = F->getFnAttribute("disable-tail-calls");

if (Attr.getValueAsString() == "true")

return false;

return true;

this disables local function related optimization for functions which may get tail call. but with any of above code most of the opportunities are missed for example sqlite3's module sqlite3 has more than 500 static function but none of them getting optimized for local function. Also according to this http://llvm.org/docs/CodeGenerator.html#tail-call-section,

on x86 and power pc while generating PIC function with local linkage are considered for tail call optimization. But this may not be the case on other architecture.

After adding above check I have noticed some improvements in test-suite ( no failures ) but this needs to be benchmark again ( specially with 43 failure which were reported before, because those test-cases are having some local function for sure and need to check if any of them is getting optimized. Other wise tail call and local function related optimization can not be together.

So just optimizing functions with local linkage is not a good idea instead it should be based on some call frequency analysis and in such case if required we can suppress tail call elimination to give preference to spill code movement.

I am looking into " Register Allocation Across Procedure and Module Boundaries" http://dl.acm.org/citation.cfm?id=93551. This paper basically extends very similar idea of propagating register usage information in call graph as functions are compiled in bottom up order. I am trying to find some way so that we can push spill code out of frequently called procedures so that we can improve runtime.

Sincerely,

Vivek

Mehdi Amini via llvm-dev

unread,

Jun 28, 2016, 10:41:54 AM6/28/16

to vivek pandya, llvm-dev, Matthias Braun

On Jun 27, 2016, at 12:25 PM, vivek pandya <vivekv...@gmail.com> wrote:

Hello ,

To solve this bug locally I have given preference to tail call optimization over local function related optimization in IPRA. I have added following method to achieve this:

bool isEligibleForTailCallOptimization(Function *F) {
CallingConv::ID CC = F->getCallingConv();
if (CC == CallingConv::Fast || CC == CallingConv::GHC || CC == CallingConv::HiPE)
return true;
return false;
}

Any other suggestions are always welcomed.

Why aren’t checking for the presence of a tail call?

—

Mehdi

vivek pandya via llvm-dev

unread,

Jun 28, 2016, 12:53:35 PM6/28/16

to Mehdi Amini, llvm-dev, Matthias Braun

On Tue, Jun 28, 2016 at 8:11 PM, Mehdi Amini <mehdi...@apple.com> wrote:

On Jun 27, 2016, at 12:25 PM, vivek pandya <vivekv...@gmail.com> wrote:

Hello ,

To solve this bug locally I have given preference to tail call optimization over local function related optimization in IPRA. I have added following method to achieve this:

bool isEligibleForTailCallOptimization(Function *F) {
CallingConv::ID CC = F->getCallingConv();
if (CC == CallingConv::Fast || CC == CallingConv::GHC || CC == CallingConv::HiPE)
return true;
return false;
}

Any other suggestions are always welcomed.

Why aren’t checking for the presence of a tail call?

Are you asking about if tail call optimization is enable or not? If not then above method is inspired from X86ISelLowering::canGuaranteeTCO().

-Vivek

Mehdi Amini via llvm-dev

unread,

Jun 28, 2016, 1:09:35 PM6/28/16

to vivek pandya, llvm-dev, Matthias Braun

Sent from my iPhone

On Jun 28, 2016, at 12:53 PM, vivek pandya <vivekv...@gmail.com> wrote:

On Tue, Jun 28, 2016 at 8:11 PM, Mehdi Amini <mehdi...@apple.com> wrote:

On Jun 27, 2016, at 12:25 PM, vivek pandya <vivekv...@gmail.com> wrote:

Hello ,

To solve this bug locally I have given preference to tail call optimization over local function related optimization in IPRA. I have added following method to achieve this:

bool isEligibleForTailCallOptimization(Function *F) {
CallingConv::ID CC = F->getCallingConv();
if (CC == CallingConv::Fast || CC == CallingConv::GHC || CC == CallingConv::HiPE)
return true;
return false;
}

Any other suggestions are always welcomed.

Why aren’t checking for the presence of a tail call?
Are you asking about if tail call optimization is enable or not? If not then above method is inspired from X86ISelLowering::canGuaranteeTCO().

Are we turning calls into tail calls during codegen?

My assumption is that tail call is inferred on the IR, so you can inspect every *call site*.

Mehdi

Matthias Braun via llvm-dev

unread,

Jun 28, 2016, 2:27:45 PM6/28/16

to Mehdi Amini, llvm-dev, vivek pandya

On Jun 28, 2016, at 10:09 AM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:

Sent from my iPhone

On Jun 28, 2016, at 12:53 PM, vivek pandya <vivekv...@gmail.com> wrote:

On Tue, Jun 28, 2016 at 8:11 PM, Mehdi Amini <mehdi...@apple.com> wrote:

On Jun 27, 2016, at 12:25 PM, vivek pandya <vivekv...@gmail.com> wrote:

Hello ,

To solve this bug locally I have given preference to tail call optimization over local function related optimization in IPRA. I have added following method to achieve this:

bool isEligibleForTailCallOptimization(Function *F) {
CallingConv::ID CC = F->getCallingConv();
if (CC == CallingConv::Fast || CC == CallingConv::GHC || CC == CallingConv::HiPE)
return true;
return false;
}

Any other suggestions are always welcomed.

Why aren’t checking for the presence of a tail call?
Are you asking about if tail call optimization is enable or not? If not then above method is inspired from X86ISelLowering::canGuaranteeTCO().

Are we turning calls into tail calls during codegen?
My assumption is that tail call is inferred on the IR, so you can inspect every *call site*.

The final decision on whether to tail call or not is done during instruction selection (it is part of X86TargetLowering::LowerCall()/IsEligibleForTailCallOptimization() for example).

The decision there also takes the callee saved registers into account, but as you can see it is using TRI->getCallPreservedMask() which is different from the stuff computed by the IPRA analysis. Maybe it would be best to somehow get the compute regmask into LowerCall() and isEligibleForTailCallOptimization() instead of changing the regmask in a separate pass afterwards?

- Matthias

Mehdi Amini via llvm-dev

unread,

Jun 28, 2016, 2:34:55 PM6/28/16

to Matthias Braun, llvm-dev, vivek pandya

Sent from my iPhone

On Jun 28, 2016, at 2:27 PM, Matthias Braun <ma...@braunis.de> wrote:

On Jun 28, 2016, at 10:09 AM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:

Sent from my iPhone

On Jun 28, 2016, at 12:53 PM, vivek pandya <vivekv...@gmail.com> wrote:

On Tue, Jun 28, 2016 at 8:11 PM, Mehdi Amini <mehdi...@apple.com> wrote:

On Jun 27, 2016, at 12:25 PM, vivek pandya <vivekv...@gmail.com> wrote:

Hello ,

To solve this bug locally I have given preference to tail call optimization over local function related optimization in IPRA. I have added following method to achieve this:

bool isEligibleForTailCallOptimization(Function *F) {
CallingConv::ID CC = F->getCallingConv();
if (CC == CallingConv::Fast || CC == CallingConv::GHC || CC == CallingConv::HiPE)
return true;
return false;
}

Any other suggestions are always welcomed.

Why aren’t checking for the presence of a tail call?
Are you asking about if tail call optimization is enable or not? If not then above method is inspired from X86ISelLowering::canGuaranteeTCO().

Are we turning calls into tail calls during codegen?
My assumption is that tail call is inferred on the IR, so you can inspect every *call site*.
The final decision on whether to tail call or not is done during instruction selection (it is part of X86TargetLowering::LowerCall()/IsEligibleForTailCallOptimization() for example).

Sorry i don't have access to the source code right now, can you clarify if the backend can tail call when the IR didn't mark the call as such, or if what you're referring to is "not honoring the tail call From the IR and demoting to a normal call?

--

Mehdi

Matthias Braun via llvm-dev

unread,

Jun 28, 2016, 3:01:59 PM6/28/16

to Mehdi Amini, llvm-dev, vivek pandya

On Jun 28, 2016, at 11:34 AM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:

Sent from my iPhone

On Jun 28, 2016, at 2:27 PM, Matthias Braun <ma...@braunis.de> wrote:

On Jun 28, 2016, at 10:09 AM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:

Sent from my iPhone

On Jun 28, 2016, at 12:53 PM, vivek pandya <vivekv...@gmail.com> wrote:

On Tue, Jun 28, 2016 at 8:11 PM, Mehdi Amini <mehdi...@apple.com> wrote:

On Jun 27, 2016, at 12:25 PM, vivek pandya <vivekv...@gmail.com> wrote:

Hello ,

To solve this bug locally I have given preference to tail call optimization over local function related optimization in IPRA. I have added following method to achieve this:

bool isEligibleForTailCallOptimization(Function *F) {
CallingConv::ID CC = F->getCallingConv();
if (CC == CallingConv::Fast || CC == CallingConv::GHC || CC == CallingConv::HiPE)
return true;
return false;
}

Any other suggestions are always welcomed.

Why aren’t checking for the presence of a tail call?
Are you asking about if tail call optimization is enable or not? If not then above method is inspired from X86ISelLowering::canGuaranteeTCO().

Are we turning calls into tail calls during codegen?
My assumption is that tail call is inferred on the IR, so you can inspect every *call site*.
The final decision on whether to tail call or not is done during instruction selection (it is part of X86TargetLowering::LowerCall()/IsEligibleForTailCallOptimization() for example).

Sorry i don't have access to the source code right now, can you clarify if the backend can tail call when the IR didn't mark the call as such, or if what you're referring to is "not honoring the tail call From the IR and demoting to a normal call?

The backend does only tail call if the middleend marked the call with the "tail" or "musttail" marker. But that happens for most calls. We can only really transform a franction of those into real tail calls later.

- Matthias

Mehdi Amini via llvm-dev

unread,

Jun 28, 2016, 3:05:14 PM6/28/16

to Matthias Braun, llvm-dev, vivek pandya

Thanks, so back to my original point: if we have to disable the CSR optimization on function that “may be tail called”, it would still be better IMO to do something like `llvm::any_of(callsites, isTailCall)` instead of IsEligibleForTailCallOptimization().

—

Mehdi

vivek pandya via llvm-dev

unread,

Jun 29, 2016, 12:21:17 PM6/29/16

to Mehdi Amini, llvm-dev, Matthias Braun

I have tried out the following code which examines each call site in a module for tail call and do not perform optimization in such case:

vivek pandya via llvm-dev

unread,

Jun 29, 2016, 12:25:48 PM6/29/16

to Mehdi Amini, llvm-dev, Matthias Braun

On Wed, Jun 29, 2016 at 9:51 PM, vivek pandya <vivekv...@gmail.com> wrote:

I have tried out the following code which examines each call site in a module for tail call and do not perform optimization in such case:

bool RegUsageInfoCollector::isEligibleForTailCallOptimization(Function *F) {

const Module *M = F->getParent();

for (const Function &Fu : *M)

for (const BasicBlock &BB : Fu)

for (const Instruction &II : BB) {

if (auto CS = ImmutableCallSite(&II))

if (CS.getCalledFunction() == F && CS.isTailCall()) {

outs() << "Function : " << F->getName() << " is tailCall\n";

return true;

}

return false;

}

This allows many static function from sqlite3 code to be a candidate for local function optimization (that is good compare to previous attempt) but it fails at runtime. I will analyze it but before that I have test-case which fails at -O2 but when I try -O2 -finline-hint-functions it works I am currently analyzing it.

-Vivek

Reply all

Reply to author

Forward