[llvm-dev] Finding all pointers to functions

Russell Wallace via llvm-dev

unread,

Dec 22, 2015, 12:11:29 AM12/22/15

to llvm-dev

I need to track down all pointers anywhere in a module that could be pointing to functions (because some of the optimizations I want to do, require either identifying every use of a function, or conservatively identifying when such cannot be done).

A starting point is to look at all the global variables:

for (auto &G : M.globals())
    for (auto &V : G.operands())
      if (auto F = dyn_cast<Function>(V))

Of course, instructions can also refer to functions, both as direct calls and otherwise:

for (auto &F : M) {
    for (auto &I : inst_range(F)) {
      for (auto &V : I.operands())
        if (auto F = dyn_cast<Function>(V))

But there are other things as well, for example it seems there is something called a personality function that can be a pointer to another function, so need to add that

    if (F.hasPersonalityFn())

It seems there are other things called prefix data and prologue data, which are pointers to constants, which could contain pointers to functions, so would need to include those as well.

Am I correct in thinking that prefix data and prologue data will not be included in the global variables list, so do need special handling?

Could they be recursive? That is, could those constants contain pointers to other constants... which end up containing pointers to functions... such that none of the intermediate constant objects are in the global variable list?

Is there anything else I'm missing?

Russell Wallace via llvm-dev

unread,

Dec 22, 2015, 4:45:37 AM12/22/15

to llvm-dev

Oh, I just came across Function::hasAddressTaken. Maybe I can just use that instead?

John Criswell via llvm-dev

unread,

Dec 22, 2015, 5:55:45 AM12/22/15

to Russell Wallace, llvm-dev

On 12/22/15 4:45 AM, Russell Wallace via llvm-dev wrote:

Oh, I just came across Function::hasAddressTaken. Maybe I can just use that instead?

You could conservatively assume that any function that has its address taken has a pointer to it that escapes into memory or external code. To make things a little more accurate, you could scan the uses of any function for which hasAddressTaken() returns true and see if any of its uses escapes its function or escapes into memory or external code. I believe hasAddressTaken() returns true if the function is subjected to a cast instruction, and functions are often casted if they are used in a call that uses a different signature than the function's declared signature.

To get anything more accurate, you'll need to use alias analysis or points-to analysis. DSA tracks function pointers in the heap and can tell you whether the function is called from external code. However, DSA's accuracy currently suffers if it is run after LLVM's optimizations, and the code needs some serious TLC.

Regards,

John Criswell

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
John Criswell
Assistant Professor
Department of Computer Science, University of Rochester
http://www.cs.rochester.edu/u/criswell

Russell Wallace via llvm-dev

unread,

Dec 23, 2015, 2:09:18 AM12/23/15

to John Criswell, llvm-dev

On Tue, Dec 22, 2015 at 10:55 AM, John Criswell <jtcr...@gmail.com> wrote:

You could conservatively assume that any function that has its address taken has a pointer to it that escapes into memory or external code.

Right, that's what I'm doing to start with.

To make things a little more accurate, you could scan the uses of any function for which hasAddressTaken() returns true and see if any of its uses escapes its function or escapes into memory or external code. I believe hasAddressTaken() returns true if the function is subjected to a cast instruction, and functions are often casted if they are used in a call that uses a different signature than the function's declared signature.

I'll look into that. It seems reasonable to guess that the major confounding factor in many C++ programs will be references from virtual function tables; there should be some way to optimize those specifically.

To get anything more accurate, you'll need to use alias analysis or points-to analysis. DSA tracks function pointers in the heap and can tell you whether the function is called from external code. However, DSA's accuracy currently suffers if it is run after LLVM's optimizations, and the code needs some serious TLC.

DSA presumably stands for data structure analysis. TLC = tender loving care? Why does DSA become less accurate if run after optimization?

Vedant Kumar via llvm-dev

unread,

Dec 23, 2015, 12:18:08 PM12/23/15

to Russell Wallace, llvm-dev

Hi Russell,

> I need to track down all pointers anywhere in a module that could be pointing to functions (because some of the optimizations I want to do, require either identifying every use of a function, or conservatively identifying when such cannot be done).
>
> A starting point is to look at all the global variables:
>
> for (auto &G : M.globals())
> for (auto &V : G.operands())
> if (auto F = dyn_cast<Function>(V))
>
> Of course, instructions can also refer to functions, both as direct calls and otherwise:
>
> for (auto &F : M) {
> for (auto &I : inst_range(F)) {
> for (auto &V : I.operands())
> if (auto F = dyn_cast<Function>(V))

Your code is reminiscent of the traversal logic in tools/verify-uselistorder.

There's a chance you might be able to reuse the ValueMapping class from that project.

> But there are other things as well, for example it seems there is something called a personality function that can be a pointer to another function, so need to add that
>
> if (F.hasPersonalityFn())
>
> It seems there are other things called prefix data and prologue data, which are pointers to constants, which could contain pointers to functions, so would need to include those as well.
>
> Am I correct in thinking that prefix data and prologue data will not be included in the global variables list, so do need special handling?

Prefix data, prologue data, and function personalities do not need to be globals.

This will require special handling. The ValueMapping class has good skeleton code for this.

> Could they be recursive? That is, could those constants contain pointers to other constants... which end up containing pointers to functions... such that none of the intermediate constant objects are in the global variable list?

I don't see how this is possible. If it is, we'd need to file a PR against verify-uselistorder.

best
vedant

John Criswell via llvm-dev

unread,

Dec 23, 2015, 12:35:59 PM12/23/15

to Russell Wallace, llvm-dev

DSA was built when LLVM's optimizations maintained the type information on GEP and other instructions (DSA existed before LLVM was open-source). As such, it uses LLVM's type information to aid in its type-inference which, in turn, gives it field sensitivity which, in turn, improves its accuracy. Over time, LLVM optimizations have come to modify the type information so that it is just simple byte-level indexing (as opposed to array-of-structure indexing). DSA hasn't been updated to handle that well. That is why its precision is better pre-optimization than post-optimization.

Just out of curiosity, what are you trying to do? I need call graph analysis for C/C++ code with function pointers, and so I'm writing an NSF proposal to seek funding to do that (among other enhancements to my SVA infrastructure). If it's something that would be useful to you (or other LLVM community members), it would be useful for me to know that.

Regards,

John Criswell

Russell Wallace via llvm-dev

unread,

Dec 23, 2015, 12:55:20 PM12/23/15

to John Criswell, llvm-dev

On Wed, Dec 23, 2015 at 5:35 PM, John Criswell <jtcr...@gmail.com> wrote:

DSA was built when LLVM's optimizations maintained the type information on GEP and other instructions (DSA existed before LLVM was open-source). As such, it uses LLVM's type information to aid in its type-inference which, in turn, gives it field sensitivity which, in turn, improves its accuracy. Over time, LLVM optimizations have come to modify the type information so that it is just simple byte-level indexing (as opposed to array-of-structure indexing). DSA hasn't been updated to handle that well. That is why its precision is better pre-optimization than post-optimization.

Ah! I don't suppose you could point to some examples of this? E.g. a simple test program such that one could eyeball the intermediate code before and after optimization?

Just out of curiosity, what are you trying to do? I need call graph analysis for C/C++ code with function pointers, and so I'm writing an NSF proposal to seek funding to do that (among other enhancements to my SVA infrastructure). If it's something that would be useful to you (or other LLVM community members), it would be useful for me to know that.

SVA?

I'm trying to write a superoptimizer that can optimize code based on a high-level understanding of what it's actually doing, so yes, call graph analysis that can deal with function pointers does seem likely to be one of the things that will be needed.

John Criswell via llvm-dev

unread,

Dec 23, 2015, 1:26:27 PM12/23/15

to Russell Wallace, llvm-dev

On 12/23/15 12:55 PM, Russell Wallace wrote:

On Wed, Dec 23, 2015 at 5:35 PM, John Criswell <jtcr...@gmail.com> wrote:

DSA was built when LLVM's optimizations maintained the type information on GEP and other instructions (DSA existed before LLVM was open-source). As such, it uses LLVM's type information to aid in its type-inference which, in turn, gives it field sensitivity which, in turn, improves its accuracy. Over time, LLVM optimizations have come to modify the type information so that it is just simple byte-level indexing (as opposed to array-of-structure indexing). DSA hasn't been updated to handle that well. That is why its precision is better pre-optimization than post-optimization.

Ah! I don't suppose you could point to some examples of this? E.g. a simple test program such that one could eyeball the intermediate code before and after optimization?

Off the top of my head, no, I don't have an example, but I suspect any program with an array indexing operation with a for loop will do.

Just out of curiosity, what are you trying to do? I need call graph analysis for C/C++ code with function pointers, and so I'm writing an NSF proposal to seek funding to do that (among other enhancements to my SVA infrastructure). If it's something that would be useful to you (or other LLVM community members), it would be useful for me to know that.

SVA?

Sorry. SVA is Secure Virtual Architecture. It's my LLVM-based infrastructure for controlling operating system kernel behavior via compiler instrumentation and hardware configuration. I've used it to build a system that protects applications from a compromised operating system kernel as well as to enforce memory safety and control-flow integrity on operating system kernel code.

I need DSA for doing things like:

1) Creating an accurate call graph for kernel code to enforce better control-flow integrity and to test our future infrastructure for measuring the efficacy of defenses against code reuse attacks.

2) Analyzing the memory accesses of kernel modules to see if they modify kernel data structures that they should not modify (e.g., to find rootkits that modify the process list).

3) For optimizing run-time checks that protect kernel data structure, at run-time, from other kernel components (useful for a number of things).

In short, strong points-to and call graph analysis enable some interesting research projects.

I'm trying to write a superoptimizer that can optimize code based on a high-level understanding of what it's actually doing, so yes, call graph analysis that can deal with function pointers does seem likely to be one of the things that will be needed.

Nice.

One thing you might want to investigate is whether building a call graph analysis off of the TBAA metadata would work. If TBAA works for lots of programs (I hear some non-conformant programs cause it problems), then using it as a springboard for analysis may be effective (as TBAA is already well maintained in the LLVM source tree).

Christian Convey via llvm-dev

unread,

Dec 28, 2015, 10:50:19 AM12/28/15

to John Criswell, llvm-dev

On 12/22/15 4:45 AM, Russell Wallace via llvm-dev wrote:

Oh, I just came across Function::hasAddressTaken. Maybe I can just use that instead?

You could conservatively assume that any function that has its address taken has a pointer to it that escapes into memory or external code.

I wonder if a conservative estimate of pointed-to functions would also need to include all functions with externally visible linkage?

I could imagine a few routes by which that allows such a function to have its address show up in the data being handled by the Module's own code.

John Criswell via llvm-dev

unread,

Dec 28, 2015, 11:32:13 AM12/28/15

to Christian Convey, llvm-dev

On 12/28/15 9:50 AM, Christian Convey wrote:

On 12/22/15 4:45 AM, Russell Wallace via llvm-dev wrote:

Oh, I just came across Function::hasAddressTaken. Maybe I can just use that instead?

You could conservatively assume that any function that has its address taken has a pointer to it that escapes into memory or external code.

I wonder if a conservative estimate of pointed-to functions would also need to include all functions with externally visible linkage?

This is correct. Any function that is externally visible can be called be external code that is outside the analysis's scope; you must therefore treat it is a function that can be called by external code (in LLVM CallGraph parlance, it can be called by the external calling node).

If you're going to do this analysis, you want to do it in libLTO after the Internalize pass has been run. The Internalize pass finds functions which are externally visible (because C linkage needs them to be in order for linking to work) and makes them internal since we know that we will not link with any more LLVM bitcode files. You should get much better results that way.

As an FYI, DSA tracks this sort of information using the External (E) flag on the DSNode. It can determine (provided it has sufficient precision) which heap objects are accessible to external code and which are not.

Regards,

John Criswell

I could imagine a few routes by which that allows such a function to have its address show up in the data being handled by the Module's own code.

David Blaikie via llvm-dev

unread,

Jan 7, 2016, 12:40:38 PM1/7/16

to Russell Wallace, llvm-dev

Everyone else has said good/important things here, but for the original problem I'm not quite sure why you were trying to search through to find the uses, rather than walking the Function's use list?

- David

Russell Wallace via llvm-dev

unread,

Jan 7, 2016, 1:37:54 PM1/7/16

to David Blaikie, llvm-dev

Because that was before I realized that functions keep track of a use list.

Reply all

Reply to author

Forward