[llvm-dev] [RFC] Framework for Finding and Using Similarity at the IR Level

79 views

Skip to first unread message

Andrew Litteken via llvm-dev

unread,

Sep 1, 2020, 5:30:01 PM9/1/20

to llvm...@lists.llvm.org

Hello,

I’m Andrew Litteken, and I am working on a framework for defining, detecting, and deduplicating similar code sections at the IR level.

Programmers can introduce sections of code that perform nearly the same operation. By copying and pasting code throughout a code base, or using a piece of code as a reference to create a new piece of code that has nearly the same structure redundancies are inadvertently introduced throughout programs. Furthermore, compilers can introduce similarity through sets of instructions that require the same code generation strategies, or optimizing to the same patterns.

For example these two pieces of code:

int fn(const std::vector<int> &myVec) {

    for (auto it = myVec.begin(),

      et = myVec.end(); it != et; ++it) {

        if (*it & 1)

            return 0;

}

    return 1;

}

And

int fn(const std::vector<int> &myVec) {

    for (const int &x : myVec) {

        if (x % 2)

            return 1;

}

    return 0;

}

have a conditional that is finding whether the variable (it and x respectively) is equal to 1. And, they generate the same piece of IR code for the conditional:

%11 = load i32, i32* %10, align 4

%12 = and i32 %11, 1

%13 = icmp eq i32 %12, 0

%14 = getelementptr inbounds i32, i32* %10, i64 1

br i1 %13, label %7, label %15

Such sections of code can be called similar. Two sections of code are considered similar when they perform the same operations on a given set of inputs. However, this does not require identity. This particular definition allows for the same sequences of instructions to have different operand names. For example:

%1 = add i32 %a, 10

%2 = add i32 %a, %1

%3 = icmp slt icmp %1, %2

and

%1 = add i32 11, %a

%2 = sub i32 %a, %1

%3 = icmp sgt icmp %2, %1

are clearly not identical. There are different constants, the comparison operator is different and there are flipped operands. But if the constant 10, the constant 11, and register %a are considered inputs inputs into these sections, then the two regions could reduce to be:

%1 = add i32 %input1, %input2

%2 = sub %input2 %a, %1

%3 = icmp slt icmp %1, %2

Ultimately, these sets of operations that are performing the same action, even though the operands are different. This patch focuses on this sort of “structural” similarity.

However, this concept can be easily generalized. For example, two sections of code could be considered similar when they contain the same instructions, but are not necessarily used in the same order. Yet, because of the way the operands are used in this different ordering, the instructions still compute the same results for the given inputs.

If these similarities are recognized, it could offer improvements towards reducing code size, analyzing redundancies in a large project, or creating tools to help refactor code for a programmer.

This new framework offers an interface to detect similarity at the IR level. It has an internal representation that can be accessed via an analysis, allowing for easy access by other transformation and analysis passes that want to leverage detected similarity within a program. It can also be easily serialized into a file format like JSON, or as OptRemarks. In the future, a more advanced notion of similarity could be used, and would it be relatively simple to swap a new detection algorithm into the similarity framework.

At present, there have been two different applications developed on top of this framework. The first is a similarity visualizer tool called LLVM-Sim that can ingest a module, and uses the framework to find the similarity in that module, and output a report for a programmer in a serialized form, like mentioned above. The second is an IR Outliner using the analysis pass as the basis for the outlining candidates. I will first talk briefly about the framework before moving into the two applications.

Finding Similarity:

The framework for finding similarity is centered around adding an analysis pass that finds similar regions of code that do not cross basic blocks, and groups them into different groups of similarity based on their structure. These are then made available to other passes, such as an outliner, or a tool, such as LLVM-Sim.

In more detail, finding similarity works by hashing instructions to integers, and placing them in a long list. The list can be thought of as a string, with each unsigned integer acting as a character. This string can then be processed by a Suffix Tree to find repeated substrings in the program. Following this, the instances of each repeated substring are analyzed for structural similarity. In this case, if a one-to-one mapping for the operands of two different instances of a repeated substring can be found, so that each set of operands is operated on in the same way, then those two regions of instructions are considered to be structurally similar to one another. These tasks are handled by three new classes: The IRInstructionMapper, the IRSimilarityCandidate, and the IRSimilarityIdentifier.

Finding Similarity Implementation:

IRInstructionMapper:

Since matching instructions that are identical does not provide that much information when looking for structural similarity, the value of the instruction, and the value of the operands for most instructions are not considered. Instead only the opcode, types and parameters are considered. Ultimately, if the opcode and types match, then the instructions are considered to be performing the same operation. However, for some instructions like shufflevector, or getelementptr instructions, it is ensured that the parameters match, in keeping with the “isSameOperationAs” instruction method. Each unique instruction via this scheme is mapped to an unsigned integer. Creating the mapping of the entire program to integers treats the program like a string with each integer as a character, and allows the use of data structures such as the Suffix Tree (recently refactored into LibSupport in review D80586) to process the string for repeated sequences of instruction in program.

This maps instructions such as

%1 = add i32 %a, 1

%2 = add i32 %b, %c

To the same values as they have the same opcode and same type, and instructions such as

%1 = add i32 %a, 1

%2 = add i64 %b, %c

%3 = mul i64 %b, %c

Are mapped to different values, since they have different opcodes and types.

IRSimilarityIdentifier:

The list of integers representing the program is then passed to the Suffix Tree, which can quickly find repeated substrings, and creates an IRSimilarityCandidate for each instance of these repeated substring. The IRSimilarityCandidate provides an interface for comparing the operand use and instructions of one IRSimilarityCandidate to another to determine structural similarity.

Structural similarity is determined by attempting to find a one-to-one mapping from each register in one candidate to each register in another. If this mapping is found, then the two candidates are performing the same operations on a given set of inputs and are structurally similar. For example:

%1 = add %2, %3

%4 = add %1, %2

Is equivalent to

%a = add %b, %c

%d = add %b, %a

And the IRSimilarityCandidates containing these sections of code would be considered structurally similar. But

%1 = add %2, %3

%4 = add %1, %2

Is not equivalent to

%a = add %b, %c

%d = add %a, %e.

Since the use of %2 in the first section does not match the use of %b in the second. So the IRSimilarityCandidates containing these sections would not be considered structurally similar.

The IRSimilarityIdentifier takes the IRSimilarityCandidates created from the instances of the repeated substrings and uses the IRSimilarityCandidate’s interface, described above, to determine which instances of a given repeated substring are structurally similar to one another.

Once it has been determined which instances of a repeated substring that are structurally similar, sets of IRSimilarityCandidates that have been found to be compatible are grouped together (a Similarity Group). The IRSimilarityCandidates contained in each group are structurally similar to one another. In terms of the example above, the first two sections of code would be in the same Similarity Group, but the third is not structurally similar to either. Therefore, it would be in its own group, and does not have a matching structure with another IRSimilarityCandidate in the program.

Application: Visualizing Similarity

One potential application is simply exporting any similarities found for viewing by a user. LLVM-Sim is a tool that accepts a module, and passes them to the IRSimilarityIdentifier and retrieves the Similarity Groups for the module. These are then exported to a JSON file that has an entry for each Similarity Group, and a list of ranges for each region of similarity contained in that group. This tool also could be implemented with remarks rather than JSON for faster, more streamlined processing of program information.

If a module had a Similarity Group with two sections, instructions 5 to 10 and instructions 15 to 20, and a different Similarity Group contained two sections between instructions 1 to 4 and 21 to 24, the output JSON would be:

{

  “1": [{"s": 5, "e": 10, {“s": 15, "e": 20}],

  “2": [{"s": 1, "e": 4}, {"s": 21, "e": 24}]

}

This information could be used to create interfaces to look at, and compare the similarity present in the program, such as in the example interface here: https://reviews.llvm.org/D86974, which has two instances of the program alongside its IR with two different similar sections highlighted.

Application: Deduplicating Similarity with Outlining:

Alternatively, following identifying the similar sections of the program, the similar regions of code can be deduplicated. The IR Outliner makes use of the Code Extractor and the new Similarity Analysis Pass to remove each of the similar ranges individually, and the new IR Outliner deduplicates the individually extracted functions for each region ensuring that the information needed for each extraction is not lost.

Outlining at the IR level can benefit any target architecture and reduce the overall size of a program before any target specific decisions. On the other hand, the Machine Outliner in LLVM runs after the register allocator and needs to be implemented for each target separately. The concept of code structural similarity is easier to implement in a general way at the IR level.

The IR Outliner is off by default. There is a flag -mllvm -ir-outliner that will use the defined cost model to attempt to outline sections of code that will lead to an overall decrease in the size of the program. There is also a debugging option -mllvm -ir-outliner-no-cost that will outline the Similarity Groups greedily based on how many instructions will be removed, but does not consider the number of instructions added.

There are more details about the implementation of the IR Outliner at the end of the RFC.

Results:

The LLVM Test Suite, compiled with the IR Outliner on top of -Oz, has seen code size reductions of up to 60% and an average of 1% code size reduction for both X86 and AArch64. In these tests, the IR Outliner is placed as the last stage before code generation for the given target. There are some regressions. These are due to the fact that our cost model is not as accurate at the assembly code generation phase. It can be difficult to estimate how many instructions will be added during code generation at the IR level. For instance, if many arguments need to be created to handle constants, the decisions when creating a call at the assembly level are not directly obvious. So, our cost model may estimate that fewer instructions are added than actually need to be. Using target hooks may be a way to address this issue to get a better sense of what will happen during code generation than a general framework for calling conventions.

CTMark Code Size (x86_64):

Test                  |Original|   With  | Reduction

                      | -Oz    |Outlining|

-----------------------------------------------------

consumer-typeset.test | 484566 | 458642  | -5.3%

SPASS.test            | 490445 | 470669  | -4.0%

sqlite3.test          | 525902 | 512813  | -2.5%

clamscan.test         | 569405 | 567437  | -0.3%

lencod.test           | 762084 | 759454  | -0.3%

bullet.test           | 779072 | 779072  |  0.0%

kc.test               | 482848 | 482848  |  0.0%

tramp3d-v4.test       | 708712 | 708712  |  0.0%

pairlocalalign.test   | 485246 | 485462  | +0.0%

7zip-benchmark.test   | 982651 | 983563  | +0.1%

Geomean difference: -1.3%

Notable Decreases:

GCC-C-execute-20041011-1.test: 6600 before, 2343 after, -64.5%.

This test uses a lot of macros, and the outliner picks up on the similarity of the same set of inserted instructions.

frame_layout: 113631 before, 98110 after, -13.7%.

This test has a lot of repetitive and repeated instructions performed on different sets of arguments, making it a good candidate for outlining.

Related Work:

The previous RFC (RFC: PT.2 Add IR level interprocedural outliner for code size.) and review (D53942: IR Outliner Pass) discuss similar techniques for outlining identical instructions at the machine code level. This patch uses a more general concept of similarity and works at the IR Level instead. Both share the Suffix Tree data structure (see Interprocedural MIR-level outlining pass) to identify “outlinable” code sequences within a module.

The framework could also be extended to encompass the Machine Outliner, so a MachineCodeSimilarityIdentifier that feeds the MachineOutliner could be used rather than having all of the functionality inside the one MachineOutliner pass.

Similarity Identification Implementation:

IR Outliner Implementation:

- Cost Model:

Using information such as region size, inputs, and outputs, across all the candidates, the arguments that need to be passed to the outlined function are found. From this, estimates are made for:

How many instructions will need to be added to make a call to the function.
How many instructions will be removed by inserting the call.
How many instructions will be added by creating the new function due to argument handling.
How many instructions will be added from stack instructions at call site.
How many instructions will be added from stack instructions inside call.

Then the outlined function is only created if the number of instructions added is less than the number of instructions removed.

Right now, this cost model was developed through a general sense of what happens in most calling conventions, and is conservative. Target hooks would allow for a target specific sense of how large each instruction is, so better sense how many machine instructions are being outlined could be developed. It would also give insight into the particular calling convention in use. This would allow for a more accurate cost model, and find more opportunities for outlining.

- Extraction and Consolidation:

If it is determined that outlining all of the IRSimilarityCandidates in a Similarity Group will add fewer instructions than it removes, each candidate's block is extracted using the the Code Extractor. The contents of one of the functions is placed in a new overall function, that has input and outputs depending on the arguments needed for each region, as determined by the Code Extractor. When the consolidation is performed, anything that is not the same across each section, like constants, are extracted into arguments as well. If the arguments are large, they are current extracted directly into the function arguments as well and are not passed by reference.

As an example of extraction and consolidation, if the following two blocks are extracted, which each have a different constant in the second operand of the first add,

%1 = add i32 %a, 10

%2 = add i32 %1, %a

%1 = add i32 %a, 11

%2 = add i32 %1, %a

The extracted function would be:

define internal @outlined_ir_func(i32 %a, i32 %b) {

entry:

  %1 = add i32 %a, %b

  %2 = add i32 %1, %a

  ret void

}

And calls of:

call void @outlined_ir_func(i32 %a, 10)

call void @outlined_ir_func(i32 %a, 11)

For values that are used outside of the region, but are created inside the region, a register that has a pointer to a location is passed as an argument and the data is stored into that register. However, between different similar sections, the values used for outputs are not necessarily the same. If this is the case, a block for each set of possible outputs needed is added, with a switch statement to different sections, controlled by an extra argument.

For example, if the following two blocks were extracted, which each have a different item used outside of the similar section

%1 = add i32 %a, %b

%2 = add i32 %1, %a

%3 = mul i32 %1, %1

%4 = add i32 %a, %b

%5 = add i32 %5, %a

%6 = div i32 %5, %5

The extracted function would be:

define internal @outlined_ir_func(i32 %a, i32 %b, i32* %c, i32 %d) {

entry:

  %1 = add i32 %a, %b

  %2 = add i32 %1, %a

  switch %d, label %final [

    i32 0, label %block1

    i32 1, label %block2

]

block1:

  store i32 %1, %c

block2:

  store i32 %2, %c

}

And calls of

call void @outlined_ir_func(i32 %a, i32 %b, i32* %c, 0)

%val = load i32, i32* %c 

%3 = mul i32 %val, %val

call void @outlined_ir_func(i32 %a, i32 %b, i32* %c, 1)

%val1 = load i32, i32* %c 

%6 = div i32 %val1, %val1

Once there is complete function, the call sites are replaced with the new function. If a function does not require a certain argument it is simply replaced with a null value. For instance, in the example above, if there was a third region did not need any sort of output value the call would be:

call void @outlined_ir_func(i32 %a, i32 %b, i32* nullptr, -1)

Using a negative one for the switch statement, the program simply falls through to the ending block.

Implementation in Many Patches:

Each of these patches includes tests. But the minimum number of patches needed to see outlining are the Instruction Mapper, Candidate, Identifier, and Wrapper pass paired with the initial setup and extraction with the input consolidation. These are current on Phabricator and the links are included.

Similarity Identification Analysis Pass:

Instruction Mapper:

IRInstructionMapper: Adding initial mapping of instructions to unsigned integers

Link: https://reviews.llvm.org/D86968
Link ilist Parent: https://reviews.llvm.org/D86969

IRInstructionCandidate:

IRInstructionCandidate: Adding initial candidate structural code, with no special cases for commutativity.

Link Basic: https://reviews.llvm.org/D86970
Link Structural Comparison: https://reviews.llvm.org/D86971

IRSimilarityIdentifier:

IRSimilarityIdentifier: Add Suffix Tree processing and grouping candidates by structural similarity.

Link: https://reviews.llvm.org/D86972

WrapperPass:

Wrapper Pass: Adding a wrapper and printer pass for the IRSimilarityIdentifier.

Link: https://reviews.llvm.org/D86973

Extra Features:

- Commutativity IRSimilarityCandidate: Adding flexibility for commutative instructions in structural comparison.

Predicate Isomorphism IRInstructionMapper/IRSimilarityCandidate: Adding flexibility for isomorphic predicate instructions in structural comparison.

LLVM-Sim:

LLVM-Sim Basic: Process multiple files using LLVM-Sim and give results of the identifier to a JSON File.

Link: https://reviews.llvm.org/D86974

IROutliner:

Extraction:

Initial Setup and Extraction: Extract the different regions individually

Link: https://reviews.llvm.org/D86975

Consolidation:

IR Outliner: Adding checks to exclude certain functions that either cannot be outlined, or require extra work to outline.

Link: https://reviews.llvm.org/D86976

IROutliner Only Inputs: Extract only regions that require inputs, and consolidate each group into one outlined function.

Link Extract: https://reviews.llvm.org/D86977
Link Consolidate: https://reviews.llvm.org/D86978

IROutliner Constants: Elevate constants to arguments when they are not the same in each region.
IROutliner Assumes: Handles the special case of llvm.assumes which are removed from extracted regions.
IROutliner Sunken Allocas: Handles the case of objects with their lifetimes entirely included in the region, and elevates them to arguments.
IROutliner Outputs: Handles the case of multiple different output schemes in different regions.
IROutliner Reduce Output Blocks: remove duplicate output scheme plots for output registers in outlined functions.

Miscellaneous Features:

IROutliner Attribute Merging: Makes sure that function attributes are merged consistently.
IROutliner Attribute Compatibility: Adds an option to turn on function attribute incompatibilities for items like sanitizing memory for outlined function.
IROutliner DebugInfo: Makes sure that debug info is consistent for outlined functions.
IROutliner Cost Model: Adds a cost model, so that reductions are prioritized rather than what comes first in terms of sheer code size outlined.
IROutliner with OptRemarks: Adding OptRemarks to the outliner.
IROutliner with SwiftError: Adding support for swifterror outlined parameters
IROutliner with LinkOnceODR: Adding flag to outline from LinkOnceODR functions.

Any feedback, comments, and discussion are welcome and appreciated!

Thank You,

Andrew

Neil Nelson via llvm-dev

unread,

Sep 1, 2020, 10:24:44 PM9/1/20

to llvm...@lists.llvm.org

This looks like a very worthwhile project. Often we are looking for faster execution with code size being a secondary issue though some uses will have the reverse priority. Hopefully, the project will provide execution times along with code-size reductions.

Neil Nelson

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

unread,

Sep 2, 2020, 6:04:21 AM9/2/20

to Andrew Litteken, LLVM Dev

Indeed, an awesome project and an excellent report!

Code size doesn't really get much attention, so the level of detail and the strong roadmap is refreshing.

Hopefully, the project will provide execution times along with code-size reductions.

I doubt it.

Outlining will (almost?) always make for slower code due to a lot more calls being made. But that's ok for embedded targets, that don't have strong speculative machinery.

Also, if the program fits versus doesn't, then it will run infinitely faster. :D

Kristof, do you know who at Arm would be good to review those patches?

cheers,

--renato

Andrew Litteken via llvm-dev

unread,

Sep 9, 2020, 6:00:18 PM9/9/20

to Renato Golin, LLVM Dev

Thanks for the positive feedback!

All the patches are up now:

Elevating different constants to arguments: [IRSim][IROutliner] Adding support for consolidating functions with different output arguments.

Extracting require outputs: [IRSim][IROutliner] Adding support for consolidating functions with different output arguments.

Merging Output Blocks: [IRSim][IROutliner] Merging output blocks for extracted functions with outputs

Cost Model: [IRSim][IROutliner] Adding a cost model, and debug option to turn the model off.

OptRemarks: [IRSim][IROutliner] Adding OptRemarks for the IROutliner.

Function Attribute Merging: [IRSim][IROutliner] Adding consistent function attribute merging

DebugInfo Handling: [IRSim][IROutliner] Adding DebugInfo handling for IR outlined functions.

Outlining from linkonceodr: [IRSim][IROutliner] Adding option to enable outlining from linkonceodr functions

Flexibility for Isomorphic Predicates: [IRSim] Adding support for isomorphic predicates

Flexibility for Commutative Instructions: [IRSim] Adding commutativity matching to structure checking

Matching call instructions for similarity identification: [IRSim] Letting call instructions be legal for similarity identification.

Outlining Calls: [IRSim][IROutliner] Allowing call instructions to be outlined.

Matching GEP instructions for similarity identification: [IRSim] Letting gep instructions be legal for similarity identification.

Outlining GEP instructions: [IRSim][IROutliner] Allowing GEP instructions to be outlined.

Feedback, comments and reviews are welcome!

Andrew

Yvan Roux via llvm-dev

unread,

Sep 30, 2020, 7:23:05 AM9/30/20

to Andrew Litteken, LLVM Dev

Hi Andrew,

Thanks for this great work and presentation, I've also a huge interest in code
size reduction. I already looked at the first IR Outliner proposal,
and I really
like your more general approach.

I wanted to look at the impact on ARM targets, and managed to apply
your patchset
up to D87311 (the remaining ones which are related to call instructions and GEP
need to be reworked to be applied or work correctly). I looked at CTMark code
size other -Oz for both ARM and X86 and got a huge code size increase on both
targets (around 30% or 40% in Geomean) when compiled with -mllvm -ir-outliner.

Does it rings some bells on your side, is it expected given that the full series
is not applied, or did I just miss something ? From what I can observe, and as
you mentioned in your first mail, a lot of code is introduced to handle
parameters passing to outlined functions and the biggest code size increase was
observed after applying D87296.

Thanks a lot for this work and any hint on how to reproduce your results, I'm
eager to play with it and give some help if needed.

Cheers,
Yvan

Andrew Litteken via llvm-dev

unread,

Sep 30, 2020, 10:48:05 AM9/30/20

to Yvan Roux, LLVM Dev

Hi Yvan,

Glad to hear you’re interested! I’m not sure which ordering you’re looking at the patches, but a cost model was not added to the outliner until D87299. So, if that hasn’t been added I would expect some pretty big code increases since it just outlines whatever similarity it finds first otherwise.

The cost model has also been slightly reworked since the initial results to be more general, and try to use target hooks, but not all of the target hooks that we would really want exist. So, if it has been applied, I’ll definitely need to take a look.

Andrew

> On Sep 30, 2020, at 6:22 AM, Yvan Roux <yvan...@linaro.org> wrote:
>
> Hi Andrew,

Yvan Roux via llvm-dev

unread,

Sep 30, 2020, 11:30:18 AM9/30/20

to Andrew Litteken, LLVM Dev

On Wed, 30 Sep 2020 at 16:48, Andrew Litteken <andrew....@gmail.com> wrote:
>
> Hi Yvan,
>
> Glad to hear you’re interested! I’m not sure which ordering you’re looking at the patches, but a cost model was not added to the outliner until D87299. So, if that hasn’t been added I would expect some pretty big code increases since it just outlines whatever similarity it finds first otherwise.
>
> The cost model has also been slightly reworked since the initial results to be more general, and try to use target hooks, but not all of the target hooks that we would really want exist. So, if it has been applied, I’ll definitely need to take a look.

I applied the patches in order on top of your commit of D86974
(15645d044bc) that is:
D86975 - D86976 - D86977 - D86978 - D87294 - D87296 - D87298 - D87299 - D87300
D87301 - D87302 - D87309 - D87310

So, the cost model was part of it, I don't remember if I had to
resolve some conflicts for
it, but I'll re-check and look if it was applied correctly.

Yvan

Andrew Litteken via llvm-dev

unread,

Sep 30, 2020, 2:41:40 PM9/30/20

to Yvan Roux, LLVM Dev

Hi Yvan,

I went through and found a bug that was causing a repeated addition of removed instructions to the estimated benefit that occurred during some refactoring. I’ve uploaded the correct diff into the cost model patch: https://reviews.llvm.org/D87299.

This should take care of the large increases that you’re seeing. However, you should see smaller changes than the results I provided in the RFC, since the call and GEP instructions allow for the regions to be larger, making it easier to have a higher benefit to cost ratio.

Additionally, the current set of patches do not outline most intrinsics, since we weren’t sure how safe it was to outline every type of intrinsic. The initial data did not enforce this restriction as we adjusted the patches after to help land them more quickly. The hope is that the different intrinsics will be added piece by piece to make sure that outlining does not cause any miscompiles.

Let me know if there are any more complications.

Andrew

Yvan Roux via llvm-dev

unread,

Oct 1, 2020, 4:39:46 AM10/1/20

to Andrew Litteken, LLVM Dev

Hi Andrew,

On Wed, 30 Sep 2020 at 20:41, Andrew Litteken <andrew....@gmail.com> wrote:
>
> Hi Yvan,
>

> I went through and found a bug that was causing a repeated addition of removed instructions to the estimated benefit that occurred during some refactoring. I’ve uploaded the correct diff into the cost model patch: https://reviews.llvm.org/D87299.

Great, it fixes the issue on my side as well on X86 and ARM targets.

> This should take care of the large increases that you’re seeing. However, you should see smaller changes than the results I provided in the RFC, since the call and GEP instructions allow for the regions to be larger, making it easier to have a higher benefit to cost ratio.

Yes, now the effect is almost flat without calls and GEP, it would be
great if you can rebase these patches on top of the changes you made
in the previous ones, but that's already a good basis.

> Additionally, the current set of patches do not outline most intrinsics, since we weren’t sure how safe it was to outline every type of intrinsic. The initial data did not enforce this restriction as we adjusted the patches after to help land them more quickly. The hope is that the different intrinsics will be added piece by piece to make sure that outlining does not cause any miscompiles.

Ok it makes sense, I can help at some point with that and/or at least
test it on ARM.