[llvm-dev] [GlobalISel] A Proposal for global instruction selection

Quentin Colombet via llvm-dev

unread,

Nov 18, 2015, 2:26:53 PM11/18/15

to llvm-dev

Hi,

With this email, I would like to kick-off the development for the next instruction selector that I described during the last LLVM Dev’ Meeting.
For the motivations, see Jakob’s proposal (http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-August/064727.html) and for the proposal, see the slides (Keynote: http://llvm.org/viewvc/llvm-project/www/trunk/devmtg/2015-10/slides/Colombet-GlobalInstructionSelection.key?view=co or PDF: http://llvm.org/viewvc/llvm-project/www/trunk/devmtg/2015-10/slides/Colombet-GlobalInstructionSelection.pdf?revision=252430&view=co) or the talk (https://www.youtube.com/watch?v=F6GGbYtae3g&list=PL_R5A0lGi1AA4Lv2bBFSwhgDaHvvpVU21&index=2).

TL;DR This is happening now, feedbacks invited!

*** Context ***

During the last LLVM Dev’ Meeting, I have presented a proposal for the next instruction selector, GlobalISel. The proposal is basically summarized in "High Level Prototype Design” and “Roadmap”. (If you want further details, feel free to reach me.)

The first step of the development plan is to prototype the new framework on open source. The idea is to start prototyping now(!) and have the discussion ongoing in parallel. The reason of such approach is to have code that can be used to inform those discussions, e.g., by collecting data and trying different designs approaches. Regarding the discussion, I have listed a few points where your feedbacks would be particularly appreciated (see Feedback Invite).

Also, as I have mentioned in my talk, some issues are controversial but I expect them to be resolved during prototype development. Specifically theses concern aspects of legalization (should parts of it be done at the LLVM IR level or all at the MI level?) and code re-use for instruction combiner. Please feel free to bring up your specific concern as I move along with the development plan.

I expect the design to evolve with our experimental findings and your feedbacks and contributions.
Nonetheless, we expect to nail down some design decisions once and for all as the prototype progresses. I have highlighted them with the following pattern [final].

*** Feedback Invite ***

If you follow and support this work you need to be aware of three things and I am eager to hear your feedback and thoughts about them: the overall goals of Global ISel, the goals of the prototype, and the impact of the prototype work on backend design.

In the section “Goals", I defined (repeated for people that saw the talk) the goals for the Global ISel design.
- Do you see anything missing?
- Do you see something that should not be there?

The prototype will answer critical design questions (see “Design Questions the Prototype Addresses at the End of M1" for examples) before the actual design of Gobal ISel is finalized, but it cannot cover everything.
Specifically we will *not* look into improving TableGen or reuse InstCombine (see “ Proposed Approach” for the rational). Please let me know if you see any issue with that.

There is also basic ground work needed to prepare for Global ISel and I need to extend the core MachineInstr-level APIs as explained during the talk. For this, I prepared sketches of patches to illustrate them and describe the details in the “Implications” section below. Please have a look at the patches to have a better idea of the expected impact.

If there is anything else you want to discuss related to Global ISel feel free to reach me. In particular, several people expressed their interests during the LLVM Dev Meeting in contributing to the project. Let me know what is your area of interest, so that we can coordinate our efforts.
Anyhow, please add [GlobalISel] in the subject line to help categorizing the emails.

*** Goals ***

The high level goals of the new instruction selector are:
- Global instruction selector.
- Fast instruction selector.
- Shared code path for fast and good instruction selection.
- IR that represents ISA concepts better.
- More flexible instruction selector.
- Easier to maintain/understand framework, in particular legalization.
- Self contained machine representation, no back links to LLVM IR.
- No change to LLVM IR.

Note: The goals are common to all targets. In particular, we do not intend to work on target specific feature for the prototype.
The bottom line is please make sure those goals are compatible with what you want to achieve for your target, even if your requirement does not get listed here.

*** Proposed Approach ***

In this section, I describe the approach I plan to pursue in the prototype and the roadmap to get there. The final design will flow out of it.

For this prototype, we purposely exclude any work to improve or use TableGen or InstCombine [final]. We will keep in mind however, that some of the C++ code we write will be table-generated at some point.
The rational is that we do not want to lay down a new TableGen/InstCombine infrastructure before being able to work on the ISel framework itself.

The prototype vehicle will be AArch64. None of the changes for GlobalISel will negatively impact the existing ISel.

** High Level Prototype Design **

As shown in the talk, the expected pipeline for the prototype is:
LLVM IR -> IRTranslator -> Generic (G) MachineInstr -> Legalizer -> RegBankSelect -> Select -> MachineInstr

Where:
- Terms in bold are intermediate representations.
- Generic MachineInstrs are machine instructions with a generic opcode, e.g., ADD, COPY.

- IRTranslator: Translate LLVM IR to (G) MachineInstr.
- Legalizer: Legalize illegal (G) MachineInstr to legal (G) MachineInstr.
- RegBankSelect: Assign virtual register with size to virtual register with Register Bank.
- Select: Translate the remaining (G) MachineInstr to MachineIntr.

** Implications **

As part of the bring-up of the prototype, we need to extend some of the core MachineInstr-level APIs:
- Need to remember FastMath flags for each MachineInstr.
- Need to know the type of each MachineInstr. We don’t want ADD8, ADD16, etc.
- Extend the MachineRegisterInfo to support size as well as register classes for virtual registers.

I have sketched the changes in the attached patches to help picturing how the changes would impact the existing APIs.

Note: I do not intend to commit those changes as they are. They will go the usual review process in due time.

The patches contain “// ***”-like comment that give a rough explanation on why those changes are needed w.r.t. the goals.
The order of the patches could be modified since the dependencies between those are not sequential. Anyhow, here are the patches:
1. Introduce (some of) the generic opcode.
2. Make MachineFunction more independent of LLVM IR to eventually be able to delete the LLVM IR instance from the memory.
3. Extend MachineInstr to represent additional information attached to generic opcode.
4. Teach MachineRegisterInfo about size for virtual registers.
5. Introduce a helper class to build MachineInstr related objects.
6. Add new target hooks to lower the ABI directly to MachineInstr.
7. Introduce the IRTranslator pass.

** Roadmap for the Prototype **

We plan to split the prototype in three main milestones:
1. Translation: LLVM IR to (G) MachineInstr translation.
2. Basic selector: Legal LLVM IR to target specific MachineInstr.
3. Simple legalization: Support scalar type legalization and some vector instructions.

Notes:
- For #1, we will not support any fancy instructions like landing pad or switch.
- Each milestone should take about 3-4 months.

- At the end of #2, we would have a FastISel like selector.

Each milestone will be detailed right before starting it. The rational is that we want to accommodate what we discovered with the prototype for the next milestone. In other words, in this email, I only describe the first milestone in detail and I will give more details on the next milestone shortly before we start it and so on. For your information, here is the remaining of the intended roadmap for the full project:
4. Productization: Clean up implementation, stabilize the APIs.
5. Complex legalization: Extend legalization support to everything missing.
6. Completeness: Fill the blanks, e.g., landing pad.
7. Clean-up and performance: Add the necessary bits to be at parity or beat SelectionDAG generated code.
8. Transition: Document how to switch, provide tools to help.

** Milestone 1 **

The first phase is focused on the IRTranslator pass.

The IRTranslator is responsible for translating the LLVM IR into Generic MachineInstr. The IRTranslator pass uses some target hooks to perform the ABI lowering. We can either define a new API for them, e.g., ABILoweringInfo, or extend the existing TargetLowering.
Moreover, the prototype will focus on simple instruction, i.e., we will not support switch or landing pad for this iteration.

At the end of M1, the prototype will not be able to produce code, since we would only have the beginning of the Global ISel pipeline. Instead, we will test the IRTranslator on the generic output that is produced from the tested IR.

* Design Decisions *

- The IRTranslator is a final class. Its purpose is to move away from LLVM IR to MachineInstr world [final].
- Lower the ABI as part of the translation process [final].

* Design Questions the Prototype Addresses at the End of M1 *

- Handling of aggregate types during the translation.
- Lowering of switches.
- What about Module pass for Machine pass?
- Introduce new APIs to have a clearer separation between:
- Legalization (setOperationAction, etc.)
- Cost/Combine related (isXXXFree, etc.)
- Lowering related (LowerFormal, etc.)
- What is the contract with the backends? Is it still “should be able to select any valid LLVM IR”?

Thanks,

-Quentin

0001-Extend-generic-opcodes-to-be-able-to-represent-the-i.patch

0006-Add-new-target-hooks-to-be-able-to-lower-the-ABI-rig.patch

0007-Introduce-the-IRtranslator-pass-for-GlobalISel.patch

0002-Pull-more-of-the-LLVM-IR-function-representation-int.patch

0003-Extend-MachineInstr-to-supply-more-information-regar.patch

0004-Teach-MachineRegisterInfo-about-size-for-virtual-reg.patch

0005-Introduce-a-MachineIRBuilder-to-gather-all-the-Machi.patch

James Molloy via llvm-dev

unread,

Nov 18, 2015, 2:54:16 PM11/18/15

to Quentin Colombet, llvm-dev

Hi Quentin,

I'm really excited to see this happening!

My major question is over the testing story for this. How are we going to write unit tests for GIR? Are you intending to leverage the LIR lowering that noone is using yet? Will you be using unit/LIT tests right from the start, or adding them in later?

Cheers,

James

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

David Chisnall via llvm-dev

unread,

Nov 18, 2015, 2:55:17 PM11/18/15

to Quentin Colombet, llvm-dev

Hi Quentin,

On 18 Nov 2015, at 19:26, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> In the section “Goals", I defined (repeated for people that saw the talk) the goals for the Global ISel design.
> - Do you see anything missing?
> - Do you see something that should not be there?

I really like the design that you outlined. I have one very small request:

Please maintain pointers as a distinct type from integers for as long as possible. We currently have some patches in SelectionDAG to add some pointer-specific operations, as in our architecture the operations valid on pointers are not the same as those valid on integers (and pointers are not the same size as integers). Your proposed model looks like it would be *much* easier for us to use as long as that constraint is kept. Various systems with different integer and address registers hit the same problem as us.

Given the way that you’re proposing to do legalisation, this seems like it should be easy (for most architectures, assigning pointers to the same register bank as integers will be a simple choice and then all of the later selection should be the same).

On a related note, keeping pointer address spaces around in the machine IR would make things easier for us and, I think, some of the GPU folks.

David

Quentin Colombet via llvm-dev

unread,

Nov 18, 2015, 4:32:18 PM11/18/15

to James Molloy, llvm-dev

Hi James,

On Nov 18, 2015, at 11:53 AM, James Molloy <ja...@jamesmolloy.co.uk> wrote:

Hi Quentin,

I'm really excited to see this happening!

My major question is over the testing story for this. How are we going to write unit tests for GIR?

Thanks for bringing that up!

That is a very good question and also one that will require a lot of work to address properly.

Ultimately, I’d like we are able to write unit tests directly in the MachineInstr representation. Part of the goal of making the IR self contained, i.e., with no back links to LLVM IR, is to make the testing easier.

Now, to answer the question on how we do that, I have a pragmatic answer, though I am not proud of it:

We are going to write unit tests with LLVM IR as input and check the MI output of the pass, e.g.,with print-after=IRTranslator.

That’s not great, but at least we can test now!

Are you intending to leverage the LIR lowering that noone is using yet?

That’s a tricky question because I do not intend to work on this in the prototype timeframe and I am not fond of the way this testing works.

However, yes, I believe that we need to redevelop or leverage the LIR lowering for this purpose. Actually, I was looking for volunteers to work on that during the prototype timeframe, so that we have everything we need when we productize the new framework.

Interested? :P

Note: My main concern is that is uses a YAML format, i.e., we cannot dump the output of a machine function and feed into it.

Will you be using unit/LIT tests right from the start, or adding them in later?

Definitely right from the start, with the “output” method I mentioned.

The hope is that a "LIR lowering" like mechanism will be developed along the way and we can migrate tests to the new format when it is ready. If we carefully design this "LIR lowering” format, we may just have to change the RUN line :).

Thanks,

-Quentin

Matthias Braun via llvm-dev

unread,

Nov 18, 2015, 6:01:42 PM11/18/15

to Quentin Colombet, llvm-dev

> On Nov 18, 2015, at 1:32 PM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Hi James,
>
>> On Nov 18, 2015, at 11:53 AM, James Molloy <ja...@jamesmolloy.co.uk> wrote:
>>
>> Hi Quentin,
>>
>> I'm really excited to see this happening!
>>
>> My major question is over the testing story for this. How are we going to write unit tests for GIR?
>
> Thanks for bringing that up!
> That is a very good question and also one that will require a lot of work to address properly.
>
> Ultimately, I’d like we are able to write unit tests directly in the MachineInstr representation. Part of the goal of making the IR self contained, i.e., with no back links to LLVM IR, is to make the testing easier.
>
> Now, to answer the question on how we do that, I have a pragmatic answer, though I am not proud of it:
> We are going to write unit tests with LLVM IR as input and check the MI output of the pass, e.g.,with print-after=IRTranslator.
>
> That’s not great, but at least we can test now!

We do have the .mir dumping and reading. To me that code looks like it basically works and just might need a bug fix here and there. Should be the right thing to use when starting a new project like this, shouldn't it?

- Matthias

Marcello Maggioni via llvm-dev

unread,

Nov 18, 2015, 6:06:30 PM11/18/15

to Quentin Colombet, llvm-dev

Thanks Quentin for the effort in putting this together!!

I’m super excited in seeing this going forward and I’m looking forward in helping in bringing GlobalISel up as much as I can as it is very promising for our targets!

It also catched my eye that you mentioned the possibility of having Module level Machine passes.

Having that would simplify some parts of our pipeline for example I believe as now we are using some hacks to obtain basically the same result at the very end of the pipeline.

This would be useful at least for us!

Marcello

<0001-Extend-generic-opcodes-to-be-able-to-represent-the-i.patch>
<0002-Pull-more-of-the-LLVM-IR-function-representation-int.patch>
<0003-Extend-MachineInstr-to-supply-more-information-regar.patch>
<0004-Teach-MachineRegisterInfo-about-size-for-virtual-reg.patch>
<0005-Introduce-a-MachineIRBuilder-to-gather-all-the-Machi.patch>
<0006-Add-new-target-hooks-to-be-able-to-lower-the-ABI-rig.patch>
<0007-Introduce-the-IRtranslator-pass-for-GlobalISel.patch>

Quentin Colombet via llvm-dev

unread,

Nov 18, 2015, 6:12:45 PM11/18/15

to Matthias Braun, llvm-dev

On Nov 18, 2015, at 3:01 PM, Matthias Braun <mbr...@apple.com> wrote:

On Nov 18, 2015, at 1:32 PM, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:

Hi James,

On Nov 18, 2015, at 11:53 AM, James Molloy <ja...@jamesmolloy.co.uk> wrote:

Hi Quentin,

I'm really excited to see this happening!

My major question is over the testing story for this. How are we going to write unit tests for GIR?

Thanks for bringing that up!
That is a very good question and also one that will require a lot of work to address properly.

Ultimately, I’d like we are able to write unit tests directly in the MachineInstr representation. Part of the goal of making the IR self contained, i.e., with no back links to LLVM IR, is to make the testing easier.

Now, to answer the question on how we do that, I have a pragmatic answer, though I am not proud of it:
We are going to write unit tests with LLVM IR as input and check the MI output of the pass, e.g.,with print-after=IRTranslator.

That’s not great, but at least we can test now!
We do have the .mir dumping and reading. To me that code looks like it basically works and just might need a bug fix here and there. Should be the right thing to use when starting a new project like this, shouldn't it?

That’s the thing, I don’t want to spend time writing mir code. I basically want to be able to take the machine code I expect and add CHECK lines without going through the yaml formatting business.

For just the translator, the story might be different though, because it’s LLVM IR to MI, so maybe that’s already well supported.

Then, I haven’t followed the mir stuff closely to know whether or not it would work with “a bug fix here and there”. For instance, I don’t know how it gets the opcode for the parser, i.e., how automatic/easy it is to add the generic opcode, plus we will have to teach it how to deal with type on instructions and so on.

My impression was that it was sufficiently lacking usability so that a fresh start would make sense (basically I see it as a prototype), but you may well be right. Again, I don’t plan to look into it for the prototype timeframe, but if you are, by all means!

Cheers,

-Quentin

- Matthias

Quentin Colombet via llvm-dev

unread,

Nov 18, 2015, 6:53:03 PM11/18/15

to David Chisnall, llvm-dev

Hi David,

> On Nov 18, 2015, at 11:55 AM, David Chisnall <David.C...@cl.cam.ac.uk> wrote:
>
> Hi Quentin,
>
> On 18 Nov 2015, at 19:26, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:
>>
>> In the section “Goals", I defined (repeated for people that saw the talk) the goals for the Global ISel design.
>> - Do you see anything missing?
>> - Do you see something that should not be there?
>
> I really like the design that you outlined. I have one very small request:
>
> Please maintain pointers as a distinct type from integers for as long as possible. We currently have some patches in SelectionDAG to add some pointer-specific operations, as in our architecture the operations valid on pointers are not the same as those valid on integers (and pointers are not the same size as integers). Your proposed model looks like it would be *much* easier for us to use as long as that constraint is kept. Various systems with different integer and address registers hit the same problem as us.

I understand the problem, but I feel like Jakob back in the day:
http://lists.llvm.org/pipermail/llvm-dev/2013-August/064734.html
http://lists.llvm.org/pipermail/llvm-dev/2013-August/064760.html

To summarize with my own words and feelings that gives:
To me the pointer/integer distinction is a way for you to specify the register classes you want. This is something the RegBankSelect pass will do for you and this distinction should not be necessary to produce efficient or correct code.

If that doesn’t work, you should be able to have target specific pass to select what you want directly after the translation or with a custom translation. One can envision some kind of IRTranslationKit that has all the generic translation build into to help you in such case.

Anyway, the good point with the prototype is that we will be able to experiment these things :).

>
> Given the way that you’re proposing to do legalisation, this seems like it should be easy (for most architectures, assigning pointers to the same register bank as integers will be a simple choice and then all of the later selection should be the same).
>
> On a related note, keeping pointer address spaces around in the machine IR would make things easier for us and, I think, some of the GPU folks.

Good point, this is also something that the MachineInstr should also expose as part of the make the IR self contained.

Thanks,
-Quentin

Quentin Colombet via llvm-dev

unread,

Nov 18, 2015, 7:53:59 PM11/18/15

to Marcello Maggioni, llvm-dev

Hi Marcello,

On Nov 18, 2015, at 3:06 PM, Marcello Maggioni <mmag...@apple.com> wrote:

Thanks Quentin for the effort in putting this together!!

I’m super excited in seeing this going forward and I’m looking forward in helping in bringing GlobalISel up as much as I can as it is very promising for our targets!
It also catched my eye that you mentioned the possibility of having Module level Machine passes.
Having that would simplify some parts of our pipeline for example I believe as now we are using some hacks to obtain basically the same result at the very end of the pipeline.
This would be useful at least for us!

Good to know!

Right now, I was considering it for the "LLVM IR -> MachineInstr" translation because if we want to go all the way down everything in MachineInstr, we need to lower the global variables as well, and this conceptually does not fit into a function-like pass.

Knowing that there are other users of that sounds like it would indeed by good to have it!

Thanks for your feedbacks!

Cheers,

-Quentin

David Chisnall via llvm-dev

unread,

Nov 19, 2015, 4:57:45 AM11/19/15

to Quentin Colombet, llvm-dev

On 18 Nov 2015, at 23:52, Quentin Colombet <qcol...@apple.com> wrote:
>
> To summarize with my own words and feelings that gives:
> To me the pointer/integer distinction is a way for you to specify the register classes you want. This is something the RegBankSelect pass will do for you and this distinction should not be necessary to produce efficient or correct code.
>
> If that doesn’t work, you should be able to have target specific pass to select what you want directly after the translation or with a custom translation. One can envision some kind of IRTranslationKit that has all the generic translation build into to help you in such case.
>
> Anyway, the good point with the prototype is that we will be able to experiment these things :).

As long as the pointer vs integer distinction is preserved until the RegBankSelect stage, then that will work for us. The problem with the current SelectionDAG ordering is that ‘what integer type do you use to represent pointers?’ is the first question that the generic CodeGen infastructure asks the back end during type legalisation, and the information is then gone (unless you add new MVTs, as we’ve had to do). If the initial lower preserves the difference between pointer-in-address-space-X and i64, and we are allowed register banks that overlap for final register allocation (which is almost certainly needed for less exotic use cases anyway), then this scheme would be *lot* easier for us to work with than the existing CodeGen infrastructure.

Jeroen Dobbelaere via llvm-dev

unread,

Nov 19, 2015, 11:13:42 AM11/19/15

to David Chisnall, Quentin Colombet, llvm...@lists.llvm.org

> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of David Chisnall via llvm-dev
[..]
>
> As long as the pointer vs integer distinction is preserved until the RegBankSelect stage, then that will work for us.

> The problem with the...
[..]

+1

Greetings,

Jeroen Dobbelaere

Quentin Colombet via llvm-dev

unread,

Nov 19, 2015, 12:49:43 PM11/19/15

to David Chisnall, llvm-dev

Hi David,

I must miss something, but I don’t get what is the problem of lower the pointer to actual integer.
As far as I can tell, what you want is to do some operation with some integers. The fact that those are used as pointer or integer is orthogonal IMO.
What you really want is to make the best use of your instruction set, meaning that if computing some pointer operations on the integer ISA is more efficient, and vice-versa, this is what we want to do.

The address space information is only relevant when you actually access the address, i.e., on memory operation, right?

What am I missing?

Cheers,
-Quentin

David Chisnall via llvm-dev

unread,

Nov 19, 2015, 12:50:59 PM11/19/15

to Quentin Colombet, llvm-dev

On 19 Nov 2015, at 17:49, Quentin Colombet <qcol...@apple.com> wrote:
>
> I must miss something, but I don’t get what is the problem of lower the pointer to actual integer.

Pointers, in our architecture, are not integers.

Quentin Colombet via llvm-dev

unread,

Nov 19, 2015, 1:07:59 PM11/19/15

to David Chisnall, llvm-dev

> On Nov 19, 2015, at 9:50 AM, David Chisnall <David.C...@cl.cam.ac.uk> wrote:
>
> On 19 Nov 2015, at 17:49, Quentin Colombet <qcol...@apple.com> wrote:
>>
>> I must miss something, but I don’t get what is the problem of lower the pointer to actual integer.
>
> Pointers, in our architecture, are not integers.

Thanks for the clarifications.

So what you’re saying is that a inttoptr instruction is not a no-op on your architecture, is that right?
Or it can be a no-op only if the consumer of the pointer values can be done on the pointer register bank?

Don’t know if that helps, but note that the registers are not typed, they just have size. The operations are typed.

I am trying to understand the constraint to see how that would fit in the framework. That being said, anything that you could do in SDag should be possible as well in the new framework.

Cheers,
-Quentin

David Chisnall via llvm-dev

unread,

Nov 19, 2015, 1:53:46 PM11/19/15

to Quentin Colombet, llvm-dev

On 19 Nov 2015, at 18:07, Quentin Colombet <qcol...@apple.com> wrote:
>
>
>> On Nov 19, 2015, at 9:50 AM, David Chisnall <David.C...@cl.cam.ac.uk> wrote:
>>
>> On 19 Nov 2015, at 17:49, Quentin Colombet <qcol...@apple.com> wrote:
>>>
>>> I must miss something, but I don’t get what is the problem of lower the pointer to actual integer.
>>
>> Pointers, in our architecture, are not integers.
>
> Thanks for the clarifications.
>
> So what you’re saying is that a inttoptr instruction is not a no-op on your architecture, is that right?

Correct.

> Or it can be a no-op only if the consumer of the pointer values can be done on the pointer register bank?

Yes (in some compilation models, we support 64-bit integers as pointers and 256-bit / 128-bit fat pointers, with the integer values being implicitly checked against a large region identified by one of the fat pointer registers, giving a coarse-grained sandbox that can communicate with the outside world via bounded pointers).

We currently have entirely separate fat pointer and integer register banks, though we’re investigating a mode where we’ll overlay the two on the same register file (though they’ll likely treat some things as sub-registers.

It also means that address space casts are not a no-op for us, which I believe is something that we share with some GPU ISAs (e.g. a 32-bit [or 16-bit] local pointer cast to a 64-bit global address space is not a simple sign/zero extension and so must be handled differently to an i32 -> i64 translation)

> Don’t know if that helps, but note that the registers are not typed, they just have size. The operations are typed.

That’s fine, once you’ve assigned values to register banks. The issue is ensuring that we’re not throwing away the information that we need to do that assignment in the translation from LLVM IR to the new machine IR (i.e. which values are pointers, and which address space they are in).

> I am trying to understand the constraint to see how that would fit in the framework. That being said, anything that you could do in SDag should be possible as well in the new framework.

We currently add several new nodes to SDag: INTTOPTR, PTRTOINT, and PTRADD, and a new iFATPTR MVT. The last is somewhat problematic, as we really want to have iFATPTR128 and iFATPTR256 (and, potentially, iFATPTR64 for an IoT/embedded variant).

Philip Reames via llvm-dev

unread,

Nov 19, 2015, 2:28:10 PM11/19/15

to Quentin Colombet, David Chisnall, llvm-dev

Having a distinction between integers and pointers preserved into MI
would be quite useful for garbage collection as well. We currently have
a lowering phase (RewriteStatepointsForGC) which effectively rewrites
operations of references (i.e. managed pointers) so that they can be
treated as integers throughout the rest of the pipeline. If we could
retain the distinction further back through the backend, it would both
simply a lot of code and likely let us generate better code (spilling,
etc..) around safepoints.

Philip

David Chisnall via llvm-dev

unread,

Nov 19, 2015, 2:35:33 PM11/19/15

to Philip Reames, llvm-dev

On 19 Nov 2015, at 19:27, Philip Reames <list...@philipreames.com> wrote:
>
> Having a distinction between integers and pointers preserved into MI would be quite useful for garbage collection as well. We currently have a lowering phase (RewriteStatepointsForGC) which effectively rewrites operations of references (i.e. managed pointers) so that they can be treated as integers throughout the rest of the pipeline. If we could retain the distinction further back through the backend, it would both simply a lot of code and likely let us generate better code (spilling, etc..) around safepoints.

It’s also important for doing a load of CFI things correctly (see: https://www.ics.uci.edu/~perl/ccs15_stackdefiler.pdf)

Quentin Colombet via llvm-dev

unread,

Nov 19, 2015, 3:19:41 PM11/19/15

to David Chisnall, llvm-dev

Hi Philip and David,

Thanks for the inputs, we will see how the design can accommodate with that when we prototype. Having some inttoptr, etc. kind of MachineInstr with additional information like address space sounds reasonable and that should fit your constraints.

Anyway, I may ping you to check if the translation is flexible enough or perform what you want, but of course, I invite you to actively review all the incoming patches related to GISel :).

Cheers,
-Quentin

Eric Christopher via llvm-dev

unread,

Nov 19, 2015, 3:46:23 PM11/19/15

to Quentin Colombet, llvm-dev

Hi Quentin,

*** Goals ***

The high level goals of the new instruction selector are:
- Global instruction selector.
- Fast instruction selector.

Are these separate or the same? It reads like two instruction selectors at the moment.

- Shared code path for fast and good instruction selection.

But then I'm not sure starting here.

- IR that represents ISA concepts better.
- More flexible instruction selector.

Some definitions here would be good.

- Easier to maintain/understand framework, in particular legalization.
- Self contained machine representation, no back links to LLVM IR.
- No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

*** Proposed Approach ***

In this section, I describe the approach I plan to pursue in the prototype and the roadmap to get there. The final design will flow out of it.

For this prototype, we purposely exclude any work to improve or use TableGen or

I'm getting the idea that you really don't want to work on TableGen? ;)

** Implications **

As part of the bring-up of the prototype, we need to extend some of the core MachineInstr-level APIs:
- Need to remember FastMath flags for each MachineInstr.

Not orthogonal to this proposal? I don't mind lumping it in as being able to do this is probably a good goal for the prototype at least, but it seems like being able to do this is something that could be done incrementally as a separate project?

At the end of M1, the prototype will not be able to produce code, since we would only have the beginning of the Global ISel pipeline. Instead, we will test the IRTranslator on the generic output that is produced from the tested IR.

So this would be targeting Generic MachineInstr? (Better name perhaps?). Which means that it should be serializable and testable in isolation yes?

* Design Decisions *

- The IRTranslator is a final class. Its purpose is to move away from LLVM IR to MachineInstr world [final].
- Lower the ABI as part of the translation process [final].

* Design Questions the Prototype Addresses at the End of M1 *

- Handling of aggregate types during the translation.
- Lowering of switches.
- What about Module pass for Machine pass?

Could you elaborate a bit more here?

- Introduce new APIs to have a clearer separation between:
- Legalization (setOperationAction, etc.)
- Cost/Combine related (isXXXFree, etc.)
- Lowering related (LowerFormal, etc.)
- What is the contract with the backends? Is it still “should be able to select any valid LLVM IR”?

Probably :)

As far as the prototype I think you also need to address a few additional things:

a) Calls

Calls are probably the most important part of any new instruction selector and lowering machinery and I think that the design of the call lowering infrastructure is going to be a critical part of evaluating the prototype. You might have meant this earlier when you said Lowering related, but I wanted to make sure to call it out explicitly.

b) Testing

It's been covered a bit before, but being able to serialize and use for testing the various IR constructs is important. In particular, I worry about the existing MIR code as I and a few others have tried to use it for testcases and failed. I'm very interested in whatever ideas you have here, all of mine are much more invasive than I think we'd like.

Thanks for tackling this project and being willing to put this out there for discussion and feedback. I'm looking forward to the code and future design.

-eric

Quentin Colombet via llvm-dev

unread,

Nov 19, 2015, 5:26:41 PM11/19/15

to Eric Christopher, llvm-dev

Hi Eric,

On Nov 19, 2015, at 12:46 PM, Eric Christopher <echr...@gmail.com> wrote:

Hi Quentin,

*** Goals ***

The high level goals of the new instruction selector are:
- Global instruction selector.
- Fast instruction selector.

Are these separate or the same? It reads like two instruction selectors at the moment.

They are the same, sorry for the confusion. This reads, we want a global and fast instruction selector where producing the code fast and producing good code quality exercise the same basic path in the framework. I.e., producing code fast is a trimmed down version of producing good code. E.g., for fast, analysis are less precise, fewer passes are run, etc.

- Shared code path for fast and good instruction selection.

But then I'm not sure starting here.

- IR that represents ISA concepts better.
- More flexible instruction selector.

Some definitions here would be good.

For IR that represents ISA concepts better, this is in opposition to SDISel or LLVM IR. In other words, the target should be able to insert target specific code (e.g., instruction, physical register) at anytime without needing some extra crust to express that (e.g., intrinsic or custom SDNode).

By more flexible we mean that targets should be able to inject target specific passes between the generic passes or replace those passes by their own.

- Easier to maintain/understand framework, in particular legalization.
- Self contained machine representation, no back links to LLVM IR.
- No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

Agree.

For now, the assumptions are there are no illegal types, just illegal pair of operation and type. But yeah, we may need to refine when we get to the legalization.

*** Proposed Approach ***

In this section, I describe the approach I plan to pursue in the prototype and the roadmap to get there. The final design will flow out of it.

For this prototype, we purposely exclude any work to improve or use TableGen or

I'm getting the idea that you really don't want to work on TableGen? ;)

Heh, that’s more a pragmatic approach. I don’t want we spend months improving TableGen before we start working on GlobalISel.

That being said, I think we should push as much thing as possible in tablegen when we are done with prototyping.

** Implications **

As part of the bring-up of the prototype, we need to extend some of the core MachineInstr-level APIs:
- Need to remember FastMath flags for each MachineInstr.

Not orthogonal to this proposal? I don't mind lumping it in as being able to do this is probably a good goal for the prototype at least, but it seems like being able to do this is something that could be done incrementally as a separate project?

That’s a good point and yes, it could be done as a separate project. The reason why this is here is because if we want to experiment with combine and such in the prototype, this is the kind of information we would need.

At the end of M1, the prototype will not be able to produce code, since we would only have the beginning of the Global ISel pipeline. Instead, we will test the IRTranslator on the generic output that is produced from the tested IR.

So this would be targeting Generic MachineInstr?

Yes.

(Better name perhaps?).

Suggestion welcome :).

Which means that it should be serializable and testable in isolation yes?

Partly. The lowering of the body of the function will be generic, but the ABI lowering will be target specific and unless we create some kind of fake target, the tests need to be bound to one target.

* Design Decisions *

- The IRTranslator is a final class. Its purpose is to move away from LLVM IR to MachineInstr world [final].
- Lower the ABI as part of the translation process [final].

* Design Questions the Prototype Addresses at the End of M1 *

- Handling of aggregate types during the translation.
- Lowering of switches.
- What about Module pass for Machine pass?

Could you elaborate a bit more here?

I have quickly mentioned in my reply to Marcello why this may be interesting. Let me rephrase my answer here.

Basically, we would like to have the MachineInstr to be self contained, i.e., get rid of those back links to LLVM IR. This implies that we would need to lower globals (maybe directly to MC) as part of the translation process. Globals are not attached to function but module, therefore it seems to make sense to introduce a concept of MachineModulePass.

- Introduce new APIs to have a clearer separation between:
- Legalization (setOperationAction, etc.)
- Cost/Combine related (isXXXFree, etc.)
- Lowering related (LowerFormal, etc.)
- What is the contract with the backends? Is it still “should be able to select any valid LLVM IR”?

Probably :)

As far as the prototype I think you also need to address a few additional things:

a) Calls
Calls are probably the most important part of any new instruction selector and lowering machinery and I think that the design of the call lowering infrastructure is going to be a critical part of evaluating the prototype. You might have meant this earlier when you said Lowering related, but I wanted to make sure to call it out explicitly.

Yes, lowering of calls is definitely going to be evaluated in the prototype for this first milestone and the "lowering related” stuff was about that :).

(You’re good at deciphering messages ;)).

b) Testing
It's been covered a bit before, but being able to serialize and use for testing the various IR constructs is important. In particular, I worry about the existing MIR code as I and a few others have tried to use it for testcases and failed. I'm very interested in whatever ideas you have here, all of mine are much more invasive than I think we'd like.

Honestly I haven’t used the MIR testing infrastructure yet, but yes my impression was it is not really… mature. I would love to have some serialization mechanism for the MI that really work so that we can write those testcases more easily.

As for now, I haven’t looked into it, so I cannot share any ideas. I’ve discussed a bit with Matthias and he thinks that we might not be that far away from having MIR testing useable modulo bug fixes.

It would be helpful if you could file PR on the cases where MIR was not working for you so that we can look into it at some point.

My hope is that someone could look into it before we actually need a proper MI testing in place.

(Hidden message: If you are willing to work on the MIR testing or any other mechanism that would allow us to do MI serialization deserialization, please come forward, we need you!! :D)

Indeed, for the translation part the MIR testing is not critical since we do have the LLVM IR around.

Then, if we get rid of the LLVM IR back links, serialization should become easier and maybe MIR testing could be leverage. That being said, it may be possible that we need to start that from scratch, while taking into account what we learnt from the MIR testing.

Thanks for the feedbacks,

-Quentin

Eric Christopher via llvm-dev

unread,

Nov 19, 2015, 7:58:49 PM11/19/15

to Quentin Colombet, llvm-dev

On Thu, Nov 19, 2015 at 2:26 PM Quentin Colombet <qcol...@apple.com> wrote:

Hi Eric,

On Nov 19, 2015, at 12:46 PM, Eric Christopher <echr...@gmail.com> wrote:

Hi Quentin,

*** Goals ***

The high level goals of the new instruction selector are:
- Global instruction selector.
- Fast instruction selector.

Are these separate or the same? It reads like two instruction selectors at the moment.

They are the same, sorry for the confusion. This reads, we want a global and fast instruction selector where producing the code fast and producing good code quality exercise the same basic path in the framework. I.e., producing code fast is a trimmed down version of producing good code. E.g., for fast, analysis are less precise, fewer passes are run, etc.

Excellent.

- Shared code path for fast and good instruction selection.

But then I'm not sure starting here.

- IR that represents ISA concepts better.
- More flexible instruction selector.

Some definitions here would be good.

For IR that represents ISA concepts better, this is in opposition to SDISel or LLVM IR. In other words, the target should be able to insert target specific code (e.g., instruction, physical register) at anytime without needing some extra crust to express that (e.g., intrinsic or custom SDNode).

I'm not sure that this represents the concepts any better. Basically it means that you have less and easier target independent handling, I'm unconvinced this is that useful. Perhaps an example might help :)

By more flexible we mean that targets should be able to inject target specific passes between the generic passes or replace those passes by their own.

It'll be interesting to see how this is going to be developed and how to keep the target independentness of the code generator with this new scheme. I.e. this is basically turning (in my mind) into "every backend for themselves" with very little target independent unification. Outside of special purpose ports I don't see a lot of need for this, but we'll see. I think it's going to take some discipline to avoid the "every backend is a large C++ project that defines everything it needs custom".

- Easier to maintain/understand framework, in particular legalization.
- Self contained machine representation, no back links to LLVM IR.
- No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

Agree.
For now, the assumptions are there are no illegal types, just illegal pair of operation and type. But yeah, we may need to refine when we get to the legalization.

Also things like canonicalization, etc. Just something to think about.

*** Proposed Approach ***

In this section, I describe the approach I plan to pursue in the prototype and the roadmap to get there. The final design will flow out of it.

For this prototype, we purposely exclude any work to improve or use TableGen or

I'm getting the idea that you really don't want to work on TableGen? ;)

Heh, that’s more a pragmatic approach. I don’t want we spend months improving TableGen before we start working on GlobalISel.
That being said, I think we should push as much thing as possible in tablegen when we are done with prototyping.

Sure.

** Implications **

As part of the bring-up of the prototype, we need to extend some of the core MachineInstr-level APIs:
- Need to remember FastMath flags for each MachineInstr.

Not orthogonal to this proposal? I don't mind lumping it in as being able to do this is probably a good goal for the prototype at least, but it seems like being able to do this is something that could be done incrementally as a separate project?

That’s a good point and yes, it could be done as a separate project. The reason why this is here is because if we want to experiment with combine and such in the prototype, this is the kind of information we would need.

Hmm, I thought you were avoiding combine? :)

At the end of M1, the prototype will not be able to produce code, since we would only have the beginning of the Global ISel pipeline. Instead, we will test the IRTranslator on the generic output that is produced from the tested IR.

So this would be targeting Generic MachineInstr?

Yes.

(Better name perhaps?).

Suggestion welcome :).

Yeah. First suggestion: Let's leave off the r ;)

Which means that it should be serializable and testable in isolation yes?

Partly. The lowering of the body of the function will be generic, but the ABI lowering will be target specific and unless we create some kind of fake target, the tests need to be bound to one target.

That's reasonable.

* Design Decisions *

- The IRTranslator is a final class. Its purpose is to move away from LLVM IR to MachineInstr world [final].
- Lower the ABI as part of the translation process [final].

* Design Questions the Prototype Addresses at the End of M1 *

- Handling of aggregate types during the translation.
- Lowering of switches.
- What about Module pass for Machine pass?

Could you elaborate a bit more here?

I have quickly mentioned in my reply to Marcello why this may be interesting. Let me rephrase my answer here.
Basically, we would like to have the MachineInstr to be self contained, i.e., get rid of those back links to LLVM IR. This implies that we would need to lower globals (maybe directly to MC) as part of the translation process. Globals are not attached to function but module, therefore it seems to make sense to introduce a concept of MachineModulePass.

*nod* I'd like to do something about the AsmPrinter anyhow.

- Introduce new APIs to have a clearer separation between:
- Legalization (setOperationAction, etc.)
- Cost/Combine related (isXXXFree, etc.)
- Lowering related (LowerFormal, etc.)
- What is the contract with the backends? Is it still “should be able to select any valid LLVM IR”?

Probably :)

As far as the prototype I think you also need to address a few additional things:

a) Calls
Calls are probably the most important part of any new instruction selector and lowering machinery and I think that the design of the call lowering infrastructure is going to be a critical part of evaluating the prototype. You might have meant this earlier when you said Lowering related, but I wanted to make sure to call it out explicitly.

Yes, lowering of calls is definitely going to be evaluated in the prototype for this first milestone and the "lowering related” stuff was about that :).
(You’re good at deciphering messages ;)).

I try. Anyhow, glad to hear about calls.

b) Testing
It's been covered a bit before, but being able to serialize and use for testing the various IR constructs is important. In particular, I worry about the existing MIR code as I and a few others have tried to use it for testcases and failed. I'm very interested in whatever ideas you have here, all of mine are much more invasive than I think we'd like.

Honestly I haven’t used the MIR testing infrastructure yet, but yes my impression was it is not really… mature. I would love to have some serialization mechanism for the MI that really work so that we can write those testcases more easily.
As for now, I haven’t looked into it, so I cannot share any ideas. I’ve discussed a bit with Matthias and he thinks that we might not be that far away from having MIR testing useable modulo bug fixes.

It would be helpful if you could file PR on the cases where MIR was not working for you so that we can look into it at some point.

My hope is that someone could look into it before we actually need a proper MI testing in place.

(Hidden message: If you are willing to work on the MIR testing or any other mechanism that would allow us to do MI serialization deserialization, please come forward, we need you!! :D)

Indeed, for the translation part the MIR testing is not critical since we do have the LLVM IR around.
Then, if we get rid of the LLVM IR back links, serialization should become easier and maybe MIR testing could be leverage. That being said, it may be possible that we need to start that from scratch, while taking into account what we learnt from the MIR testing.

Pretty much agree with this. I didn't file a bug because I wasn't sure what to say other than "this serialization wasn't useful for making test cases". Maybe you'll find it more so and we can get some best practices out of it.

Thanks!

-eric

Quentin Colombet via llvm-dev

unread,

Nov 19, 2015, 8:43:47 PM11/19/15

to Eric Christopher, llvm-dev

On Nov 19, 2015, at 4:58 PM, Eric Christopher <echr...@gmail.com> wrote:

On Thu, Nov 19, 2015 at 2:26 PM Quentin Colombet <qcol...@apple.com> wrote:
Hi Eric,

On Nov 19, 2015, at 12:46 PM, Eric Christopher <echr...@gmail.com> wrote:

Hi Quentin,

*** Goals ***

The high level goals of the new instruction selector are:
- Global instruction selector.
- Fast instruction selector.

Are these separate or the same? It reads like two instruction selectors at the moment.

They are the same, sorry for the confusion. This reads, we want a global and fast instruction selector where producing the code fast and producing good code quality exercise the same basic path in the framework. I.e., producing code fast is a trimmed down version of producing good code. E.g., for fast, analysis are less precise, fewer passes are run, etc.

Excellent.

- Shared code path for fast and good instruction selection.

But then I'm not sure starting here.

- IR that represents ISA concepts better.
- More flexible instruction selector.

Some definitions here would be good.

For IR that represents ISA concepts better, this is in opposition to SDISel or LLVM IR. In other words, the target should be able to insert target specific code (e.g., instruction, physical register) at anytime without needing some extra crust to express that (e.g., intrinsic or custom SDNode).

I'm not sure that this represents the concepts any better. Basically it means that you have less and easier target independent handling, I'm unconvinced this is that useful. Perhaps an example might help :)

I don’t have an example off hand, but basically, any time we have to create a custom SDNode, it is useless.

Another thing, that I didn’t call out before because it is a strong statement and I don’t want to commit on that for now, is that we can have a better estimate for register pressure and thing like choosing addressing mode, since we are at a much lower level and that we can directly emit the proper memory operation or look at the actual register classes.

Those are the kind of opportunities that I envision by moving to MachineInstr level.

By more flexible we mean that targets should be able to inject target specific passes between the generic passes or replace those passes by their own.

It'll be interesting to see how this is going to be developed and how to keep the target independentness of the code generator with this new scheme. I.e. this is basically turning (in my mind) into "every backend for themselves" with very little target independent unification. Outside of special purpose ports I don't see a lot of need for this, but we'll see. I think it's going to take some discipline to avoid the "every backend is a large C++ project that defines everything it needs custom”.

At this point, the idea is to have the standard passes shared (i.e., IRTranslator, Legalizer, RegBankSelect, and Select) and let the targets create their own pass if they want to do more stuff. Then, if we see room for generalization, we can refactor :).

This is what happens right now with the IR passes that are target specific. Sometimes those get factored out like GlobalMerge.

I believe the same could happen with MachineInstr passes.

Now, regarding the standard passes themselves, I don’t see how the target independentness will be different than our current selector. If your concern is about canonicalization, well, yes, if targets start to mess up the generic opcodes with target specific opcodes, we may lose some of it, but, I would say, that is the point!

If targets want to mess up with canonicalization, so be it. This is a problem we have with SDISel: sometime targets fight the canonicalization and this is very hard. Now, that would be much easier :).

Anyhow, yes, this is something we need to keep an eye on, and at some point in the prototype, it will be nice to have quick targeting of the framework for other backends.

- Easier to maintain/understand framework, in particular legalization.
- Self contained machine representation, no back links to LLVM IR.
- No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

Agree.
For now, the assumptions are there are no illegal types, just illegal pair of operation and type. But yeah, we may need to refine when we get to the legalization.

Also things like canonicalization, etc. Just something to think about.

Yeah, just mentioned canonicalization in my previous paragraph and the bottom line is that I don’t think canonicalization should be required for correctness.

*** Proposed Approach ***

In this section, I describe the approach I plan to pursue in the prototype and the roadmap to get there. The final design will flow out of it.

For this prototype, we purposely exclude any work to improve or use TableGen or

I'm getting the idea that you really don't want to work on TableGen? ;)

Heh, that’s more a pragmatic approach. I don’t want we spend months improving TableGen before we start working on GlobalISel.
That being said, I think we should push as much thing as possible in tablegen when we are done with prototyping.

Sure.

** Implications **

As part of the bring-up of the prototype, we need to extend some of the core MachineInstr-level APIs:
- Need to remember FastMath flags for each MachineInstr.

Not orthogonal to this proposal? I don't mind lumping it in as being able to do this is probably a good goal for the prototype at least, but it seems like being able to do this is something that could be done incrementally as a separate project?

That’s a good point and yes, it could be done as a separate project. The reason why this is here is because if we want to experiment with combine and such in the prototype, this is the kind of information we would need.

Hmm, I thought you were avoiding combine? :)

Heh, figured someones may want to try it during the prototype timeframe :), though, what interests me here is to check how easy it is to propagate this kind of information, while going for the self contained IR.

At the end of M1, the prototype will not be able to produce code, since we would only have the beginning of the Global ISel pipeline. Instead, we will test the IRTranslator on the generic output that is produced from the tested IR.

So this would be targeting Generic MachineInstr?

Yes.

(Better name perhaps?).

Suggestion welcome :).

Yeah. First suggestion: Let's leave off the r ;)

Gene.ic :P

Which means that it should be serializable and testable in isolation yes?

Partly. The lowering of the body of the function will be generic, but the ABI lowering will be target specific and unless we create some kind of fake target, the tests need to be bound to one target.

That's reasonable.

The fake target or be bound to one target?

Fingers crossed!

Thanks for the additional feedback, I think it really helps to call all that out!

Q.

Eric Christopher via llvm-dev

unread,

Nov 19, 2015, 8:57:31 PM11/19/15

to Quentin Colombet, llvm-dev

- Shared code path for fast and good instruction selection.

But then I'm not sure starting here.

- IR that represents ISA concepts better.
- More flexible instruction selector.

Some definitions here would be good.

For IR that represents ISA concepts better, this is in opposition to SDISel or LLVM IR. In other words, the target should be able to insert target specific code (e.g., instruction, physical register) at anytime without needing some extra crust to express that (e.g., intrinsic or custom SDNode).

I'm not sure that this represents the concepts any better. Basically it means that you have less and easier target independent handling, I'm unconvinced this is that useful. Perhaps an example might help :)

I don’t have an example off hand, but basically, any time we have to create a custom SDNode, it is useless.
Another thing, that I didn’t call out before because it is a strong statement and I don’t want to commit on that for now, is that we can have a better estimate for register pressure and thing like choosing addressing mode, since we are at a much lower level and that we can directly emit the proper memory operation or look at the actual register classes.

Those are the kind of opportunities that I envision by moving to MachineInstr level.

I guess it'll depend on how we construct the generic MIR I guess. I'm not seeing it, but I'll hope for some magic :)

- Easier to maintain/understand framework, in particular legalization.
- Self contained machine representation, no back links to LLVM IR.
- No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

Agree.
For now, the assumptions are there are no illegal types, just illegal pair of operation and type. But yeah, we may need to refine when we get to the legalization.

Also things like canonicalization, etc. Just something to think about.

Yeah, just mentioned canonicalization in my previous paragraph and the bottom line is that I don’t think canonicalization should be required for correctness.

Well, I mean things like "target A wants op * 2, target B wants op << 2" as far as canonicalization. I'm meaning it from a "how we look at code generation level", nothing more. I'm also not too fussed here. We'll see how it comes out.

Suggestion welcome :).

Yeah. First suggestion: Let's leave off the r ;)

Gene.ic :P

Hahahahahaha.

Which means that it should be serializable and testable in isolation yes?

Partly. The lowering of the body of the function will be generic, but the ABI lowering will be target specific and unless we create some kind of fake target, the tests need to be bound to one target.

That's reasonable.

The fake target or be bound to one target?

Bound to one target if necessary. I like that we have some code generation tests that are "every target can handle this", but others are "let's see what we get on target X with this input". Both are valuable, but I don't see a need to construct up a fake target for anything.

-eric

David Chisnall via llvm-dev

unread,

Nov 20, 2015, 4:39:21 AM11/20/15

to Quentin Colombet, llvm-dev

On 20 Nov 2015, at 01:43, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:
>
>>> By more flexible we mean that targets should be able to inject target specific passes between the generic passes or replace those passes by their own.
>>
>> It'll be interesting to see how this is going to be developed and how to keep the target independentness of the code generator with this new scheme. I.e. this is basically turning (in my mind) into "every backend for themselves" with very little target independent unification. Outside of special purpose ports I don't see a lot of need for this, but we'll see. I think it's going to take some discipline to avoid the "every backend is a large C++ project that defines everything it needs custom”.
>
> At this point, the idea is to have the standard passes shared (i.e., IRTranslator, Legalizer, RegBankSelect, and Select) and let the targets create their own pass if they want to do more stuff. Then, if we see room for generalization, we can refactor :).

To give a concrete example of this:

Currently, SelectionDAG has completely generic infrastructure for expanding unaligned loads and stores into sequences of aligned loads and masks. Unfortunately, the interface is entirely push from the generic side, not pull from the target side. We hit a problem where we needed different handling of unaligned loads and stores based on the address space of the base pointer. We ended up having to duplicate a load of the SelectionDAG logic in the back end. Eventually, extending our CPU to support unaligned loads and stores proved less effort than trying to get SelectionDAG to place nicely with our constraints.

With the new design, I’d imagine that there’d be a generic ExpandUnalignedLoad function in the supporting library that any target could simply use for the places where it makes sense. On MIPS, for example, the load-word-left and load-word-right instructions need some special handling and allow you to generate quite efficient code, but other forms of unaligned load and store may want to be handled generically. Being able to have the targets choose at a fine granularity by explicitly calling into the generic code for the functionality that they need is likely to be a lot better than having to provide a load of predicates up-front and hoping that the ones that the generic infrastructure asks for match the ones that you want (currently, there’s no way to tell SelectionDAG that you want different handling for things with pointers in different address spaces and no way to specify Custom for all address spaces and then call back into the generic behaviour if you want to use it for a subset).

There’s also the issue that, because SelectionDAG is a push model, it’s very easy to get into a situation (especially with the set_cc / br_cc families) where you do a transform, SelectionDAG does the inverse transform, then calls back into the back end, which redoes the transform, which SelectionDAG undoes, and so on. I think everyone who has worked on any back end has encountered this at least once (often identified with ‘why is this one test in the test suite running forever?’).

A few other things that come to mind as being easier to generalise in this model:

- Constant island generation (the ARM constant islands pass has been copied around the place a few times)

- The expansions for ll/sc atomics (we currently do this in LLVM IR, which is the wrong place, because the right place is MIR, which doesn’t exist yet)

I’m sure that there are others.

Daniel Sanders via llvm-dev

unread,

Nov 20, 2015, 9:53:30 AM11/20/15

to Quentin Colombet, llvm-dev

Hi,

I haven't had chance to read all of this yet, but one minor thing occurred to me during your presentation that I want to mention. At one point you mentioned deleting all the bitcast instructions since they're equivalent to nops but this isn't always true.

The http://llvm.org/docs/LangRef.html definition of the bitcast instruction includes this sentence:

The conversion is done as if the value had been stored to memory and read back as type ty2.

For big-endian MSA, this is equivalent to a shuffling of the bits in the register because endianness only changes the byte order within each element. The order of the elements is unaffected by endianness. IIRC, big-endian NEON is the same way.

Quentin Colombet via llvm-dev

unread,

Nov 20, 2015, 12:14:56 PM11/20/15

to Daniel Sanders, llvm-dev

Hi Daniel,

Thanks for pointing that out.

Cheers,

-Quentin

Quentin Colombet via llvm-dev

unread,

Nov 20, 2015, 12:16:32 PM11/20/15

to David Chisnall, llvm-dev

> On Nov 20, 2015, at 1:39 AM, David Chisnall <David.C...@cl.cam.ac.uk> wrote:
>
> On 20 Nov 2015, at 01:43, Quentin Colombet via llvm-dev <llvm...@lists.llvm.org> wrote:
>>
>>>> By more flexible we mean that targets should be able to inject target specific passes between the generic passes or replace those passes by their own.
>>>
>>> It'll be interesting to see how this is going to be developed and how to keep the target independentness of the code generator with this new scheme. I.e. this is basically turning (in my mind) into "every backend for themselves" with very little target independent unification. Outside of special purpose ports I don't see a lot of need for this, but we'll see. I think it's going to take some discipline to avoid the "every backend is a large C++ project that defines everything it needs custom”.
>>
>> At this point, the idea is to have the standard passes shared (i.e., IRTranslator, Legalizer, RegBankSelect, and Select) and let the targets create their own pass if they want to do more stuff. Then, if we see room for generalization, we can refactor :).
>
> To give a concrete example of this:
>
> Currently, SelectionDAG has completely generic infrastructure for expanding unaligned loads and stores into sequences of aligned loads and masks. Unfortunately, the interface is entirely push from the generic side, not pull from the target side. We hit a problem where we needed different handling of unaligned loads and stores based on the address space of the base pointer. We ended up having to duplicate a load of the SelectionDAG logic in the back end. Eventually, extending our CPU to support unaligned loads and stores proved less effort than trying to get SelectionDAG to place nicely with our constraints.
>
> With the new design, I’d imagine that there’d be a generic ExpandUnalignedLoad function in the supporting library that any target could simply use for the places where it makes sense.

That is the idea behind the LegalizationKit.

> On MIPS, for example, the load-word-left and load-word-right instructions need some special handling and allow you to generate quite efficient code, but other forms of unaligned load and store may want to be handled generically. Being able to have the targets choose at a fine granularity by explicitly calling into the generic code for the functionality that they need is likely to be a lot better than having to provide a load of predicates up-front and hoping that the ones that the generic infrastructure asks for match the ones that you want (currently, there’s no way to tell SelectionDAG that you want different handling for things with pointers in different address spaces and no way to specify Custom for all address spaces and then call back into the generic behaviour if you want to use it for a subset).
>
> There’s also the issue that, because SelectionDAG is a push model, it’s very easy to get into a situation (especially with the set_cc / br_cc families) where you do a transform, SelectionDAG does the inverse transform, then calls back into the back end, which redoes the transform, which SelectionDAG undoes, and so on. I think everyone who has worked on any back end has encountered this at least once (often identified with ‘why is this one test in the test suite running forever?’).
>
> A few other things that come to mind as being easier to generalise in this model:
>
> - Constant island generation (the ARM constant islands pass has been copied around the place a few times)
>
> - The expansions for ll/sc atomics (we currently do this in LLVM IR, which is the wrong place, because the right place is MIR, which doesn’t exist yet)
>
> I’m sure that there are others.

Agreed, for instance, we have the load/store optimizers for both ARM and AArch64.

Thanks for the detailed example.
Q.

Quentin Colombet via llvm-dev

unread,

Nov 20, 2015, 12:17:37 PM11/20/15

to Eric Christopher, llvm-dev

Agreed!

Thanks Eric!

Q.

Hal Finkel via llvm-dev

unread,

Nov 26, 2015, 3:58:32 PM11/26/15

to Quentin Colombet, llvm-dev

Hi Quentin,

First, thanks a lot for working on this! This is obviously a really-important problem.

One thought:

+ /// *** This is to support:
+ /// *** Self contained machine representation, no back links to LLVM IR.
+ /// Import the attribute from the IR Function.
+ AttributeSet AttributeSets; ///< Parameter attributes

I fully support better modeling of functions without ties to IR-level functions. This will allow very-late outlining, multiversioning, etc., and there are good use cases for these things. That having been said, I think we should have a narrower scope for severing MI <-> IR ties, because at least one important link will continue to exist: MMOs used to provide access to IR-level alias analysis. This is critical to good instruction scheduling, memory-access merging, etc. and replicating AA at the MI level is not feasible.

-Hal

----- Original Message -----

> Hi,

> *** Context ***

> *** Feedback Invite ***

> The prototype will answer critical design questions (see “Design

> Questions the Prototype Addresses at the End of M1" for examples)
> before the actual design of Gobal ISel is finalized, but it cannot
> cover everything.
> Specifically we will *not* look into improving TableGen or reuse
> InstCombine (see “ Proposed Approach” for the rational). Please let
> me know if you see any issue with that.

> There is also basic ground work needed to prepare for Global ISel and
> I need to extend the core MachineInstr-level APIs as explained
> during the talk. For this, I prepared sketches of patches to
> illustrate them and describe the details in the “Implications”
> section below. Please have a look at the patches to have a better
> idea of the expected impact.

> If there is anything else you want to discuss related to Global ISel
> feel free to reach me. In particular, several people expressed their
> interests during the LLVM Dev Meeting in contributing to the
> project. Let me know what is your area of interest, so that we can
> coordinate our efforts.
> Anyhow, please add [GlobalISel] in the subject line to help
> categorizing the emails.

> *** Goals ***

> The high level goals of the new instruction selector are:
> - Global instruction selector.
> - Fast instruction selector.

> - Shared code path for fast and good instruction selection.

> - IR that represents ISA concepts better.
> - More flexible instruction selector.

> - Easier to maintain/understand framework, in particular
> legalization.
> - Self contained machine representation, no back links to LLVM IR.
> - No change to LLVM IR.

> Note: The goals are common to all targets. In particular, we do not

> intend to work on target specific feature for the prototype.
> The bottom line is please make sure those goals are compatible with
> what you want to achieve for your target, even if your requirement
> does not get listed here.

> *** Proposed Approach ***

> In this section, I describe the approach I plan to pursue in the
> prototype and the roadmap to get there. The final design will flow
> out of it.

> For this prototype, we purposely exclude any work to improve or use
> TableGen or InstCombine [final]. We will keep in mind however, that
> some of the C++ code we write will be table-generated at some point.
> The rational is that we do not want to lay down a new
> TableGen/InstCombine infrastructure before being able to work on the
> ISel framework itself.

> The prototype vehicle will be AArch64 . None of the changes for

> ** Implications **

> ** Milestone 1 **

> * Design Decisions *

> - The IRTranslator is a final class. Its purpose is to move away from
> LLVM IR to MachineInstr world [final] .
> - Lower the ABI as part of the translation process [final] .

> * Design Questions the Prototype Addresses at the End of M1 *

> - Handling of aggregate types during the translation.
> - Lowering of switches.
> - What about Module pass for Machine pass?

> - Introduce new APIs to have a clearer separation between:
> - Legalization (setOperationAction, etc.)
> - Cost/Combine related (isXXXFree, etc.)
> - Lowering related (LowerFormal, etc.)
> - What is the contract with the backends? Is it still “should be able
> to select any valid LLVM IR”?

> Thanks,

> -Quentin

> --
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Krzysztof Parzyszek via llvm-dev

unread,

Nov 30, 2015, 10:30:56 AM11/30/15

to llvm...@lists.llvm.org

On 11/19/2015 6:58 PM, Eric Christopher via llvm-dev wrote:
> It'll be interesting to see how this is going to be developed and how to
> keep the target independentness of the code generator with this new
> scheme. I.e. this is basically turning (in my mind) into "every backend
> for themselves" with very little target independent unification.

I really don't mind the "every backend for themselves" approach. The
instruction selection pass is about as target-specific as a common pass
can get, and the more work the generic code tries to do, the more
potential it has to be inflexible. This is not to say that a generic
code will necessarily be bad, but that a target-centric approach has a
better chance of working out better, even if it means that more work is
required to implement instruction selection for a new target.

As someone mentioned in another email, the canonicalization currently
done in the DAG combiner has a tendency to interfere with what
individual targets may prefer. One example of it that I remember for
Hexagon was that the LLVM IR had a combination of shifts left and right
to extract a bitfield from a longer integer. Hexagon has an instruction
to do that and it's quite simple to map the shifts into that
instruction. The combiner, hovewer, would fold the shifts leaving only
the minimum sequence of operations necessary to get the bitfield. This
seems to be better from the generic point of view, but it makes it
practically impossible for us to match it to the "extract" instruction,
and in practice the code turns out to be worse. This is the only reason
why we have the HexagonGenExtract pass---we detect the patterns in the
LLVM IR and generate "extract" intrinsics before the combiner mangles
them up into unrecognizable forms. The same goes for replacing ADD with
OR when the bits in the operands do not overlap. We have code that
specifically undoes that, since for us, if the original code had an ADD,
it is pretty much always better if it remains an ADD.

There were cases in the past when we had to disable parts of
CodeGenPrepare, or else it would happily promote i32 into i64 where it
wasn't strictly necessary. I64 is a legal type on Hexagon, but it uses
pairs of registers which, in practical terms, means that our register
set is cut by half when 64-bit values are used.

On the other hand, having a relatively simple, generic IR makes it
easier to simplify code that is no longer subjected to the LLVM IR's
constraints (e.g. getelementptr expressed as +/*, etc.). Hexagon has a
lot of very specific complex/compound instructions and a given code can
be written in many different ways. This makes it harder to optimize
code after the specific instructions have been selected. For example, a
pass that would try to simplify arithmetic code would need to deal with
the dozens of variants of add/multiplication instructions, instead of
simply looking at some generic GADD/GMPY.

-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Hal Finkel via llvm-dev

unread,

Nov 30, 2015, 1:08:15 PM11/30/15

to Krzysztof Parzyszek, llvm...@lists.llvm.org

I sympathize with this, but this is not uniquely a backend consideration. Even though we generally canonicalize these patterns at the IR level is a roughly consistent way, if the backend is not really robust in its matching logic, you'll miss a lot just from input IR differences. There's ~1K lines of code in PPCISelDAGToDAG.cpp (look for BitPermutationSelector) to match these in a robust way, and I don't see any good way around that.

> The same goes for replacing ADD
> with
> OR when the bits in the operands do not overlap.

This is true for other targets too, at least when the ADD is part of an addressing expression. PowerPC, for example, has code in various places to recognize ORs, in combination with some known-bits information, as surrogates for ADDs. It seems like, globally, we could do a better job here.

> We have code that
> specifically undoes that, since for us, if the original code had an
> ADD,
> it is pretty much always better if it remains an ADD.
>
> There were cases in the past when we had to disable parts of
> CodeGenPrepare, or else it would happily promote i32 into i64 where
> it
> wasn't strictly necessary. I64 is a legal type on Hexagon, but it
> uses
> pairs of registers which, in practical terms, means that our register
> set is cut by half when 64-bit values are used.

This is a common problem, but is not unique to CGP. Parts of the mid-level optimizer (e.g. IndVarSimplify) also do integer promotion in inopportune way for some targets. Also, CGP has a lot of target hooks to turn off things like this, and this should certainly be optional (in practice, this widening is sometimes information destroying, and thus, not always reversible).

>
> On the other hand, having a relatively simple, generic IR makes it
> easier to simplify code that is no longer subjected to the LLVM IR's
> constraints (e.g. getelementptr expressed as +/*, etc.). Hexagon has
> a
> lot of very specific complex/compound instructions and a given code
> can
> be written in many different ways. This makes it harder to optimize
> code after the specific instructions have been selected. For
> example, a
> pass that would try to simplify arithmetic code would need to deal
> with
> the dozens of variants of add/multiplication instructions, instead of
> simply looking at some generic GADD/GMPY.

I have mixed feelings about this. I agree that sometimes we do too much, and sometimes we make transformation where we should be providing better analysis instead, but I think there is still a lot of value in the common backend optimizations.

In this context, it is worth thinking about the various reasons why these opportunities exist in the first place (not a complete list):

1. The process of lowering GEPs (and some other relatively-higher-level IR constructs) into instructions that represent explicitly the underlying computations being performed (accounting for target capabilities) expose generic peephole/CSE opportunities.

2. The process of introducing target-specific nodes to represent partial behaviors introduces generic peephole/CSE opportunities. What I mean by this is, for example, when a floating-point division or sqrt is lowered to a target-specific reciprocal estimate function plus some Newton iterations, those Newton iterations are generic floating-point adds, multiplies, etc. that can be further optimized.

3. The process of type/operation legalization, especially when operations are split, promoted and/or turned into procedures involving stack loads and stores, present opportunities not apparent at the IR level.

4. Some generic optimizations, such as store merging, need more-detailed cost information than is available through TTI, and so are done in the backend.

In short, within our current framework, there are many reasons why we have DAGCombine and related code, and I think that while moving specific aspects into the targets might be good overall, leaving all of that to each target is too much. The fact that you can get a reasonable set of backend optimizations for normalish targets is a strong point of LLVM.

Thanks again,
Hal

>
> -Krzysztof
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> hosted by The Linux Foundation
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Quentin Colombet via llvm-dev

unread,

Nov 30, 2015, 2:34:37 PM11/30/15

to Hal Finkel, llvm-dev

Hi Hal,

The alias information is a good example of the MI to IR back links, thanks for pointing that out.

On Nov 26, 2015, at 12:58 PM, Hal Finkel <hfi...@anl.gov> wrote:

Hi Quentin,

First, thanks a lot for working on this! This is obviously a really-important problem.

One thought:

+ /// *** This is to support:
+ /// *** Self contained machine representation, no back links to LLVM IR.
+ /// Import the attribute from the IR Function.
+ AttributeSet AttributeSets; ///< Parameter attributes

I fully support better modeling of functions without ties to IR-level functions. This will allow very-late outlining, multiversioning, etc., and there are good use cases for these things. That having been said, I think we should have a narrower scope for severing MI <-> IR ties, because at least one important link will continue to exist: MMOs used to provide access to IR-level alias analysis. This is critical to good instruction scheduling, memory-access merging, etc. and replicating AA at the MI level is not feasible.

Honestly, although I understand why we have the MMOs right now, I don’t think this is a clean design and I would rather have an AA working at MI level or, better, a different way of passing the information to MI.

I don’t have something in mind on how to pass the information if we choose that path, but I think it would be important to get rid of the MI -> IR link for aliases purposes, because we end up with, IMHO, ugly code where a Machine pass patches the IR to fix the alias information. E.g., in the stack coloring pass:

    // AA might be used later for instruction scheduling, and we need it to be

    // able to deduce the correct aliasing releationships between pointers

    // derived from the alloca being remapped and the target of that remapping.

    // The only safe way, without directly informing AA about the remapping

    // somehow, is to directly update the IR to reflect the change being made

    // here.

    Instruction *Inst = const_cast<AllocaInst *>(To);

    if (From->getType() != To->getType()) {

      BitCastInst *Cast = new BitCastInst(Inst, From->getType());

      Cast->insertAfter(Inst);

      Inst = Cast;

}

Therefore, I would prefer having the alias information expressed as something decoupled from the IR and that could be updated.

What do you think?

Cheers,

-Quentin

Hal Finkel via llvm-dev

unread,

Nov 30, 2015, 3:58:13 PM11/30/15

to Quentin Colombet, llvm-dev

I certainly agree that it is ugly, and as the author of the comment you've highlighted above, I would love to have a better solution. The only problem is that I don't have a good idea how such a solution might work; prerecording all N^2 possible aliasing queries is impractical. I don't think that rewriting our current AA to work on MI, or even refactoring the current AA logic to work in terms of abstractions over both IR and MI is really possible, because we need it to function even after MI has dropped out of SSA form and PHI elimination has happened.

That having been said, prerecording query results still seems like the best solution, but it can't be naive (N^2). We have to understand the constraints that the query results will only be used to disambiguate otherwise-undecidable aliasing queries for the purpose of doing merging, scheduling, etc., and maybe within those use cases, we can constrain the problem enough to make prerecording the query results practical.

The bad news, however, is that prerecording the query results does not remove the comment in stack coloring, but just changes it to discuss operating on some MI-level data structure that is not the IR.

Thanks again,
Hal

Cheers,
-Quentin

-Hal

Quentin Colombet via llvm-dev

unread,

Jan 7, 2016, 2:47:13 PM1/7/16

to David Chisnall, llvm-dev

Hi David,

I had a quick look at the inttoptr/ptrtoint thing for GlobalISel and unless I am mistaken the semantic you want for such instructions do not match what the language reference says.

Indeed, you said that inttoptr instruction is not a no-op on your architecture, whereas the language reference says:

“The ‘inttoptr‘ instruction converts value to type ty2 by applying either a zero extension or a truncation depending on the size of the integer value. If value is larger than the size of a pointer then a truncation is done. If value is smaller than the size of a pointer then a zero extension is done. If they are the same size, nothing is done (no-op cast).”

The bottom line is that IMHO, if you rely on inttoptr/ptrtoint instructions to do the conversion from fat pointers to plain integers you are abusing the IR.

I plan to stick to the LLVM IR semantic for the generic opcode of GlobalISel and thus, it does not seem useful to have INTTOPTR like nodes around.

For instance, AArch64 has the TBI feature to deal more efficiently with fat pointer when accessing memory and the masking operations are explicitly set in the LLVM IR and later combine with the memory accesses.

Anyhow, this is just a heads-up, we will see in due time what we can do here.

Cheers,

-Quentin

Quentin Colombet via llvm-dev

unread,

Jan 7, 2016, 2:58:15 PM1/7/16

to Daniel Sanders, llvm-dev

Hi Daniel,

I had a quick look at the language reference for bitcast and I have a different reading than what you were pointing out.

Indeed, my take away is:

"It is always a no-op cast because no bits change with this conversion."

In other words, deleting all bitcast instructions should be fine.

My understanding of the quote you’ve highlighted is that it tells C programmers that this is like a memcpy, not a cast :).

Cheers,

-Quentin

On Nov 20, 2015, at 6:53 AM, Daniel Sanders <Daniel....@imgtec.com> wrote:

David Chisnall via llvm-dev

unread,

Jan 8, 2016, 4:34:18 AM1/8/16

to Quentin Colombet, llvm-dev

On 7 Jan 2016, at 19:47, Quentin Colombet <qcol...@apple.com> wrote:
>
> Indeed, you said that inttoptr instruction is not a no-op on your architecture, whereas the language reference says:
> “The ‘inttoptr‘ instruction converts value to type ty2 by applying either a zero extension or a truncation depending on the size of the integer value. If value is larger than the size of a pointer then a truncation is done. If value is smaller than the size of a pointer then a zero extension is done. If they are the same size, nothing is done (no-op cast).”
>
> The bottom line is that IMHO, if you rely on inttoptr/ptrtoint instructions to do the conversion from fat pointers to plain integers you are abusing the IR.
>

I believe that this is somewhere where the IR specification needs to evolve. Currently, we have no in-tree architectures where pointers are not integers and so that definition is appropriate. Adding a new pair of IR instructions for integer-to-pointer and pointer-to-integer conversions and not calling them inttoptr / ptrtoint is likely to be far more confusing.

Mehdi Amini via llvm-dev

unread,

Jan 8, 2016, 11:51:22 AM1/8/16

to David Chisnall, llvm-dev

> On Jan 8, 2016, at 1:34 AM, David Chisnall via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> On 7 Jan 2016, at 19:47, Quentin Colombet <qcol...@apple.com> wrote:
>>
>> Indeed, you said that inttoptr instruction is not a no-op on your architecture, whereas the language reference says:
>> “The ‘inttoptr‘ instruction converts value to type ty2 by applying either a zero extension or a truncation depending on the size of the integer value. If value is larger than the size of a pointer then a truncation is done. If value is smaller than the size of a pointer then a zero extension is done. If they are the same size, nothing is done (no-op cast).”
>>
>> The bottom line is that IMHO, if you rely on inttoptr/ptrtoint instructions to do the conversion from fat pointers to plain integers you are abusing the IR.
>>
>
> I believe that this is somewhere where the IR specification needs to evolve. Currently, we have no in-tree architectures where pointers are not integers and so that definition is appropriate. Adding a new pair of IR instructions for integer-to-pointer and pointer-to-integer conversions and not calling them inttoptr / ptrtoint is likely to be far more confusing.

I think this would deserve its own RFC email thread.

—
Mehdi

Quentin Colombet via llvm-dev

unread,

Jan 8, 2016, 12:12:37 PM1/8/16

to Mehdi Amini, llvm-dev

On Jan 8, 2016, at 8:51 AM, Mehdi Amini <mehdi...@apple.com> wrote:

On Jan 8, 2016, at 1:34 AM, David Chisnall via llvm-dev <llvm...@lists.llvm.org> wrote:

On 7 Jan 2016, at 19:47, Quentin Colombet <qcol...@apple.com> wrote:

Indeed, you said that inttoptr instruction is not a no-op on your architecture, whereas the language reference says:
“The ‘inttoptr‘ instruction converts value to type ty2 by applying either a zero extension or a truncation depending on the size of the integer value. If value is larger than the size of a pointer then a truncation is done. If value is smaller than the size of a pointer then a zero extension is done. If they are the same size, nothing is done (no-op cast).”

The bottom line is that IMHO, if you rely on inttoptr/ptrtoint instructions to do the conversion from fat pointers to plain integers you are abusing the IR.

I believe that this is somewhere where the IR specification needs to evolve. Currently, we have no in-tree architectures where pointers are not integers and so that definition is appropriate. Adding a new pair of IR instructions for integer-to-pointer and pointer-to-integer conversions and not calling them inttoptr / ptrtoint is likely to be far more confusing.

I think this would deserve its own RFC email thread.

+1

Q.

—
Mehdi

Philip Reames via llvm-dev

unread,

Jan 8, 2016, 5:41:00 PM1/8/16

to Quentin Colombet, Mehdi Amini, llvm-dev

FYI, we need something very similar for GC purposes and are going to make a proposal along these lines within the next week or two. We not yet to the point of having a "final" proposal, but we're planning on starting with some initial experimental support, prototyping on ToT, and evolving the spec language as we need to. Details to follow separately.

David, you and I should probably talk offline to make sure what we're thinking about works for you as well.

Q.

—
Mehdi

Daniel Sanders via llvm-dev

unread,

Jan 11, 2016, 10:44:05 AM1/11/16

to Quentin Colombet, Tim Northover (t.p.northover@gmail.com), llvm-dev

Hi,

It was a comment by Tim that first made me aware of it (see http://lists.llvm.org/pipermail/llvm-dev/2013-August/064714.html but I think he commented on one of my patches before that).

I asked about it on llvm-dev a couple weeks later (http://lists.llvm.org/pipermail/llvm-dev/2013-August/064919.html) highlighting the contradiction and was told that 'no-op cast' referred to the lack of math rather than a requirement that zero instructions are used. It's therefore my understanding that shuffling the bits to preserve the load/store based definition isn't considered to be changing the bits.

I think the main thing the current definition is unclear on is whether it refers to the bits in a physical machine register or the bits in the LLVM-IR virtual register. Most of the time these two views are the same but this doesn't quite work for big-endian MSA/NEON. For example:

%0 = bitcast <4 x i32> <i32 1, i32 2, i32 3, i32 4> to <2 x i64>

%0 = <2 x i64> <i64 (1 << 32) | 2, i64 (3 << 32) | 4>

are equivalent to each other in LLVM-IR terms but the constants are physically laid out in MSA registers as:

0x00000004000000030000000200000001 # <4 x i32> <i32 1, i32 2, i32 3, i32 4>

0x00000003000000040000000100000002 # <2 x i64> <i64 (1 << 32) | 2, i64 (3 << 32) | 4>

and we must therefore shuffle the bits to preserve LLVM-IR's point of view.

Quentin Colombet via llvm-dev

unread,

Jan 11, 2016, 12:22:38 PM1/11/16

to Daniel Sanders, llvm-dev

Hi Daniel,

Thanks for the pointers, I wasn’t aware of the second thread you’ve mentioned.

I may be wrong but I think LLVM-IR optimizations really treat bistcasts as no-op casts, in the sense of no instructions are required.

Is there anyone that could chime in on that?

However, it seems SelectionDAG sticks to the load/store semantic:

"BITCAST - This operator converts between integer, vector and FP values, as if the value was stored to memory with one type and loaded from the same address with the other type (or equivalently for vector format conversions, etc)."

I am fine with treating bit casts as equivalent store/load pairs in GISel, I just want to be sure we do not have a semantic gap between the LLVM-IR and the backend if we do.

Thanks,

-Quentin

Daniel Sanders via llvm-dev

unread,

Jan 12, 2016, 8:23:18 AM1/12/16

to Quentin Colombet, llvm-dev

Hi,

I haven't found much time to look into the LLVM-IR-level optimizations yet so I'm not sure how they handle bitcasts. With that disclaimer in mind, I expect it's fine for the LLVM-IR level optimizations to handle them using either definition since they are equivalent at the LLVM-IR level. My thinking is that LLVM-IR is consistent about how virtual bits are assigned to types and that non-zero instruction nops arise when there is inconsistency.

At the LLVM-IR level, bits 0-127 of <4 x i32> map directly onto bits 0-127 of <2 x i64> using the identity map. It's therefore ok to interpret such bitcasts as zero-instruction no-ops. As far as I can tell, LLVM-IR has been defined such that the identity map can be used for bitcasts between all same-sized types, and also such that bitcasting between different-sized types is invalid.

Similarly, most targets have a single mapping of virtual bit numbers to physical bit numbers for each size that is applied consistently when mapping a type to memory. For example 32-bits map like so:

Little Endian Targets: virtual register bits {0..7,8..15,16..23,24..31} map to physical memory bits {0..7,8..15,16..23,24..31}

Big Endian Targets: virtual register bits {0..7,8..15,16..23,24..31} map to physical memory bits {24..31,16..23,8..15,0..7}

regardless of whether it's a float, or an i32. We therefore need zero instructions to re-map physical memory bits for one type onto another type.

The same idea holds for physical register classes. There's a single consistent mapping from physical memory bits to physical register bits that applies for all types that can be stored in that class. As long as this is the case the load/store and zero-instruction interpretation of bitcasts are equivalent.

In the case of big-endian MSA and NEON, there isn't a single consistent mapping from physical memory bits to physical register bits so the equivalence in the two definitions breaks down:

i128: virtual register bits {0..31, 32..63, 64..95, 96...127} map to physical memory bits {96..127, 64..95, 32..63, 0..31}

<4 x i32>: virtual register bits {0..31, 32..63, 64..95, 96...127} map to physical memory bits {0..31, 32..63, 64..95, 96..127}

<2 x i64>: virtual register bits {0..31, 32..63, 64..95, 96...127} map to physical memory bits {32..63, 0..31, 96..127, 64..95}

with these inconsistent mappings we require instructions to bitcast between the types.

I found this thinking quite difficult to explain. Does it make sense?

> I am fine with treating bit casts as equivalent store/load pairs in GISel, I just want to be sure we do not have a semantic gap between the LLVM-IR and the backend if we do.

I think a gap would arise from not having a GISel equivalent to ISD::BITCAST (gBITCAST?) available when it's necessary for correctness. However, I agree that GISel should delete bitcasts for the common case where the store/load and zero-instruction definitions are equivalent.

James Molloy via llvm-dev

unread,

Jan 12, 2016, 8:56:13 AM1/12/16

to Daniel Sanders, Quentin Colombet, llvm-dev

Hi,

> I found this thinking quite difficult to explain. Does it make sense?

It might help to link to the documentation on why bitcasts are weird on big-endian NEON: http://llvm.org/docs/BigEndianNEON.html#bitconverts

Cheers,

James

_______________________________________________

Daniel Sanders via llvm-dev

unread,

Jan 12, 2016, 9:38:18 AM1/12/16

to James Molloy, Quentin Colombet, llvm-dev

Thanks, I didn't know about that page. It's a much clearer explanation of why the backend choses the code it does. However, there's a bit I'm trying to explain that isn't covered on that page. I'm trying to explain why the seemingly contradictory statements at http://llvm.org/docs/LangRef.html#bitcast-to-instruction don't actually contradict each other (even for big-endian NEON/MSA) while we're at the LLVM-IR level and why it's safe for LLVM-IR-level optimizations to use the zero-instruction definition despite the backend relying on the store/load definition. It boils down to both definitions being equivalent until we specialize to a target at which point the two definitions sometimes diverge. They diverge when the mapping of virtual bits to physical bits differs between LLVM-IR types.

Mehdi Amini via llvm-dev

unread,

Jan 12, 2016, 11:46:22 AM1/12/16

to Daniel Sanders, llvm-dev

What happens when you cascade bitcast?

Are these sequences all equivalent at the IR level (i.e. do they reference the same byte from the original i128)?

i128 => <16 x i8> => GEP 0

i128 => <2 x i64> => GEP 0 => <8 x i8> => GEP 0

i128 => <2 x i64> => GEP 0 => <2 x i32> => GEP 0 => <4 x i8> => GEP 0

—

Mehdi

On Jan 12, 2016, at 6:37 AM, Daniel Sanders via llvm-dev <llvm...@lists.llvm.org> wrote:

Thanks, I didn't know about that page. It's a much clearer explanation of why the backend choses the code it does. However, there's a bit I'm trying to explain that isn't covered on that page. I'm trying to explain why the seemingly contradictory statements athttp://llvm.org/docs/LangRef.html#bitcast-to-instruction don't actually contradict each other (even for big-endian NEON/MSA) while we're at the LLVM-IR level and why it's safe for LLVM-IR-level optimizations to use the zero-instruction definition despite the backend relying on the store/load definition. It boils down to both definitions being equivalent until we specialize to a target at which point the two definitions sometimes diverge. They diverge when the mapping of virtual bits to physical bits differs between LLVM-IR types.

James Molloy via llvm-dev

unread,

Jan 12, 2016, 11:54:29 AM1/12/16

to Mehdi Amini, Daniel Sanders, llvm-dev

i128 => <16 x i8> => GEP 0

i128 => <2 x i64> => GEP 0 => <8 x i8> => GEP 0

i128 => <2 x i64> => GEP 0 => <2 x i32> => GEP 0 => <4 x i8> => GEP 0

They all reference the same memory object from the same base address. If the result is loaded, the in-register contents will differ between them though (because there's a special "load a vector of this type" instruction (LD1)).

Mehdi Amini via llvm-dev

unread,

Jan 12, 2016, 12:12:52 PM1/12/16

to James Molloy, llvm-dev

I meant extract_element instead of GEP (sorry for the confusion).

Sequence would be load i128, then bitcast to a vector and extract the first byte.

James explained me on IRC why it works :)

—

Mehdi

Philip Reames via llvm-dev

unread,

Jan 12, 2016, 7:22:43 PM1/12/16

to James Molloy, Daniel Sanders, Quentin Colombet, llvm-dev

I think after reading your link I'm actually more confused. This might just be a wording problem, but let me ask a couple of clarifying questions.

1) After compiling the code sequence below (from that page), does the in memory bit pattern differ? The page seemed to contradict itself.

%0 = load <4 x i32> %x
%1 = bitcast <4 x i32> %0 to <2 x i64>
     store <2 x i64> %1, <2 x i64>* %y

2) If so, does this mean that performing dead-store-elimination is illegal for ARM?

3) Are loads and stores ever allowed to fault based on the in memory representation?

4) What happens if we have a load of <2xi64> following the store above and we do DSE the store before forwarding it's value?

Philip

Quentin Colombet via llvm-dev

unread,

Jan 13, 2016, 2:23:14 AM1/13/16

to James Molloy, llvm-dev

Hi James,

I am also confused!

On Jan 12, 2016, at 4:11 PM, Philip Reames <list...@philipreames.com> wrote:

I think after reading your link I'm actually more confused. This might just be a wording problem, but let me ask a couple of clarifying questions.

1) After compiling the code sequence below (from that page), does the in memory bit pattern differ? The page seemed to contradict itself.

+1

Thanks,

Q.

James Molloy via llvm-dev

unread,

Jan 13, 2016, 3:50:37 AM1/13/16

to Quentin Colombet, llvm-dev

Hi Philip,

> I think after reading your link I'm actually more confused. This might just be a wording problem, but let me ask a couple of clarifying questions.

Sorry about that :( Every time I explain this I get slightly more embarassed because it is indeed weird and ugly (but was certainly the least ugly solution).

> 1) After compiling the code sequence below (from that page), does the in memory bit pattern differ? The page seemed to contradict itself.

> %0 = load <4 x i32> %x
> %1 = bitcast <4 x i32> %0 to <2 x i64>
>      store <2 x i64> %1, <2 x i64>* %y

Yes. The memory pattern differs. This is the first diagram on the right at: http://llvm.org/docs/BigEndianNEON.html#bitconverts)

> If so, does this mean that performing dead-store-elimination is illegal for ARM?

Yes, for vector types whose corresponding load differs from the store type.

%0 = load <4 x i32> %x

store <4 x i32> %0, <4 x i32>* %x

is still fine. I should go and check that DSE doesn't do bad things for big-endian NEON actually...

> 3) Are loads and stores ever allowed to fault based on the in memory representation?

No (thank goodness!)

> 4) What happens if we have a load of <2xi64> following the store above and we do DSE the store before forwarding it's value?

The store can't be DSE'd as above. But value forwarding is fine. It's fine because the IR is strongly typed - there's no way to remove that bitcast and still have the IR correctly formed. However folding bitcasts into memory operands is explicitly illegal:

%1 = bitcast <4 x i32> %x to <2 x i64>
store <2 x i64> %x to <2 x i64>* %y
=>
store <4 x i32> %x to (bitcast <2 x i64>* %x to < 4 x i32>*) ; ILLEGAL!

There's a hook somewhere in CGP that disables an optimization that tries to do this.

So in IR, because it's strongly typed, there's not really many special cases or things to worry about. But in SDAG things get more difficult. SDAG is weakly typed and all bitconverts will just get blasted into oblivion, so while SDAG can merge bitconverts (bitconvert (bitconvert %x)) -> (bitconvert %x), it mustn't remove them completely.

I hope I've explained that OK. CCing Tim who can hopefully pick more holes in the explanation.

Also, could you please point me to where the documentation seems contradictory? then I'll fix it. I wrote it for exactly this scenario!

Cheers,

James

Daniel Sanders via llvm-dev

unread,

Jan 13, 2016, 6:49:23 AM1/13/16

to James Molloy, Quentin Colombet, llvm-dev

> Hi Philip,
>
> > I think after reading your link I'm actually more confused. This might just be a wording problem, but let me ask a couple of clarifying questions.
>
> Sorry about that :( Every time I explain this I get slightly more embarassed because it is indeed weird and ugly (but was certainly the least ugly solution).
>
> > 1) After compiling the code sequence below (from that page), does the in memory bit pattern differ? The page seemed to contradict itself.
>
> > %0 = load <4 x i32> %x
> > %1 = bitcast <4 x i32> %0 to <2 x i64>
> > store <2 x i64> %1, <2 x i64>* %y
>
> Yes. The memory pattern differs. This is the first diagram on the right at: http://llvm.org/docs/BigEndianNEON.html#bitconverts)
>
> > If so, does this mean that performing dead-store-elimination is illegal for ARM?
>
> Yes, for vector types whose corresponding load differs from the store type.
>
> %0 = load <4 x i32> %x
> store <4 x i32> %0, <4 x i32>* %x
>
> is still fine. I should go and check that DSE doesn't do bad things for big-endian NEON actually...

I just had a quick look and I think we're ok for this case at least. DSE is checking that the value operand of the StoreInst is a LoadInst. In this case it will be a BitCastInst and therefore the StoreInst won't be deleted.

James Molloy via llvm-dev

unread,

Jan 13, 2016, 7:07:43 AM1/13/16

to Daniel Sanders, Quentin Colombet, llvm-dev

Ok, sounds good.

Hal Finkel via llvm-dev

unread,

Jan 13, 2016, 9:34:57 AM1/13/16

to James Molloy, llvm-dev

----- Original Message -----
> From: "James Molloy via llvm-dev" <llvm...@lists.llvm.org>
> To: "Daniel Sanders" <Daniel....@imgtec.com>, "Quentin Colombet" <qcol...@apple.com>
> Cc: "llvm-dev" <llvm...@lists.llvm.org>
> Sent: Wednesday, January 13, 2016 6:00:01 AM
> Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global instruction selection
>
>

> Ok, sounds good.
>

Hrmm... this however, could be considered a bug we should fix. Please add a comment in the relevant place to make sure no one "fixes" this.

-Hal

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel via llvm-dev

unread,

Jan 13, 2016, 9:39:29 AM1/13/16

to James Molloy, llvm-dev

[resending so the message is smaller]

From: "James Molloy via llvm-dev" <llvm...@lists.llvm.org>
To: "Quentin Colombet" <qcol...@apple.com>
Cc: "llvm-dev" <llvm...@lists.llvm.org>
Sent: Wednesday, January 13, 2016 2:35:32 AM
Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global instruction selection

Hi Philip,

store <2 x i64> %1, <2 x i64>* %y

Yes. The memory pattern differs. This is the first diagram on the right at: http://llvm.org/docs/BigEndianNEON.html#bitconverts )

I think that teaching the optimizer about big-Endian lane ordering would have been better. Inserting the REV after every LDR sounds very similar to what we do for VSX on little-Endian PowerPC systems (PowerPC may have a slight advantage here in that we don't need to do insertelement / extractelement / shufflevector through memory on systems where little-Endian mode is relevant, see http://llvm.org/devmtg/2014-10/Slides/Schmidt-SupportingVectorProgramming.pdf).

Given what's been done, should we update the LangRef. It currently reads, " The ‘ bitcast ‘ instruction converts value to type ty2 . It is always a no-op cast because no bits change with this conversion. The conversion is done as if the value had been stored to memory and read back as type ty2 ." But this is now, at the least, misleading, because this process of storing the value as one type and reading it back in as another does, in fact, change the bits. We need to make clear that this might change the bits (perhaps specifically by calling out this case of vector bitcasts on big-Endian systems?).

Also, regarding this, " Most operating systems however do not run with alignment faults enabled, so this is often not an issue." Are you saying that the processor does the correct thing in this case (if alignment faults are not enabled, then it performs a proper unaligned load), or that the operating-system trap handler emulates the unaligned load should one occur?

Thanks again,
Hal

James Molloy via llvm-dev

unread,

Jan 13, 2016, 10:54:44 AM1/13/16

to Hal Finkel, llvm-dev

>I think that teaching the optimizer about big-Endian lane ordering would have been better.

It's certainly arguable. Even in hindsight I'm glad we didn't - that's the approach GCC took and they've been fixing subtle bugs in their vectorizer ever since.

> Inserting the REV after every LDR

We only do this conceptually. In most cases REVs cancel out, and we have the LD1 instruction which is LDR+REV. With enough peepholes there's really no need for code to run slower.

> Given what's been done, should we update the LangRef.

Potentially, yes. I hadn't realised quite how strongly worded it was with respect to this.

James

Hal Finkel via llvm-dev

unread,

Jan 13, 2016, 11:03:15 AM1/13/16

to James Molloy, llvm-dev

----- Original Message -----
> From: "James Molloy" <ja...@jamesmolloy.co.uk>
> To: "Hal Finkel" <hfi...@anl.gov>

> Cc: "llvm-dev" <llvm...@lists.llvm.org>, "Quentin Colombet" <qcol...@apple.com>
> Sent: Wednesday, January 13, 2016 9:54:26 AM
> Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global instruction selection
>
>

> > I think that teaching the optimizer about big-Endian lane ordering
> > would have been better.
>
>
> It's certainly arguable. Even in hindsight I'm glad we didn't -
> that's the approach GCC took and they've been fixing subtle bugs in
> their vectorizer ever since.
>
>
> > Inserting the REV after every LDR
>
>
> We only do this conceptually. In most cases REVs cancel out, and we
> have the LD1 instruction which is LDR+REV. With enough peepholes
> there's really no need for code to run slower.
>
>
> > Given what's been done, should we update the LangRef.
>
>
> Potentially, yes. I hadn't realised quite how strongly worded it was
> with respect to this.
>

Please do ;)

-Hal

>
> James
>
>
> On Wed, 13 Jan 2016 at 14:39 Hal Finkel < hfi...@anl.gov > wrote:
>
>
>
>
> [resending so the message is smaller]
>
>
>
>
>
>

Philip Reames via llvm-dev

unread,

Jan 13, 2016, 12:31:12 PM1/13/16

to Hal Finkel, James Molloy, llvm-dev

You should also add test cases for both EarlyCSE and GVN which cover
this case. I had a patch I was working on for EarlyCSE just a couple of
weeks ago which would have utterly broken this assumption. I ended up
rejecting it for other reasons, but still.

Philip Reames via llvm-dev

unread,

Jan 13, 2016, 1:09:05 PM1/13/16

to Hal Finkel, James Molloy, llvm-dev

On 01/13/2016 08:01 AM, Hal Finkel via llvm-dev wrote:
> ----- Original Message -----
>> From: "James Molloy" <ja...@jamesmolloy.co.uk>
>> To: "Hal Finkel" <hfi...@anl.gov>
>> Cc: "llvm-dev" <llvm...@lists.llvm.org>, "Quentin Colombet" <qcol...@apple.com>
>> Sent: Wednesday, January 13, 2016 9:54:26 AM
>> Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global instruction selection
>>
>>
>>> I think that teaching the optimizer about big-Endian lane ordering
>>> would have been better.
>>
>> It's certainly arguable. Even in hindsight I'm glad we didn't -
>> that's the approach GCC took and they've been fixing subtle bugs in
>> their vectorizer ever since.
>>
>>
>>> Inserting the REV after every LDR
>>
>> We only do this conceptually. In most cases REVs cancel out, and we
>> have the LD1 instruction which is LDR+REV. With enough peepholes
>> there's really no need for code to run slower.
>>
>>
>>> Given what's been done, should we update the LangRef.
>>
>> Potentially, yes. I hadn't realised quite how strongly worded it was
>> with respect to this.
>>
> Please do ;)

I'm not sure changing bitcast is the right place. Since the bitcast is
representing the in-register value (which doesn't change), maybe we
should define it as part of the load/store instead? That's essentially
what's going on; we're converting from a canonical register form to a
variety of memory forms. (Right?)

James Molloy via llvm-dev

unread,

Jan 13, 2016, 3:21:06 PM1/13/16

to Philip Reames, Hal Finkel, llvm-dev

> (Right?)

Uh no, the register content explicitly does change :( We insert REV instructions (byteswap) on each bitcast. Bitcasts can be merged and elided etc, but conceptually there's a register content change on every bitcast.

James

Philip Reames via llvm-dev

unread,

Jan 13, 2016, 3:31:09 PM1/13/16

to James Molloy, Hal Finkel, llvm-dev

On 01/13/2016 12:20 PM, James Molloy wrote:

> (Right?)

Uh no, the register content explicitly does change :( We insert REV instructions (byteswap) on each bitcast. Bitcasts can be merged and elided etc, but conceptually there's a register content change on every bitcast.

Ok. Then we need to change the LangRef as suggested. Given this is a rather important semantic change, I think you need to send a top level RFC to the list.

A couple of points that will need clarified:
- Does this only apply to vector types? It definitely doesn't apply between pointer types today. What about integer, floating point, and FCAs?
- Is combining two casts into one a legal operation? I think it is so far, but we need to explicitly state that.
- Do we have a predicate for identifying no-op casts that can be freely removed/combined?
- Is coercing a load to the type it's immediately bitcast to legal under this model?

Daniel Sanders via llvm-dev

unread,

Jan 14, 2016, 8:18:01 AM1/14/16

to Philip Reames, James Molloy, Hal Finkel, llvm...@lists.llvm.org

> Ok. Then we need to change the LangRef as suggested. Given this is a rather important semantic change, I think you need to send a top level RFC to the list.

FWIW, I don't think this is a semantic change to LLVM-IR itself. I think it's more clearing up the misconception that LLVM-IR semantics also apply to SelectionDAG's operations. That said, I do think it's important to mention this in LangRef since it's very easy to make this mistake and very few targets need to worry about the distinction.

To explain why I don't think this is a semantic change to LLVM-IR, let's consider this example from earlier:

%0 = load <4 x i32> %x
%1 = bitcast <4 x i32> %0 to <2 x i64>

store <2 x i64> %1, <2 x i64>* %y

In LLVM-IR terms, if the value of %0 is:

%0 = 0x00112233_44556677_8899aabb_ccddeeff

then the value of %1 is:

%1 = 0x0011223344556677_8899aabbccddeeff

which agrees with the store/load and the 'no bits change' statements in LangRef.

However, the mapping of these bits to physical register bits is not consistent between types:

Physreg(%0) = 0xccddeeff_8899aabb_44556677_00112233

Physreg(%1) = 0x8899aabbccddeeff_0011223344556677

Essentially, I'm saying that BitCastInst and ISD::BITCAST have slightly different semantics because of their different domains. The former is working on an abstract representation of the values where both statements in LangRef are true, but the latter is closer to the target where the 'no bits change' statement ceases to be true in some cases.

> A couple of points that will need clarified:
> - Does this only apply to vector types? It definitely doesn't apply between pointer types today. What about integer, floating point, and FCAs?

I've only seen it for vector types so far but in theory it could happen for other types. I'd expect FCAs to encounter it since the physical registers may contain padding that isn't present in the LLVM-IR representation and the placement and amount of padding will depend on the exact FCA.

I can think of cases where address space casts can encounter the same problem but that's already been covered in LangRef ("It can be a no-op cast or a complex value modification, depending on the target and the address space pair.").

Does anyone use FCAs directly? Most targets seem to convert them to same-sized integers or bitcast an FCA* to i8*.

> - Is combining two casts into one a legal operation? I think it is so far, but we need to explicitly state that.

Yes, A->B->C and A->C are equivalent.

> - Do we have a predicate for identifying no-op casts that can be freely removed/combined?

James mentioned one in CGP but I haven't been able to find it. I don't think it's necessary to have one at the LLVM-IR level but we do need one in the backends. I remember adding one to the backend but I can't find that either so I think I'm remembering one of my patches from before I split MSA's registers into type-specific classes.

> - Is coercing a load to the type it's immediately bitcast to legal under this model?

Yes.

James Molloy via llvm-dev

unread,

Jan 14, 2016, 8:48:33 AM1/14/16

to Daniel Sanders, Philip Reames, Hal Finkel, llvm...@lists.llvm.org

Hi,

I've given a bit of misinformation here and have caused some confusion. After talking with Tim and Mehdi last night on IRC, I need to correct what I said above to fall more in line with what Daniel is saying. If any of the below contradicts what I've said already, please accept my apologies. This version should be right.

The behaviour of the code generator for big-endian NEON and MIPS is derived from the fact that we did not want to change IR semantics at all. A fundamental property that we do not want to break is memory round-tripping:

%1 = load <4 x i32>, %p32

%2 = bitcast <4 x i32> %1 to <2 x i64>

store <2 x i64> %2, (bitcast %p32 to <2 x i64>*)

The value of memory before and after the store MUST NOT change (contrary to what I said in an earlier post, I know).

So in fact everything you can do in IR is valid. There are no changes to IR semantics in the slightest. However, when it comes to generating code from the IR, there are new rules:

1) Loads and stores are selected to be special loads and stores that do some transform from a canonical form in memory to a type-specific form in register.

2) Because bitcasts are load/store pairs in semantic, they must behave as if a store then load was done. Specifically (bitcast TyA to TyB) must transform TyA -> canonical form -> TyB, as a store then load would. Therefore bitcasts are not no-ops during code generation (*but behave as if they are from an IR perspective!*).

The reason this works neatly in IR is due to the IR's type system. In order to change type, a cast must be inserted or a memory round trip. There is no other way. However in SDAG, things break down a bit. SDAG is more weakly typed, and bitconverts are often simply removed. We need that not to happen. Bitconverts are not no-ops.

Daniel's explanation of physical register mapping was excellent so I'm not going to repeat that.

I apologise for the confusion and misinformation. This is quite a complex topic and takes a bit of mind bending for me to understand, and it was a long time ago.

James

Philip Reames via llvm-dev

unread,

Jan 14, 2016, 4:48:59 PM1/14/16

to James Molloy, Daniel Sanders, Hal Finkel, llvm...@lists.llvm.org

This explanation makes a lot more sense to me. I think it would make sense to document this mental model, but I agree that this interpretation does not seem to require changes to the IR semantics.

Just to check, this implies that DSE *is* legal right?

Philip

Hal Finkel via llvm-dev

unread,

Jan 14, 2016, 5:35:03 PM1/14/16

to Philip Reames, llvm...@lists.llvm.org

----- Original Message -----
> From: "Philip Reames" <list...@philipreames.com>
> To: "James Molloy" <ja...@jamesmolloy.co.uk>, "Daniel Sanders" <Daniel....@imgtec.com>, "Hal Finkel"
> <hfi...@anl.gov>
> Cc: llvm...@lists.llvm.org
> Sent: Thursday, January 14, 2016 3:48:37 PM
> Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global instruction selection
>

> This explanation makes a lot more sense to me. I think it would make
> sense to document this mental model, but I agree that this
> interpretation does not seem to require changes to the IR semantics.

The semantics, no. But we still may want to update the language reference. It says, "It is always a no-op cast because no bits change with this conversion. The conversion is done as if the value had been stored to memory and read back as type ty2." And, what we've learned, is that this second sentence does not always imply the first (the bits might, in fact, change).

-Hal

James Molloy via llvm-dev

unread,

Jan 15, 2016, 3:45:57 AM1/15/16

to Hal Finkel, Philip Reames, llvm...@lists.llvm.org

Hi,

> "It is always a no-op cast because no bits change with this conversion. The conversion is done as if the value had been stored to memory and read back as type ty2."

I think a simple "as-if" in there should be sufficient;

"It is always a no-op cast because it acts as if no bits change with this conversion. The conversion is done as if the value had been stored to memory and read back as type ty2."

What do you think?

James

Daniel Sanders via llvm-dev

unread,

Jan 15, 2016, 5:29:43 AM1/15/16

to James Molloy, Hal Finkel, Philip Reames, llvm...@lists.llvm.org

Hi,

I think we just need to draw attention to the fact that other IR's may vary.

I'm thinking we should add something like this to the ISD::BITCAST doxygen comment:

This is subtly different from the bitcast instruction from LLVM-IR since this node may change the bits

in the register. For example, this occurs on big-endian NEON and big-endian MSA where the layout

of the bits in the register depends on the vector type and this node acts as a shuffle operation for

some vector type combinations.

And have LangRef say something like:

The conversion is done as if the value had been stored to memory and read back as type ty2. This is equivalent to a no-op cast where no bits change with this conversion.

.. caution::

This equivalence does not necessarily apply to other IR's in LLVM. See ISD::BITCAST for an example.

The '.. caution::' should render in the same way as the 'Rationale' box in http://llvm.org/docs/LangRef.html#volatile-memory-accesses.

Hal Finkel via llvm-dev

unread,

Jan 15, 2016, 8:41:57 AM1/15/16

to James Molloy, llvm...@lists.llvm.org

----- Original Message -----
> From: "James Molloy" <ja...@jamesmolloy.co.uk>
> To: "Hal Finkel" <hfi...@anl.gov>, "Philip Reames" <list...@philipreames.com>
> Cc: llvm...@lists.llvm.org, "Daniel Sanders" <Daniel....@imgtec.com>
> Sent: Friday, January 15, 2016 2:45:32 AM
> Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global instruction selection
>
>

> Hi,
>
>
> > "It is always a no-op cast because no bits change with this
> > conversion. The conversion is done as if the value had been stored
> > to memory and read back as type ty2."
>
>
> I think a simple "as-if" in there should be sufficient;
>
>
> "It is always a no-op cast because it acts as if no bits change with
> this conversion. The conversion is done as if the value had been
> stored to memory and read back as type ty2."
>
>
> What do you think?
>

I think this sounds confusing (and, regardless, we always get to apply 'as if'). I see you're point, however, that any changes in the bits are unobservable at the IR level. Is it true that int -> floating-point -> int bitcasts round-trip cleanly for all possible values on all hardware? I was under the impression that this was not true. I think that the best solution might just be to delete the first sentence.

-Hal

Hal Finkel via llvm-dev

unread,

Jan 15, 2016, 8:47:13 AM1/15/16

to Daniel Sanders, llvm...@lists.llvm.org

From: "Daniel Sanders" <Daniel....@imgtec.com>
To: "James Molloy" <ja...@jamesmolloy.co.uk>, "Hal Finkel" <hfi...@anl.gov>, "Philip Reames" <list...@philipreames.com>
Cc: llvm...@lists.llvm.org

Sent: Friday, January 15, 2016 4:29:33 AM
Subject: RE: [llvm-dev] [GlobalISel] A Proposal for global instruction selection

Hi,

I think we just need to draw attention to the fact that other IR's may vary.

I'm thinking we should add something like this to the ISD::BITCAST doxygen comment:

This is subtly different from the bitcast instruction from LLVM-IR since this node may change the bits

in the register. For example, this occurs on big-endian NEON and big-endian MSA where the layout

of the bits in the register depends on the vector type and this node acts as a shuffle operation for

some vector type combinations.

I agree; improving the ISD::BITCAST documentation is a good idea.

And have LangRef say something like:

The conversion is done as if the value had been stored to memory and read back as type ty2. This is equivalent to a no-op cast where no bits change with this conversion.

.. caution::

This equivalence does not necessarily apply to other IR's in LLVM. See ISD::BITCAST for an example.

The '.. caution::' should render in the same way as the 'Rationale' box in http://llvm.org/docs/LangRef.html#volatile-memory-accesses.

We should not do this. We try to keep the LangRef as implementation-independent as possible, and thus, we don't explicitly discuss things like ISD nodes there.

-Hal

From: James Molloy [mailto:ja...@jamesmolloy.co.uk]
Sent: 15 January 2016 08:46
To: Hal Finkel; Philip Reames
Cc: llvm...@lists.llvm.org; Daniel Sanders
Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global instruction selection

Hi,

> "It is always a no-op cast because no bits change with this conversion. The conversion is done as if the value had been stored to memory and read back as type ty2."

I think a simple "as-if" in there should be sufficient;

"It is always a no-op cast because it acts as if no bits change with this conversion. The conversion is done as if the value had been stored to memory and read back as type ty2."

What do you think?

James

On Thu, 14 Jan 2016 at 22:35 Hal Finkel <hfi...@anl.gov> wrote:

Daniel Sanders via llvm-dev

unread,

Jan 22, 2016, 8:39:52 AM1/22/16

to Hal Finkel, llvm...@lists.llvm.org

Hi,

I've posted a patch for the ISD::BITCAST change as http://reviews.llvm.org/D16464.

Reply all

Reply to author

Forward