[LLVMdev] RFC: Machine Instruction Bundle

433 views
Skip to first unread message

Evan Cheng

unread,
Dec 2, 2011, 3:40:02 PM12/2/11
to LLVM Dev
Machine Instruction Bundle in LLVM

Hi all,

There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community.

Design Criteria

1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead.
2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late.
3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE).
4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle.

Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices?

Bundle Representation

1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs.
2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle.

The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this:

--------------
|  Bundle    | -------
--------------        \
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |
--------------
|  Bundle    | ------
--------------       \
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |                  …
         |
--------------
|  Bundle    | ------
--------------       \
         |
        ...


This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to MI that indicates whether it is part of a bundle.

                     ----------------
                     |      MI    * |   (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle)
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI      |    (no bit, this is the end of the bundle)
                     --------------
                             |
                     ----------------
                     |      MI    * |   (* a new bundle)
                     ----------------
                             |
                     ----------------
                     |      MI      |
                     ----------------
                             |
                            ...

We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles.

We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks.

                     ----------------
                     |      MI      |   (just a MI)
                     ----------------
                            |
                     ----------------
                     |      MI    * |   (* Start of Thumb2 IT block)
                     ----------------
                            |
                     ----------------
                     |      MI    * | 
                     ----------------
                            |
                     ----------------
                     |      MI      |   (last MI in the block)
                     ----------------
                            |
                     ----------------
                     |      MI      | 
                     ----------------
                            |
                           ...

This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available.

Properties of Bundle

If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g.

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0<kill>, #c |
-------------------------

r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def).

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0, #c       |
-------------------------
 ...
     = op3 r0, 

r0 is a def, not a use.

What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation.

                     ----------------
                     |    Bundle  * |   (A MI with special opcode "Bundle")
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI      |    (no bit, this is the end of the bundle)
                     --------------
                             |
                     ----------------
                     |    Bundle  * |   (a new bundle)
                     ----------------
                             |
                     ----------------
                     |      MI    * | 
                     ----------------
                             |
                     ----------------
                     |      MI      |
                     ----------------
                             |
                            ...

The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later.

Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles).

Packetizing

The current MI flow looks like this:

1. DAG to MI lowering (and pre-RA schedule)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Actual register allocation
4. Post-RA optimizations
5. PEI
6. Post-RA scheduling

In the hopefully not very distant future it should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Pre-RA scheduling
   3d. Actual register allocation
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions.

Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this.

1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains.

e.g. Now vr0 has two defs!
defs: vr0<dead>, vr3, uses: vr1, vr2
----------------------------
| vr0 = op1 vr1, vr2       |
| vr3 = op2 vr0<kill>, #c  |
----------------------------

2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets).

So the overall flow should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Pre-scheduling packetization (optional)
   3d. Pre-RA scheduling (or integrated packetization)
   3e. Post-scheduling packetization (optional)
   3f. Actual register allocation
   3g. Packet finalization
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

Lowering Bundles to MCInst

There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g.

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0, #c       |
-------------------------

=>

MCInst: op1 r0, r1, r2, op2, r3, r0, #c
or
MCInst: op1 op2 r0, r1, r2, r3, r0, #c

What's Next?

I am hoping to find some time to implement the followings in the near future:
1. Add BUNDLE opcode
2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo().
3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator.
4. Add MachineInstr API to check for instruction properties and switch existing code over.
5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI.
6. Switch Thumb2 IT block to using MI bundles.
7. Add interface for targets to register their own packetization passes.

I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-)

In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome!

Evan

Sergei Larin

unread,
Dec 2, 2011, 5:34:58 PM12/2/11
to Evan Cheng, LLVM Dev

Evan,

 

  I will need to comprehend it better, but one small comment right away…

 

Did we not discuss one more option for bundle implementation – global cycle ID. We would add an unsigned int field to MI definition representing “global scheduling cycle”. All MIs with the same global cycle value belong to one group/packet.  Zero means unscheduled MI.

 

  That is light weight, position independent (means if instructions are added/reordered by a pass that does not care for packets/groups, original grouping could be easily restored. With a single bit marking, a newly inserted instruction breaks the coding or special ABI is needed to be used for that). Only real downside I see, iterating over a single packet members becomes a search, but if the scope is BB it is not a big deal.

 

  This way we could also effectively represent NOOP bundles (no bundle for a certain cycle value) – VLIW cycles with no instructions issued or estimate scheduling depth easily etc.

 

  I am not voting here in any way, I just wanted to bring it up for discussion.

 

Sergei Larin

 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

Sergei Larin

unread,
Dec 2, 2011, 5:41:01 PM12/2/11
to Evan Cheng, LLVM Dev

… and yes, one more thing. On some architectures it might be desirable to know the _order_ of instructions in the packet. That is a bit trickier….

 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

 

From: llvmdev...@cs.uiuc.edu [mailto:llvmdev...@cs.uiuc.edu] On Behalf Of Evan Cheng
Sent: Friday, December 02, 2011 2:40 PM
To: LLVM Dev
Subject: [LLVMdev] RFC: Machine Instruction Bundle

 

Machine Instruction Bundle in LLVM

Jakob Stoklund Olesen

unread,
Dec 2, 2011, 5:44:25 PM12/2/11
to Evan Cheng, LLVM Dev
On Dec 2, 2011, at 12:40 PM, Evan Cheng wrote:

                     ----------------
                     |    Bundle  * |   (A MI with special opcode "Bundle")
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI      |    (no bit, this is the end of the bundle)
                     --------------
                             |
                     ----------------
                     |    Bundle  * |   (a new bundle)
                     ----------------
                             |
                     ----------------
                     |      MI    * | 
                     ----------------
                             |
                     ----------------
                     |      MI      |
                     ----------------
                             |
                            ...

I think you should turn the glue bit upside down, so it means "This instruction is glued to its predecessor" instead of "This instruction is glued to its successor. That way, the inner bundled instructions have the glue bit, while the outer bundle headers and unbundled instructions don't.

That should make your iterator a lot simpler when you have mixed bundles and unbundled MIs. It simply skips MIs with the glue bit set:

             |       ----------------
             |-->    |    Bundle    |    (A MI with special opcode "Bundle")
             |       ----------------
             |               |
             |       ----------------
             |       |      MI    * |
             |       ----------------
             |               |
             |       ----------------
             |       |      MI    * |    (last MI in bundle)
             |       ----------------
             |               |
             |       ----------------
             |-->    |      MI      |    (no bit, this is a top-level unbundled MI)
             |       ----------------
             |               |
             |       ----------------
             |-->    |      MI      |    (unbundled MI)
             |       ----------------
             |               |
             |       ----------------
             |-->    |    Bundle    |   (a new bundle)
             |       ----------------
             |               |
             |       ----------------
             |       |      MI    * | 
             |       ----------------
             |               |
             |       ----------------
             |       |      MI    * |
             |       ----------------
             |               |

What happens when you assign a bundled MI* to a MachineBasicBlock::iterator? Does it rewind to the bundle header? There is a lot of code that treats the two as almost equivalent.

/jakob

David A. Greene

unread,
Dec 2, 2011, 5:44:58 PM12/2/11
to Evan Cheng, LLVM Dev
Evan Cheng <evan....@apple.com> writes:

> 2. It must be flexible enough to represent more than VLIW bundles. It
> should be useful to represent arbitrary sequence of instructions that
> must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare +
> branch macro-fusion, or random instruction sequences that are
> currently modeled as pseudo instructions that are expanded late.

This is really important. This kind of support could also aid
scheduling to make sure some later pass doesn't screw up a carefully
crafted schedule.

> So the overall flow should look like this:
>
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
> 3a. De-ssa (2-address, phi slim)
> 3b. Coalescing
> 3c. Pre-scheduling packetization (optional)
> 3d. Pre-RA scheduling (or integrated packetization)
> 3e. Post-scheduling packetization (optional)
> 3f. Actual register allocation
> 3g. Packet finalization
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies

This all looks fairly reasonable to me. I just have one question. How
much machine-level information do we have at the point where bundling
occurs? For example, can we figure out instruction sizes exactly? This
kind of thing is important for certain kinds of scheduling.

If we don't have that information at this point, it might be
advantageous to delay the final packetization until later, once we're at
a point where we have all of the useful information we might want.

-Dave
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Evan Cheng

unread,
Dec 2, 2011, 6:58:21 PM12/2/11
to Jakob Stoklund Olesen, LLVM Dev
Ok. Makes sense.


What happens when you assign a bundled MI* to a MachineBasicBlock::iterator? Does it rewind to the bundle header? There is a lot of code that treats the two as almost equivalent.

That's invalid. If you use the "top level MIs only" iterator, then it's not legal to assign it to a bundled instruction.

Evan


/jakob


Evan Cheng

unread,
Dec 2, 2011, 7:09:31 PM12/2/11
to David A. Greene, LLVM Dev

That depends on the target.

>
> If we don't have that information at this point, it might be
> advantageous to delay the final packetization until later, once we're at
> a point where we have all of the useful information we might want.

Targets are allowed to delay packetization to later. The only restriction is packets cannot be finalized under register allocation is done because we do not want to add virtual register defs and uses to bundle instructions.

Evan

Evan Cheng

unread,
Dec 2, 2011, 7:01:52 PM12/2/11
to Sergei Larin, LLVM Dev
On Dec 2, 2011, at 2:34 PM, Sergei Larin wrote:

Evan,
 
  I will need to comprehend it better, but one small comment right away…
 
Did we not discuss one more option for bundle implementation – global cycle ID. We would add an unsigned int field to MI

No, I don't remember discussing this.

definition representing “global scheduling cycle”. All MIs with the same global cycle value belong to one group/packet.  Zero means unscheduled MI.
 
  That is light weight, position independent (means if instructions are added/reordered by a pass that does not care for packets/groups, original grouping could be easily restored. With a single bit marking, a newly inserted instruction breaks the coding or special ABI is needed to be used for that). Only real downside I see, iterating over a single packet members becomes a search, but if the scope is BB it is not a big deal.

Sorry, I'm not seeing the advantages of this. This actually requires one additional field to keep the id. And all the passes which can potentially move code around, breaking MBBs etc. will have to maintain it.

Evan

Evan Cheng

unread,
Dec 2, 2011, 7:06:39 PM12/2/11
to Sergei Larin, LLVM Dev
On Dec 2, 2011, at 2:41 PM, Sergei Larin wrote:

… and yes, one more thing. On some architectures it might be desirable to know the _order_ of instructions in the packet. That is a bit trickier….

Isn't that just the order of the instructions in the list? I don't see anything that prevents getting the order of instructions. It might require iterator over MIs in the packet. But for targets like VLIW, the # of instructions should be small. Also note, passes that require this kind of information should aggregate and maintain the data itself. Information like schedule cycle belongs in "schedule unit", not in MI.

Evan

Pekka Jääskeläinen

unread,
Dec 3, 2011, 10:12:20 AM12/3/11
to LLVM Developers Mailing List
Hi,

I'm glad to see some action with regard to static instruction
scheduling and VLIW support in LLVM. I have some questions and
remarks which might not be relevant as I'm not totally familiar
with the current code generation framework of LLVM nor your plan.

On 12/02/2011 10:40 PM, Evan Cheng wrote:
> 2. It must be flexible enough to represent more than VLIW bundles. It should be
> useful to represent arbitrary sequence of instructions that must be scheduled as
> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random
> instruction sequences that are currently modeled as pseudo instructions that are
> expanded late.

The concept of a "VLIW bundle" is to mark a set of instructions that
should/could be executed in *parallel*. A static parallel instruction
schedule for a single instruction cycle, that is.

In other words, with a VLIW target a bundle might not be just "an atomic,
possibly sequentially executed chunk of instructions" or "a set of
instructions that can be executed in parallel but also sequentially".
In some architectures, the sequential execution might break the schedule
due to visible function unit pipeline latencies and no hardware interlocking.

Is it wise to mix the two concepts of "parallel instructions" and the looser
"instructions that should be executed together"? The "parallel semantics"
implies changes to how the scheduling is done (the earliest/latest cycle where
an instruction can be scheduled) and also, e.g., the register allocation's live
ranges (if allocating regs on a "packetized" = parallel code)?

Moreover, the definition of VLIW parallel bundle implies that there cannot be
no "intra bundle dependencies", otherwise those instructions could not be
executed in parallel in the reality.

For example, looking at your example of a bundle with "intra-bundle
dependencies":

-------------------------
| r0 = op1 r1, r2 |
| r3 = op2 r0<kill>, #c |
-------------------------

In case of a static VLIW target the semantics of this instruction is that these
two "RISC instructions are executed in parallel, period". Thus, the first
instruction cannot depend on the latter (or the other way around) but op2 reads
the old value of r0, not the one written in the same bundle.

It depends on the architecture's data hazard detection support, register file
bypasses, etc. whether the r0 update of the 1st instruction is available to
the second instruction in the bundle or whether the new r0 value can be read
only by the succeeding instruction bundles. If it is available, the execution
is sequential in reality as op1 must produce the value before op2 can
execute.

Itanium machines are an example of "parallel bundle architectures"
(and of course also other "more traditional" VLIWs are, like the TI C64x[2]):

"EPIC allows compilers to define independent instruction sequences, which allows
hardware to ignore dependency checks between these instructions. This same
hardware functionality in OOO RISC designs is very costly and complex."
[1]

As an example of the "not truly parallel instruction bundles", on the other
hand, we have played a bit with the Cell SPU which is quite static architecture
but still has hardware data hazard detection and hardware interlocking. It
would differentiate between your case and the one where the order is different
because it follows the sequential instruction order in its hardware data
dependence resolving logic and stalls the pipeline (thus does not really
execute the instructions in parallel) if the sequential order has data hazards.

For how to actually represent the (parallel) instruction bundles I do not have
a strong opinion, as long as these semantic difference between a "parallel
bundle" and "just a chunk of instructions that should be executed together" are
made clear and adhered to everywhere in the code generation.

[1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf
[2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf

Best regards,
--
Pekka from the TCE project
http://tce.cs.tut.fi

Evan Cheng

unread,
Dec 3, 2011, 3:18:26 PM12/3/11
to Pekka Jääskeläinen, LLVM Developers Mailing List
It's important to understand this is a proposal for code generator IR change, not specific to specific passes for register allocator or scheduler. Passes that want to understand more about the internals of instruction bundles are free to design their own data structure. It is imperative that we do not add multiple constructs for various types of bundles. That would add significant overhead for the rest of the code generator.

I have discussed the proposal for current code owners of register allocator and instruction scheduler. This is a proposal that will work with existing (or module being planned for near future) infrastructure.

Evan

Anshuman Dasgupta

unread,
Dec 5, 2011, 10:17:25 AM12/5/11
to Evan Cheng, LLVM Dev
Hi Evan,

Thanks for sending out the proposal. This pass will, of course, be very important for the Hexagon backend. I have to study the proposal in more detail but overall, it looks good to me.

> I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion).

My team will be happy to help with the tasks and specifically with the packet-formation pass. We already have a pass that forms packets for Hexagon and we can generalize the pass and make it machine-independent.

-Anshu

--
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum

Sergei Larin

unread,
Dec 5, 2011, 11:59:20 AM12/5/11
to Evan Cheng, LLVM Dev

Yes it is. Or it could be. All I am saying – if we make this implied assumption (that the order of instructions in the list == semantical order in the bundle)  it needs to be explicitly stated. Otherwise a pass developer could really treat a bundle as truly “parallel” entity potentially causing nasty things.

 

Sergei Larin

 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

 

Jakob Stoklund Olesen

unread,
Dec 5, 2011, 4:50:03 PM12/5/11
to Evan Cheng, LLVM Dev

On Dec 2, 2011, at 12:40 PM, Evan Cheng wrote:

> There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community.

Let me add some information about how the register handles this extension.

First of all, I agree it is important that this can be used for more than VLIW targets. Besides the uses you mention, I would like to use bundle DBG_VALUE instructions so we can avoid silly code like this:

MachineBasicBlock::iterator InsertPos = mi;
while (InsertPos != MBB->begin() && llvm::prior(InsertPos)->isDebugValue())
--InsertPos;
MachineBasicBlock::iterator From = KillMI;
MachineBasicBlock::iterator To = llvm::next(From);
while (llvm::prior(From)->isDebugValue())
--From;

I also think bundles can be used for implementing parallel copies.


Value Semantics
===============

The register allocator and probably most of the target-independent code generator won't care if instructions in a bundle are executed sequentially or in parallel. We do care about value semantics, though. That is, which value is being read by an instruction operand in a bundle.

We definitely want to support the parallel execution semantics where all instructions in a bundle read values defined outside the bundle. This is a swap implemented as a parallel copy bundle:

{
R2 = R3;
R3 = R2;
}

However, even VLIW targets like Hexagon can read values defined inside the same bundle:

{
P0 = cmp.eq(R2,#4)
if (!P0) R5 = #5
if (P0.new) R3 = memw(R4)
}

This Hexagon bundle reads both the P0 predicate register defined inside the bundle (P0.new) and the value defined outside the bundle (!P0). We need to support this.

I propose that we add a new MachineOperand flag, IsInternalRead, to represent this. The flag will mean "This operand is reading a value defined within the same instruction/bundle, not a value from outside the instruction/bundle." The register allocator will treat the <internal> flag almost exactly like it currently treats <undef>, but there is a big difference to the target-specific code.

The register allocator and other target-independent passes don't care if there are multiple defs of the same register inside a bundle. Values are only tracked at the bundle granularity. The semantics of multiple defs is target-defined.

To summarize, all instructions in a bundle read values defined outside the bundle, unless explicitly marked as bundle-internal reads. Multiple defs inside a bundle are indistinguishable except to the target.


Rewriting
=========

The register allocator super-pass needs to rewrite register operands. Virtual-to-virtual rewriting happens during coalescing and live range splitting. Virtual-to-physical rewriting happens only once at the end.

When rewriting virtual registers, a minimal understanding of value semantics is required. In particular, it is possible to split a live range right down the middle of an instruction:

%vr0 = add %vr0, 1

May be rewritten as:

%vr2 = add %vr1, 1

This is assuming the add doesn't have two-address constraints, of course.

When rewriting bundle operands, the <internal> flag will be sufficient to determine the correct virtual register. For example:

{
%vr0 = cmp.eq(R2,#4)
if (!%vr0) R5 = #5
if (%vr0<internal>) R3 = memw(R4)
}

could be rewritten after live range splitting as:

{
%vr2 = cmp.eq(R2,#4)
if (!%vr1) R5 = #5
if (%vr2<internal>) R3 = memw(R4)
}


Constraining spill code insertion
=================================

It is important to note that bundling instructions doesn't constrain the register allocation problem.

For example, this bundle would be impossible with sequential value constraints:

{
call foo
%vr0 = addps %vr1, %vr2
call bar
}

The calls clobber the xmm registers, so it is impossible to register allocate this code without breaking up the bundle and inserting spill code between the calls.

With our definition of bundle value semantics, the addps is reading %vr1 and %vr2 outside the bundle, and the call clobbers are not considered relevant. The fact that the call clobbers all registers before addps is the target's problem.

This is very similar to how inline assembly is treated.


TL;DR
=====

By adding a new <internal> flag to MachineOperand, the register allocator can effectively treat a bundle as a single instruction. All MachineOperands inside a bundle are treated as if they all belong to the single instruction. This even works when rewriting operands.


/jakob

Jakob Stoklund Olesen

unread,
Dec 5, 2011, 5:56:30 PM12/5/11
to Evan Cheng, LLVM Dev

On Dec 5, 2011, at 1:50 PM, Jakob Stoklund Olesen wrote:

> I would like to use bundle DBG_VALUE instructions so we can avoid silly code like this:
>
> MachineBasicBlock::iterator InsertPos = mi;
> while (InsertPos != MBB->begin() && llvm::prior(InsertPos)->isDebugValue())
> --InsertPos;
> MachineBasicBlock::iterator From = KillMI;
> MachineBasicBlock::iterator To = llvm::next(From);
> while (llvm::prior(From)->isDebugValue())
> --From;

Thinking about it, this probably wouldn't work for pre-RA passes that can delete code. They could inadvertently delete DBG_VALUE instructions.

Pekka Jääskeläinen

unread,
Dec 6, 2011, 5:06:27 AM12/6/11
to llv...@cs.uiuc.edu
On 12/05/2011 11:50 PM, Jakob Stoklund Olesen wrote:
> By adding a new<internal> flag to MachineOperand, the register allocator can
> effectively treat a bundle as a single instruction. All MachineOperands
> inside a bundle are treated as if they all belong to the single instruction.
> This even works when rewriting operands.

This sounds like a simple and good solution to the "parallel bundle" semantic
worries I had.

What about instruction scheduling? Has anyone thought how/if isched could work
with parallel bundles? That is, to create the bundles (referred to as "packing"
by some) the first place (before and/or after RA).

I'm not familiar enough with the current LLVM isched work so I do not know
if the current algorithms could be used for static scheduling for VLIWs.
Is one really required to write a separate "packer" that just tries to pack
sequential(?) instructions to bundles without trying to reorder to improve
the packing density?

How I see it, "packing" is just instruction scheduling with an additional
"what can be in a single bundle and in what order"-scheduling constraint and has
fixed cycles (bundles) for the instructions as an outcome. The bundle
constraint (the wide instruction template(s) supported by the machine) can be
implemented with the state machine approach or just a resource table.

VLIW scheduling requires also multiplicity for the FU resource constraints.
Can those be described in the current TableGen format? Clustered (or in
general, not-fully-connected) VLIWs also require connectivity information
(which FUs are connected to which RFs) but that can be modeled with
register classes, I've been told.

--
--Pekka

Anshuman Dasgupta

unread,
Dec 6, 2011, 11:06:40 AM12/6/11
to Pekka Jääskeläinen, llv...@cs.uiuc.edu

> What about instruction scheduling? Has anyone thought how/if isched could work 
> with parallel bundles? That is, to create the bundles (referred to as "packing"
> by some) the first place (before and/or after RA).
> ...
> The bundle constraint (the wide instruction template(s) supported by the machine)
> can be implemented with the state machine approach or just a resource table.

We have thought about it from a VLIW perspective. Yes, you can integrate scheduling with packetization. I committed a DFA-based packetization mechanism last week and that can be used by the scheduler to make more intelligent decisions for a VLIW target. The flow that Evan described is flexible enough to accommodate both an integrated scheduler and a pre and post-scheduling packetizer. Some VLIW targets may need all three phases; others can use some combination of the three. 

-Anshu

--
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum

Evandro Menezes

unread,
Dec 6, 2011, 5:34:22 PM12/6/11
to Pekka Jääskeläinen, LLVM Developers Mailing List
Indeed, there are strict VLIW architectures out there and VLIW
architectures that leverage some aspects of conventional architectures,
sometimes taking advantage of how the microarchitecture was implemented.

-------------------------
| r0 = op1 r1, r2 |
| r3 = op2 r0<kill>, #c |
-------------------------

For instance, some of such VLIW architectures allow that the programmer
specifies whether "r0" in "op2" will have the value from t-1 (before the
bundle) or from t+1 (the result produced by "op1"). If anything,
because the forwarding of results is close enough in the pipeline to be
used in the same bundle.

It seems to me that such subtleties are better left near the
target-dependent code, but the target-independent code should refrain
from making strict interpretations of what a bundle should look like.

When it gets to the MC layer, it seems to me that overloading the MCI
operands with opcodes is a bit unusual, but, at first glance, it might
not be too hard to handle.

--
Evandro Menezes Austin, TX emen...@codeaurora.org


Qualcomm Innovation Center, Inc is a member of Code Aurora Forum

Stripf, Timo (ITIV)

unread,
Jan 11, 2012, 7:56:04 AM1/11/12
to Evan Cheng, LLVM Dev

Hi Evan,

 

I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM.

 

 

I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction.

 

From my experience a simple packetizer that groups instruction into bundles (like the old IA-64 back-end did) without changing the order of the instructions produces bad code. Instead, a VLIW scheduler that directly outputs bundles produces better code. The current LLVM scheduler (at the end of the instruction selection pass) is not suitable to generate bundled instructions since it operates on scheduling units for glued instructions.

However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary.

 

I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling.

 

Timo

girish gulawani

unread,
Jan 11, 2012, 10:43:44 AM1/11/12
to Stripf, Timo (ITIV), Evan Cheng, LLVM Dev

Hi Evan/Evan.
This is great! I''ve been experimenting on a VLIW target by adding a FinalizeBundle loop and combination of Scoreboard / Schedule hazard classes.

Only question that is nagging now is how to pass the scheduling information (or the flags) down PECI, to 'choose' the correct opcode. I am currently hacking a few bits off the "MachineInstr.Flags". But somebody suggested avoiding this and recommended adding a dummy operand. I'm not sure if that is the correct way. Any comments on that?

Thanks.
Girish.


Andrew Trick

unread,
Jan 11, 2012, 5:18:36 PM1/11/12
to Stripf, Timo (ITIV), LLVM Dev
On Jan 11, 2012, at 4:56 AM, Stripf, Timo (ITIV) wrote:

Hi Evan,
 
I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM.
 
 
I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction.
 
From my experience a simple packetizer that groups instruction into bundles (like the old IA-64 back-end did) without changing the order of the instructions produces bad code. Instead, a VLIW scheduler that directly outputs bundles produces better code. The current LLVM scheduler (at the end of the instruction selection pass) is not suitable to generate bundled instructions since it operates on scheduling units for glued instructions.
However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary.

What you're describing is how we expect some packetizers to be implemented. Although it's really up the the person implementing the packetizer whether or not to integrate it with scheduling. The point is that a framework will support both scheduling and bundling before RA (and after coalescing).

I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling.

I think Evan referred to a regalloc "super pass". This means nothing more than a set of passes that all require and preserve liveness information.

Once physical registers are assigned and we throw away vreg liveness, then passes can view bundles as individual instructions. That's what I thin of as "packet finalization". Nothing really prevents rebundling though. The late bundler just may not have as much freedom.

-Andy

Reply all
Reply to author
Forward
0 new messages