Evan,
I will need to comprehend it better, but one small comment right away…
Did we not discuss one more option for bundle implementation – global cycle ID. We would add an unsigned int field to MI definition representing “global scheduling cycle”. All MIs with the same global cycle value belong to one group/packet. Zero means unscheduled MI.
That is light weight, position independent (means if instructions are added/reordered by a pass that does not care for packets/groups, original grouping could be easily restored. With a single bit marking, a newly inserted instruction breaks the coding or special ABI is needed to be used for that). Only real downside I see, iterating over a single packet members becomes a search, but if the scope is BB it is not a big deal.
This way we could also effectively represent NOOP bundles (no bundle for a certain cycle value) – VLIW cycles with no instructions issued or estimate scheduling depth easily etc.
I am not voting here in any way, I just wanted to bring it up for discussion.
Sergei Larin
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
… and yes, one more thing. On some architectures it might be desirable to know the _order_ of instructions in the packet. That is a bit trickier….
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
From: llvmdev...@cs.uiuc.edu [mailto:llvmdev...@cs.uiuc.edu] On Behalf Of Evan Cheng
Sent: Friday, December 02, 2011 2:40 PM
To: LLVM Dev
Subject: [LLVMdev] RFC: Machine Instruction Bundle
Machine Instruction Bundle in LLVM
----------------| Bundle * | (A MI with special opcode "Bundle")----------------|----------------| MI * |----------------|----------------| MI * |----------------|----------------| MI | (no bit, this is the end of the bundle)--------------|----------------| Bundle * | (a new bundle)----------------|----------------| MI * |----------------|----------------| MI |----------------|...
> 2. It must be flexible enough to represent more than VLIW bundles. It
> should be useful to represent arbitrary sequence of instructions that
> must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare +
> branch macro-fusion, or random instruction sequences that are
> currently modeled as pseudo instructions that are expanded late.
This is really important. This kind of support could also aid
scheduling to make sure some later pass doesn't screw up a carefully
crafted schedule.
> So the overall flow should look like this:
>
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
> 3a. De-ssa (2-address, phi slim)
> 3b. Coalescing
> 3c. Pre-scheduling packetization (optional)
> 3d. Pre-RA scheduling (or integrated packetization)
> 3e. Post-scheduling packetization (optional)
> 3f. Actual register allocation
> 3g. Packet finalization
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
This all looks fairly reasonable to me. I just have one question. How
much machine-level information do we have at the point where bundling
occurs? For example, can we figure out instruction sizes exactly? This
kind of thing is important for certain kinds of scheduling.
If we don't have that information at this point, it might be
advantageous to delay the final packetization until later, once we're at
a point where we have all of the useful information we might want.
-Dave
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
What happens when you assign a bundled MI* to a MachineBasicBlock::iterator? Does it rewind to the bundle header? There is a lot of code that treats the two as almost equivalent.
/jakob
That depends on the target.
>
> If we don't have that information at this point, it might be
> advantageous to delay the final packetization until later, once we're at
> a point where we have all of the useful information we might want.
Targets are allowed to delay packetization to later. The only restriction is packets cannot be finalized under register allocation is done because we do not want to add virtual register defs and uses to bundle instructions.
Evan
Evan,I will need to comprehend it better, but one small comment right away…Did we not discuss one more option for bundle implementation – global cycle ID. We would add an unsigned int field to MI
definition representing “global scheduling cycle”. All MIs with the same global cycle value belong to one group/packet. Zero means unscheduled MI.That is light weight, position independent (means if instructions are added/reordered by a pass that does not care for packets/groups, original grouping could be easily restored. With a single bit marking, a newly inserted instruction breaks the coding or special ABI is needed to be used for that). Only real downside I see, iterating over a single packet members becomes a search, but if the scope is BB it is not a big deal.
… and yes, one more thing. On some architectures it might be desirable to know the _order_ of instructions in the packet. That is a bit trickier….
I'm glad to see some action with regard to static instruction
scheduling and VLIW support in LLVM. I have some questions and
remarks which might not be relevant as I'm not totally familiar
with the current code generation framework of LLVM nor your plan.
On 12/02/2011 10:40 PM, Evan Cheng wrote:
> 2. It must be flexible enough to represent more than VLIW bundles. It should be
> useful to represent arbitrary sequence of instructions that must be scheduled as
> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random
> instruction sequences that are currently modeled as pseudo instructions that are
> expanded late.
The concept of a "VLIW bundle" is to mark a set of instructions that
should/could be executed in *parallel*. A static parallel instruction
schedule for a single instruction cycle, that is.
In other words, with a VLIW target a bundle might not be just "an atomic,
possibly sequentially executed chunk of instructions" or "a set of
instructions that can be executed in parallel but also sequentially".
In some architectures, the sequential execution might break the schedule
due to visible function unit pipeline latencies and no hardware interlocking.
Is it wise to mix the two concepts of "parallel instructions" and the looser
"instructions that should be executed together"? The "parallel semantics"
implies changes to how the scheduling is done (the earliest/latest cycle where
an instruction can be scheduled) and also, e.g., the register allocation's live
ranges (if allocating regs on a "packetized" = parallel code)?
Moreover, the definition of VLIW parallel bundle implies that there cannot be
no "intra bundle dependencies", otherwise those instructions could not be
executed in parallel in the reality.
For example, looking at your example of a bundle with "intra-bundle
dependencies":
-------------------------
| r0 = op1 r1, r2 |
| r3 = op2 r0<kill>, #c |
-------------------------
In case of a static VLIW target the semantics of this instruction is that these
two "RISC instructions are executed in parallel, period". Thus, the first
instruction cannot depend on the latter (or the other way around) but op2 reads
the old value of r0, not the one written in the same bundle.
It depends on the architecture's data hazard detection support, register file
bypasses, etc. whether the r0 update of the 1st instruction is available to
the second instruction in the bundle or whether the new r0 value can be read
only by the succeeding instruction bundles. If it is available, the execution
is sequential in reality as op1 must produce the value before op2 can
execute.
Itanium machines are an example of "parallel bundle architectures"
(and of course also other "more traditional" VLIWs are, like the TI C64x[2]):
"EPIC allows compilers to define independent instruction sequences, which allows
hardware to ignore dependency checks between these instructions. This same
hardware functionality in OOO RISC designs is very costly and complex."
[1]
As an example of the "not truly parallel instruction bundles", on the other
hand, we have played a bit with the Cell SPU which is quite static architecture
but still has hardware data hazard detection and hardware interlocking. It
would differentiate between your case and the one where the order is different
because it follows the sequential instruction order in its hardware data
dependence resolving logic and stalls the pipeline (thus does not really
execute the instructions in parallel) if the sequential order has data hazards.
For how to actually represent the (parallel) instruction bundles I do not have
a strong opinion, as long as these semantic difference between a "parallel
bundle" and "just a chunk of instructions that should be executed together" are
made clear and adhered to everywhere in the code generation.
[1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf
[2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf
Best regards,
--
Pekka from the TCE project
http://tce.cs.tut.fi
I have discussed the proposal for current code owners of register allocator and instruction scheduler. This is a proposal that will work with existing (or module being planned for near future) infrastructure.
Evan
Yes it is. Or it could be. All I am saying – if we make this implied assumption (that the order of instructions in the list == semantical order in the bundle) it needs to be explicitly stated. Otherwise a pass developer could really treat a bundle as truly “parallel” entity potentially causing nasty things.
Sergei Larin
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
> There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community.
Let me add some information about how the register handles this extension.
First of all, I agree it is important that this can be used for more than VLIW targets. Besides the uses you mention, I would like to use bundle DBG_VALUE instructions so we can avoid silly code like this:
MachineBasicBlock::iterator InsertPos = mi;
while (InsertPos != MBB->begin() && llvm::prior(InsertPos)->isDebugValue())
--InsertPos;
MachineBasicBlock::iterator From = KillMI;
MachineBasicBlock::iterator To = llvm::next(From);
while (llvm::prior(From)->isDebugValue())
--From;
I also think bundles can be used for implementing parallel copies.
Value Semantics
===============
The register allocator and probably most of the target-independent code generator won't care if instructions in a bundle are executed sequentially or in parallel. We do care about value semantics, though. That is, which value is being read by an instruction operand in a bundle.
We definitely want to support the parallel execution semantics where all instructions in a bundle read values defined outside the bundle. This is a swap implemented as a parallel copy bundle:
{
R2 = R3;
R3 = R2;
}
However, even VLIW targets like Hexagon can read values defined inside the same bundle:
{
P0 = cmp.eq(R2,#4)
if (!P0) R5 = #5
if (P0.new) R3 = memw(R4)
}
This Hexagon bundle reads both the P0 predicate register defined inside the bundle (P0.new) and the value defined outside the bundle (!P0). We need to support this.
I propose that we add a new MachineOperand flag, IsInternalRead, to represent this. The flag will mean "This operand is reading a value defined within the same instruction/bundle, not a value from outside the instruction/bundle." The register allocator will treat the <internal> flag almost exactly like it currently treats <undef>, but there is a big difference to the target-specific code.
The register allocator and other target-independent passes don't care if there are multiple defs of the same register inside a bundle. Values are only tracked at the bundle granularity. The semantics of multiple defs is target-defined.
To summarize, all instructions in a bundle read values defined outside the bundle, unless explicitly marked as bundle-internal reads. Multiple defs inside a bundle are indistinguishable except to the target.
Rewriting
=========
The register allocator super-pass needs to rewrite register operands. Virtual-to-virtual rewriting happens during coalescing and live range splitting. Virtual-to-physical rewriting happens only once at the end.
When rewriting virtual registers, a minimal understanding of value semantics is required. In particular, it is possible to split a live range right down the middle of an instruction:
%vr0 = add %vr0, 1
May be rewritten as:
%vr2 = add %vr1, 1
This is assuming the add doesn't have two-address constraints, of course.
When rewriting bundle operands, the <internal> flag will be sufficient to determine the correct virtual register. For example:
{
%vr0 = cmp.eq(R2,#4)
if (!%vr0) R5 = #5
if (%vr0<internal>) R3 = memw(R4)
}
could be rewritten after live range splitting as:
{
%vr2 = cmp.eq(R2,#4)
if (!%vr1) R5 = #5
if (%vr2<internal>) R3 = memw(R4)
}
Constraining spill code insertion
=================================
It is important to note that bundling instructions doesn't constrain the register allocation problem.
For example, this bundle would be impossible with sequential value constraints:
{
call foo
%vr0 = addps %vr1, %vr2
call bar
}
The calls clobber the xmm registers, so it is impossible to register allocate this code without breaking up the bundle and inserting spill code between the calls.
With our definition of bundle value semantics, the addps is reading %vr1 and %vr2 outside the bundle, and the call clobbers are not considered relevant. The fact that the call clobbers all registers before addps is the target's problem.
This is very similar to how inline assembly is treated.
TL;DR
=====
By adding a new <internal> flag to MachineOperand, the register allocator can effectively treat a bundle as a single instruction. All MachineOperands inside a bundle are treated as if they all belong to the single instruction. This even works when rewriting operands.
/jakob
> I would like to use bundle DBG_VALUE instructions so we can avoid silly code like this:
>
> MachineBasicBlock::iterator InsertPos = mi;
> while (InsertPos != MBB->begin() && llvm::prior(InsertPos)->isDebugValue())
> --InsertPos;
> MachineBasicBlock::iterator From = KillMI;
> MachineBasicBlock::iterator To = llvm::next(From);
> while (llvm::prior(From)->isDebugValue())
> --From;
Thinking about it, this probably wouldn't work for pre-RA passes that can delete code. They could inadvertently delete DBG_VALUE instructions.
This sounds like a simple and good solution to the "parallel bundle" semantic
worries I had.
What about instruction scheduling? Has anyone thought how/if isched could work
with parallel bundles? That is, to create the bundles (referred to as "packing"
by some) the first place (before and/or after RA).
I'm not familiar enough with the current LLVM isched work so I do not know
if the current algorithms could be used for static scheduling for VLIWs.
Is one really required to write a separate "packer" that just tries to pack
sequential(?) instructions to bundles without trying to reorder to improve
the packing density?
How I see it, "packing" is just instruction scheduling with an additional
"what can be in a single bundle and in what order"-scheduling constraint and has
fixed cycles (bundles) for the instructions as an outcome. The bundle
constraint (the wide instruction template(s) supported by the machine) can be
implemented with the state machine approach or just a resource table.
VLIW scheduling requires also multiplicity for the FU resource constraints.
Can those be described in the current TableGen format? Clustered (or in
general, not-fully-connected) VLIWs also require connectivity information
(which FUs are connected to which RFs) but that can be modeled with
register classes, I've been told.
--
--Pekka
> What about instruction scheduling? Has anyone thought how/if isched could work > with parallel bundles? That is, to create the bundles (referred to as "packing" > by some) the first place (before and/or after RA). > ... > The bundle constraint (the wide instruction template(s) supported by the machine) > can be implemented with the state machine approach or just a resource table. We have thought about it from a VLIW perspective. Yes, you can integrate scheduling with packetization. I committed a DFA-based packetization mechanism last week and that can be used by the scheduler to make more intelligent decisions for a VLIW target. The flow that Evan described is flexible enough to accommodate both an integrated scheduler and a pre and post-scheduling packetizer. Some VLIW targets may need all three phases; others can use some combination of the three. -Anshu
-------------------------
| r0 = op1 r1, r2 |
| r3 = op2 r0<kill>, #c |
-------------------------
For instance, some of such VLIW architectures allow that the programmer
specifies whether "r0" in "op2" will have the value from t-1 (before the
bundle) or from t+1 (the result produced by "op1"). If anything,
because the forwarding of results is close enough in the pipeline to be
used in the same bundle.
It seems to me that such subtleties are better left near the
target-dependent code, but the target-independent code should refrain
from making strict interpretations of what a bundle should look like.
When it gets to the MC layer, it seems to me that overloading the MCI
operands with opcodes is a bit unusual, but, at first glance, it might
not be too hard to handle.
--
Evandro Menezes Austin, TX emen...@codeaurora.org
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum
Hi Evan,
I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM.
I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction.
From my experience a simple packetizer that groups instruction into bundles (like the old IA-64 back-end did) without changing the order of the instructions produces bad code. Instead, a VLIW scheduler that directly outputs bundles produces better code. The current LLVM scheduler (at the end of the instruction selection pass) is not suitable to generate bundled instructions since it operates on scheduling units for glued instructions.
However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary.
I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling.
Timo
Hi Evan,I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM.I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction.From my experience a simple packetizer that groups instruction into bundles (like the old IA-64 back-end did) without changing the order of the instructions produces bad code. Instead, a VLIW scheduler that directly outputs bundles produces better code. The current LLVM scheduler (at the end of the instruction selection pass) is not suitable to generate bundled instructions since it operates on scheduling units for glued instructions.However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary.
I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling.