[LLVMdev] Instruction Scheduling Itineraries

Andrew Trick

unread,

Oct 21, 2011, 10:30:16 PM10/21/11

to James Molloy, LLVM Developers Mailing List

On Oct 21, 2011, at 12:15 AM, James Molloy wrote:
> Hi Andy,
>
> Could you describe how this would be done? In the current ARM itineraries
> (say C-A9 for example), the superscalar issue stage is modelled as taking 1
> cycle. If it were to take 2 cycles instead, as far as I can tell the hazard
> analyser would stall because both FU's would be acquired.
>
> I would like to model both issue width and pipeline depth. To save myself
> explaining a possibly incorrect assumption again, could you please briefly
> say how you expect that to be modelled and I can respond to that? Say for
> example a simple M-wide, N-deep pipeline.
>
> Cheers,
>
> James
>
Hi James,

I'll try to describe how the itinerary works a bit. It's nonintuitive.

The itinerary has two lists, a list of pipeline stages and a list of
operand latencies. The latency of an instruction is captured by the
latency of its "definition" operands, so latency does not need to be
modeled in the pipeline stages at all.

A 2 wide, 1 deep pipeline (2x1) would be:

[InstrStage<1, [Pipe0, Pipe1]>]

A 2 wide, 4 deep pipeline (2x4) would be:

[InstrStage<1, [Pipe0, Pipe1]>]

Surprise. There is no difference in the pipeline description, because
the units are fully pipelined and we don't need to express latency
here. (I'm only showing the pipeline stages here, not the operand latency list).

Let's say you want to treat each stage of a pipeline as a separate
type of unit:

stage0: Decode
stage1: Exec
stage2: Write

[InstrStage<1, [Decode0, Decode1], 0>,
InstrStage<1, [Exec0, Exec1], 0>,
InstrStage<1, [Write0, Write1, 0]>]

Now when the first instruction is scheduled, it fills in the current
row of the reservation table with Decode0, Exec0, Write0. This is
counterintuitive because the instruction does not execute on all units
in the same cycle, but it results in a more compact reservation table
and still sufficiently models hazards.

Things only get more complicated if you have functional units that are
not fully pipelined, or you have instructions that use the same functional
units at different pipeline stages.

If I have an instruction that consumes a functional unit for 2 cycles,
during which no other instruction may be issued to that unit, then I
need to do this:

[InstrStage<2, [NonPipelinedUnit]>

If I have an instruction that splits into two dependent microops, that
use the same type of functional unit, but at different times, then I need to
do this:

[InstrStage<1, [ALU0, ALU1], 1>
InstrStage<1, [ALU0, ALU1]>

-Andy

>From TargetScheduled.td:
//===----------------------------------------------------------------------===//
// Instruction stage - These values represent a non-pipelined step in
// the execution of an instruction. Cycles represents the number of
// discrete time slots needed to complete the stage. Units represent
// the choice of functional units that can be used to complete the
// stage. Eg. IntUnit1, IntUnit2. NextCycles indicates how many
// cycles should elapse from the start of this stage to the start of
// the next stage in the itinerary. For example:
//
// A stage is specified in one of two ways:
//
// InstrStage<1, [FU_x, FU_y]> - TimeInc defaults to Cycles
// InstrStage<1, [FU_x, FU_y], 0> - TimeInc explicit
//

class InstrStage<int cycles, list<FuncUnit> units,
int timeinc = -1,
ReservationKind kind = Required> {
int Cycles = cycles; // length of stage in machine cycles
list<FuncUnit> Units = units; // choice of functional units
int TimeInc = timeinc; // cycles till start of next stage
int Kind = kind.Value; // kind of FU reservation
}

> -----Original Message-----
> From: Andrew Trick [mailto:atr...@apple.com]
> Sent: 21 October 2011 02:36
> To: James Molloy
> Cc: Hal Finkel; llvm-commits LLVM; Evan Cheng
> Subject: Re: [llvm-commits] [llvm] r142171 - in /llvm/trunk:
> lib/Target/PowerPC/PPCSchedule440.td test/CodeGen/PowerPC/ppc440-fp-basic.ll
> test/CodeGen/PowerPC/ppc440-msync.ll
>
> On Oct 20, 2011, at 3:24 PM, Evan Cheng wrote:
>
>>
>> On Oct 20, 2011, at 12:04 PM, James Molloy wrote:
>>
>>> Evan,
>>>
>>> Regarding this, I wanted to ask - there's currently a hard limit of 32
> FunctionalUnits. Functional units cannot be pipelined, so for example to
> describe a pipeline for a superscalar machine of issue width N taking M
> cycles, one requires N*M functional units.
>>
>> I don't think that's how it works. You can describe a resource being
> acquired or reserved for M cycles. Perhaps I am not understanding your
> question.
>>
>> Evan
>>
>
> An N-wide machine can be described with N units, regardless of how deep the
> pipeline is.
>
> Furthermore if you only need to model issue width, then you don't even need
> to describe the pipeline at all. You only need to set the
> InstrItineraryData::IssueWidth field. ARMSubtarget::computeIssueWidth does
> this by assuming something about the convention of ARM itineraries. But you
> could simply embed the issue width constants for your subtargets within the
> target initialization code (in place of computeIssueWidth). I never bothered
> to add tablegen support for an IssueWidth field in the itinerary because we
> didn't need it for x86 and it is redundant with the existing ARM
> itineraries.
>
> -Andy
>
>>>
>>> This can quickly take you over the 32 unit limit. Is there any plan (or
> can I implement) pipelined functional units that can accept a new
> instruction every cycle but hold instructions for N cycles?
>>>
>>> Cheers,
>>>
>>> James
>>> ________________________________________
>>> From: llvm-commi...@cs.uiuc.edu [llvm-commi...@cs.uiuc.edu]
> On Behalf Of Evan Cheng [evan....@apple.com]
>>> Sent: 20 October 2011 18:21
>>> To: Hal Finkel
>>> Cc: llvm-c...@cs.uiuc.edu
>>> Subject: Re: [llvm-commits] [llvm] r142171 - in /llvm/trunk:
> lib/Target/PowerPC/PPCSchedule440.td test/CodeGen/PowerPC/ppc440-fp-basic.ll
> test/CodeGen/PowerPC/ppc440-msync.ll
>>>
>>> On Oct 19, 2011, at 7:29 PM, Hal Finkel <hfi...@anl.gov> wrote:
>>>
>>>> Evan,
>>>>
>>>> Thanks for the heads up! Is there a current target that implements the
>>>> scheduling as it will be? And does the bottom-up scheduling also account
>>>
>>> ARM is a good model.
>>>
>>>> for pipeline-conflict hazards?
>>>
>>> Yes, definitely. And it should be doing a much better job of it.
>>>
>>> Evan
>>>
>>>>
>>>> -Hal
>>>>
>>>> On Wed, 2011-10-19 at 16:45 -0700, Evan Cheng wrote:
>>>>> Hi Hal,
>>>>>
>>>>> Heads up. We'll soon abolish top-down pre-register allocation scheduler
> and force every target to bottom up scheduling. The problem is tt list
> scheduler does not handle physical register dependency at all but it is
> something that's required for some upcoming legalizer change.
>>>>>
>>>>> If you are interested in PPC, you might want to look into switching its
> scheduler now. The bottom up register pressure aware scheduler should work
> quite well for PPC.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Evan
>>>>>
>>>>> On Oct 16, 2011, at 9:03 PM, Hal Finkel wrote:
>>>>>
>>>>>> Author: hfinkel
>>>>>> Date: Sun Oct 16 23:03:55 2011
>>>>>> New Revision: 142171
>>>>>>
>>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=142171&view=rev
>>>>>> Log:
>>>>>> Add PPC 440 scheduler and some associated tests (new files)
>>>>>>
>>>>>> Added:
>>>>>> llvm/trunk/lib/Target/PowerPC/PPCSchedule440.td
>>>>>> llvm/trunk/test/CodeGen/PowerPC/ppc440-fp-basic.ll
>>>>>> llvm/trunk/test/CodeGen/PowerPC/ppc440-msync.ll
>>>>>>
>>>>>> Added: llvm/trunk/lib/Target/PowerPC/PPCSchedule440.td
>>>>>> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/PowerPC/PPCSchedul
> e440.td?rev=142171&view=auto
>>>>>>
> ============================================================================
> ==
>>>>>> --- llvm/trunk/lib/Target/PowerPC/PPCSchedule440.td (added)
>>>>>> +++ llvm/trunk/lib/Target/PowerPC/PPCSchedule440.td Sun Oct 16
> 23:03:55 2011
>>>>>> @@ -0,0 +1,568 @@
>>>>>> +//===- PPCSchedule440.td - PPC 440 Scheduling Definitions ----*-
> tablegen -*-===//
>>>>>> +//
>>>>>> +// The LLVM Compiler Infrastructure
>>>>>> +//
>>>>>> +// This file is distributed under the University of Illinois Open
> Source
>>>>>> +// License. See LICENSE.TXT for details.
>>>>>> +//
>>>>>>
> +//===----------------------------------------------------------------------
> ===//
>>>>>> +
>>>>>> +// Primary reference:
>>>>>> +// PowerPC 440x6 Embedded Processor Core UserâEUR(tm)s Manual.
>>>>>> +// IBM (as updated in) 2010.
>>>>>> +
>>>>>> +// The basic PPC 440 does not include a floating-point unit; the
> pipeline
>>>>>> +// timings here are constructed to match the FP2 unit shipped with
> the
>>>>>> +// PPC-440- and PPC-450-based Blue Gene (L and P) supercomputers.
>>>>>> +// References:
>>>>>> +// S. Chatterjee, et al. Design and exploitation of a
> high-performance
>>>>>> +// SIMD floating-point unit for Blue Gene/L.
>>>>>> +// IBM J. Res. & Dev. 49 (2/3) March/May 2005.
>>>>>> +// also:
>>>>>> +// Carlos Sosa and Brant Knudson. IBM System Blue Gene Solution:
>>>>>> +// Blue Gene/P Application Development.
>>>>>> +// IBM (as updated in) 2009.
>>>>>> +
>>>>>>
> +//===----------------------------------------------------------------------
> ===//
>>>>>> +// Functional units on the PowerPC 440/450 chip sets
>>>>>> +//
>>>>>> +def IFTH1 : FuncUnit; // Fetch unit 1
>>>>>> +def IFTH2 : FuncUnit; // Fetch unit 2
>>>>>> +def PDCD1 : FuncUnit; // Decode unit 1
>>>>>> +def PDCD2 : FuncUnit; // Decode unit 2
>>>>>> +def DISS1 : FuncUnit; // Issue unit 1
>>>>>> +def DISS2 : FuncUnit; // Issue unit 2
>>>>>> +def LRACC : FuncUnit; // Register access and dispatch for
>>>>>> + // the simple integer (J-pipe) and
>>>>>> + // load/store (L-pipe) pipelines
>>>>>> +def IRACC : FuncUnit; // Register access and dispatch for
>>>>>> + // the complex integer (I-pipe) pipeline
>>>>>> +def FRACC : FuncUnit; // Register access and dispatch for
>>>>>> + // the floating-point execution (F-pipe)
> pipeline
>>>>>> +def IEXE1 : FuncUnit; // Execution stage 1 for the I pipeline
>>>>>> +def IEXE2 : FuncUnit; // Execution stage 2 for the I pipeline
>>>>>> +def IWB : FuncUnit; // Write-back unit for the I pipeline
>>>>>> +def JEXE1 : FuncUnit; // Execution stage 1 for the J pipeline
>>>>>> +def JEXE2 : FuncUnit; // Execution stage 2 for the J pipeline
>>>>>> +def JWB : FuncUnit; // Write-back unit for the J pipeline
>>>>>> +def AGEN : FuncUnit; // Address generation for the L pipeline
>>>>>> +def CRD : FuncUnit; // D-cache access for the L pipeline
>>>>>> +def LWB : FuncUnit; // Write-back unit for the L pipeline
>>>>>> +def FEXE1 : FuncUnit; // Execution stage 1 for the F pipeline
>>>>>> +def FEXE2 : FuncUnit; // Execution stage 2 for the F pipeline
>>>>>> +def FEXE3 : FuncUnit; // Execution stage 3 for the F pipeline
>>>>>> +def FEXE4 : FuncUnit; // Execution stage 4 for the F pipeline
>>>>>> +def FEXE5 : FuncUnit; // Execution stage 5 for the F pipeline
>>>>>> +def FEXE6 : FuncUnit; // Execution stage 6 for the F pipeline
>>>>>> +def FWB : FuncUnit; // Write-back unit for the F pipeline
>>>>>> +
>>>>>> +def LWARX_Hold : FuncUnit; // This is a pseudo-unit which is used
>>>>>> + // to make sure that no lwarx/stwcx.
>>>>>> + // instructions are issued while another
>>>>>> + // lwarx/stwcx. is in the L pipe.
>>>>>> +
>>>>>> +def GPR_Bypass : Bypass; // The bypass for general-purpose regs.
>>>>>> +def FPR_Bypass : Bypass; // The bypass for floating-point regs.
>>>>>> +
>>>>>> +// Notes:
>>>>>> +// Instructions are held in the FRACC, LRACC and IRACC pipeline
>>>>>> +// stages until their source operands become ready. Exceptions:
>>>>>> +// - Store instructions will hold in the AGEN stage
>>>>>> +// - The integer multiply-accumulate instruction will hold in
>>>>>> +// the IEXE1 stage
>>>>>> +//
>>>>>> +// For most I-pipe operations, the result is available at the end of
>>>>>> +// the IEXE1 stage. Operations such as multiply and divide must
>>>>>> +// continue to execute in IEXE2 and IWB. Divide resides in IWB for
>>>>>> +// 33 cycles (multiply also calculates its result in IWB). For all
>>>>>> +// J-pipe instructions, the result is available
>>>>>> +// at the end of the JEXE1 stage. Loads have a 3-cycle latency
>>>>>> +// (data is not available until after the LWB stage).
>>>>>> +//
>>>>>> +// The L1 cache hit latency is four cycles for floating point loads
>>>>>> +// and three cycles for integer loads.
>>>>>> +//
>>>>>> +// The stwcx. instruction requires both the LRACC and the IRACC
>>>>>> +// dispatch stages. It must be issued from DISS0.
>>>>>> +//
>>>>>> +// All lwarx/stwcx. instructions hold in LRACC if another
>>>>>> +// uncommitted lwarx/stwcx. is in AGEN, CRD, or LWB.
>>>>>> +//
>>>>>> +// msync (a.k.a. sync) and mbar will hold in LWB until all load/store
>>>>>> +// resources are empty. AGEN and CRD are held empty until the
> msync/mbar
>>>>>> +// commits.
>>>>>> +//
>>>>>> +// Most floating-point instructions, computational and move,
>>>>>> +// have a 5-cycle latency. Divide takes longer (30 cycles).
> Instructions that
>>>>>> +// update the CR take 2 cycles. Stores take 3 cycles and, as
> mentioned above,
>>>>>> +// loads take 4 cycles (for L1 hit).
>>>>>> +
>>>>>> +//
>>>>>> +// This file defines the itinerary class data for the PPC 440
> processor.
>>>>>> +//
>>>>>>
> +//===----------------------------------------------------------------------
> ===//
>>>>>> +
>>>>>> +
>>>>>> +def PPC440Itineraries : ProcessorItineraries<
>>>>>> + [IFTH1, IFTH2, PDCD1, PDCD2, DISS1, DISS2, FRACC,
>>>>>> + IRACC, IEXE1, IEXE2, IWB, LRACC, JEXE1, JEXE2, JWB, AGEN, CRD,
> LWB,
>>>>>> + FEXE1, FEXE2, FEXE3, FEXE4, FEXE5, FEXE6, FWB, LWARX_Hold],
>>>>>> + [GPR_Bypass, FPR_Bypass], [
>>>>>> + InstrItinData<IntGeneral , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC, LRACC]>,
>>>>>> + InstrStage<1, [IEXE1, JEXE1]>,
>>>>>> + InstrStage<1, [IEXE2, JEXE2]>,
>>>>>> + InstrStage<1, [IWB, JWB]>],
>>>>>> + [6, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntCompare , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC, LRACC]>,
>>>>>> + InstrStage<1, [IEXE1, JEXE1]>,
>>>>>> + InstrStage<1, [IEXE2, JEXE2]>,
>>>>>> + InstrStage<1, [IWB, JWB]>],
>>>>>> + [6, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntDivW , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<33, [IWB]>],
>>>>>> + [40, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMFFS , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [7, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMTFSB0 , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [7, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMulHW , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMulHWU , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMulLI , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntRotate , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC, LRACC]>,
>>>>>> + InstrStage<1, [IEXE1, JEXE1]>,
>>>>>> + InstrStage<1, [IEXE2, JEXE2]>,
>>>>>> + InstrStage<1, [IWB, JWB]>],
>>>>>> + [6, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntShift , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC, LRACC]>,
>>>>>> + InstrStage<1, [IEXE1, JEXE1]>,
>>>>>> + InstrStage<1, [IEXE2, JEXE2]>,
>>>>>> + InstrStage<1, [IWB, JWB]>],
>>>>>> + [6, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntTrapW , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [6, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<BrB , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<BrCR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<BrMCR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<BrMCRX , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStDCBA , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStDCBF , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStDCBI , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStGeneral , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<2, [LWB]>],
>>>>>> + [9, 5], // FIXME: should be [9, 5] for
> loads and
>>>>>> + // [8, 5] for stores.
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStICBI , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStUX , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5, 5],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLFD , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<2, [LWB]>],
>>>>>> + [9, 5, 5],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLFDU , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [9, 5, 5],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLHA , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLMW , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLWARX , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1]>,
>>>>>> + InstrStage<1, [IRACC], 0>,
>>>>>> + InstrStage<4, [LWARX_Hold], 0>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStSTWCX , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1]>,
>>>>>> + InstrStage<1, [IRACC], 0>,
>>>>>> + InstrStage<4, [LWARX_Hold], 0>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStSync , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<3, [AGEN], 1>,
>>>>>> + InstrStage<2, [CRD], 1>,
>>>>>> + InstrStage<1, [LWB]>]>,
>>>>>> + InstrItinData<SprISYNC , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC], 0>,
>>>>>> + InstrStage<1, [LRACC], 0>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [FEXE1], 0>,
>>>>>> + InstrStage<1, [AGEN], 0>,
>>>>>> + InstrStage<1, [JEXE1], 0>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [FEXE2], 0>,
>>>>>> + InstrStage<1, [CRD], 0>,
>>>>>> + InstrStage<1, [JEXE2], 0>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<6, [FEXE3], 0>,
>>>>>> + InstrStage<6, [LWB], 0>,
>>>>>> + InstrStage<6, [JWB], 0>,
>>>>>> + InstrStage<6, [IWB]>]>,
>>>>>> + InstrItinData<SprMFSR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [6, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMTMSR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [6, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMTSR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [9, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprTLBSYNC , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>]>,
>>>>>> + InstrItinData<SprMFCR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMFMSR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [7, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMFSPR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [10, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMFTB , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [10, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMTSPR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [10, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMTSRIN , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [10, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprRFI , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprSC , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<FPGeneral , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<1, [FWB]>],
>>>>>> + [10, 4, 4],
>>>>>> + [FPR_Bypass, FPR_Bypass, FPR_Bypass]>,
>>>>>> + InstrItinData<FPCompare , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<1, [FWB]>],
>>>>>> + [10, 4, 4],
>>>>>> + [FPR_Bypass, FPR_Bypass, FPR_Bypass]>,
>>>>>> + InstrItinData<FPDivD , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<25, [FWB]>],
>>>>>> + [35, 4, 4],
>>>>>> + [NoBypass, FPR_Bypass, FPR_Bypass]>,
>>>>>> + InstrItinData<FPDivS , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<13, [FWB]>],
>>>>>> + [23, 4, 4],
>>>>>> + [NoBypass, FPR_Bypass, FPR_Bypass]>,
>>>>>> + InstrItinData<FPFused , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<1, [FWB]>],
>>>>>> + [10, 4, 4, 4],
>>>>>> + [FPR_Bypass, FPR_Bypass, FPR_Bypass,
> FPR_Bypass]>,
>>>>>> + InstrItinData<FPRes , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<1, [FWB]>],
>>>>>> + [10, 4],
>>>>>> + [FPR_Bypass, FPR_Bypass]>
>>>>>> +]>;
>>>>>>
>>>>>> Added: llvm/trunk/test/CodeGen/PowerPC/ppc440-fp-basic.ll
>>>>>> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/PowerPC/ppc440-f
> p-basic.ll?rev=142171&view=auto
>>>>>>
> ============================================================================
> ==
>>>>>> --- llvm/trunk/test/CodeGen/PowerPC/ppc440-fp-basic.ll (added)
>>>>>> +++ llvm/trunk/test/CodeGen/PowerPC/ppc440-fp-basic.ll Sun Oct 16
> 23:03:55 2011
>>>>>> @@ -0,0 +1,32 @@
>>>>>> +; RUN: llc < %s -march=ppc32 -mcpu=440 | grep fmadd
>>>>>> +
>>>>>> +%0 = type { double, double }
>>>>>> +
>>>>>> +define void @maybe_an_fma(%0* sret %agg.result, %0* byval %a, %0*
> byval %b, %0* byval %c) nounwind {
>>>>>> +entry:
>>>>>> + %a.realp = getelementptr inbounds %0* %a, i32 0, i32 0
>>>>>> + %a.real = load double* %a.realp
>>>>>> + %a.imagp = getelementptr inbounds %0* %a, i32 0, i32 1
>>>>>> + %a.imag = load double* %a.imagp
>>>>>> + %b.realp = getelementptr inbounds %0* %b, i32 0, i32 0
>>>>>> + %b.real = load double* %b.realp
>>>>>> + %b.imagp = getelementptr inbounds %0* %b, i32 0, i32 1
>>>>>> + %b.imag = load double* %b.imagp
>>>>>> + %mul.rl = fmul double %a.real, %b.real
>>>>>> + %mul.rr = fmul double %a.imag, %b.imag
>>>>>> + %mul.r = fsub double %mul.rl, %mul.rr
>>>>>> + %mul.il = fmul double %a.imag, %b.real
>>>>>> + %mul.ir = fmul double %a.real, %b.imag
>>>>>> + %mul.i = fadd double %mul.il, %mul.ir
>>>>>> + %c.realp = getelementptr inbounds %0* %c, i32 0, i32 0
>>>>>> + %c.real = load double* %c.realp
>>>>>> + %c.imagp = getelementptr inbounds %0* %c, i32 0, i32 1
>>>>>> + %c.imag = load double* %c.imagp
>>>>>> + %add.r = fadd double %mul.r, %c.real
>>>>>> + %add.i = fadd double %mul.i, %c.imag
>>>>>> + %real = getelementptr inbounds %0* %agg.result, i32 0, i32 0
>>>>>> + %imag = getelementptr inbounds %0* %agg.result, i32 0, i32 1
>>>>>> + store double %add.r, double* %real
>>>>>> + store double %add.i, double* %imag
>>>>>> + ret void
>>>>>> +}
>>>>>>
>>>>>> Added: llvm/trunk/test/CodeGen/PowerPC/ppc440-msync.ll
>>>>>> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/PowerPC/ppc440-m
> sync.ll?rev=142171&view=auto
>>>>>>
> ============================================================================
> ==
>>>>>> --- llvm/trunk/test/CodeGen/PowerPC/ppc440-msync.ll (added)
>>>>>> +++ llvm/trunk/test/CodeGen/PowerPC/ppc440-msync.ll Sun Oct 16
> 23:03:55 2011
>>>>>> @@ -0,0 +1,23 @@
>>>>>> +; RUN: llc < %s -march=ppc32 -o %t
>>>>>> +; RUN: grep sync %t
>>>>>> +; RUN: not grep msync %t
>>>>>> +; RUN: llc < %s -march=ppc32 -mcpu=440 | grep msync
>>>>>> +
>>>>>> +define i32 @has_a_fence(i32 %a, i32 %b) nounwind {
>>>>>> +entry:
>>>>>> + fence acquire
>>>>>> + %cond = icmp eq i32 %a, %b
>>>>>> + br i1 %cond, label %IfEqual, label %IfUnequal
>>>>>> +
>>>>>> +IfEqual:
>>>>>> + fence release
>>>>>> + br label %end
>>>>>> +
>>>>>> +IfUnequal:
>>>>>> + fence release
>>>>>> + ret i32 0
>>>>>> +
>>>>>> +end:
>>>>>> + ret i32 1
>>>>>> +}
>>>>>> +
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-c...@cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>
>>>>
>>>> --
>>>> Hal Finkel
>>>> Postdoctoral Appointee
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>> 1-630-252-0023
>>>> hfi...@anl.gov
>>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-c...@cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>>>
>>
>
>
>
>
>

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Hal Finkel

unread,

Oct 25, 2011, 9:01:29 PM10/25/11

to Evan Cheng, llvm-c...@cs.uiuc.edu, llv...@cs.uiuc.edu

Is there documentation somewhere for the bottom-up scheduling? I'm
trying to figure out what changes are necessary in order to support it
in the PPC backend.

Thanks in advance,
Hal

On Thu, 2011-10-20 at 10:21 -0700, Evan Cheng wrote:
>
> On Oct 19, 2011, at 7:29 PM, Hal Finkel <hfi...@anl.gov> wrote:
>
> > Evan,
> >
> > Thanks for the heads up! Is there a current target that implements the
> > scheduling as it will be? And does the bottom-up scheduling also account
>
> ARM is a good model.

What part of ARM's implementation is associated with the bottom-up
scheduling? I am confused because it looks like it is essentially using
the same kind of ScoreboardHazardRecognizer that was commented out of
the PPC 440 code.

Thanks in advance,
Hal

>
> > for pipeline-conflict hazards?
>
> Yes, definitely. And it should be doing a much better job of it.
>
> Evan
>
> >
> > -Hal
> >
> > On Wed, 2011-10-19 at 16:45 -0700, Evan Cheng wrote:
> >> Hi Hal,
> >>
> >> Heads up. We'll soon abolish top-down pre-register allocation scheduler and force every target to bottom up scheduling. The problem is tt list scheduler does not handle physical register dependency at all but it is something that's required for some upcoming legalizer change.
> >>
> >> If you are interested in PPC, you might want to look into switching its scheduler now. The bottom up register pressure aware scheduler should work quite well for PPC.
> >>
> >> Thanks,
> >>
> >> Evan
> >>

--

Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Andrew Trick

unread,

Oct 26, 2011, 12:00:06 AM10/26/11

to Hal Finkel, llv...@cs.uiuc.edu

On Oct 25, 2011, at 6:01 PM, Hal Finkel wrote:
> Is there documentation somewhere for the bottom-up scheduling? I'm
> trying to figure out what changes are necessary in order to support it
> in the PPC backend.
>
> Thanks in advance,
> Hal
>
> On Thu, 2011-10-20 at 10:21 -0700, Evan Cheng wrote:
>>
>> On Oct 19, 2011, at 7:29 PM, Hal Finkel <hfi...@anl.gov> wrote:
>>
>>> Evan,
>>>
>>> Thanks for the heads up! Is there a current target that implements the
>>> scheduling as it will be? And does the bottom-up scheduling also account
>>
>> ARM is a good model.
>
> What part of ARM's implementation is associated with the bottom-up
> scheduling? I am confused because it looks like it is essentially using
> the same kind of ScoreboardHazardRecognizer that was commented out of
> the PPC 440 code.
>
> Thanks in advance,
> Hal

Hi Hal,

The best way to ensure the PPC scheduling isn't hosed now or in the
future is probably to make it work as much like ARM as possible.

This means (1) defaulting to the "hybrid" scheduler, (2) implementing the
register pressure limit, and (3) reenabling the hazard recognizer.

(1) TargetLowering::setSchedulingPreference(Sched::Hybrid)

(2) TargetRegisterInfo::getRegisterPressureLimit(...) should probably
return something a bit less than 32, depending on register class.

(3) The standard hazard recognizer works either bottom-up or top-down
on the itinerary data. So it *should* work out-of-box. The problem is
that PPC has overriden the API to layer some custom "bundling" logic
on top of basic hazard detection. This logic needs to be reversed for
bottom-up, or you could start by simply disabling it instead of the
entire hazard recognizer.

Now, to generate the best PPC schedules, there is one thing you may
want to override. The scheduler's priority function has a
HasReadyFilter attribute (enum). It can be overriden by specializing
hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
scheduling, and maximizes the instructions that can issue in one
group, regardless of register pressure. We still care about register
pressure enough in ARM to avoid enabling this. I'm really not sure how
much it will help on modern PPC implementations though.

I realize this is confusing because we have a scheduler mode named
"ILP". That mode is intended for target's that do not have an
itinerary. It's currently setup for x86 and would need some tweaking
to work well for other targets. Again, if your target has an
itinerary, you probably want the "hybrid" mode.

-Andy

On Wed, 2011-10-19 at 16:45 -0700, Evan Cheng wrote:
>>>> Hi Hal,
>>>>
>>>> Heads up. We'll soon abolish top-down pre-register allocation scheduler and force every target to bottom up scheduling. The problem is tt list scheduler does not handle physical register dependency at all but it is something that's required for some upcoming legalizer change.
>>>>
>>>> If you are interested in PPC, you might want to look into switching its scheduler now. The bottom up register pressure aware scheduler should work quite well for PPC.
>>>>
>>>> Thanks,
>>>>
>>>> Evan
>>>>

_______________________________________________

Hal Finkel

unread,

Nov 22, 2011, 2:27:35 PM11/22/11

to Andrew Trick, llv...@cs.uiuc.edu

Is EmitInstruction used in bottom-up scheduling at all? The version in
the ARM recognizer seems essential, but in all of the regression tests
(and some other .ll files I have lying around), it is never called. It
seems that only Reset() and getHazardType() are called. Could you please
explain the calling sequence?

Thanks again,
Hal

>
> Now, to generate the best PPC schedules, there is one thing you may
> want to override. The scheduler's priority function has a
> HasReadyFilter attribute (enum). It can be overriden by specializing
> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
> scheduling, and maximizes the instructions that can issue in one
> group, regardless of register pressure. We still care about register
> pressure enough in ARM to avoid enabling this. I'm really not sure how
> much it will help on modern PPC implementations though.
>
> I realize this is confusing because we have a scheduler mode named
> "ILP". That mode is intended for target's that do not have an
> itinerary. It's currently setup for x86 and would need some tweaking
> to work well for other targets. Again, if your target has an
> itinerary, you probably want the "hybrid" mode.
>
> -Andy
>
> On Wed, 2011-10-19 at 16:45 -0700, Evan Cheng wrote:
> >>>> Hi Hal,
> >>>>
> >>>> Heads up. We'll soon abolish top-down pre-register allocation scheduler and force every target to bottom up scheduling. The problem is tt list scheduler does not handle physical register dependency at all but it is something that's required for some upcoming legalizer change.
> >>>>
> >>>> If you are interested in PPC, you might want to look into switching its scheduler now. The bottom up register pressure aware scheduler should work quite well for PPC.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Evan
> >>>>
>

--

Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Hal Finkel

unread,

Nov 28, 2011, 6:35:01 PM11/28/11

to Andrew Trick, llv...@cs.uiuc.edu

I feel that I should clarify my comment: For PPC, now that Hybrid
scheduling is enabled, EmitInstruction seems never to be called (at
least it is not called when running any PPC codegen test in the
regression-test collection).

Thanks again,
Hal

Andrew Trick

unread,

Nov 28, 2011, 6:45:02 PM11/28/11

to Hal Finkel, llv...@cs.uiuc.edu

On Nov 28, 2011, at 3:35 PM, Hal Finkel wrote:

Is EmitInstruction used in bottom-up scheduling at all? The version in
the ARM recognizer seems essential, but in all of the regression tests
(and some other .ll files I have lying around), it is never called. It
seems that only Reset() and getHazardType() are called. Could you please
explain the calling sequence?

I feel that I should clarify my comment: For PPC, now that Hybrid
scheduling is enabled, EmitInstruction seems never to be called (at
least it is not called when running any PPC codegen test in the
regression-test collection).

Hal,

Since PPCHazardRecognizer is not derived from ScoreboardHazardRecognizer, you'll need to initialize MaxLookAhead to the max depth of your target's itinerary.

See how this is done in the ScoreboardHazardRecognizer ctor:

> MaxLookAhead = ScoreboardDepth;

-Andy

Hal Finkel

unread,

Nov 29, 2011, 9:29:18 AM11/29/11

to Andrew Trick, llv...@cs.uiuc.edu

On Mon, 2011-11-28 at 15:45 -0800, Andrew Trick wrote:
>
> On Nov 28, 2011, at 3:35 PM, Hal Finkel wrote:
>
> > >
> > > Is EmitInstruction used in bottom-up scheduling at all? The
> > > version in
> > > the ARM recognizer seems essential, but in all of the regression
> > > tests
> > > (and some other .ll files I have lying around), it is never
> > > called. It
> > > seems that only Reset() and getHazardType() are called. Could you
> > > please
> > > explain the calling sequence?
> >
> > I feel that I should clarify my comment: For PPC, now that Hybrid
> > scheduling is enabled, EmitInstruction seems never to be called (at
> > least it is not called when running any PPC codegen test in the
> > regression-test collection).
>
>
> Hal,
>
>
> Since PPCHazardRecognizer is not derived from
> ScoreboardHazardRecognizer, you'll need to initialize MaxLookAhead to
> the max depth of your target's itinerary.

Andy,

Thanks! Since I have to change PPCHazardRecognizer for bottom-up support
anyway, is there any reason not to have it derive from
ScoreboardHazardRecognizer at this point? It looks like the custom
bundling logic could be implemented on top of the scoreboard recognizer
(that seems similar to what ARM's recognizer is doing).

-Hal

>
>
> See how this is done in the ScoreboardHazardRecognizer ctor:
> > MaxLookAhead = ScoreboardDepth;
>
>
>
> -Andy
>
>

--

Hal Finkel

unread,

Nov 29, 2011, 11:51:50 AM11/29/11

to Andrew Trick, llv...@cs.uiuc.edu

Also, how does the ARM hazard recognizer get away with not implementing
RecedeCycle?

Thanks again,
Hal

Andrew Trick

unread,

Nov 29, 2011, 12:47:31 PM11/29/11

to Hal Finkel, llv...@cs.uiuc.edu

ARM can reuse all the default scoreboard hazard recognizer logic such as recede cycle (naturally since its the primary client). If you can do the same with PPC that's great.

Andy

On Nov 29, 2011, at 8:51 AM, Hal Finkel <hfi...@anl.gov> wrote:

>> Thanks! Since I have to change PPCHazardRecognizer for bottom-up support
>> anyway, is there any reason not to have it derive from
>> ScoreboardHazardRecognizer at this point? It looks like the custom
>> bundling logic could be implemented on top of the scoreboard recognizer
>> (that seems similar to what ARM's recognizer is doing).
>
> Also, how does the ARM hazard recognizer get away with not implementing
> RecedeCycle?
>
> Thanks again,
> Hal

Hal Finkel

unread,

Nov 29, 2011, 1:47:04 PM11/29/11

to Andrew Trick, llv...@cs.uiuc.edu

Andy,

I should have been more clear, the ARM implementation has:
void ARMHazardRecognizer::RecedeCycle() {
llvm_unreachable("reverse ARM hazard checking unsupported");
}

How does that work?

Thanks again,
Hal

On Tue, 2011-11-29 at 09:47 -0800, Andrew Trick wrote:
> ARM can reuse all the default scoreboard hazard recognizer logic such as recede cycle (naturally since its the primary client). If you can do the same with PPC that's great.
>
> Andy
>
> On Nov 29, 2011, at 8:51 AM, Hal Finkel <hfi...@anl.gov> wrote:
>
> >> Thanks! Since I have to change PPCHazardRecognizer for bottom-up support
> >> anyway, is there any reason not to have it derive from
> >> ScoreboardHazardRecognizer at this point? It looks like the custom
> >> bundling logic could be implemented on top of the scoreboard recognizer
> >> (that seems similar to what ARM's recognizer is doing).
> >
> > Also, how does the ARM hazard recognizer get away with not implementing
> > RecedeCycle?
> >
> > Thanks again,
> > Hal

--

Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Sergei Larin

unread,

Nov 29, 2011, 2:20:34 PM11/29/11

to Andrew Trick, Hal Finkel, llv...@cs.uiuc.edu

Andy,

Is there any good info/docs on scheduling strategy in LLVM? As I was
complaining to you at the LLVM meeting, I end up reverse engineering/double
guessing more than I would like to... This thread shows that I am not
exactly alone in this... Thanks.

Sergei Larin

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

Andrew Trick

unread,

Nov 29, 2011, 2:40:34 PM11/29/11

to Hal Finkel, llv...@cs.uiuc.edu

On Nov 29, 2011, at 10:47 AM, Hal Finkel wrote:

> Andy,
>
> I should have been more clear, the ARM implementation has:
> void ARMHazardRecognizer::RecedeCycle() {
> llvm_unreachable("reverse ARM hazard checking unsupported");
> }
>
> How does that work?
>
> Thanks again,
> Hal

Hal,

My first answer was off the top of my head, so missed the subtle issue. Just so you know, to answer questions like this I usually need to instrument the code with tracing or step through in the debugger. Even though I've hacked on the code quite a bit, the interaction between the scheduler and target hooks is still not obvious to me from glancing at the code. FWIW, I'm hoping it can be cleaned up gradually, maybe for the next release.

The preRA scheduler is bottom-up, for register pressure tracking. The postRA scheduler is top-down, for simpler hazard detection logic.

On ARM, the preRA scheduler uses an unspecialized instance of ScoreboardHazardRecognizer. The machine-independent RecedeCycle() logic that operates on the scheduler itinerary is sufficient.

The ARM postRA scheduler specializes the HazardRecognizer to handle additional constraints that cannot be expressed in the itinerary. Since this is a top-down scheduler, RecedeCycle() is no applicable.

-Andy

Andrew Trick

unread,

Nov 29, 2011, 4:16:27 PM11/29/11

to Sergei Larin, llv...@cs.uiuc.edu

Sergei,

I would say that each target has its own scheduling strategy that has changed considerably over time. We try to maximize code reuse across targets, but it's not easy and done ad hoc. The result is confusing code that makes it difficult to understand the strategy for any particular target.

The right thing to do is:
1) Make it as easy as possible to understand how scheduling works for each of the primary targets (x86 and ARM) independent of each other.
2) Make it easy for very similar targets to piggyback on one of those implementations, without having to worry about other targets
3) Allow dissimilar targets (e.g. VLIW) to completely bypass the scheduler used by other targets and reuse only nicely self-contained parts of the framework, such as the DAG builder and individual machine description features.

We've recently moved further from this ideal scenario in that we're now forcing targets to implement the bottom-up selection dag scheduler. This is not really so bad, because you can revert to "source order" scheduling, -pre-RA-sched=source, and you don't need to implement many target hooks. It burns compile time for no good reason, but you can probably live with it. Then you're free to implement your own MI-level scheduler.

The next step in making it easier to maintain an llvm scheduler for "interesting" targets is to build an MI-level scheduling framework and move at least one of the primary targets to this framework so it's well supported. This would separate the nasty issues of serializing the selection DAG from the challenge of microarchitecture-level scheduling, and provide a suitable place to inject your own scheduling algorithm. It's easier to implement a scheduler when starting from a valid instruction sequence where all dependencies are resolved and no register interferences exit.

To answer your question, there's no clear way to describe the current overall scheduling strategy. For now, you'll need to ask porting questions on llvm-dev. Maybe someone who's faced a similar problem will have a good suggestion. We do want to improve that situation and we intend to do that by first providing a new scheduler framework. When we get to that point, I'll be sure that the new direction can work for you and is easy to understand. All I can say now is that the new design will allow a target to compose a preRA scheduler from an MI-level framework combined with target-specific logic for selecting the optimal instruction order. I don't see any point in imposing a generic scheduling algorithm across all targets.

-Andy

Hal Finkel

unread,

Nov 29, 2011, 6:10:40 PM11/29/11

to Andrew Trick, llv...@cs.uiuc.edu

On Tue, 2011-11-29 at 11:40 -0800, Andrew Trick wrote:
> On Nov 29, 2011, at 10:47 AM, Hal Finkel wrote:
>
> > Andy,
> >
> > I should have been more clear, the ARM implementation has:
> > void ARMHazardRecognizer::RecedeCycle() {
> > llvm_unreachable("reverse ARM hazard checking unsupported");
> > }
> >
> > How does that work?
> >
> > Thanks again,
> > Hal
>
> Hal,
>
> My first answer was off the top of my head, so missed the subtle issue. Just so you know, to answer questions like this I usually need to instrument the code with tracing or step through in the debugger.

This was actually the source of my question, it was clear that the ARM
RecedeCycle function was not being called, but, running the PPC code in
the debugger, it was clear that the PPC's RecedeCycle function was being
called. I did not appreciate the preRA vs postRA distinction, so thank
you for explaining that.

Instead of trying to port the PPC bundling logic for use with the
bottom-up scheduler, maybe it would be sufficient for now to use it
postRA (which seems like what ARM is doing).

-Hal

Andrew Trick

unread,

Nov 29, 2011, 10:30:20 PM11/29/11

to Hal Finkel, llvmdev@cs.uiuc.edu Mailing List

On Nov 29, 2011, at 7:10 PM, Hal Finkel wrote:

From the perspective of the hazard recognizer, from what I can tell, the
difference between the top-down and bottom-up modes are:

In top-down mode, the scheduling proceeds in the forward direction.
AdvanceCycle() may be used, RecedeCycle() is not used. EmitInstruction()
implies a cycle-count increment. In bottom-up mode, scheduling proceeds
in the backwards direction (last instruction first). AdvanceCycle() is
not used, RecedeCycle() is always used to decrement the current cycle
offset (EmitInstruction() does *not* imply a cycle-count decrement).

Is this right? Have I captured everything?

ScoreboardHazardRecognizer::EmitInstruction checks the reservation table where cycle=0 corresponds to the first pipeline stage and cycle=n is the stage the occurs after n cycles in the itinerary. It doesn't care if we're scheduling top-down or bottom-up. For top-down, the reservation table models resources used by previous instructions, and for bottom-up it models the resources of subsequent instructions, but the EmitInstruction logic is the same either way.

AdvanceCycle is called by the current top-down scheduler (now postRA only) to shift the reservation table forward in time. The earliest pipeline stage is dropped.

RecedeCycle is called by the bottom-up scheduler (preRA) to shift the reservation table backward in time. The latest pipeline stage is dropped.

It's easier to think about hazards in a top-down scheduler.

-Andy

Sergei Larin

unread,

Nov 30, 2011, 12:11:24 PM11/30/11

to Andrew Trick, Evan Cheng, llv...@cs.uiuc.edu

Andy,

Thank you for the extended and prompt answer. Let me try to summaries my
current position so you (and everyone interested) would have a better view
of the world through my eyes ;)

1) LLVM first robust VLIW target is currently in review. It needs for
scheduling strategy/quality are rather different than what current
scheduler(schedulers) can provide.
2) My first attempt in porting (while I was on 2.9) resulted in a new
top-down Pre-RA VLIW enabled scheduler that I was hoping to upstream as soon
as our back end is accepted. I guess I have missed the window since our
commit took a bit longer than planned. Now Evan told me (and you have
confirmed) that it would need to change to bottom-up version for 3.0.
Moreover, current "level" (exact placement in DAG->DAG pass) of Pre-RA
scheduling is less than optimal (and I agree to that since I have to bend
backwards to extract info readily available in MIs).
3) Your group is working on a "new" scheduler, and the best I understand
it would be same general algorithm moved "closer" to RA. I also understand
that at first it would not have added support for "packets"/bundles/multiops
in VLIW definition (or will it?). If they will be presented, interesting
discussion on how subsequent passes will be modified to recognize them would
follow... but we had another thread on this topic not that long ago.

So, IMHO the following would make sense:

1) It would be very nice if we can have some sort of write-up detailing
proposed changes, and maybe defining overall strategy for instruction
scheduling in LLVM __before__ major decisions are made. It should later be
converted in to "how to" or simple doc chapter on porting scheduler(s) to
new targets. Public discussion should follow, and we need to try to
accommodate all needs (as much as possible).
2) Any attempts on my part to further VLIW scheduler design for my target
would be unwise until such discussion would take place. I also do not
separate this process from bundle/packet representation. If you perceive an
overhead associated with this activity, I could volunteer to help.

Also, please see my comments embedded below.

Thanks.

Sergei Larin

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

> -----Original Message-----
> From: Andrew Trick [mailto:atr...@apple.com]
> Sent: Tuesday, November 29, 2011 3:16 PM
> To: Sergei Larin
> Cc: 'Hal Finkel'; llv...@cs.uiuc.edu
> Subject: Re: [LLVMdev] [llvm-commits] Bottom-Up Scheduling?
>

> Sergei,
>
> I would say that each target has its own scheduling strategy that has
> changed considerably over time. We try to maximize code reuse across
> targets, but it's not easy and done ad hoc. The result is confusing
> code that makes it difficult to understand the strategy for any
> particular target.
>
> The right thing to do is:
> 1) Make it as easy as possible to understand how scheduling works for
> each of the primary targets (x86 and ARM) independent of each other.

[Larin, Sergei]
Sure, that could be achieved with the design document/documentation set I
am talking about.

> 2) Make it easy for very similar targets to piggyback on one of those
> implementations, without having to worry about other targets

[Larin, Sergei]
Yes, and having a robust VLIW scheduler would greatly help here. It would
also IMHO set LLVM apart from GCC, and become an additional selling point
for us.

> 3) Allow dissimilar targets (e.g. VLIW) to completely bypass the
> scheduler used by other targets and reuse only nicely self-contained
> parts of the framework, such as the DAG builder and individual machine
> description features.

[Larin, Sergei]
I think this is rather implementation dependent, and we can finesse this
once we have framework better defined.

>
> We've recently moved further from this ideal scenario in that we're now
> forcing targets to implement the bottom-up selection dag scheduler.

[Larin, Sergei]
I really dislike this, especially due to the reason that lead to this
decision. I think the general "flexibility"/functionality was sacrificed for
tactical reason.

> This is not really so bad, because you can revert to "source order"
> scheduling, -pre-RA-sched=source, and you don't need to implement many
> target hooks. It burns compile time for no good reason, but you can
> probably live with it. Then you're free to implement your own MI-level
> scheduler.

[Larin, Sergei]
I am not 100% sure about this statement, but as I get closer to
re-implementing my scheduler I might grasp a better picture.

>
> The next step in making it easier to maintain an llvm scheduler for
> "interesting" targets is to build an MI-level scheduling framework and
> move at least one of the primary targets to this framework so it's well
> supported. This would separate the nasty issues of serializing the
> selection DAG from the challenge of microarchitecture-level scheduling,
> and provide a suitable place to inject your own scheduling algorithm.
> It's easier to implement a scheduler when starting from a valid
> instruction sequence where all dependencies are resolved and no
> register interferences exit.

[Larin, Sergei]
Agree, and my whole point is that it needs to be done with preceding
public discussion, and not de-facto with code drops.

>
> To answer your question, there's no clear way to describe the current
> overall scheduling strategy. For now, you'll need to ask porting
> questions on llvm-dev. Maybe someone who's faced a similar problem will
> have a good suggestion. We do want to improve that situation and we
> intend to do that by first providing a new scheduler framework. When we
> get to that point, I'll be sure that the new direction can work for you

[Larin, Sergei]
Any clue on time frame?

> and is easy to understand. All I can say now is that the new design
> will allow a target to compose a preRA scheduler from an MI-level
> framework combined with target-specific logic for selecting the optimal
> instruction order. I don't see any point in imposing a generic
> scheduling algorithm across all targets.
>
> -Andy

[Larin, Sergei]
Thank you again for the explanation. I am really looking forward to digging
into it.

Andrew Trick

unread,

Nov 30, 2011, 1:41:49 PM11/30/11

to Sergei Larin, llv...@cs.uiuc.edu

On Nov 30, 2011, at 9:11 AM, Sergei Larin wrote:

This is not really so bad, because you can revert to "source order"
scheduling, -pre-RA-sched=source, and you don't need to implement many
target hooks. It burns compile time for no good reason, but you can
probably live with it. Then you're free to implement your own MI-level
scheduler.
[Larin, Sergei]
I am not 100% sure about this statement, but as I get closer to
re-implementing my scheduler I might grasp a better picture.

One thing that would be nice to have ASAP is a SelectionDAG serialization pass that satisfies dependencies and physical register interferences, while preserving IR instruction whenever possible. This should be totally separate from from the SelectionDAG scheduler. It should not work on SUnits.

I realize this is quite disjoint from the work needed to port a new target. I'm just pointing out that it would be a welcome feature.

If we had that pass, I could tell you that it would be fairly straightforward to reenable the top-down SD scheduler. At this point, since you'd rather scheduler MI's anyway, you may choose to focus on that strategy instead.

The next step in making it easier to maintain an llvm scheduler for
"interesting" targets is to build an MI-level scheduling framework and
move at least one of the primary targets to this framework so it's well
supported. This would separate the nasty issues of serializing the
selection DAG from the challenge of microarchitecture-level scheduling,
and provide a suitable place to inject your own scheduling algorithm.
It's easier to implement a scheduler when starting from a valid
instruction sequence where all dependencies are resolved and no
register interferences exit.

[Larin, Sergei]
Agree, and my whole point is that it needs to be done with preceding
public discussion, and not de-facto with code drops.

It will be an incremental process. I'm not going to design a complete scheduling framework for all microarchitectures "on paper" before making any changes. Design decisions will be deferred as late as they can be without holding up progress. You'll know when they're being made and have the opportunity to influence them. In fact, any new design will be strongly influenced by the scheduler work that you and others have done recently.

I think you're reacting to the recent dropping of preRA top-down scheduling without public discussion. As you know it was not part of a planned strategy, and not a desirable outcome for anyone. The fact is that we couldn't wait to fix an existing design flaw in DAG serialization. The bottom-up scheduler has the ability to overcome this problem, but implementing a fix that doesn't require running the bottom-up scheduler requires significant work. The right thing to do is to implement SD serialization pass I mentioned above. That solution would be preferable to everyone, but someone needs make the investment.

Of course, anyone is welcome to fix the existing top-down scheduler as well. It requires implementing the inverse of the bottom-up scheduler's physical register tracking, see LiveRegDefs, plus some really hairy logic for resolving interferences that the SelectionDAG builder has created.

FWIW, we're not going to run into this issue with the MI scheduling framework that I'm referring to because no part of it will be imposed on any targets.

To answer your question, there's no clear way to describe the current
overall scheduling strategy. For now, you'll need to ask porting
questions on llvm-dev. Maybe someone who's faced a similar problem will
have a good suggestion. We do want to improve that situation and we
intend to do that by first providing a new scheduler framework. When we
get to that point, I'll be sure that the new direction can work for you

[Larin, Sergei]
Any clue on time frame?

2012 :)

-Andy

Hal Finkel

unread,

Dec 19, 2011, 9:51:52 AM12/19/11

to Andrew Trick, llv...@cs.uiuc.edu

On Tue, 2011-10-25 at 21:00 -0700, Andrew Trick wrote:

Now, to generate the best PPC schedules, there is one thing you may
> want to override. The scheduler's priority function has a
> HasReadyFilter attribute (enum). It can be overriden by specializing
> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
> scheduling, and maximizes the instructions that can issue in one
> group, regardless of register pressure. We still care about register
> pressure enough in ARM to avoid enabling this. I'm really not sure how
> much it will help on modern PPC implementations though.

> hybrid_ls_rr_sort

Can this be done without modifying common code? It looks like
hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp.

Thanks again,
Hal

--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Andrew Trick

unread,

Dec 19, 2011, 10:41:11 AM12/19/11

to Hal Finkel, llv...@cs.uiuc.edu

On Dec 19, 2011, at 6:51 AM, Hal Finkel <hfi...@anl.gov> wrote:

> On Tue, 2011-10-25 at 21:00 -0700, Andrew Trick wrote:
> Now, to generate the best PPC schedules, there is one thing you may
>> want to override. The scheduler's priority function has a
>> HasReadyFilter attribute (enum). It can be overriden by specializing
>> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
>> scheduling, and maximizes the instructions that can issue in one
>> group, regardless of register pressure. We still care about register
>> pressure enough in ARM to avoid enabling this. I'm really not sure how
>> much it will help on modern PPC implementations though.
>> hybrid_ls_rr_sort
>
> Can this be done without modifying common code? It looks like
> hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp.
>
> Thanks again,
> Hal

Right. You would need to specialize the priority queue logic. A small amount of common code.
Andy

Hal Finkel

unread,

Dec 19, 2011, 6:19:58 PM12/19/11

to Andrew Trick, llv...@cs.uiuc.edu

On Mon, 2011-12-19 at 07:41 -0800, Andrew Trick wrote:
> On Dec 19, 2011, at 6:51 AM, Hal Finkel <hfi...@anl.gov> wrote:
>
> > On Tue, 2011-10-25 at 21:00 -0700, Andrew Trick wrote:
> > Now, to generate the best PPC schedules, there is one thing you may
> >> want to override. The scheduler's priority function has a
> >> HasReadyFilter attribute (enum). It can be overriden by specializing
> >> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
> >> scheduling, and maximizes the instructions that can issue in one
> >> group, regardless of register pressure. We still care about register
> >> pressure enough in ARM to avoid enabling this. I'm really not sure how
> >> much it will help on modern PPC implementations though.
> >> hybrid_ls_rr_sort
> >
> > Can this be done without modifying common code? It looks like
> > hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp.
> >
> > Thanks again,
> > Hal
>
> Right. You would need to specialize the priority queue logic. A small amount of common code.
> Andy

Andy,

I played around with this some today for my PPC 440 chips. These are
embedded chips (multiple pipelines but in-order), and may be more
similar to your ARMs than to the PPC-970 style designs...

I was able to get reasonable PPC 440 code generation by using the ILP
scheduler pre-RA and then the post-RA scheduler with ANTIDEP_ALL (and my
load/store reordering patch). This worked significantly better than
using either hybrid or ilp alone (with or without setting
HasReadyFilter). I was looking at my primary use case which is
partially-unrolled loops with loads, stores and floating-point
calculations.

This seems to work b/c ILP first groups the instructions to extract
parallelism and then the post-RA scheduler breaks up the groups to avoid
stalls. This allows the scheduler to find its way out of what seems to
be a "local minimum" of sorts, whereby it wants to schedule each
unrolled iteration of the loop sequentially. The reason why this seems
to occur is that the hybrid scheduler would prefer to suffer a large
data-dependency delay over a shorter full-pipeline delay. Do you know
why it would do this? (you can see PR11589 for an example if you'd
like).

Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
used? Is there a reason that this is a compile-time constant? Both
Hybrid and ILP have isReady() functions. I can certainly propose a patch
to make them command-line options.

Thanks again,
Hal

--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Andrew Trick

unread,

Dec 20, 2011, 1:14:05 AM12/20/11

to Hal Finkel, llv...@cs.uiuc.edu

The "ilp" scheduler has several heuristics designed to compensate for lack of itinerary. Each of those heuristics has a flag, so you can see what works for your target. I've never used that scheduler with an itinerary, but it should work. It's just that some of the heuristics effectively override the hazard checker.

The "hybrid" scheduler depends more on the itinerary/hazard checker. It's less likely to schedule instructions close together if they may induce a pipeline stall, regardless of operand latency.

> Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
> used? Is there a reason that this is a compile-time constant? Both
> Hybrid and ILP have isReady() functions. I can certainly propose a patch
> to make them command-line options.

It's a compile time constant because it's clearly on the scheduler's critical path and not used by any active targets. Enabling HasReadyFilter turns the preRA scheduler into a strict scheduler such that the hazard checker overrides all other heuristics. That's not what you want if you're also enabling postRA scheduling!

-Andy

Hal Finkel

unread,

Dec 20, 2011, 1:53:05 AM12/20/11

to Andrew Trick, llv...@cs.uiuc.edu

I'd prefer to have a scheduler that just does what I want :) -- How can
I make a modified version of the hybrid scheduler that will weight
operand latency and pipeline stalls more equally?

Here's my "thought experiment" (from PR11589): I have a bunch of
load-fadd-store chains to schedule. A store takes two cycles to clear
its last pipeline stage. The fadd takes longer to compute its result
(say 5 cycles), but can sustain a rate of 1 independent add per cycle.
As the scheduling is bottom-up, it will schedule a store, then it has a
choice: it can schedule another store (at a 1 cycle penalty), or it can
schedule the fadd associated with the store it just scheduled (with a 4
cycle penalty due to operand latency). It seems that the current hybrid
scheduler will choose the fadd, I want a scheduler that will make the
opposite choice.

> > Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
> > used? Is there a reason that this is a compile-time constant? Both
> > Hybrid and ILP have isReady() functions. I can certainly propose a patch
> > to make them command-line options.
>
> It's a compile time constant because it's clearly on the scheduler's critical path and not used by any active targets. Enabling HasReadyFilter turns the preRA scheduler into a strict scheduler such that the hazard checker overrides all other heuristics. That's not what you want if you're also enabling postRA scheduling!

Indeed, that makes sense.

Thanks again,
Hal

>
> -Andy

--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Andrew Trick

unread,

Dec 20, 2011, 2:20:15 AM12/20/11

to Hal Finkel, llv...@cs.uiuc.edu

On Dec 19, 2011, at 10:53 PM, Hal Finkel wrote:

Here's my "thought experiment" (from PR11589): I have a bunch of
load-fadd-store chains to schedule. A store takes two cycles to clear
its last pipeline stage. The fadd takes longer to compute its result
(say 5 cycles), but can sustain a rate of 1 independent add per cycle.
As the scheduling is bottom-up, it will schedule a store, then it has a
choice: it can schedule another store (at a 1 cycle penalty), or it can
schedule the fadd associated with the store it just scheduled (with a 4
cycle penalty due to operand latency). It seems that the current hybrid
scheduler will choose the fadd, I want a scheduler that will make the
opposite choice.

That's just wrong. You may need to look at -debug-only=pre-RA-sched and debug your itinerary.

-Andy

Andrew Trick

unread,

Dec 20, 2011, 11:21:09 AM12/20/11

to Hal Finkel, llv...@cs.uiuc.edu

Wait… if you're using "sched=ilp" then register pressure avoidance is the primary heuristic. So, yes, it will schedule each chain individually as long as the critical path is not *too* long.

sched=hybrid will only behave like this when it determines that register pressure is high. You have to make sure that the register pressure limit is implemented properly for your target. debug-only=pre-RA-sched will show you the number of locally live register out of the total available each time an instruction is scheduled.

-Andy

Hal Finkel

unread,

Dec 20, 2011, 11:35:24 AM12/20/11

to Andrew Trick, llv...@cs.uiuc.edu

Andy, I've already looked at the debug output quite a bit; please help
me understand what I'm missing...

First, looking at the code does seem to confirm my suspicion. This is
certainly is low-pressure mode, and so hybrid_ls_rr_sort::operator()
will return the result of BUCompareLatency. That function first checks
for stalls and returns 1 or -1. Only after that does it look at the
relative latencies.

In addition, the stall computation is done using BUHasStall, and that
function only checks the current cycle. Without looking forward, I don't
understand how it could know how long the pipeline hazard will last.

It looks like this may have something to do with the height. Can you
explain how that is supposed to work?

For the specific example: We start with the initial store...

GPRC: 4 / 31
F4RC: 1 / 31

Examining Available:
Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,
0x2c03870, 0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
[ORD=94] [ID=102]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

*** Scheduling [21]: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70,
0x2bf3910, 0x2c03870, 0x2c03e70<Mem:ST4[%
arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94] [ID=102]

then it schedules a "token factor" that is attached to the address
computation required by the store (this is essentially a no-op,
right?)...

GPRC: 5 / 31
F4RC: 2 / 31

Examining Available:
Height 21: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1, 0x2c03a70
[ORD=94] [ID=5]

Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
[ID=105]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

*** Scheduling [21]: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1,
0x2c03a70 [ORD=94] [ID=5]

how here is the choice that we may want to be different...

GPRC: 5 / 31
F4RC: 2 / 31

Examining Available:
Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
[ID=105]

Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]

Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
[ID=92]

Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]

Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]

(with more debug turned on, I also see a bunch of messages like:
*** Hazard in cycle 3, SU(97): xxx: ch = STFSX ...<Mem:ST4[%
arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
one of these for each of the other possible stores).

*** Scheduling [24]: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70,
0x2bf3710 [ORD=92] [ID=105]

why did it choose this fadd over any of the other stores? the
corresponding unit descriptions are:

SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910, 0x2c03870,
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94]
[ID=102]

# preds left : 4
# succs left : 1
# rdefs left : 0
Latency : 7
Depth : 0
Height : 0
Predecessors:
val #0x2c11ff0 - SU(105): Latency=3
val #0x2c0cdd0 - SU(32): Latency=1
val #0x2c11db0 - SU(103): Latency=1
ch #0x2c0af70 - SU(5): Latency=0
Successors:
ch #0x2c0ac10 - SU(2): Latency=1

SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92] [ID=105]

# preds left : 2
# succs left : 1
# rdefs left : 1
Latency : 11
Depth : 0
Height : 0
Predecessors:
val #0x2c12110 - SU(106): Latency=6
val #0x2c0d130 - SU(35): Latency=6
Successors:
val #0x2c11c90 - SU(102): Latency=3

Just from the debugging messages, it looks like what is happening is
that the scheduler is first rejecting the other stores because of
pipeline hazards and then picking the instruction with the lowest
latency. Looking at the code, it seems that this is exactly what it was
designed to do. If I'm wrong about that, please explain.

Thanks in advance,

Hal Finkel

unread,

Dec 20, 2011, 1:29:26 PM12/20/11

to Andrew Trick, llv...@cs.uiuc.edu

Looking at this more carefully, I think that I see the problem. The
heights are set to account for the latencies:
PredSU->setHeightToAtLeast(SU->getHeight() + PredEdge->getLatency());

but the latencies are considered only if the node as an ILP scheduling
preference (the default in TargetLowering.h is None):
bool LStall = (!checkPref || left->SchedulingPref == Sched::ILP) &&
BUHasStall(left, LHeight, SPQ);
...

and the PPC backend does not override getSchedulingPreference.

-Hal

Andrew Trick

unread,

Dec 20, 2011, 3:18:11 PM12/20/11

to Hal Finkel, llv...@cs.uiuc.edu

Right, even with sched=hybrid, the scheduler will fall back to register pressure scheduling unless the target implements

TargetLowering::getSchedulingPreference. I forgot that piece of the puzzle.

You could try simply returning Sched::ILP from PPCTargetLowering::getSchedulingPreference. If you have later have regpressure problems, you can so something more complicated like ARMTargetLowering::getSchedulingPreference.

BTW - If you set HasReadyFilter, the fadd (105) would not even appear in the queue until the scheduler reached cycle [24]. So three additional stores would have been scheduled first. HasReadyFilter effectively treats operand latency stalls as strictly as pipeline hazards. It's not clear to me that want to do that though if you fix getSchedulingPreference and do postRA scheduling later anyway.

So it should work to do "hybrid" scheduling biased toward ILP, vs. "ilp" scheduling which really does the opposite of what it's name implies because it's initially biased toward regpressure.

-Andy

Hal\

Hal Finkel

unread,

Dec 20, 2011, 3:31:58 PM12/20/11

to Andrew Trick, llv...@cs.uiuc.edu

That is exactly what I tried and it works pretty well.

>
>
> BTW - If you set HasReadyFilter, the fadd (105) would not even appear
> in the queue until the scheduler reached cycle [24]. So three
> additional stores would have been scheduled first. HasReadyFilter
> effectively treats operand latency stalls as strictly as pipeline
> hazards. It's not clear to me that want to do that though if you fix
> getSchedulingPreference and do postRA scheduling later anyway.

I just ran some quick tests, even with getSchedulingPreference returning
ILP (regardless of how HasReadyFilter is set), using postRA still
results in significantly better code compared to using Hybrid alone.
Whether ILP+postRA or Hybrid+postRA wins depends on the benchmark
(Hybrid+postRA may have a small edge).

Thanks again,
Hal

>
>
> So it should work to do "hybrid" scheduling biased toward ILP, vs.
> "ilp" scheduling which really does the opposite of what it's name
> implies because it's initially biased toward regpressure.
>
>
> -Andy
>
> > >
> > > In addition, the stall computation is done using BUHasStall, and
> > > that
> > > function only checks the current cycle. Without looking forward, I
> > > don't
> > > understand how it could know how long the pipeline hazard will
> > > last.
> > >
> > > It looks like this may have something to do with the height. Can
> > > you
> > > explain how that is supposed to work?
> > >
> > > For the specific example: We start with the initial store...
> > >
> > > GPRC: 4 / 31
> > > F4RC: 1 / 31
> > >
> > > Examining Available:
> > > Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,

> > > 0x2c03870, 0x2c03e70<Mem:ST4[%
> > > arrayidx6.14](align=8)(tbaa=!"float")>
> > > [ORD=94] [ID=102]
> > >

--

Reply all

Reply to author

Forward