I'll try to describe how the itinerary works a bit. It's nonintuitive.
The itinerary has two lists, a list of pipeline stages and a list of
operand latencies. The latency of an instruction is captured by the
latency of its "definition" operands, so latency does not need to be
modeled in the pipeline stages at all.
A 2 wide, 1 deep pipeline (2x1) would be:
[InstrStage<1, [Pipe0, Pipe1]>]
A 2 wide, 4 deep pipeline (2x4) would be:
[InstrStage<1, [Pipe0, Pipe1]>]
Surprise. There is no difference in the pipeline description, because
the units are fully pipelined and we don't need to express latency
here. (I'm only showing the pipeline stages here, not the operand latency list).
Let's say you want to treat each stage of a pipeline as a separate
type of unit:
stage0: Decode
stage1: Exec
stage2: Write
[InstrStage<1, [Decode0, Decode1], 0>,
InstrStage<1, [Exec0, Exec1], 0>,
InstrStage<1, [Write0, Write1, 0]>]
Now when the first instruction is scheduled, it fills in the current
row of the reservation table with Decode0, Exec0, Write0. This is
counterintuitive because the instruction does not execute on all units
in the same cycle, but it results in a more compact reservation table
and still sufficiently models hazards.
Things only get more complicated if you have functional units that are
not fully pipelined, or you have instructions that use the same functional
units at different pipeline stages.
If I have an instruction that consumes a functional unit for 2 cycles,
during which no other instruction may be issued to that unit, then I
need to do this:
[InstrStage<2, [NonPipelinedUnit]>
If I have an instruction that splits into two dependent microops, that
use the same type of functional unit, but at different times, then I need to
do this:
[InstrStage<1, [ALU0, ALU1], 1>
InstrStage<1, [ALU0, ALU1]>
-Andy
>From TargetScheduled.td:
//===----------------------------------------------------------------------===//
// Instruction stage - These values represent a non-pipelined step in
// the execution of an instruction. Cycles represents the number of
// discrete time slots needed to complete the stage. Units represent
// the choice of functional units that can be used to complete the
// stage. Eg. IntUnit1, IntUnit2. NextCycles indicates how many
// cycles should elapse from the start of this stage to the start of
// the next stage in the itinerary. For example:
//
// A stage is specified in one of two ways:
//
// InstrStage<1, [FU_x, FU_y]> - TimeInc defaults to Cycles
// InstrStage<1, [FU_x, FU_y], 0> - TimeInc explicit
//
class InstrStage<int cycles, list<FuncUnit> units,
int timeinc = -1,
ReservationKind kind = Required> {
int Cycles = cycles; // length of stage in machine cycles
list<FuncUnit> Units = units; // choice of functional units
int TimeInc = timeinc; // cycles till start of next stage
int Kind = kind.Value; // kind of FU reservation
}
> -----Original Message-----
> From: Andrew Trick [mailto:atr...@apple.com]
> Sent: 21 October 2011 02:36
> To: James Molloy
> Cc: Hal Finkel; llvm-commits LLVM; Evan Cheng
> Subject: Re: [llvm-commits] [llvm] r142171 - in /llvm/trunk:
> lib/Target/PowerPC/PPCSchedule440.td test/CodeGen/PowerPC/ppc440-fp-basic.ll
> test/CodeGen/PowerPC/ppc440-msync.ll
>
> On Oct 20, 2011, at 3:24 PM, Evan Cheng wrote:
>
>>
>> On Oct 20, 2011, at 12:04 PM, James Molloy wrote:
>>
>>> Evan,
>>>
>>> Regarding this, I wanted to ask - there's currently a hard limit of 32
> FunctionalUnits. Functional units cannot be pipelined, so for example to
> describe a pipeline for a superscalar machine of issue width N taking M
> cycles, one requires N*M functional units.
>>
>> I don't think that's how it works. You can describe a resource being
> acquired or reserved for M cycles. Perhaps I am not understanding your
> question.
>>
>> Evan
>>
>
> An N-wide machine can be described with N units, regardless of how deep the
> pipeline is.
>
> Furthermore if you only need to model issue width, then you don't even need
> to describe the pipeline at all. You only need to set the
> InstrItineraryData::IssueWidth field. ARMSubtarget::computeIssueWidth does
> this by assuming something about the convention of ARM itineraries. But you
> could simply embed the issue width constants for your subtargets within the
> target initialization code (in place of computeIssueWidth). I never bothered
> to add tablegen support for an IssueWidth field in the itinerary because we
> didn't need it for x86 and it is redundant with the existing ARM
> itineraries.
>
> -Andy
>
>>>
>>> This can quickly take you over the 32 unit limit. Is there any plan (or
> can I implement) pipelined functional units that can accept a new
> instruction every cycle but hold instructions for N cycles?
>>>
>>> Cheers,
>>>
>>> James
>>> ________________________________________
>>> From: llvm-commi...@cs.uiuc.edu [llvm-commi...@cs.uiuc.edu]
> On Behalf Of Evan Cheng [evan....@apple.com]
>>> Sent: 20 October 2011 18:21
>>> To: Hal Finkel
>>> Cc: llvm-c...@cs.uiuc.edu
>>> Subject: Re: [llvm-commits] [llvm] r142171 - in /llvm/trunk:
> lib/Target/PowerPC/PPCSchedule440.td test/CodeGen/PowerPC/ppc440-fp-basic.ll
> test/CodeGen/PowerPC/ppc440-msync.ll
>>>
>>> On Oct 19, 2011, at 7:29 PM, Hal Finkel <hfi...@anl.gov> wrote:
>>>
>>>> Evan,
>>>>
>>>> Thanks for the heads up! Is there a current target that implements the
>>>> scheduling as it will be? And does the bottom-up scheduling also account
>>>
>>> ARM is a good model.
>>>
>>>> for pipeline-conflict hazards?
>>>
>>> Yes, definitely. And it should be doing a much better job of it.
>>>
>>> Evan
>>>
>>>>
>>>> -Hal
>>>>
>>>> On Wed, 2011-10-19 at 16:45 -0700, Evan Cheng wrote:
>>>>> Hi Hal,
>>>>>
>>>>> Heads up. We'll soon abolish top-down pre-register allocation scheduler
> and force every target to bottom up scheduling. The problem is tt list
> scheduler does not handle physical register dependency at all but it is
> something that's required for some upcoming legalizer change.
>>>>>
>>>>> If you are interested in PPC, you might want to look into switching its
> scheduler now. The bottom up register pressure aware scheduler should work
> quite well for PPC.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Evan
>>>>>
>>>>> On Oct 16, 2011, at 9:03 PM, Hal Finkel wrote:
>>>>>
>>>>>> Author: hfinkel
>>>>>> Date: Sun Oct 16 23:03:55 2011
>>>>>> New Revision: 142171
>>>>>>
>>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=142171&view=rev
>>>>>> Log:
>>>>>> Add PPC 440 scheduler and some associated tests (new files)
>>>>>>
>>>>>> Added:
>>>>>> llvm/trunk/lib/Target/PowerPC/PPCSchedule440.td
>>>>>> llvm/trunk/test/CodeGen/PowerPC/ppc440-fp-basic.ll
>>>>>> llvm/trunk/test/CodeGen/PowerPC/ppc440-msync.ll
>>>>>>
>>>>>> Added: llvm/trunk/lib/Target/PowerPC/PPCSchedule440.td
>>>>>> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/PowerPC/PPCSchedul
> e440.td?rev=142171&view=auto
>>>>>>
> ============================================================================
> ==
>>>>>> --- llvm/trunk/lib/Target/PowerPC/PPCSchedule440.td (added)
>>>>>> +++ llvm/trunk/lib/Target/PowerPC/PPCSchedule440.td Sun Oct 16
> 23:03:55 2011
>>>>>> @@ -0,0 +1,568 @@
>>>>>> +//===- PPCSchedule440.td - PPC 440 Scheduling Definitions ----*-
> tablegen -*-===//
>>>>>> +//
>>>>>> +// The LLVM Compiler Infrastructure
>>>>>> +//
>>>>>> +// This file is distributed under the University of Illinois Open
> Source
>>>>>> +// License. See LICENSE.TXT for details.
>>>>>> +//
>>>>>>
> +//===----------------------------------------------------------------------
> ===//
>>>>>> +
>>>>>> +// Primary reference:
>>>>>> +// PowerPC 440x6 Embedded Processor Core UserâEUR(tm)s Manual.
>>>>>> +// IBM (as updated in) 2010.
>>>>>> +
>>>>>> +// The basic PPC 440 does not include a floating-point unit; the
> pipeline
>>>>>> +// timings here are constructed to match the FP2 unit shipped with
> the
>>>>>> +// PPC-440- and PPC-450-based Blue Gene (L and P) supercomputers.
>>>>>> +// References:
>>>>>> +// S. Chatterjee, et al. Design and exploitation of a
> high-performance
>>>>>> +// SIMD floating-point unit for Blue Gene/L.
>>>>>> +// IBM J. Res. & Dev. 49 (2/3) March/May 2005.
>>>>>> +// also:
>>>>>> +// Carlos Sosa and Brant Knudson. IBM System Blue Gene Solution:
>>>>>> +// Blue Gene/P Application Development.
>>>>>> +// IBM (as updated in) 2009.
>>>>>> +
>>>>>>
> +//===----------------------------------------------------------------------
> ===//
>>>>>> +// Functional units on the PowerPC 440/450 chip sets
>>>>>> +//
>>>>>> +def IFTH1 : FuncUnit; // Fetch unit 1
>>>>>> +def IFTH2 : FuncUnit; // Fetch unit 2
>>>>>> +def PDCD1 : FuncUnit; // Decode unit 1
>>>>>> +def PDCD2 : FuncUnit; // Decode unit 2
>>>>>> +def DISS1 : FuncUnit; // Issue unit 1
>>>>>> +def DISS2 : FuncUnit; // Issue unit 2
>>>>>> +def LRACC : FuncUnit; // Register access and dispatch for
>>>>>> + // the simple integer (J-pipe) and
>>>>>> + // load/store (L-pipe) pipelines
>>>>>> +def IRACC : FuncUnit; // Register access and dispatch for
>>>>>> + // the complex integer (I-pipe) pipeline
>>>>>> +def FRACC : FuncUnit; // Register access and dispatch for
>>>>>> + // the floating-point execution (F-pipe)
> pipeline
>>>>>> +def IEXE1 : FuncUnit; // Execution stage 1 for the I pipeline
>>>>>> +def IEXE2 : FuncUnit; // Execution stage 2 for the I pipeline
>>>>>> +def IWB : FuncUnit; // Write-back unit for the I pipeline
>>>>>> +def JEXE1 : FuncUnit; // Execution stage 1 for the J pipeline
>>>>>> +def JEXE2 : FuncUnit; // Execution stage 2 for the J pipeline
>>>>>> +def JWB : FuncUnit; // Write-back unit for the J pipeline
>>>>>> +def AGEN : FuncUnit; // Address generation for the L pipeline
>>>>>> +def CRD : FuncUnit; // D-cache access for the L pipeline
>>>>>> +def LWB : FuncUnit; // Write-back unit for the L pipeline
>>>>>> +def FEXE1 : FuncUnit; // Execution stage 1 for the F pipeline
>>>>>> +def FEXE2 : FuncUnit; // Execution stage 2 for the F pipeline
>>>>>> +def FEXE3 : FuncUnit; // Execution stage 3 for the F pipeline
>>>>>> +def FEXE4 : FuncUnit; // Execution stage 4 for the F pipeline
>>>>>> +def FEXE5 : FuncUnit; // Execution stage 5 for the F pipeline
>>>>>> +def FEXE6 : FuncUnit; // Execution stage 6 for the F pipeline
>>>>>> +def FWB : FuncUnit; // Write-back unit for the F pipeline
>>>>>> +
>>>>>> +def LWARX_Hold : FuncUnit; // This is a pseudo-unit which is used
>>>>>> + // to make sure that no lwarx/stwcx.
>>>>>> + // instructions are issued while another
>>>>>> + // lwarx/stwcx. is in the L pipe.
>>>>>> +
>>>>>> +def GPR_Bypass : Bypass; // The bypass for general-purpose regs.
>>>>>> +def FPR_Bypass : Bypass; // The bypass for floating-point regs.
>>>>>> +
>>>>>> +// Notes:
>>>>>> +// Instructions are held in the FRACC, LRACC and IRACC pipeline
>>>>>> +// stages until their source operands become ready. Exceptions:
>>>>>> +// - Store instructions will hold in the AGEN stage
>>>>>> +// - The integer multiply-accumulate instruction will hold in
>>>>>> +// the IEXE1 stage
>>>>>> +//
>>>>>> +// For most I-pipe operations, the result is available at the end of
>>>>>> +// the IEXE1 stage. Operations such as multiply and divide must
>>>>>> +// continue to execute in IEXE2 and IWB. Divide resides in IWB for
>>>>>> +// 33 cycles (multiply also calculates its result in IWB). For all
>>>>>> +// J-pipe instructions, the result is available
>>>>>> +// at the end of the JEXE1 stage. Loads have a 3-cycle latency
>>>>>> +// (data is not available until after the LWB stage).
>>>>>> +//
>>>>>> +// The L1 cache hit latency is four cycles for floating point loads
>>>>>> +// and three cycles for integer loads.
>>>>>> +//
>>>>>> +// The stwcx. instruction requires both the LRACC and the IRACC
>>>>>> +// dispatch stages. It must be issued from DISS0.
>>>>>> +//
>>>>>> +// All lwarx/stwcx. instructions hold in LRACC if another
>>>>>> +// uncommitted lwarx/stwcx. is in AGEN, CRD, or LWB.
>>>>>> +//
>>>>>> +// msync (a.k.a. sync) and mbar will hold in LWB until all load/store
>>>>>> +// resources are empty. AGEN and CRD are held empty until the
> msync/mbar
>>>>>> +// commits.
>>>>>> +//
>>>>>> +// Most floating-point instructions, computational and move,
>>>>>> +// have a 5-cycle latency. Divide takes longer (30 cycles).
> Instructions that
>>>>>> +// update the CR take 2 cycles. Stores take 3 cycles and, as
> mentioned above,
>>>>>> +// loads take 4 cycles (for L1 hit).
>>>>>> +
>>>>>> +//
>>>>>> +// This file defines the itinerary class data for the PPC 440
> processor.
>>>>>> +//
>>>>>>
> +//===----------------------------------------------------------------------
> ===//
>>>>>> +
>>>>>> +
>>>>>> +def PPC440Itineraries : ProcessorItineraries<
>>>>>> + [IFTH1, IFTH2, PDCD1, PDCD2, DISS1, DISS2, FRACC,
>>>>>> + IRACC, IEXE1, IEXE2, IWB, LRACC, JEXE1, JEXE2, JWB, AGEN, CRD,
> LWB,
>>>>>> + FEXE1, FEXE2, FEXE3, FEXE4, FEXE5, FEXE6, FWB, LWARX_Hold],
>>>>>> + [GPR_Bypass, FPR_Bypass], [
>>>>>> + InstrItinData<IntGeneral , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC, LRACC]>,
>>>>>> + InstrStage<1, [IEXE1, JEXE1]>,
>>>>>> + InstrStage<1, [IEXE2, JEXE2]>,
>>>>>> + InstrStage<1, [IWB, JWB]>],
>>>>>> + [6, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntCompare , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC, LRACC]>,
>>>>>> + InstrStage<1, [IEXE1, JEXE1]>,
>>>>>> + InstrStage<1, [IEXE2, JEXE2]>,
>>>>>> + InstrStage<1, [IWB, JWB]>],
>>>>>> + [6, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntDivW , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<33, [IWB]>],
>>>>>> + [40, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMFFS , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [7, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMTFSB0 , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [7, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMulHW , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMulHWU , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntMulLI , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntRotate , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC, LRACC]>,
>>>>>> + InstrStage<1, [IEXE1, JEXE1]>,
>>>>>> + InstrStage<1, [IEXE2, JEXE2]>,
>>>>>> + InstrStage<1, [IWB, JWB]>],
>>>>>> + [6, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntShift , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC, LRACC]>,
>>>>>> + InstrStage<1, [IEXE1, JEXE1]>,
>>>>>> + InstrStage<1, [IEXE2, JEXE2]>,
>>>>>> + InstrStage<1, [IWB, JWB]>],
>>>>>> + [6, 4, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<IntTrapW , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [6, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<BrB , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<BrCR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<BrMCR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<BrMCRX , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4, 4],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStDCBA , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStDCBF , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStDCBI , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStGeneral , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<2, [LWB]>],
>>>>>> + [9, 5], // FIXME: should be [9, 5] for
> loads and
>>>>>> + // [8, 5] for stores.
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStICBI , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStUX , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5, 5],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLFD , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<2, [LWB]>],
>>>>>> + [9, 5, 5],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLFDU , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [9, 5, 5],
>>>>>> + [NoBypass, GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLHA , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLMW , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStLWARX , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1]>,
>>>>>> + InstrStage<1, [IRACC], 0>,
>>>>>> + InstrStage<4, [LWARX_Hold], 0>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStSTWCX , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1]>,
>>>>>> + InstrStage<1, [IRACC], 0>,
>>>>>> + InstrStage<4, [LWARX_Hold], 0>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<1, [AGEN]>,
>>>>>> + InstrStage<1, [CRD]>,
>>>>>> + InstrStage<1, [LWB]>],
>>>>>> + [8, 5],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<LdStSync , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [LRACC]>,
>>>>>> + InstrStage<3, [AGEN], 1>,
>>>>>> + InstrStage<2, [CRD], 1>,
>>>>>> + InstrStage<1, [LWB]>]>,
>>>>>> + InstrItinData<SprISYNC , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC], 0>,
>>>>>> + InstrStage<1, [LRACC], 0>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [FEXE1], 0>,
>>>>>> + InstrStage<1, [AGEN], 0>,
>>>>>> + InstrStage<1, [JEXE1], 0>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [FEXE2], 0>,
>>>>>> + InstrStage<1, [CRD], 0>,
>>>>>> + InstrStage<1, [JEXE2], 0>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<6, [FEXE3], 0>,
>>>>>> + InstrStage<6, [LWB], 0>,
>>>>>> + InstrStage<6, [JWB], 0>,
>>>>>> + InstrStage<6, [IWB]>]>,
>>>>>> + InstrItinData<SprMFSR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [6, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMTMSR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [6, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMTSR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [9, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprTLBSYNC , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>]>,
>>>>>> + InstrItinData<SprMFCR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMFMSR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [7, 4],
>>>>>> + [GPR_Bypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMFSPR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [10, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMFTB , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [10, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMTSPR , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [10, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprMTSRIN , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<3, [IWB]>],
>>>>>> + [10, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprRFI , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<SprSC , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [IRACC]>,
>>>>>> + InstrStage<1, [IEXE1]>,
>>>>>> + InstrStage<1, [IEXE2]>,
>>>>>> + InstrStage<1, [IWB]>],
>>>>>> + [8, 4],
>>>>>> + [NoBypass, GPR_Bypass]>,
>>>>>> + InstrItinData<FPGeneral , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<1, [FWB]>],
>>>>>> + [10, 4, 4],
>>>>>> + [FPR_Bypass, FPR_Bypass, FPR_Bypass]>,
>>>>>> + InstrItinData<FPCompare , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<1, [FWB]>],
>>>>>> + [10, 4, 4],
>>>>>> + [FPR_Bypass, FPR_Bypass, FPR_Bypass]>,
>>>>>> + InstrItinData<FPDivD , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<25, [FWB]>],
>>>>>> + [35, 4, 4],
>>>>>> + [NoBypass, FPR_Bypass, FPR_Bypass]>,
>>>>>> + InstrItinData<FPDivS , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<13, [FWB]>],
>>>>>> + [23, 4, 4],
>>>>>> + [NoBypass, FPR_Bypass, FPR_Bypass]>,
>>>>>> + InstrItinData<FPFused , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<1, [FWB]>],
>>>>>> + [10, 4, 4, 4],
>>>>>> + [FPR_Bypass, FPR_Bypass, FPR_Bypass,
> FPR_Bypass]>,
>>>>>> + InstrItinData<FPRes , [InstrStage<1, [IFTH1, IFTH2]>,
>>>>>> + InstrStage<1, [PDCD1, PDCD2]>,
>>>>>> + InstrStage<1, [DISS1, DISS2]>,
>>>>>> + InstrStage<1, [FRACC]>,
>>>>>> + InstrStage<1, [FEXE1]>,
>>>>>> + InstrStage<1, [FEXE2]>,
>>>>>> + InstrStage<1, [FEXE3]>,
>>>>>> + InstrStage<1, [FEXE4]>,
>>>>>> + InstrStage<1, [FEXE5]>,
>>>>>> + InstrStage<1, [FEXE6]>,
>>>>>> + InstrStage<1, [FWB]>],
>>>>>> + [10, 4],
>>>>>> + [FPR_Bypass, FPR_Bypass]>
>>>>>> +]>;
>>>>>>
>>>>>> Added: llvm/trunk/test/CodeGen/PowerPC/ppc440-fp-basic.ll
>>>>>> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/PowerPC/ppc440-f
> p-basic.ll?rev=142171&view=auto
>>>>>>
> ============================================================================
> ==
>>>>>> --- llvm/trunk/test/CodeGen/PowerPC/ppc440-fp-basic.ll (added)
>>>>>> +++ llvm/trunk/test/CodeGen/PowerPC/ppc440-fp-basic.ll Sun Oct 16
> 23:03:55 2011
>>>>>> @@ -0,0 +1,32 @@
>>>>>> +; RUN: llc < %s -march=ppc32 -mcpu=440 | grep fmadd
>>>>>> +
>>>>>> +%0 = type { double, double }
>>>>>> +
>>>>>> +define void @maybe_an_fma(%0* sret %agg.result, %0* byval %a, %0*
> byval %b, %0* byval %c) nounwind {
>>>>>> +entry:
>>>>>> + %a.realp = getelementptr inbounds %0* %a, i32 0, i32 0
>>>>>> + %a.real = load double* %a.realp
>>>>>> + %a.imagp = getelementptr inbounds %0* %a, i32 0, i32 1
>>>>>> + %a.imag = load double* %a.imagp
>>>>>> + %b.realp = getelementptr inbounds %0* %b, i32 0, i32 0
>>>>>> + %b.real = load double* %b.realp
>>>>>> + %b.imagp = getelementptr inbounds %0* %b, i32 0, i32 1
>>>>>> + %b.imag = load double* %b.imagp
>>>>>> + %mul.rl = fmul double %a.real, %b.real
>>>>>> + %mul.rr = fmul double %a.imag, %b.imag
>>>>>> + %mul.r = fsub double %mul.rl, %mul.rr
>>>>>> + %mul.il = fmul double %a.imag, %b.real
>>>>>> + %mul.ir = fmul double %a.real, %b.imag
>>>>>> + %mul.i = fadd double %mul.il, %mul.ir
>>>>>> + %c.realp = getelementptr inbounds %0* %c, i32 0, i32 0
>>>>>> + %c.real = load double* %c.realp
>>>>>> + %c.imagp = getelementptr inbounds %0* %c, i32 0, i32 1
>>>>>> + %c.imag = load double* %c.imagp
>>>>>> + %add.r = fadd double %mul.r, %c.real
>>>>>> + %add.i = fadd double %mul.i, %c.imag
>>>>>> + %real = getelementptr inbounds %0* %agg.result, i32 0, i32 0
>>>>>> + %imag = getelementptr inbounds %0* %agg.result, i32 0, i32 1
>>>>>> + store double %add.r, double* %real
>>>>>> + store double %add.i, double* %imag
>>>>>> + ret void
>>>>>> +}
>>>>>>
>>>>>> Added: llvm/trunk/test/CodeGen/PowerPC/ppc440-msync.ll
>>>>>> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/PowerPC/ppc440-m
> sync.ll?rev=142171&view=auto
>>>>>>
> ============================================================================
> ==
>>>>>> --- llvm/trunk/test/CodeGen/PowerPC/ppc440-msync.ll (added)
>>>>>> +++ llvm/trunk/test/CodeGen/PowerPC/ppc440-msync.ll Sun Oct 16
> 23:03:55 2011
>>>>>> @@ -0,0 +1,23 @@
>>>>>> +; RUN: llc < %s -march=ppc32 -o %t
>>>>>> +; RUN: grep sync %t
>>>>>> +; RUN: not grep msync %t
>>>>>> +; RUN: llc < %s -march=ppc32 -mcpu=440 | grep msync
>>>>>> +
>>>>>> +define i32 @has_a_fence(i32 %a, i32 %b) nounwind {
>>>>>> +entry:
>>>>>> + fence acquire
>>>>>> + %cond = icmp eq i32 %a, %b
>>>>>> + br i1 %cond, label %IfEqual, label %IfUnequal
>>>>>> +
>>>>>> +IfEqual:
>>>>>> + fence release
>>>>>> + br label %end
>>>>>> +
>>>>>> +IfUnequal:
>>>>>> + fence release
>>>>>> + ret i32 0
>>>>>> +
>>>>>> +end:
>>>>>> + ret i32 1
>>>>>> +}
>>>>>> +
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-c...@cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>
>>>>
>>>> --
>>>> Hal Finkel
>>>> Postdoctoral Appointee
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>> 1-630-252-0023
>>>> hfi...@anl.gov
>>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-c...@cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>>>
>>
>
>
>
>
>
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Thanks in advance,
Hal
On Thu, 2011-10-20 at 10:21 -0700, Evan Cheng wrote:
>
> On Oct 19, 2011, at 7:29 PM, Hal Finkel <hfi...@anl.gov> wrote:
>
> > Evan,
> >
> > Thanks for the heads up! Is there a current target that implements the
> > scheduling as it will be? And does the bottom-up scheduling also account
>
> ARM is a good model.
What part of ARM's implementation is associated with the bottom-up
scheduling? I am confused because it looks like it is essentially using
the same kind of ScoreboardHazardRecognizer that was commented out of
the PPC 440 code.
Thanks in advance,
Hal
>
> > for pipeline-conflict hazards?
>
> Yes, definitely. And it should be doing a much better job of it.
>
> Evan
>
> >
> > -Hal
> >
> > On Wed, 2011-10-19 at 16:45 -0700, Evan Cheng wrote:
> >> Hi Hal,
> >>
> >> Heads up. We'll soon abolish top-down pre-register allocation scheduler and force every target to bottom up scheduling. The problem is tt list scheduler does not handle physical register dependency at all but it is something that's required for some upcoming legalizer change.
> >>
> >> If you are interested in PPC, you might want to look into switching its scheduler now. The bottom up register pressure aware scheduler should work quite well for PPC.
> >>
> >> Thanks,
> >>
> >> Evan
> >>
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
Hi Hal,
The best way to ensure the PPC scheduling isn't hosed now or in the
future is probably to make it work as much like ARM as possible.
This means (1) defaulting to the "hybrid" scheduler, (2) implementing the
register pressure limit, and (3) reenabling the hazard recognizer.
(1) TargetLowering::setSchedulingPreference(Sched::Hybrid)
(2) TargetRegisterInfo::getRegisterPressureLimit(...) should probably
return something a bit less than 32, depending on register class.
(3) The standard hazard recognizer works either bottom-up or top-down
on the itinerary data. So it *should* work out-of-box. The problem is
that PPC has overriden the API to layer some custom "bundling" logic
on top of basic hazard detection. This logic needs to be reversed for
bottom-up, or you could start by simply disabling it instead of the
entire hazard recognizer.
Now, to generate the best PPC schedules, there is one thing you may
want to override. The scheduler's priority function has a
HasReadyFilter attribute (enum). It can be overriden by specializing
hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
scheduling, and maximizes the instructions that can issue in one
group, regardless of register pressure. We still care about register
pressure enough in ARM to avoid enabling this. I'm really not sure how
much it will help on modern PPC implementations though.
I realize this is confusing because we have a scheduler mode named
"ILP". That mode is intended for target's that do not have an
itinerary. It's currently setup for x86 and would need some tweaking
to work well for other targets. Again, if your target has an
itinerary, you probably want the "hybrid" mode.
-Andy
On Wed, 2011-10-19 at 16:45 -0700, Evan Cheng wrote:
>>>> Hi Hal,
>>>>
>>>> Heads up. We'll soon abolish top-down pre-register allocation scheduler and force every target to bottom up scheduling. The problem is tt list scheduler does not handle physical register dependency at all but it is something that's required for some upcoming legalizer change.
>>>>
>>>> If you are interested in PPC, you might want to look into switching its scheduler now. The bottom up register pressure aware scheduler should work quite well for PPC.
>>>>
>>>> Thanks,
>>>>
>>>> Evan
>>>>
_______________________________________________
Is EmitInstruction used in bottom-up scheduling at all? The version in
the ARM recognizer seems essential, but in all of the regression tests
(and some other .ll files I have lying around), it is never called. It
seems that only Reset() and getHazardType() are called. Could you please
explain the calling sequence?
Thanks again,
Hal
>
> Now, to generate the best PPC schedules, there is one thing you may
> want to override. The scheduler's priority function has a
> HasReadyFilter attribute (enum). It can be overriden by specializing
> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
> scheduling, and maximizes the instructions that can issue in one
> group, regardless of register pressure. We still care about register
> pressure enough in ARM to avoid enabling this. I'm really not sure how
> much it will help on modern PPC implementations though.
>
> I realize this is confusing because we have a scheduler mode named
> "ILP". That mode is intended for target's that do not have an
> itinerary. It's currently setup for x86 and would need some tweaking
> to work well for other targets. Again, if your target has an
> itinerary, you probably want the "hybrid" mode.
>
> -Andy
>
> On Wed, 2011-10-19 at 16:45 -0700, Evan Cheng wrote:
> >>>> Hi Hal,
> >>>>
> >>>> Heads up. We'll soon abolish top-down pre-register allocation scheduler and force every target to bottom up scheduling. The problem is tt list scheduler does not handle physical register dependency at all but it is something that's required for some upcoming legalizer change.
> >>>>
> >>>> If you are interested in PPC, you might want to look into switching its scheduler now. The bottom up register pressure aware scheduler should work quite well for PPC.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Evan
> >>>>
>
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
I feel that I should clarify my comment: For PPC, now that Hybrid
scheduling is enabled, EmitInstruction seems never to be called (at
least it is not called when running any PPC codegen test in the
regression-test collection).
Thanks again,
Hal
Is EmitInstruction used in bottom-up scheduling at all? The version inthe ARM recognizer seems essential, but in all of the regression tests(and some other .ll files I have lying around), it is never called. Itseems that only Reset() and getHazardType() are called. Could you pleaseexplain the calling sequence?
I feel that I should clarify my comment: For PPC, now that Hybrid
scheduling is enabled, EmitInstruction seems never to be called (at
least it is not called when running any PPC codegen test in the
regression-test collection).
Andy,
Thanks! Since I have to change PPCHazardRecognizer for bottom-up support
anyway, is there any reason not to have it derive from
ScoreboardHazardRecognizer at this point? It looks like the custom
bundling logic could be implemented on top of the scoreboard recognizer
(that seems similar to what ARM's recognizer is doing).
-Hal
>
>
> See how this is done in the ScoreboardHazardRecognizer ctor:
> > MaxLookAhead = ScoreboardDepth;
>
>
>
> -Andy
>
>
--
Also, how does the ARM hazard recognizer get away with not implementing
RecedeCycle?
Thanks again,
Hal
Andy
On Nov 29, 2011, at 8:51 AM, Hal Finkel <hfi...@anl.gov> wrote:
>> Thanks! Since I have to change PPCHazardRecognizer for bottom-up support
>> anyway, is there any reason not to have it derive from
>> ScoreboardHazardRecognizer at this point? It looks like the custom
>> bundling logic could be implemented on top of the scoreboard recognizer
>> (that seems similar to what ARM's recognizer is doing).
>
> Also, how does the ARM hazard recognizer get away with not implementing
> RecedeCycle?
>
> Thanks again,
> Hal
I should have been more clear, the ARM implementation has:
void ARMHazardRecognizer::RecedeCycle() {
llvm_unreachable("reverse ARM hazard checking unsupported");
}
How does that work?
Thanks again,
Hal
On Tue, 2011-11-29 at 09:47 -0800, Andrew Trick wrote:
> ARM can reuse all the default scoreboard hazard recognizer logic such as recede cycle (naturally since its the primary client). If you can do the same with PPC that's great.
>
> Andy
>
> On Nov 29, 2011, at 8:51 AM, Hal Finkel <hfi...@anl.gov> wrote:
>
> >> Thanks! Since I have to change PPCHazardRecognizer for bottom-up support
> >> anyway, is there any reason not to have it derive from
> >> ScoreboardHazardRecognizer at this point? It looks like the custom
> >> bundling logic could be implemented on top of the scoreboard recognizer
> >> (that seems similar to what ARM's recognizer is doing).
> >
> > Also, how does the ARM hazard recognizer get away with not implementing
> > RecedeCycle?
> >
> > Thanks again,
> > Hal
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
Is there any good info/docs on scheduling strategy in LLVM? As I was
complaining to you at the LLVM meeting, I end up reverse engineering/double
guessing more than I would like to... This thread shows that I am not
exactly alone in this... Thanks.
Sergei Larin
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
> Andy,
>
> I should have been more clear, the ARM implementation has:
> void ARMHazardRecognizer::RecedeCycle() {
> llvm_unreachable("reverse ARM hazard checking unsupported");
> }
>
> How does that work?
>
> Thanks again,
> Hal
Hal,
My first answer was off the top of my head, so missed the subtle issue. Just so you know, to answer questions like this I usually need to instrument the code with tracing or step through in the debugger. Even though I've hacked on the code quite a bit, the interaction between the scheduler and target hooks is still not obvious to me from glancing at the code. FWIW, I'm hoping it can be cleaned up gradually, maybe for the next release.
The preRA scheduler is bottom-up, for register pressure tracking. The postRA scheduler is top-down, for simpler hazard detection logic.
On ARM, the preRA scheduler uses an unspecialized instance of ScoreboardHazardRecognizer. The machine-independent RecedeCycle() logic that operates on the scheduler itinerary is sufficient.
The ARM postRA scheduler specializes the HazardRecognizer to handle additional constraints that cannot be expressed in the itinerary. Since this is a top-down scheduler, RecedeCycle() is no applicable.
-Andy
I would say that each target has its own scheduling strategy that has changed considerably over time. We try to maximize code reuse across targets, but it's not easy and done ad hoc. The result is confusing code that makes it difficult to understand the strategy for any particular target.
The right thing to do is:
1) Make it as easy as possible to understand how scheduling works for each of the primary targets (x86 and ARM) independent of each other.
2) Make it easy for very similar targets to piggyback on one of those implementations, without having to worry about other targets
3) Allow dissimilar targets (e.g. VLIW) to completely bypass the scheduler used by other targets and reuse only nicely self-contained parts of the framework, such as the DAG builder and individual machine description features.
We've recently moved further from this ideal scenario in that we're now forcing targets to implement the bottom-up selection dag scheduler. This is not really so bad, because you can revert to "source order" scheduling, -pre-RA-sched=source, and you don't need to implement many target hooks. It burns compile time for no good reason, but you can probably live with it. Then you're free to implement your own MI-level scheduler.
The next step in making it easier to maintain an llvm scheduler for "interesting" targets is to build an MI-level scheduling framework and move at least one of the primary targets to this framework so it's well supported. This would separate the nasty issues of serializing the selection DAG from the challenge of microarchitecture-level scheduling, and provide a suitable place to inject your own scheduling algorithm. It's easier to implement a scheduler when starting from a valid instruction sequence where all dependencies are resolved and no register interferences exit.
To answer your question, there's no clear way to describe the current overall scheduling strategy. For now, you'll need to ask porting questions on llvm-dev. Maybe someone who's faced a similar problem will have a good suggestion. We do want to improve that situation and we intend to do that by first providing a new scheduler framework. When we get to that point, I'll be sure that the new direction can work for you and is easy to understand. All I can say now is that the new design will allow a target to compose a preRA scheduler from an MI-level framework combined with target-specific logic for selecting the optimal instruction order. I don't see any point in imposing a generic scheduling algorithm across all targets.
-Andy
This was actually the source of my question, it was clear that the ARM
RecedeCycle function was not being called, but, running the PPC code in
the debugger, it was clear that the PPC's RecedeCycle function was being
called. I did not appreciate the preRA vs postRA distinction, so thank
you for explaining that.
Instead of trying to port the PPC bundling logic for use with the
bottom-up scheduler, maybe it would be sufficient for now to use it
postRA (which seems like what ARM is doing).
-Hal
From the perspective of the hazard recognizer, from what I can tell, thedifference between the top-down and bottom-up modes are:
In top-down mode, the scheduling proceeds in the forward direction.
AdvanceCycle() may be used, RecedeCycle() is not used. EmitInstruction()
implies a cycle-count increment. In bottom-up mode, scheduling proceeds
in the backwards direction (last instruction first). AdvanceCycle() is
not used, RecedeCycle() is always used to decrement the current cycle
offset (EmitInstruction() does *not* imply a cycle-count decrement).
Is this right? Have I captured everything?
Thank you for the extended and prompt answer. Let me try to summaries my
current position so you (and everyone interested) would have a better view
of the world through my eyes ;)
1) LLVM first robust VLIW target is currently in review. It needs for
scheduling strategy/quality are rather different than what current
scheduler(schedulers) can provide.
2) My first attempt in porting (while I was on 2.9) resulted in a new
top-down Pre-RA VLIW enabled scheduler that I was hoping to upstream as soon
as our back end is accepted. I guess I have missed the window since our
commit took a bit longer than planned. Now Evan told me (and you have
confirmed) that it would need to change to bottom-up version for 3.0.
Moreover, current "level" (exact placement in DAG->DAG pass) of Pre-RA
scheduling is less than optimal (and I agree to that since I have to bend
backwards to extract info readily available in MIs).
3) Your group is working on a "new" scheduler, and the best I understand
it would be same general algorithm moved "closer" to RA. I also understand
that at first it would not have added support for "packets"/bundles/multiops
in VLIW definition (or will it?). If they will be presented, interesting
discussion on how subsequent passes will be modified to recognize them would
follow... but we had another thread on this topic not that long ago.
So, IMHO the following would make sense:
1) It would be very nice if we can have some sort of write-up detailing
proposed changes, and maybe defining overall strategy for instruction
scheduling in LLVM __before__ major decisions are made. It should later be
converted in to "how to" or simple doc chapter on porting scheduler(s) to
new targets. Public discussion should follow, and we need to try to
accommodate all needs (as much as possible).
2) Any attempts on my part to further VLIW scheduler design for my target
would be unwise until such discussion would take place. I also do not
separate this process from bundle/packet representation. If you perceive an
overhead associated with this activity, I could volunteer to help.
Also, please see my comments embedded below.
Thanks.
Sergei Larin
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
> -----Original Message-----
> From: Andrew Trick [mailto:atr...@apple.com]
> Sent: Tuesday, November 29, 2011 3:16 PM
> To: Sergei Larin
> Cc: 'Hal Finkel'; llv...@cs.uiuc.edu
> Subject: Re: [LLVMdev] [llvm-commits] Bottom-Up Scheduling?
>
> Sergei,
>
> I would say that each target has its own scheduling strategy that has
> changed considerably over time. We try to maximize code reuse across
> targets, but it's not easy and done ad hoc. The result is confusing
> code that makes it difficult to understand the strategy for any
> particular target.
>
> The right thing to do is:
> 1) Make it as easy as possible to understand how scheduling works for
> each of the primary targets (x86 and ARM) independent of each other.
[Larin, Sergei]
Sure, that could be achieved with the design document/documentation set I
am talking about.
> 2) Make it easy for very similar targets to piggyback on one of those
> implementations, without having to worry about other targets
[Larin, Sergei]
Yes, and having a robust VLIW scheduler would greatly help here. It would
also IMHO set LLVM apart from GCC, and become an additional selling point
for us.
> 3) Allow dissimilar targets (e.g. VLIW) to completely bypass the
> scheduler used by other targets and reuse only nicely self-contained
> parts of the framework, such as the DAG builder and individual machine
> description features.
[Larin, Sergei]
I think this is rather implementation dependent, and we can finesse this
once we have framework better defined.
>
> We've recently moved further from this ideal scenario in that we're now
> forcing targets to implement the bottom-up selection dag scheduler.
[Larin, Sergei]
I really dislike this, especially due to the reason that lead to this
decision. I think the general "flexibility"/functionality was sacrificed for
tactical reason.
> This is not really so bad, because you can revert to "source order"
> scheduling, -pre-RA-sched=source, and you don't need to implement many
> target hooks. It burns compile time for no good reason, but you can
> probably live with it. Then you're free to implement your own MI-level
> scheduler.
[Larin, Sergei]
I am not 100% sure about this statement, but as I get closer to
re-implementing my scheduler I might grasp a better picture.
>
> The next step in making it easier to maintain an llvm scheduler for
> "interesting" targets is to build an MI-level scheduling framework and
> move at least one of the primary targets to this framework so it's well
> supported. This would separate the nasty issues of serializing the
> selection DAG from the challenge of microarchitecture-level scheduling,
> and provide a suitable place to inject your own scheduling algorithm.
> It's easier to implement a scheduler when starting from a valid
> instruction sequence where all dependencies are resolved and no
> register interferences exit.
[Larin, Sergei]
Agree, and my whole point is that it needs to be done with preceding
public discussion, and not de-facto with code drops.
>
> To answer your question, there's no clear way to describe the current
> overall scheduling strategy. For now, you'll need to ask porting
> questions on llvm-dev. Maybe someone who's faced a similar problem will
> have a good suggestion. We do want to improve that situation and we
> intend to do that by first providing a new scheduler framework. When we
> get to that point, I'll be sure that the new direction can work for you
[Larin, Sergei]
Any clue on time frame?
> and is easy to understand. All I can say now is that the new design
> will allow a target to compose a preRA scheduler from an MI-level
> framework combined with target-specific logic for selecting the optimal
> instruction order. I don't see any point in imposing a generic
> scheduling algorithm across all targets.
>
> -Andy
[Larin, Sergei]
Thank you again for the explanation. I am really looking forward to digging
into it.
This is not really so bad, because you can revert to "source order"scheduling, -pre-RA-sched=source, and you don't need to implement manytarget hooks. It burns compile time for no good reason, but you canprobably live with it. Then you're free to implement your own MI-levelscheduler.[Larin, Sergei]
I am not 100% sure about this statement, but as I get closer to
re-implementing my scheduler I might grasp a better picture.
The next step in making it easier to maintain an llvm scheduler for"interesting" targets is to build an MI-level scheduling framework andmove at least one of the primary targets to this framework so it's wellsupported. This would separate the nasty issues of serializing theselection DAG from the challenge of microarchitecture-level scheduling,and provide a suitable place to inject your own scheduling algorithm.It's easier to implement a scheduler when starting from a validinstruction sequence where all dependencies are resolved and noregister interferences exit.
[Larin, Sergei]
Agree, and my whole point is that it needs to be done with preceding
public discussion, and not de-facto with code drops.
To answer your question, there's no clear way to describe the currentoverall scheduling strategy. For now, you'll need to ask portingquestions on llvm-dev. Maybe someone who's faced a similar problem willhave a good suggestion. We do want to improve that situation and weintend to do that by first providing a new scheduler framework. When weget to that point, I'll be sure that the new direction can work for you
[Larin, Sergei]
Any clue on time frame?
Can this be done without modifying common code? It looks like
hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp.
Thanks again,
Hal
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
> On Tue, 2011-10-25 at 21:00 -0700, Andrew Trick wrote:
> Now, to generate the best PPC schedules, there is one thing you may
>> want to override. The scheduler's priority function has a
>> HasReadyFilter attribute (enum). It can be overriden by specializing
>> hybrid_ls_rr_sort. Setting this to "true" enables proper ILP
>> scheduling, and maximizes the instructions that can issue in one
>> group, regardless of register pressure. We still care about register
>> pressure enough in ARM to avoid enabling this. I'm really not sure how
>> much it will help on modern PPC implementations though.
>> hybrid_ls_rr_sort
>
> Can this be done without modifying common code? It looks like
> hybrid_ls_rr_sort is local to ScheduleDAGRRList.cpp.
>
> Thanks again,
> Hal
Right. You would need to specialize the priority queue logic. A small amount of common code.
Andy
Andy,
I played around with this some today for my PPC 440 chips. These are
embedded chips (multiple pipelines but in-order), and may be more
similar to your ARMs than to the PPC-970 style designs...
I was able to get reasonable PPC 440 code generation by using the ILP
scheduler pre-RA and then the post-RA scheduler with ANTIDEP_ALL (and my
load/store reordering patch). This worked significantly better than
using either hybrid or ilp alone (with or without setting
HasReadyFilter). I was looking at my primary use case which is
partially-unrolled loops with loads, stores and floating-point
calculations.
This seems to work b/c ILP first groups the instructions to extract
parallelism and then the post-RA scheduler breaks up the groups to avoid
stalls. This allows the scheduler to find its way out of what seems to
be a "local minimum" of sorts, whereby it wants to schedule each
unrolled iteration of the loop sequentially. The reason why this seems
to occur is that the hybrid scheduler would prefer to suffer a large
data-dependency delay over a shorter full-pipeline delay. Do you know
why it would do this? (you can see PR11589 for an example if you'd
like).
Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
used? Is there a reason that this is a compile-time constant? Both
Hybrid and ILP have isReady() functions. I can certainly propose a patch
to make them command-line options.
Thanks again,
Hal
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
The "ilp" scheduler has several heuristics designed to compensate for lack of itinerary. Each of those heuristics has a flag, so you can see what works for your target. I've never used that scheduler with an itinerary, but it should work. It's just that some of the heuristics effectively override the hazard checker.
The "hybrid" scheduler depends more on the itinerary/hazard checker. It's less likely to schedule instructions close together if they may induce a pipeline stall, regardless of operand latency.
> Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
> used? Is there a reason that this is a compile-time constant? Both
> Hybrid and ILP have isReady() functions. I can certainly propose a patch
> to make them command-line options.
It's a compile time constant because it's clearly on the scheduler's critical path and not used by any active targets. Enabling HasReadyFilter turns the preRA scheduler into a strict scheduler such that the hazard checker overrides all other heuristics. That's not what you want if you're also enabling postRA scheduling!
-Andy
I'd prefer to have a scheduler that just does what I want :) -- How can
I make a modified version of the hybrid scheduler that will weight
operand latency and pipeline stalls more equally?
Here's my "thought experiment" (from PR11589): I have a bunch of
load-fadd-store chains to schedule. A store takes two cycles to clear
its last pipeline stage. The fadd takes longer to compute its result
(say 5 cycles), but can sustain a rate of 1 independent add per cycle.
As the scheduling is bottom-up, it will schedule a store, then it has a
choice: it can schedule another store (at a 1 cycle penalty), or it can
schedule the fadd associated with the store it just scheduled (with a 4
cycle penalty due to operand latency). It seems that the current hybrid
scheduler will choose the fadd, I want a scheduler that will make the
opposite choice.
> > Regarding HasReadyFilter: HasReadyFilter just causes isReady() to be
> > used? Is there a reason that this is a compile-time constant? Both
> > Hybrid and ILP have isReady() functions. I can certainly propose a patch
> > to make them command-line options.
>
> It's a compile time constant because it's clearly on the scheduler's critical path and not used by any active targets. Enabling HasReadyFilter turns the preRA scheduler into a strict scheduler such that the hazard checker overrides all other heuristics. That's not what you want if you're also enabling postRA scheduling!
Indeed, that makes sense.
Thanks again,
Hal
>
> -Andy
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
Here's my "thought experiment" (from PR11589): I have a bunch of
load-fadd-store chains to schedule. A store takes two cycles to clear
its last pipeline stage. The fadd takes longer to compute its result
(say 5 cycles), but can sustain a rate of 1 independent add per cycle.
As the scheduling is bottom-up, it will schedule a store, then it has a
choice: it can schedule another store (at a 1 cycle penalty), or it can
schedule the fadd associated with the store it just scheduled (with a 4
cycle penalty due to operand latency). It seems that the current hybrid
scheduler will choose the fadd, I want a scheduler that will make the
opposite choice.
Andy, I've already looked at the debug output quite a bit; please help
me understand what I'm missing...
First, looking at the code does seem to confirm my suspicion. This is
certainly is low-pressure mode, and so hybrid_ls_rr_sort::operator()
will return the result of BUCompareLatency. That function first checks
for stalls and returns 1 or -1. Only after that does it look at the
relative latencies.
In addition, the stall computation is done using BUHasStall, and that
function only checks the current cycle. Without looking forward, I don't
understand how it could know how long the pipeline hazard will last.
It looks like this may have something to do with the height. Can you
explain how that is supposed to work?
For the specific example: We start with the initial store...
GPRC: 4 / 31
F4RC: 1 / 31
Examining Available:
Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,
0x2c03870, 0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
[ORD=94] [ID=102]
Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
[ID=92]
Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]
Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]
*** Scheduling [21]: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70,
0x2bf3910, 0x2c03870, 0x2c03e70<Mem:ST4[%
arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94] [ID=102]
then it schedules a "token factor" that is attached to the address
computation required by the store (this is essentially a no-op,
right?)...
GPRC: 5 / 31
F4RC: 2 / 31
Examining Available:
Height 21: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1, 0x2c03a70
[ORD=94] [ID=5]
Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
[ID=105]
Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
[ID=92]
Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]
Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]
*** Scheduling [21]: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1,
0x2c03a70 [ORD=94] [ID=5]
how here is the choice that we may want to be different...
GPRC: 5 / 31
F4RC: 2 / 31
Examining Available:
Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
[ID=105]
Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
[ID=92]
Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]
Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
[ORD=70] [ID=85]
(with more debug turned on, I also see a bunch of messages like:
*** Hazard in cycle 3, SU(97): xxx: ch = STFSX ...<Mem:ST4[%
arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
one of these for each of the other possible stores).
*** Scheduling [24]: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70,
0x2bf3710 [ORD=92] [ID=105]
why did it choose this fadd over any of the other stores? the
corresponding unit descriptions are:
SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910, 0x2c03870,
0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94]
[ID=102]
# preds left : 4
# succs left : 1
# rdefs left : 0
Latency : 7
Depth : 0
Height : 0
Predecessors:
val #0x2c11ff0 - SU(105): Latency=3
val #0x2c0cdd0 - SU(32): Latency=1
val #0x2c11db0 - SU(103): Latency=1
ch #0x2c0af70 - SU(5): Latency=0
Successors:
ch #0x2c0ac10 - SU(2): Latency=1
SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92] [ID=105]
# preds left : 2
# succs left : 1
# rdefs left : 1
Latency : 11
Depth : 0
Height : 0
Predecessors:
val #0x2c12110 - SU(106): Latency=6
val #0x2c0d130 - SU(35): Latency=6
Successors:
val #0x2c11c90 - SU(102): Latency=3
Just from the debugging messages, it looks like what is happening is
that the scheduler is first rejecting the other stores because of
pipeline hazards and then picking the instruction with the lowest
latency. Looking at the code, it seems that this is exactly what it was
designed to do. If I'm wrong about that, please explain.
Thanks in advance,
Looking at this more carefully, I think that I see the problem. The
heights are set to account for the latencies:
PredSU->setHeightToAtLeast(SU->getHeight() + PredEdge->getLatency());
but the latencies are considered only if the node as an ILP scheduling
preference (the default in TargetLowering.h is None):
bool LStall = (!checkPref || left->SchedulingPref == Sched::ILP) &&
BUHasStall(left, LHeight, SPQ);
...
and the PPC backend does not override getSchedulingPreference.
-Hal
Hal\
That is exactly what I tried and it works pretty well.
>
>
> BTW - If you set HasReadyFilter, the fadd (105) would not even appear
> in the queue until the scheduler reached cycle [24]. So three
> additional stores would have been scheduled first. HasReadyFilter
> effectively treats operand latency stalls as strictly as pipeline
> hazards. It's not clear to me that want to do that though if you fix
> getSchedulingPreference and do postRA scheduling later anyway.
I just ran some quick tests, even with getSchedulingPreference returning
ILP (regardless of how HasReadyFilter is set), using postRA still
results in significantly better code compared to using Hybrid alone.
Whether ILP+postRA or Hybrid+postRA wins depends on the benchmark
(Hybrid+postRA may have a small edge).
Thanks again,
Hal
>
>
> So it should work to do "hybrid" scheduling biased toward ILP, vs.
> "ilp" scheduling which really does the opposite of what it's name
> implies because it's initially biased toward regpressure.
>
>
> -Andy
>
> > >
> > > In addition, the stall computation is done using BUHasStall, and
> > > that
> > > function only checks the current cycle. Without looking forward, I
> > > don't
> > > understand how it could know how long the pipeline hazard will
> > > last.
> > >
> > > It looks like this may have something to do with the height. Can
> > > you
> > > explain how that is supposed to work?
> > >
> > > For the specific example: We start with the initial store...
> > >
> > > GPRC: 4 / 31
> > > F4RC: 1 / 31
> > >
> > > Examining Available:
> > > Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,
> > > 0x2c03870, 0x2c03e70<Mem:ST4[%
> > > arrayidx6.14](align=8)(tbaa=!"float")>
> > > [ORD=94] [ID=102]
> > >
--