Conditional instruction support

Sam Parker-Haynes

unread,

Jul 25, 2023, 4:23:54 AM7/25/23

to v8-dev

Hi,

I've been wondering how to go about adding general support for conditional operations in Turbofan, specifically conditional compares for AArch64. But I now see that Intel APX extensions also support more conditional instructions, beyond cmov.

I'm not a huge fan of flag continuations, as it's been too easy for me to introduce bugs while using them, so is there a safer way to approach this?

Machine-level predication would seem to require a CFG in a form that isn't really available until during/after jump-threading. So a few questions about jump threading:

- Could predication happen there?

- Why does jump threading run so late, why not before regalloc?

And, in general, what is the policy on having optimisation passes that run on machine code?

Thanks,

Sam

Nico Hartmann

unread,

Jul 25, 2023, 4:37:26 AM7/25/23

to v8-...@googlegroups.com

Hi Sam,

We are currently migrating Turbofan to a new CFG-based IR (Turboshaft). This could help.

Cheers,

Nico

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/80169017-f248-4424-89ad-97260f28643dn%40googlegroups.com.

--

Nico Hartmann | Software Engineer | nicoha...@google.com | Chrome - V8

Sam Parker-Haynes

unread,

Jul 25, 2023, 5:05:26 AM7/25/23

to v8-dev

Hi Nico,

A CFG IR sounds good to me :) I've been hesitate to look as I didn't know the status, is there any documentation? Is it currently piggybacking off turbofan for some functionality? I can't immediately see where/how graph nodes are defined.

cheers,

Nico Hartmann

unread,

Jul 25, 2023, 5:47:07 AM7/25/23

to v8-...@googlegroups.com

There is not a lot of (up to date) documentation. A very brief summary:

The new IR's operations (which replace Turbofan's nodes) are defined in turboshaft/operations.h.

The current state of the pipeline is as follows: We run Turbofan's frontend and then translate to the Turboshaft IR (in turboshaft/graph-builder.cc). Next, the Turboshaft optimization and lowering phases run on the CFG (see primarily pipeline.cc and the turboshaft/*-phase.h included from there). Eventually, we translate the CFG back into Turbofan's sea-of-nodes IR (in turboshaft/recreate-schedule.cc) to run the backend.

We are currently working on reimplementing the rest of Turbofan's phases to have a full Turboshaft CFG pipeline eventually, but we plan to ship a first version of Turboshaft soonish, which still only covers part of the pipeline and relies on the translation back and forth.

To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/f4cb3d5e-d703-4e0a-b65d-1fbacf1573a0n%40googlegroups.com.

Sam Parker-Haynes

unread,

Jul 25, 2023, 5:53:01 AM7/25/23

to v8-dev

Great, thanks for the info, I'll take a look around.

Sam Parker-Haynes

unread,

Jul 26, 2023, 6:35:25 AM7/26/23

to v8-dev

Hi Nico,

I've been having a poke around using a toy wasm example. I'm trying to understand how the wasm optimization pipeline works. I have this basic block:

BLOCK B7 <- B4
26: Constant()[word32: 50]
27: Equal(#3, #26)[Word32]
28: Constant()[word32: 0]
29: Equal(#27, #28)[Word32]
30: Branch(#29)[B10, B8, None]

And I can see that MachineOptimizationReducer changes the branch (29) condition, but it doesn't remove (28) and (27):

BLOCK B6 <- B4
25: Constant()[word32: 50]
26: Equal(#3, #25)[Word32]
27: Constant()[word32: 0]
28: Equal(#26, #27)[Word32]
29: Branch(#26)[B7, B8, None]

I've tried adding DeadCodeElimination, but it always seems to run before MachineOptimizationReducer, how can I get it to run afterwards?

What I find somewhat worrying is that this IR is lowered to equivalent looking Turbofan IR but then ISel appears to be using Word32Equal (28) as the branch condition again:

--- BLOCK B5 id7 <- B4 ---
24: IfFalse(22)
25: Int32Constant[50]
26: Word32Equal(4, 25)
27: Int32Constant[0]
28: Word32Equal(26, 27)
29: Branch[Unspecified, None](26) -> B7, B6

B5: AO#4 (no frame) instructions: [11, 13)
predecessors: B4
11: gap () ()
v7(R) = Arm64Cmp32 && set if equal v0(R) #50
12: gap () ()
Arm64CompareAndBranch32 && branch if not equal v7(R) [rpo_immediate:7] [rpo_immediate:6]

Any idea of what's going here?

cheers!

Nico Hartmann

unread,

Jul 27, 2023, 4:52:27 AM7/27/23

to v8-...@googlegroups.com

Hi Sam,

Briefly summarized, the Turboshaft pipeline is organized into phases where each phase is defined by a stack of reducers (e.g. `OptimizationPhase<SomeReducer, SomeOtherReducer, AnotherReducer>::Run()`). A Turboshaft graph is passed to each phase (the "input graph") and the phase then iterates over the graph and passes each operation through the stack. At the bottom of the stack, a new operation (or a sequence of operations) is generated into the phase's output (the "output graph"). The output graph is then the next phase's input. However, some phases run analyses. Those always process the phase's input graph regardless of where they are in the stack. The `DeadCodeEliminationReducer` is one of these that runs an analysis on the input graph and then skips emitting dead operations during reduction. As a result, operations that became dead during a previous reduction in the same phase (reducer stack), will not be removed.

There are two ways to get rid of unused operations:

- When a phase iterates over the input graph, all operations that don't have uses are just skipped. So once an operation that doesn't have any uses is emitted into the output graph, it will be removed (or rather ignored when generating the next graph) at the beginning of the next phase.

- Using the `DeadCodeEliminationReducer`, which is a more powerful liveness analysis that can do much more. It will run a full fixpoint analysis on the input graph to find all live operations to remove all dead operations in one step. It will also eliminate unnecessary control flow by rewriting BranchOp and GotoOp targets and it will rewrite PhiOps and such. If you want to make use of this, it has to be in the following phase, because - as described above - it always runs on the phase's input graph. However, this is a more expensive analysis+reducer, so I would not recommend using it for the example you provided above but just rely on the next OptimizationPhase run to remove the unused operations.

Regarding the ISel, I think the generated code looks okay. `v7(R) = ...` is the comparison of `v0` with `50` (so that's basically `26: Word32Equal(4, 25)`) and then we branch based on this result. Please clarify what you think is wrong here.

Nico

To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/87824ff9-a7a0-4a22-b69b-44e686d07f1an%40googlegroups.com.

Sam Parker-Haynes

unread,

Jul 27, 2023, 6:34:02 AM7/27/23

to v8-dev

If I'm reading the IR correctly, the branch has already been rewritten to directly use #26, so I'm not expecting there to be two comparisons. The branch targets are the same, but the condition is being inverted (again) during ISel. I think the IR should be translated to this:

v7(R) = Arm64Cmp32 && branch if equal v0(R) #50 [rpo_immediate:7] [rpo_immediate:6]

And many thanks for the pipeline overview! I will continue playing.

Sam Parker-Haynes

unread,

Jul 27, 2023, 8:24:25 AM7/27/23

to v8-dev

Ah, I think I was miss-reading the Arm64 instructions: Arm64CompareAndBranch is cbz (?) so the control-flow would work out... Too many levels of conditionals for me!

Sam Parker-Haynes

unread,

Jul 27, 2023, 9:49:11 AM7/27/23

to v8-dev

Right, it is now working as expected. I've broken the WasmOptimizePhase into two:

void WasmOptimizePhase::Run(Zone* temp_zone) {
UnparkedScopeIfNeeded scope(PipelineData::Get().broker(),
v8_flags.turboshaft_trace_reduction);
// TODO(14108): Add more reducers as needed.
OptimizationPhase<WasmLoweringReducer,
MachineOptimizationReducerSignallingNanPossible>::Run(temp_zone);
OptimizationPhase<ValueNumberingReducer>::Run(temp_zone);

And this indeed cleans up the graph to remove dead code, which allows the backend to perform optimisations under CanCover:

B5: AO#4 (no frame) instructions: [11, 12)

predecessors: B4
11: gap () ()

Arm64Cmp32 && branch if equal v0(R) #50 [rpo_immediate:7] [rpo_immediate:6]

So, my final question :) how does one decide how to organize passes? There's always going to be an ordering problem, but what are the trade-offs to grouping/splitting reducers into OptimizationPhase(s)?

Sam

Nico Hartmann

unread,

Jul 27, 2023, 10:47:08 AM7/27/23

to v8-...@googlegroups.com

Great to hear that it's working as expected.

The phase ordering is a hard problem indeed. There is no one answer to that question. Every phase has a cost in the sense that the graph has to be iterated and a new graph has to be emitted. Although the cost is relatively small, it is not free. In other words, you might not want to put every reducer in its own phase. Besides some reducers that have a few additional constraints (e.g. they always run on the input graph or they always need to be at the bottom of the stack), if you have a situation where ReducerA recognizes and optimizes a certain pattern and you have a ReducerB that may produce those as part of some other reductions/optimizations, then it might typically be a good idea to run ReducerA (somewhere) behind ReducerB in a stack.

Other than that, for the pipeline we have so far, we just tried different combinations and compared performance of generated code and of the compilation itself.

Hope that helps a bit,

Nico

To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/2565d8b5-4153-4d11-81f2-76b0bbcd223an%40googlegroups.com.

Sam Parker-Haynes

unread,

Aug 31, 2023, 8:21:11 AM8/31/23

to v8-dev

Hi Nico,

I finally got around to trying to make a prototype and I'm unsure how to proceed. As an overview, I want to find a triangle CFG and hoist the conditional 'middle' block into its only predecessor. Something like this:

B0:

x = ComparisonOp a, b, kind

BranchOp x, B2, B1

B1:

y = ComparisonOp c, d, kind

BranchOp y, B2, B3

B2:

z = ....

-------------------------------------

B0:

x = ComparisionOp a, b, kind

y = ConditionalComparisonOp a, b, x, kind

BranchOp y, B2, B3

B2:

z = ...

B3:

My current implementation would, from `Bind`, search the CFG from B2 and decide that B1 can be hoisted into B0. From there, we need to replace the BranchOp in B0 with the one in B1. We'r'e then replacing the condition of that branch with a new ConditonalCompareOp.

If this sounds feasible, where/how should I do the replacing? I'm still yet to figure out if there is an interface which a reducer has to implement, but I've seen several examples which sound promising: REDUCE(Branch), ReduceBranchCondition and ReduceInputGraphBranch. Would any of these be right place or would they be a misuse of how the reducer stacks operate..?

cheers,

sam

Darius Mercadier

unread,

Aug 31, 2023, 8:40:30 AM8/31/23

to v8-...@googlegroups.com

Hi Sam,

ReduceBranch in branch-elimination-reducer.h sounds like the right place to do this (the method is declared with "REDUCE(Branch)" rather than just "ReduceBranch").

You can probably draw some inspiration from "REDUCE(Goto)" in this same file: this method tries to eliminate double diamonds by identifying Gotos to blocks that end with a branch whose condition is currently known, and inline the destination (see the 2nd part of the topmost comment of the file).

Note that "if_true" in this ReduceBranch is the "true" block in the output graph that we're currently building, which is currently not bound and is thus empty (we always start by emitting all of the predecessors of a block before emitting it (except for loop headers)). You thus have to look at its origin in the input graph, like the beginning of ReduceBranch does with "if_true_origin = Asm().OriginForBlockStart(if_true)".

Also note that in your example, once you've inlined B1 into B0, you do not need to explicitly skip B1 later: B1 will now have 0 predecessors, and will thus automatically be skipped by the OptimizationPhase.

Let me know if you have any questions and feel free to add me as reviewer to your CL ;)

Cheers,

Darius

To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/7df09dff-8493-4e78-9b1c-071023ef6d05n%40googlegroups.com.

Darius Mercadier

Software Engineer

dmerc...@google.com

Google Germany GmbH

Erika-Mann-Straße 33

80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Sam Parker-Haynes

unread,

Aug 31, 2023, 8:46:58 AM8/31/23

to v8-dev

Great, thanks for the tips Darius!

Sam Parker-Haynes

unread,

Oct 4, 2023, 6:30:27 AM10/4/23

to v8-dev

Hello again,

So, I played with branch-elimination until I realized that I still need to use the current isel framework, and so I've been trying to figure out how to do this with FlagsContinuations... I've begun by just trying to perform isel pattern matching of (and/or cmp, cmp) so that, instead of

cmp

cset

cmp

cset

and/orr

We have:

cmp

ccmp

cset.

To achieve this, I've added another type of FlagsContinuation that performs the conditional compare and materializes the boolean result at the end; though, I would also want to support the case where we branch too. This is okay for combining two comparison operations, but it seems to fall apart if there's a deeper chain.

So, fundamentally, what I need is to be able to support an arbitrary number of conditional compares. Say if we had a `ConditionalCompareChain` FlagsContinuation, we could pass an arbitrarily sized array of inputs, where each conditional compare has four inputs: lhs, rhs, default flag value and it's predicate. During codegen we could generate any number of ccmp based upon the size of the input array. Would this be acceptable?