[llvm-dev] Writing loop transformations on the right representation is more productive

984 views

Skip to first unread message

Michael Kruse via llvm-dev

unread,

Jan 3, 2020, 6:27:32 AM1/3/20

to llvm-dev

In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
optimizes loops. In summary, the biggest issues are (a) the complexity
of writing a new loop optimization pass (including needing to deal
with a variety of low-level issues, a significant amount of required
boilerplate, the difficulty of analysis preservation, etc.), (b)
independent optimization heuristics and a fixed pass ordering and (c)
code explosion due to versioning. Currently, different people are
working on improving different loop optimization passes such as
LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
'perfect' individual loop passes, they would still have the
aforementioned problems due to still being independent and the system
as a whole will be suboptimal.

Instead of each loop pass being a major component, the heavy lifting
could be done by a framework of which each transformation itself is a
small part. In this RFC, I would like to work towards a consensus on
how such a framework could look. I already outlined a possible
solution in the same presentation [1] and a publication [7], which is
partially based on a previous RFC [8]. All feedback is welcome,
including a simple '+1'.

The central idea is to use a modifiable loop tree -- similar to
LoopInfo -- as the primary representation. LLVM-IR is converted to a
loop tree, then optimized and finally LLVM-IR is generated again for
subtrees that are considered profitable. This is not a new concept, it
has already been used in compilers such as IBM XL Fortran (called ASTI
[4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
such as the Value State Dependence Graph [3].

Other features include the following:

1. Candidate selection through cost functions
2. Cheap copies using Red/Green Trees
3. Application of transformations from high-level to low-level
4. Represents loops and predicates instead of CFGs
5. Data and control dependencies in one graph
6. Late fallback versioning at IR regeneration
7. Standardized API for analyses with multiple implementations
8. Abstract representation of statements
9. Expansion of use-def chains to arrays when spanning loops

To elaborate on each of these:

1. Candidate selection through cost function
--------------------------------------------
Instead of needing to know which transformation is profitable before
applying it, create a copy of the data structure, modify it, and
compare it to the original. Moreover, holding multiple, differently
optimized, copies allows evaluating each variant using a cost function
and select the 'best' when re-generating LLVM-IR again (or re-use the
original LLVM-IR).

Otherwise, each transformation will have to implement its own
profitability heuristic but cannot know what other passes will do. For
instance, the motivation for LLVM's LoopDistribution is to enable
vectorization of loops that can be separated from other loops.
However, it does not know whether LoopVectorize will actually
vectorize the loop, or whether two vectorizable loops are better
vectorized together.

Instantiating every possible sequence of transformations of course is
not feasible, so the search space needs to be pruned. This could be
made dependent on the optimization level.

2. Cheap copies using Red/Green Trees
-------------------------------------
A red/green trees [4] is a data structure for cheap copies of DAGs.
Basically, a green tree node references only the nodes children but
not the parent. A node from the red tree references its corresponding
green node and its parent (red) node. When modifying the represented
node through copy-on-write, this allows reusing all green nodes except
the ones from the modified node to the tree's root. The same node can
even be used multiple times in the same copy, which can be useful for
e.g. loop unrolling and code versioning. As a result, storing many
optimization variants of the same code only requires copies of the
changed parts.

By contrast, the red node is created on-demand. Traversing the entire
tree should rarely be necessary (instead, use summary information in
the green node) and when limited to subtrees that are considered
interesting for optimization.

3. Apply transformations from high-level to low-level
-----------------------------------------------------
Optimization should be applied from very specialized to very general
(potentially after some canonicalization). For instance, the first
step could be detecting common idioms such as gemm and replace them
with either a BLAS function call or apply well-studied optimizations
like BLIS to them. After such an idiom has been detected, no other
transformation should be applied to them.

Mid-level transformations may try to map entire loop nests to cache-
and compute hierarchies (SIMT threads, multiprocessors, offloading,
etc) by applying transformations such as tiling, loop interchange and
array packing.

Low-level transformations commonly look at only one loop to improve
processor pipelining and predication: Unrolling, software pipelining,
etc.

4. Represents loops and predicates instead of CFGs
-------------------------------------------------
Control flow graphs make loop optimizations difficult, but is also not
something a loop optimizer typically cares about. For instance, in the
example below, loop distribution would need to understand the
condition `c` and reconstruct the branch to fission

for (int i = 0; i < n; i += 1)
if (c) {
foo(i);
bar(i);
}

into

for (int i = 0; i < n; i += 1)
if (c)
foo(i);
for (int i = 0; i < n; i += 1)
if (c)
bar(i);

In the loop tree, control flow is represented through predicates of
the statements. In a first step, it is represented as

for (int i = 0; i < n; i += 1) {
bool __BB1_executed = 0;
{ if (c) foo(i); __BB1_executed = 1; }
{ if (__BB1_executed) bar(i); }
}

The predicates (`c` and `__BB1_executed`) can be separated into a
"necessary condition" (`c` must be true before `foo` can be executed)
and a "sufficient condition" (if `c` is true, `foo` must be executed).
These can be different to allow speculative execution of the
statement. For instance, a necessary condition of `true` means that
the statement can always be executed speculatively.

The advantage is that control flow becomes a data-dependency problem.
It can trivially be converted back to its CFG representation because
`__BB1_executed` directly corresponds to the CFG edge to which it was
connected. Data flow optimizations such as value propagation can be
applied as well (here: `__BB1_executed == c`), such that we get

for (int i = 0; i < n; i += 1) {
{ if (c) foo(i); }
{ if (c) bar(i); }
}

Which can be distributed trivially (assuming foo() and bar() do not
cause dependencies). Whether the copy propagation is beneficial can be
decided by a heuristic (considering whether the original flow can
still be reconstructed) or just thrown away if the cost function does
not select a loop tree based on the copy propagation.

As a result of the predicated representation the loop tree has a
regular pattern: sequence -> loop -> sequence -> loop -> ... . A loop
here is abstract in that it is something that is executed multiple
times. This can be a sequential loop, but also a collection of SPMD
processes, threads, vector lanes, etc.

At IR regeneration, we have multiple choices of how to lower
conditions. If we lower to LLVM-IR CFG, it may lower to
* Branch instructions, reusing branch conditions from previous statements
* Select instructions

If lowering to SIMD instructions, we can generate
* Branch instructions for uniform predicates
* Execution masks for varying predicates
* A mix of both, since predicates can be uniform relative to previous
taken conditions

If lowering to NVPTX-style SIMT
* @p per-instruction execution masks
* Divergent branches; Every statement can be made a re-convergence
point (by re-evaluating the condition), but will happen, at the
latest, at the end of the loop
* Uniform predicates can be given the ".uni" suffix

When lowering to AMD-style SIMT
* S_CBRANCH (explicitly managed execution mask)
* S_CBRANCH_I/G_FORK (stack of execution masks)

5. Data and control dependencies in one graph
---------------------------------------------
Also as a result of the predicated from, during dependency analysis,
control dependencies and data dependencies have the same
representation, like in a program dependency graph. In LLVM-IR, passes
such as AggressiveDeadCodeElimination use DominanceFrontier for this
purpose. This would not be necessary.

When re-generating IR, control dependencies can either be lowered to
branching or as a boolean variables. For the purpose of loop
optimizations, the choice of representation does not (and should not)
make a difference.

6. Late fallback versioning at IR regeneration
------------------------------------------
When a transformation is applied, it can attach conditions (no
aliasing, no integer wrap, value restrictions, etc.) under which the
transformation is valid. During LLVM-IR generation, these conditions
are collected and emitted as run-time conditions. If the condition
fails, the original code is executed.

This avoids having every transformation emit its own fallback code.
The fallback should never execute, or only in cases that do not do
significant work (e.g. loop trip count is 0), such that optimizing the
fallback itself is not necessary. In other cases, the transformation
should implement proper loop versioning by adding the original
statement with a `!rtc` condition.

7. Standardized API for analyses with multiple implementations
--------------------------------------------------------------
Common analyses use an interface that can be fulfilled by different
implementations. Among these are:

* Closed form expressions
- None (expressions are always represented through dependencies
between statements)
- LLVM ScalarEvolution
- PolyhedralValueAnalysis [6]

Closed form expressions can be used to remap values (typically array
subscripts, loop lower/upper bounds), simplify expression computations
(including predicates => is a predicate always true?), determine loop
trip counts, etc. Most importantly, it identifies a statement
execution (instance) relative to the loop counters.

* Array accesses / access range / provenance
- LLVM AliasAnalysis
- One-dimensional (always linearized)
- Multi-dimensional, delinearized if necessary
- From metadata (that a front-end emits for arrays of pointers to
arrays of data)

Array access analysis determines whether two accesses at a specific
index can alias each other, access ranges and connectivity.

* Dependencies
- none (cannot reorder statements)
- flat (depends on last execution of another statement)
- LLVM DependendencyAnalysis
- Polyhedral analysis

There is intentionally no "does X depend on Y?" API as this would make
any analysis based on it quadratic in runtime. Instead, it has to list
all statements another statement depends on such that a graph can be
built and e.g. sorted topologically.

Dependencies include: use-def chains via registers, memory
flow/anti/output, general side-effects and control dependencies.

* Control flow
- Extracted from original CFG
- Evaluate statement conditions

Control flow analysis answers questions about (post-)dominance and
mutual exclusivity within the acyclic sequence/body of a loop.

* Cost heuristic
- TargetTransformationInfo

Estimate the execution time of a (sub-)tree. Average trip counts from
profiling [11] could be useful.

8. Abstract representation of statements
----------------------------------------
There is no direct relationship between LLVM instructions and loop DAG
statements. If possible, loop tree optimizations should only use the
properties of the statement node, not its contents. Typically, a
statement consists of multiple instructions that do not make sense to
split. For instance, assuming that %add is not used a second time, in
the example below

%add = add i64 %arg, 2
%mul = shl i64 %add, 1

the two instructions should always be computed together in the same
loop iteration. Combining instructions to statements reduces the
computational complexity of loop optimizations and might be controlled
by the optimization level (at lower optimization level more
instructions would be combined). A loop node is itself a statement in
it's parent loop node's acyclic body.

Moreover, there is no reason to use only LLVM-IR to carry the
semantics. We could also use e.g. Machine-IR or MLIR operations.

9. Expansion of use-def chains to arrays when spanning loops
------------------------------------------------------------
Consider we would like to fission the following loop:

for (int i = 0; i < n; i += 1) {
{ if(true) double a = f(i); }
{ if(true) g(a); }
}

The problem is that the split cuts the use-def dependency over `a`.
However, the fission can still be done and results in:

double A[n];
for (int i = 0; i < n; i += 1)
{ if(true) A[i] = f(i); }
for (int i = 0; i < n; i += 1) {
{ if(true) g(A[i]); }

The transforming pass has to consider this during its profitability
model. The big advantage is that in terms of correctness, use-def
chains do not manifest false dependencies.

Possible Transformations
------------------------
In principle, any transformation could be applied on this
representation, but instruction-centric transformations such as
InstCombine are less suitable. Obviously, loop-centric transformations
are a better fit. Ideas for loop transformations include:

* Unroll(-and-jam)
* Peeling
* Iteration range split, concatenation
* Interchange
* Loop fusion and distribution(/fission)
* Tiling/stripmining/stripemining/blocking
* Code versioning (including loop unswitching)
* Maybe vectorization
* Accelerator offloading
* Parallelization
* Space-filling curve iteration [12]
* Array packing (e.g. for scratch pads), transposition
* Statement reordering
* SIMT-ization
* Detect and optimize reductions
* Loop skewing/wavefronting
* Loop reversal
* Solver-driven (polyhedral) optimization
* Loop idiom recognition

Optimization Pipeline
---------------------
The optimization pipeline for the hierarchy DAG is straightforward:

void optimizeLoopNest() {
RedOriginal = generateLoopHierarchy();
List = createCandidates(RedOriginal);
sort(List, [](Cand1,Cand2) { return estimateExecutionTime(Cand2)
-
estimateExecutionTime(Cand1); } );
if (List[0] != RedOriginal)
emit(List[0]);
}

Subtree exploration is done recursively, optimizing inner nodes which
can be (re-)used when optimizing outer nodes.

The current node to optimize is compared a selection of known
heuristics and its transformation is added to the list of candidates.
That is, an optimization routine resemble the following, with more
specific heuristics checked before more general ones.

void exploreOptimizations(Root, Worklist) {
Matches = 0;

// High level: user annotations
if (Root.hasHint()) {
// Hint is mandatory: do not explore any other possibility
Worklist.replace(Root, applyHint(Root));
return;
}

// High level: Idiom recognition
if (isBLAS(Root)) {
Optimized = LinkAgainstBLASLibary ? convertToBlasLibCall(Root) :

applyBLIS(Root):
Worklist.replace(Root, Optimized);
return;
}

// Mid level: recognize stencils
if (isStencil(Root))
...

// Mid level: Map to compute hierarchy (here: simd -> SMT -> core
-> socket for CPUs)
if (isParallel(Root) fitsWorkingSet(Root.getBody(), LLCSize) && &&
isParallel(Root.getPerfectlyNestedLoop(1)) &&
fitsWorkingSet(Root.getPerfectlyNestedLoop(2).getBody(), L2Size) &&
isParallel(Root.getPerfectlyNestedLoop(2)) &&
fitsWorkingSet(Root.getPerfectlyNestedLoop(3).getBody(), L1Size) &&
isVectorizable(Root.getPerfectlyNestedLoop(3))) {
Optimized = applyVectorization(Root.getPerfectlyNestedLoop(3));
Optimized = applyParallel(Optimized.getParent(), /*places=*/SMT);
Optimized = applyParallel(Optimized.getParent(), /*places=*/Cores);
Optimized = applyParallel(Optimized.getParent(), /*places=*/Sockets);

// Still explore CPU pipeline optimizations of innermost
Worklist.insert(Root,Optimized);
Matches += 1;
}

if (Matches >= Threshold)
return;

// Low level: optimize working set for L1 cache size using tiling
Band = maximumTilableNest(Root);
WorkingSetSizePerIteration = estimatWorkingSet(Band.back().getBody());
TileSize = floor(nthroot(L1CacheSize / WorkingSetSizePerIteration,
Band.size()));
if (TileSize > 1) {
Worklist.insert(applyTiling(Root, TilesSize.splat(Band.size()));
Matches += 1;
}

if (Matches >= Threshold)
return;

// Low level: vectorization for each SIMD level (potentially
// carried-out by VPlan)
foreach (Width : promisingSimdWidths(Root)) {
Vectorized = applyVectorization(Root, Width);
if (Vectorized.isSatisfiable()) {
Worklist.insert(Vectorized);
Matches += 1;
}
}

if (Matches >= Threshold)
return;

// Low level: rely on heuristic to find best unrolling factor
if (Factor = unrollFactorHeuristic(Root)) {
Worklist.insert(applyUnroll(Root, Factor).getRoot());
Matches += 1;
}

if (Matches >= Threshold)
return;

...
}

The "levels" encode precedence optimizations. Depending on the
optimization level, we might only apply the most promising
optimization.

Instead of this brute force implementation, for predefined
optimization levels, the search space exploration should be guided:
Encode the order of the most useful transformations in the search
space exploration. For instance, find all idioms first, then apply
memory hierarchy optimizations, then compute hierarchy, then CPU
pipeline optimizations. Since this is very transformation specific, I
do not expect to write a 'pass manager' where transformations can
register itself to (except for exhaustive searches when explicitly
enabled). Instead, I'd write this in code that classifies loop nests
(e.g. compute bound, bandwidth bound, known idioms, tensor
contractions, stencils, etc.) and choses appropriate optimization
paths. We can, of course, make this more generic over time.

Writing an optimization pass
-----------------------------

Optimizations themselves are written in the following steps:
1. Check preconditions
2. Quick profitability analysis to skip obviously bad transformations
3. Create new green nodes to represent the changed loop nest; reuse
unchanged green nodes as much as possible
4. Create a red node to insert into the parent
5. Preserve analyses (most analyses should be able to use analysis
stored in reused green nodes and/or lazyly reevaluate new nodes)
6. Check whether the transformation was valid

Especially the possibility to analyze the correctness after the
transformation was applied could significantly reduce the amount of
code needed to verify the correctness.

A transformation like loop fusion could look like:

void applyFuse(Red1, Red2) {
// Assume Red1 and Red2 are consecutive sibling loops

// Bail out on untypical things (atomic accesses, exceptions,
convergent instructions, ...)
if (!Red1->isSimple() || !Red2->isSimple())
return;

// Consistency checks (can be less less strict by moving conditions
into the fused body).
if (Red1->getIterationCount() != Red2->getIterationCount())
return;
if (Red1->getPredicate() != Red2->getPredicate())
return;

// Create new loop hierarchy with fused loops.
GreenFused = new GreenLoop({ Red1->getGreen(), Red2->getGreen() },
Red1->getIterationCount());
RedFused = new RedNode(GreenFused, Red1->getParent());
NewRoot = RedFused->getRoot();

// Preserve analysis.
// In this case not really necessary since the green nodes of the
// loop bodies are reused.
DependenceAnalysis->transferAnalysis(Red1->getBody(),
RedFused->getBody()[0],
ClosedExprAnalysis->makeIdentityMapping(Red1->getIterator(),
RedFused->getIterator()));
DependenceAnalysis->transferAnalysis(Red2->getBody(),
RedFused->getBody()[1],
ClosedExprAnalysis->makeIdentityMapping(Red2->getIterator(),
RedFused->getIterator()));

ValidityCondition = DependenceAnalysis->isValid(NewRoot);
if (ValidityCondition.isSatisfiable()) {
RedFused.addAssumption(ValidityCondition);
Candidates.push_back(NewRoot);
}
}

Questions & Answers
-------------------

Q: Is the loop hierarchy a tree or DAG?
A: Both. A green node can be reused multiple times in the same
structure (e.g. due to unrolling, versioning, peelings, etc), but a
red node is specific to the position in its parent; there can be
multiple red nodes for the same green node.

Q: No SSA?
A: Since the location of a definition depends on the parent in a green
DAG, possibly from different loop hierarchies after transformations,
it is not possible to directly reference the definition with its use.
However, it is possible in the red tree, but runs against the goal of
avoiding heavy red nodes. Instead, to find the previous definition,
one has to traverse the green no up to find at which sibling level the
value is defined. However, this should not be a common query; loop
optimizations should use dependency information instead.

Q: Loops with multiple exits?
A: If every exiting edge a unique number to its __???_executed
variable, it can be handled like a switch.

Q: Unstructured control flow?
A: Unstructured control flow only introduces disjunctions to statement
predicated. That is, the predicate

a or b

corresponds to block BB2 the CFG

BB0:
br i1 %a, label %BB2, label %BB2

BB1:
br i1 %b, label %BB2, label %BB3

BB2:

In structured control flow, BB1 and/or BB2 would be post-dominated by
BB2 and the disjunction disappear.

Q: Irreducible loops?
A: Can be folded into the first iteration of the loop by including a
condition for statements not executed during the first iteration.

Q: The unreachable terminator?
A: If due to a non-returning function call, this introduces a loop
exit (which will also introduce control a dependency to everything
that follows) to the surrounding loops. If due to an assumption, we
may add the branching condition to the sufficient condition, but not
to the necessary condition.

Q: Infinite loops?
A: The closed-form expression identifying a statement instances and
loop trip count have no upper bound. Either use arbitrary range
arithmetic, use relative dependence vectors and/or fall back on
pessimistic assumptions. In addition, like with the unreachable
terminator, this may be handled as another loop exit.

Q: IndirectBr?
Since the list of possible targets is part of the instructions, can be
handled like a switch over the target address

Q: Loop guards?
A: Loop guards manifests as the execution condition of the loop
statement's predicate. A loop whose predicate evaluates to false can
be considered a loop with 0 iterations. The IR generator can also emit
the predicate as a loop guard. It might be useful to ensure that if
the predicate evaluates to true, the loop has at least one iteration
and the loop-condition is not evaluated for the first iteration (like
LoopRotation).

Q: Exceptions?
A: Could also be handled with control dependencies, but I would
instead mark the surrounding loops with "contains unpredictable
control flow" to be ignored by most transformations. Similar to
`!isSimple()` of LoadInst and StoreInst.

Q: Relation to the new/legacy pass manager?
A: LLVM's pass managers are unfortunately not designed to apply to
subtrees nor persistent data structures besides the LLVM-IR itself.
Instead, the loop tree optimizer would be its own monolithic pass on
the pass manager level (like MachinePassManager and VPlan). My idea is
to add it somewhere before LoopVectorize, but after the inliner,
potentially replace most other loop transformations. There should be
just one instance of the pass in the pipeline since creating an
alternative representation of LLVM-IR is not that cheap. It does not
need IR normalization (LoopSimplify, LICM, LoopRotate, LCSSA); these
can be performed on the loop tree and emitted into the LLVM-IR if a
non-modified tree is chosen. Possibly some of these passes can be
removed.
In order to preserve user-directed transformations (such as "#pragma
clang loop distribute"), it may also be interesting to put it earlier
into the pipeline to avoid that other passes such as LoopFullUnroll
change the loop before the transformation can be applied.
Alternatively, such passes can be taught the check the presence of a
user-directed transformation before applying its transformation.

Q: Relation to LoopVectorize/VPlan?
A: VPlan has similar design goals [9] but is intended for
vectorization only. It also has a hierarchical data structure, but
uses a CFG for acyclic control flow. It is also intended to have
multiple variants of the same source IR from which the best is
selected via a cost function. However, it lacks cheap copies. Instead
of instructions, it uses recipes/"meta-instructions" that handle what
happens to instructions after vectorization, e.g. do that operation on
each vector lane ("WIDEN"). VPlan is more oriented towards modifying
instructions instead of statements as collection of instructions. With
the loop hierarchy, I do not plan to redo the work of VPlan, but also
do not see a reason why it should not also be able to do
vectorization.

Q: Relation to Polly?
A: Polly uses ISL's schedule tree as the representation of the loop
hierarchy. Its creation may fail of something not representable in a
schedule tree is found. Polly tries to find the maximum representable
region, but once it is chosen, it cannot be changed (e.g. because
optimizing it takes too long). The schedule tree is stored internally
by ISL as a kind of green tree, but is walked over by an iterator that
remembers the path to the root. While this could be used to select a
best schedule tree using a cost function, Polly relies mostly on ISL's
reschedule algorithm to find a profitable transformation.

Q: Relation to MLIR?
A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
instance, it also does not feature cheap copies. An advantage is that
loops and multi-dimensional arrays can be represented in the language
without the need of being rediscovered, but have to be inserted by a
front-end. That is, if Clang was generating MLIR, loops and arrays
still have to be rediscovered. However, a loop hierarchy optimizer
could be applied to MLIR just as well as to LLVM-IR.

References
----------
[1] http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk11
[2] https://ieeexplore.ieee.org/abstract/document/5389392/
[3] http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf
[4] https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/
[5] https://github.com/Meinersbur/llvm-project/tree/LOF/llvm/lib/LOF
[6] https://llvm.org/devmtg/2017-10/#src2
[7] https://arxiv.org/abs/1811.00632
[8] https://lists.llvm.org/pipermail/llvm-dev/2017-October/118125.html
[9] https://llvm.org/docs/Proposals/VectorizationPlan.html
[10] https://github.com/open64-compiler/open64/tree/master/osprey/be/lno
[11] https://reviews.llvm.org/D70688
[12] https://arxiv.org/abs/1801.07399
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Michael Kruse via llvm-dev

unread,

Jan 3, 2020, 10:58:50 AM1/3/20

to Michael Kruse, llvm-dev

I seem to have dropped the "RFC" in the title accidentally. That is,
this is meant to be a Request For Comments.

Michael

Renato Golin via llvm-dev

unread,

Jan 10, 2020, 5:10:27 PM1/10/20

to Michael Kruse, llvm-dev

On Fri, 3 Jan 2020 at 11:27, Michael Kruse via llvm-dev
<llvm...@lists.llvm.org> wrote:
> 1. Candidate selection through cost function
> --------------------------------------------
> Instead of needing to know which transformation is profitable before
> applying it, create a copy of the data structure, modify it, and
> compare it to the original. Moreover, holding multiple, differently
> optimized, copies allows evaluating each variant using a cost function
> and select the 'best' when re-generating LLVM-IR again (or re-use the
> original LLVM-IR).

This sounds a lot like VPlan.

> Instantiating every possible sequence of transformations of course is
> not feasible, so the search space needs to be pruned. This could be
> made dependent on the optimization level.

Are you planning on using heuristic searches? This could make the
vectoriser unstable upon small input changes and therefore hard to get
consistent results and testing.

I'm not against such idea, but I think we need to be conservative in
such a core component of the compiler.

It would be nice to have -Ogocrazy to mean "keep going until you find
something", but usually, -O3 should terminate. :)

> 2. Cheap copies using Red/Green Trees

This seems like an elegant approach to a complex problem.

> 3. Apply transformations from high-level to low-level
> -----------------------------------------------------
> Optimization should be applied from very specialized to very general
> (potentially after some canonicalization). For instance, the first
> step could be detecting common idioms such as gemm and replace them
> with either a BLAS function call or apply well-studied optimizations
> like BLIS to them. After such an idiom has been detected, no other
> transformation should be applied to them.

I'm sceptical to such a machinery. People usually write bad code (me
included) and trying to mach multiple patterns to the same semantics
will be hard, considering how lenient C++ is to pointer handling and
type conversions.

If you do find a match and convert to a library function call, then
well, you can't do anything with it even if you wanted to. :)

> Mid-level transformations may try to map entire loop nests to cache-
> and compute hierarchies (SIMT threads, multiprocessors, offloading,
> etc) by applying transformations such as tiling, loop interchange and
> array packing.

This is hard but very profitable. However, feels to me again that this
is just VPlan packed differently.

While VPlan still has no way yet to handle even simple outer-loops
(has that landed yet?), once we do, then the natural progression will
be to start understanding their semantics and possibly make high level
assumptions like that.

> 6. Late fallback versioning at IR regeneration
> ------------------------------------------
> When a transformation is applied, it can attach conditions (no
> aliasing, no integer wrap, value restrictions, etc.) under which the
> transformation is valid. During LLVM-IR generation, these conditions
> are collected and emitted as run-time conditions. If the condition
> fails, the original code is executed.

This sounds like it will bloat code for a lot of cold cases. Or worse,
get it wrong, and put hot code in the cold path.

> 7. Standardized API for analyses with multiple implementations

These are good to have regardless of which vectorisation strategy we use.

> 8. Abstract representation of statements
> ----------------------------------------

> For instance, assuming that %add is not used a second time, in
> the example below
>
> %add = add i64 %arg, 2
> %mul = shl i64 %add, 1
>
> the two instructions should always be computed together in the same
> loop iteration.

This may inhibit further combines, or even detection of target
specific patterns for SIMD code that aren't common.

I agree that not forcefully binding statements with instructions is a
good idea, but this may need a target-specific pattern matcher to be
more sensitive to target idiosyncrasies.

> 9. Expansion of use-def chains to arrays when spanning loops
> ------------------------------------------------------------

> The transforming pass has to consider this during its profitability
> model. The big advantage is that in terms of correctness, use-def
> chains do not manifest false dependencies.

Sounds good, but also creates the problem of how to handle the array.
If 'n' is unknown, or dependent on SIMD widths or number of threads,
it's too low level to add anything that is guaranteed to not change
the performance profile of the original loop.

> Q: Relation to the new/legacy pass manager?
> A: LLVM's pass managers are unfortunately not designed to apply to
> subtrees nor persistent data structures besides the LLVM-IR itself.

By design. The more alternative persistent data structures you have
being modified by a series of passes, the harder it is to know what
did what and where you are.

> Instead, the loop tree optimizer would be its own monolithic pass on
> the pass manager level (like MachinePassManager and VPlan). My idea is
> to add it somewhere before LoopVectorize, but after the inliner,
> potentially replace most other loop transformations.

To me this almost sounds like Polly. Take LLVM IR into a completely
different representation, do a bunch of transformations with it,
re-generate LLVM IR and spits it back into the pipeline.

By that time, all analyses have to be invalidated. All
canonicalisations that had been done will probably be destroyed and
many current pattern matches will stop working. This infrastructure is
only meaningful at the function level or higher, so the potential for
wide range destruction is not trivial.

Don't get me wrong, I like the idea, it's a cool experiment using some
cool data structures and algorithms. But previous experiences with the
pass manager have, well, not gone smooth in any shape or form.

> Q: Relation to LoopVectorize/VPlan?
> A: VPlan has similar design goals [9] but is intended for
> vectorization only.

Again, by a conservative design. I think adding yet-another won't help.

My point is: if this is the way to go, then we should start to think
how we make everything that makes sense become part of this scheme.
Splitting the pass manager into SSA and Tree, run some passes in one
others in the other, and so on.

But creating multiple, incompatible and potentially destructive whole
new pass managers will make a hard job impossible.

> However, it lacks cheap copies. Instead
> of instructions, it uses recipes/"meta-instructions" that handle what
> happens to instructions after vectorization, e.g. do that operation on
> each vector lane ("WIDEN").

Nothing stops us from implementing a leaner approach to VPlan. It
wouldn't be a trivial implementation, but the volume of work that
would be required in this proposal is staggering, too.

> VPlan is more oriented towards modifying
> instructions instead of statements as collection of instructions.

Fair enough, the design was to enhance SIMD code generation, not any
kind of parallel semantics. I guess it would be possible to add the
concept of higher level blocks to VPlan.

All in all, VPlan is young and in constant refactoring, and perhaps it
would be more productive to move it towards a more inclusive approach
than throwing it away before it fully matures to start a whole new
project.

https://xkcd.com/927/

> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> instance, it also does not feature cheap copies.

If you treat MLIR as your red tree, you could create a green tree
(perhaps as a dialect) and have cheap copies (passing the dialects and
deltas without passing the base).

> An advantage is that
> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end.

Not necessarily. We have discussed introducing dialect annotation to
MLIR during compile time from analysis passes that would basically do
what the front-end should have done.

Conclusions?

This was a long email, with too many proposals, so I don't have any
strong opinion or conclusions, not even from my own comments.

Overall, I like a lot of the ideas (red/green, tree optimisation,
different search strategy), but I dislike the encompassing proposal to
*replace* a lot of the existing infrastructure.

For better or worse, LLVM is a product of its age. Some things could
have been done better, but we have always adopted the "general
consensus and slow movement" way to change things. Sometimes too slow,
but...

Now, MLIR can be a way to speed that up.

It is a much more malleable format than LLVM IR, it was designed for
high-level representation, has a lot of parallelism concepts in it and
it's supposed to interact with LLVM IR seamlessly.

It may be much easier to use MLIR to interoperate the two "pass
managers" _together_ than converting from one to the other and vice
versa.

This is a bold claim and I have no evidence it could ever work. But I
think it would still be less work than creating yet another pass
manager from scratch.

cheers,
--renato

Michael Kruse via llvm-dev

unread,

Jan 10, 2020, 7:34:33 PM1/10/20

to Renato Golin, llvm-dev

Am Fr., 10. Jan. 2020 um 16:10 Uhr schrieb Renato Golin <reng...@gmail.com>:
> On Fri, 3 Jan 2020 at 11:27, Michael Kruse via llvm-dev
> <llvm...@lists.llvm.org> wrote:
> > 1. Candidate selection through cost function
> > --------------------------------------------
> > Instead of needing to know which transformation is profitable before
> > applying it, create a copy of the data structure, modify it, and
> > compare it to the original. Moreover, holding multiple, differently
> > optimized, copies allows evaluating each variant using a cost function
> > and select the 'best' when re-generating LLVM-IR again (or re-use the
> > original LLVM-IR).
>
> This sounds a lot like VPlan.

Yes, as mentioned in the Q&A. Unfortunately VPlan is able to represent
arbitrary code not has cheap copies.

> > Instantiating every possible sequence of transformations of course is
> > not feasible, so the search space needs to be pruned. This could be
> > made dependent on the optimization level.
>
> Are you planning on using heuristic searches? This could make the
> vectoriser unstable upon small input changes and therefore hard to get
> consistent results and testing.
>
> I'm not against such idea, but I think we need to be conservative in
> such a core component of the compiler.
>
> It would be nice to have -Ogocrazy to mean "keep going until you find
> something", but usually, -O3 should terminate. :)

I agree, as outlined in the RFC under "predefined optimization levels".

> > 3. Apply transformations from high-level to low-level
> > -----------------------------------------------------
> > Optimization should be applied from very specialized to very general
> > (potentially after some canonicalization). For instance, the first
> > step could be detecting common idioms such as gemm and replace them
> > with either a BLAS function call or apply well-studied optimizations
> > like BLIS to them. After such an idiom has been detected, no other
> > transformation should be applied to them.
>
> I'm sceptical to such a machinery. People usually write bad code (me
> included) and trying to mach multiple patterns to the same semantics
> will be hard, considering how lenient C++ is to pointer handling and
> type conversions.

This conversion is a possibility and certainly not the main motivation
for a loop hierarchy. Smaller idioms exists as well, such as detecting
popcount. Even with gemm I think it would be nice if it could be
written in a naive version in the source code that compiles with any
compiler, but also benefit from the target platform's hand-optimized
performance primitives by adding a compiler switch (which could be
-O3).

> > Mid-level transformations may try to map entire loop nests to cache-
> > and compute hierarchies (SIMT threads, multiprocessors, offloading,
> > etc) by applying transformations such as tiling, loop interchange and
> > array packing.
>
> This is hard but very profitable. However, feels to me again that this
> is just VPlan packed differently.
>
> While VPlan still has no way yet to handle even simple outer-loops
> (has that landed yet?), once we do, then the natural progression will
> be to start understanding their semantics and possibly make high level
> assumptions like that.

I wouldn't have thought that parallelization and offloading was ever
considered on top of VPlan.

> > 6. Late fallback versioning at IR regeneration
> > ------------------------------------------
> > When a transformation is applied, it can attach conditions (no
> > aliasing, no integer wrap, value restrictions, etc.) under which the
> > transformation is valid. During LLVM-IR generation, these conditions
> > are collected and emitted as run-time conditions. If the condition
> > fails, the original code is executed.
>
> This sounds like it will bloat code for a lot of cold cases. Or worse,
> get it wrong, and put hot code in the cold path.

Are you arguing against code versioning? It is already done today by
multiple passes such as LoopVersioningLICM, LoopDistribute,
LoopUnrollAndJam and LoopVectorize. The proposal explicitly tries to
avoid code bloat by having just one fallback copy. Runtime conditions
can be chosen more or less optimistically, but I don't see how this
should be an argument for all kinds of versioning.

If you are concerned about bloat in cold paths, we could use profile
information to optimize cold functions with '-Os', something that GCC
does, but not Clang.

> > 7. Standardized API for analyses with multiple implementations
>
> These are good to have regardless of which vectorisation strategy we use.

In LLVM, AliasAnalysis does this, but hat not yet found another application.

> > 8. Abstract representation of statements
> > ----------------------------------------
> > For instance, assuming that %add is not used a second time, in
> > the example below
> >
> > %add = add i64 %arg, 2
> > %mul = shl i64 %add, 1
> >
> > the two instructions should always be computed together in the same
> > loop iteration.
>
> This may inhibit further combines, or even detection of target
> specific patterns for SIMD code that aren't common.
>
> I agree that not forcefully binding statements with instructions is a
> good idea, but this may need a target-specific pattern matcher to be
> more sensitive to target idiosyncrasies.

My idea here is that loop-level optimizations rarely need to know
which target-specific instructions are executed, as long as it knows
its performance-relevant properties. This might be a difference to
vectorization which may be more ISA-specific.

> > 9. Expansion of use-def chains to arrays when spanning loops
> > ------------------------------------------------------------
> > The transforming pass has to consider this during its profitability
> > model. The big advantage is that in terms of correctness, use-def
> > chains do not manifest false dependencies.
>
> Sounds good, but also creates the problem of how to handle the array.
> If 'n' is unknown, or dependent on SIMD widths or number of threads,
> it's too low level to add anything that is guaranteed to not change
> the performance profile of the original loop.

As mentioned, the profitability model has to take this into account.
Conservatively, we may only do this if the resulting array is a small
constant size such that we can assume that even multiple of those fir
on the stack.

> > Q: Relation to the new/legacy pass manager?
> > A: LLVM's pass managers are unfortunately not designed to apply to
> > subtrees nor persistent data structures besides the LLVM-IR itself.
>
> By design. The more alternative persistent data structures you have
> being modified by a series of passes, the harder it is to know what
> did what and where you are.

The proposal avoids persistent data structures between separate passes.

Note that MachineFunctionPass maintains the MachineFunction data
structure in parallel to the LLVM-IR.

> > Instead, the loop tree optimizer would be its own monolithic pass on
> > the pass manager level (like MachinePassManager and VPlan). My idea is
> > to add it somewhere before LoopVectorize, but after the inliner,
> > potentially replace most other loop transformations.
>
> To me this almost sounds like Polly. Take LLVM IR into a completely
> different representation, do a bunch of transformations with it,
> re-generate LLVM IR and spits it back into the pipeline.

There is indeed an inspiration from Polly.

> By that time, all analyses have to be invalidated. All
> canonicalisations that had been done will probably be destroyed and
> many current pattern matches will stop working. This infrastructure is
> only meaningful at the function level or higher, so the potential for
> wide range destruction is not trivial.
>
> Don't get me wrong, I like the idea, it's a cool experiment using some
> cool data structures and algorithms. But previous experiences with the
> pass manager have, well, not gone smooth in any shape or form.

What experiments? I don't see a problem if the pass manger has to
invalidate analysis are re-run canonicalization passes. This happens
many times in the default pass pipelines. In addition, this
invalidation is only necessary if the loop optimization pass optimizes
something, in which case the additional cost should be justified.

> > Q: Relation to LoopVectorize/VPlan?
> > A: VPlan has similar design goals [9] but is intended for
> > vectorization only.
>
> Again, by a conservative design. I think adding yet-another won't help.
>
> My point is: if this is the way to go, then we should start to think
> how we make everything that makes sense become part of this scheme.
> Splitting the pass manager into SSA and Tree, run some passes in one
> others in the other, and so on.
>
> But creating multiple, incompatible and potentially destructive whole
> new pass managers will make a hard job impossible.

I don't think the proposal qualifies as including a full-flexible new
pass manger, at least no more than the current mechanism LoopVectorize
uses to run passes on VPlan (LoopVectorizationPlanner::plan).

> > However, it lacks cheap copies. Instead
> > of instructions, it uses recipes/"meta-instructions" that handle what
> > happens to instructions after vectorization, e.g. do that operation on
> > each vector lane ("WIDEN").
>
> Nothing stops us from implementing a leaner approach to VPlan. It
> wouldn't be a trivial implementation, but the volume of work that
> would be required in this proposal is staggering, too.
>
> > VPlan is more oriented towards modifying
> > instructions instead of statements as collection of instructions.
>
> Fair enough, the design was to enhance SIMD code generation, not any
> kind of parallel semantics. I guess it would be possible to add the
> concept of higher level blocks to VPlan.
>
> All in all, VPlan is young and in constant refactoring, and perhaps it
> would be more productive to move it towards a more inclusive approach
> than throwing it away before it fully matures to start a whole new
> project.

While I still think the goals of VPlan and a loop hierarchy are
different, I expect VPlan to be production-ready earlier than this
proposal. I fear that combining them would delay the both.

> https://xkcd.com/927/

While I can never find this xkcd not funny, a the loop hierarchy is
not intended to be universal.

> > Q: Relation to MLIR?
> > A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> > instance, it also does not feature cheap copies.
>
> If you treat MLIR as your red tree, you could create a green tree
> (perhaps as a dialect) and have cheap copies (passing the dialects and
> deltas without passing the base).

I don't see how this could work.

> > An advantage is that
> > loops and multi-dimensional arrays can be represented in the language
> > without the need of being rediscovered, but have to be inserted by a
> > front-end.
>
> Not necessarily. We have discussed introducing dialect annotation to
> MLIR during compile time from analysis passes that would basically do
> what the front-end should have done.

The argument is that MLIR has first-class expressions for
multi-dimensional array accesses ("MemRef") while LLVM-IR does not.

https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html

Both of them can have analyses to raise the abstraction to a
multi-dimensional access ("delinearization").

> Conclusions?
>
> This was a long email, with too many proposals, so I don't have any
> strong opinion or conclusions, not even from my own comments.

Thank you for going through it!

> Overall, I like a lot of the ideas (red/green, tree optimization,

> different search strategy), but I dislike the encompassing proposal to
> *replace* a lot of the existing infrastructure.

Not a replacement, but an addition that does not always need to be
enabled (e.g. -O0).

In a previous RFC [8] I tried to NOT introduce a data structure but to
re-use LLVM-IR. The only discussion there was about the RFC, was about
not to 'abuse' the LLVM-IR.

https://lists.llvm.org/pipermail/llvm-dev/2017-October/118169.html
https://lists.llvm.org/pipermail/llvm-dev/2017-October/118258.html

I definitely see the merits of using fewer data structures, but it is
also hard to re-use something existing for a different purpose (in
this case: VPlan) without making both more complex.

> For better or worse, LLVM is a product of its age. Some things could
> have been done better, but we have always adopted the "general
> consensus and slow movement" way to change things. Sometimes too slow,
> but...
>
> Now, MLIR can be a way to speed that up.
>
> It is a much more malleable format than LLVM IR, it was designed for
> high-level representation, has a lot of parallelism concepts in it and
> it's supposed to interact with LLVM IR seamlessly.
>
> It may be much easier to use MLIR to interoperate the two "pass
> managers" _together_ than converting from one to the other and vice
> versa.
>
> This is a bold claim and I have no evidence it could ever work. But I
> think it would still be less work than creating yet another pass
> manager from scratch.

This is why I don't want the framework to be too tangled with LLVM-IR.
For the foreseeable future, Clang will generate LLVM-IR, but our
motivation is to (also) optimize C/C++ code. That is, I do not see a
way to not (also) handle LLVM-IR until Clang is changed to generate
MLIR (which then again will be another data struture in the system).

Michael

Renato Golin via llvm-dev

unread,

Jan 11, 2020, 12:43:55 PM1/11/20

to Michael Kruse, llvm-dev

On Sat, 11 Jan 2020 at 00:34, Michael Kruse <llv...@meinersbur.de> wrote:
> Yes, as mentioned in the Q&A. Unfortunately VPlan is able to represent
> arbitrary code not has cheap copies.

Orthogonal, but we should also be looking into implementing the cheap
copies in VPlan if we want to search for composable plans.

> This conversion is a possibility and certainly not the main motivation
> for a loop hierarchy.

I know. There are many things that can be done with what you propose,
but we should focus on what's the main motivation.

From what I can tell, the tree representation is a concrete proposal
for the many year discussion about parallel IR.

The short paper doesn't mention that, nor it discusses other
opportunities to fix pipeline complexity (that is inherent of any
compiler).

I still believe that many of the techniques you propose are meaningful
ways to solve them, but creating another IR will invariably create
some adoption barriers.

Especially when we already have VPlan and MLIR converging now, which
will need to find their own spaces, too.

> I wouldn't have thought that parallelization and offloading was ever
> considered on top of VPlan.

I don't see why not. VPlan is a structure for picking a path through
composable transformations.

While so far it's being mainly focused at replacing the monolithic
vectorisation, there are concrete plans to look at composition and
more complex idioms.

> Are you arguing against code versioning? It is already done today by
> multiple passes such as LoopVersioningLICM, LoopDistribute,
> LoopUnrollAndJam and LoopVectorize. The proposal explicitly tries to
> avoid code bloat by having just one fallback copy. Runtime conditions
> can be chosen more or less optimistically, but I don't see how this
> should be an argument for all kinds of versioning.

No. I'm cautious to the combination of heuristics search and
versioning, especially when the conditions are runtime based. It may
be hard to CSE them later.

The paths found may not be the most optimal in terms of intermediate states.

> > Don't get me wrong, I like the idea, it's a cool experiment using some
> > cool data structures and algorithms. But previous experiences with the
> > pass manager have, well, not gone smooth in any shape or form.
>
> What experiments? I don't see a problem if the pass manger has to
> invalidate analysis are re-run canonicalization passes. This happens
> many times in the default pass pipelines. In addition, this
> invalidation is only necessary if the loop optimization pass optimizes
> something, in which case the additional cost should be justified.

My point goes back to doing that in VPlan, then tree. The more
back-and-forth IR transformations we add to the pipeline, the more
brittle it will be.

The original email also proposes, for the future, to do all sorts of
analyses and transformations in the tree representation, and that will
likely be incompatible with (or at least not propagated through) the
conversions.

> I don't think the proposal qualifies as including a full-flexible new
> pass manger, at least no more than the current mechanism LoopVectorize
> uses to run passes on VPlan (LoopVectorizationPlanner::plan).

Sorry, that came out stronger than it should have been. I agree it's
not a "whole new pass manager".

> While I still think the goals of VPlan and a loop hierarchy are
> different, I expect VPlan to be production-ready earlier than this
> proposal. I fear that combining them would delay the both.

I get it, but I fear taking a completely different approach may make
it harder to get your proposal to show benefits any time soon.

> > https://xkcd.com/927/
>
> While I can never find this xkcd not funny, a the loop hierarchy is
> not intended to be universal.

Sorry, poetic license. :)

I tried to reflect the perils of creating too many, sometimes competing, IRs.

> In a previous RFC [8] I tried to NOT introduce a data structure but to
> re-use LLVM-IR. The only discussion there was about the RFC, was about
> not to 'abuse' the LLVM-IR.
>
> https://lists.llvm.org/pipermail/llvm-dev/2017-October/118169.html
> https://lists.llvm.org/pipermail/llvm-dev/2017-October/118258.html
>
> I definitely see the merits of using fewer data structures, but it is
> also hard to re-use something existing for a different purpose (in
> this case: VPlan) without making both more complex.

My point about avoiding more structures and IRs was related to VPlan
and MLIR, not LLVM-IR.

I agree there should be an abstraction layer to do parallelisation
analysis, but we already have two, and I'd rather add many of your
good proposals on those than create a third.

Perhaps it's not clear how we could do that now, but we should at
least try to weigh the options.

I'd seriously look at adding a tree-like annotation as an MLIR
dialect, and use it for lean copies.

> For the foreseeable future, Clang will generate LLVM-IR, but our
> motivation is to (also) optimize C/C++ code. That is, I do not see a
> way to not (also) handle LLVM-IR until Clang is changed to generate
> MLIR (which then again will be another data struture in the system).

Even if/when Clang generates MLIR, there's no guarantee the high-level
dialects will be preserved until the vectorisation pass. And other
front-ends may not generate the same quality of annotations.

We may have to re-generate what we need anyway, so no point in waiting
all the front-ends to do what we need as well as all the previous
passes to guarantee to keep it.

cheers,
--renato

Chris Lattner via llvm-dev

unread,

Jan 13, 2020, 1:07:33 AM1/13/20

to Michael Kruse, llvm-dev

On Jan 3, 2020, at 3:26 AM, Michael Kruse via llvm-dev <llvm...@lists.llvm.org> wrote:

In the 2018 LLVM DevMtg [1], I presented some shortcomings of how LLVM
optimizes loops. In summary, the biggest issues are (a) the complexity
of writing a new loop optimization pass (including needing to deal
with a variety of low-level issues, a significant amount of required
boilerplate, the difficulty of analysis preservation, etc.), (b)
independent optimization heuristics and a fixed pass ordering and (c)
code explosion due to versioning. Currently, different people are
working on improving different loop optimization passes such as
LoopInterchange, LoopFuse and LoopVectorize. Even if LLVM had
'perfect' individual loop passes, they would still have the
aforementioned problems due to still being independent and the system
as a whole will be suboptimal.

Hi Michael,

Thank you for bringing this up. This is an area of interest, and I certainly share you view of what a pain this all is right now. I can tell you’ve put a lot of thought into this and time into your RFC!

The central idea is to use a modifiable loop tree -- similar to
LoopInfo -- as the primary representation. LLVM-IR is converted to a
loop tree, then optimized and finally LLVM-IR is generated again for
subtrees that are considered profitable. This is not a new concept, it
has already been used in compilers such as IBM XL Fortran (called ASTI
[4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
such as the Value State Dependence Graph [3].

Ignoring the details of its representation, this is also conceptually how Polly works: code is lifted into its representation, transformed, then lowered back down.

4. Represents loops and predicates instead of CFGs

Yes, totally!

Overall, I think that this discussion would be easier to process if we broke it into a few pieces. There seems to be consensus that LLVM IR (as is) is not the right representation for aggressive loop transformations. If we don’t have consensus on this, then I’d make sure to start there.

Once that is established, there is a question of “what is the right representation to use”? This question has two subcomponents: what data structure should we use, and what is the IR within it.

If you propose introducing a brand new data structure, please expect me to push back on that pretty hard. This is a perfect application of MLIR, and using it provides a lot of value (including amazing testing tools, round-tripping, location tracking, etc) which would otherwise would have to be reinvented, and doesn’t not dictate the IR structure otherwise. MLIR absolutely supports nested loop structures etc, as is seen in the affine dialect.

The MLIR community also is highly invested in HPC-style transformations on this, and a lot of thought has gone into it. You can learn more about this in the slides and videos from the MLIR open design meetings.

One you achieve consensus on data structure, there is the question of what IR to use within it. I would recommend starting with some combination of “existing LLVM IR operations + high level control flow representation”, e.g. parallel and affine loops. The key here is that you need to always be able to lower in a simple and predictable way to LLVM IR (this is one of the things that classic polyhedral systems did sub optimally, making it difficult to reason about the cost model of various transformations), and this is a natural incremental starting point anyway. Over time, more high level concepts can be gradually introduced. FYI, MLIR already has a reasonable LLVM dialect and can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which should be straightforward to build.

Once you have the data structure and the dialect within it decided, you have the set of transformations. Again, you’ve given a lot of thought to this, and that all sounds great to me, but the priorities can be driven by whoever wants to contribute and what concrete problems they have to solve.

Once the infra for “raising to this representation and lowering back down” is figured out, we can open the box of having clang and other front ends directly generate it.

Q: Relation to MLIR?

A: MLIR is more similar to LLVM-IR than a loop hierarchy.

This is not true, MLIR is great for dialects that want to model loop hierarchies naturally, this is a major focus of the affine dialect (e.g. see affine.for on that page). MLIR is not limited to affine loops, that is just a choice made by the affine dialect - the loop dialect has more general constructs that are less developed.

For
instance, it also does not feature cheap copies.

I’m not sure what this means.

An advantage is that
loops and multi-dimensional arrays can be represented in the language
without the need of being rediscovered, but have to be inserted by a
front-end.

This is correct, but I don’t see how this helps if your focus is raising code that has already been lowered to LLVM IR, e.g. by Clang or some other frontend that generates LLVM IR today.

That is, if Clang was generating MLIR, loops and arrays
still have to be rediscovered.

This isn’t true, it would be perfectly sensible to lower C control flow structures directly too LLVM. The primary concern are things like unstructured switches (think duff’s device) and unstructured gotos, but these occur rarely: they need to be handled correctly, but can do so with a traditionally lowered CFG and other “best effort” attempts to raise them.

Other frontends like Swift and Flang could also generate this directly if they chose to, getting the benefits of progressive lowering.

However, a loop hierarchy optimizer
could be applied to MLIR just as well as to LLVM-IR.

Right! In addition to structured control flow, MLIR has great support for CFG representations like LLVM of course. :-)

-Chris

Renato Golin via llvm-dev

unread,

Jan 13, 2020, 4:52:09 AM1/13/20

to Chris Lattner, llvm-dev

On Mon, 13 Jan 2020 at 06:07, Chris Lattner via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Overall, I think that this discussion would be easier to process if we broke it into a few pieces. There seems to be consensus that LLVM IR (as is) is not the right representation for aggressive loop transformations. If we don’t have consensus on this, then I’d make sure to start there.

From all meetings we had about parallelism representations in IR over
the past few dev meetings, this was pretty much the only agreement. :)

We couldn't find a clear way to represent most parallel concepts in IR
(without breaking the others), but there wasn't a single format on top
of LLVM IR that we could all agree on.

> If you propose introducing a brand new data structure, please expect me to push back on that pretty hard. This is a perfect application of MLIR, and using it provides a lot of value (including amazing testing tools, round-tripping, location tracking, etc) which would otherwise would have to be reinvented, and doesn’t not dictate the IR structure otherwise. MLIR absolutely supports nested loop structures etc, as is seen in the affine dialect.

Today, I believe MLIR can serve that purpose, with (possibly)
overlapping dialects, and have cheap copies.

Cheap copies are required for heuristic searches, which have at least
polynomial compute/memory cost, but sometimes exponential.

Copying the whole module N^2 times (N = some pre-defined large budget)
won't work, that's why overlays (like the red/green tree idea) are
needed.

Journaling and snap-shooting are also possible ways to make memory
almost constant (and not exploding compute), and that should be
possible with MLIR dialects, I think.

> Once you have the data structure and the dialect within it decided, you have the set of transformations. Again, you’ve given a lot of thought to this, and that all sounds great to me, but the priorities can be driven by whoever wants to contribute and what concrete problems they have to solve.

If we don't have a generic enough representation, early designs will
change how later ones will be implemented.

That's why I think MLIR would be the right choice, as it's much more
flexible and composable.

cheers,
--renato

Michael Kruse via llvm-dev

unread,

Jan 14, 2020, 11:39:43 PM1/14/20

to Renato Golin, llvm-dev

Am Sa., 11. Jan. 2020 um 07:43 Uhr schrieb Renato Golin <reng...@gmail.com>:
> On Sat, 11 Jan 2020 at 00:34, Michael Kruse <llv...@meinersbur.de> wrote:
> > Yes, as mentioned in the Q&A. Unfortunately VPlan is able to represent
> > arbitrary code not has cheap copies.
>
> Orthogonal, but we should also be looking into implementing the cheap
> copies in VPlan if we want to search for composable plans.

VPlan structures have many references to neighboring structures such as parents and use-def chains. This makes adding cheap copies as an afterthought really hard.

> > This conversion is a possibility and certainly not the main motivation
> > for a loop hierarchy.
>
> I know. There are many things that can be done with what you propose,
> but we should focus on what's the main motivation.
>
> From what I can tell, the tree representation is a concrete proposal
> for the many year discussion about parallel IR.

As I recall, the Parallel IR approaches were trying to add parallel constructs to the existing LLVM-IR. This added the issue that the current infrastructure suddenly need to handle those as well, becoming a major problem for adoption.

> The short paper doesn't mention that, nor it discusses other
> opportunities to fix pipeline complexity (that is inherent of any
> compiler).
>
> I still believe that many of the techniques you propose are meaningful
> ways to solve them, but creating another IR will invariably create
> some adoption barriers.

I see it as an advantage in respect of adoption: It can be switched on and off without affecting other parts.

> > Are you arguing against code versioning? It is already done today by
> > multiple passes such as LoopVersioningLICM, LoopDistribute,
> > LoopUnrollAndJam and LoopVectorize. The proposal explicitly tries to
> > avoid code bloat by having just one fallback copy. Runtime conditions
> > can be chosen more or less optimistically, but I don't see how this
> > should be an argument for all kinds of versioning.
>
> No. I'm cautious to the combination of heuristics search and
> versioning, especially when the conditions are runtime based. It may
> be hard to CSE them later.
>
> The paths found may not be the most optimal in terms of intermediate states.

Versioning is always a trade-off between how likely the preconditions apply and code size (and maybe how expensive the runtime checks are). IMHO this concern is separate from how code versioning is implemented.

> > > Don't get me wrong, I like the idea, it's a cool experiment using some
> > > cool data structures and algorithms. But previous experiences with the
> > > pass manager have, well, not gone smooth in any shape or form.
> >
> > What experiments? I don't see a problem if the pass manger has to
> > invalidate analysis are re-run canonicalization passes. This happens
> > many times in the default pass pipelines. In addition, this
> > invalidation is only necessary if the loop optimization pass optimizes
> > something, in which case the additional cost should be justified.
>
> My point goes back to doing that in VPlan, then tree. The more
> back-and-forth IR transformations we add to the pipeline, the more
> brittle it will be.

Agreed, but IMHO this is the price to pay for better loop optimizations.

> The original email also proposes, for the future, to do all sorts of
> analyses and transformations in the tree representation, and that will
> likely be incompatible with (or at least not propagated through) the
> conversions.

Correct, but I'd argue these are different kinds of analyses not necessarily even useful for different representations. MLIR also has its set of analyses separate to those on MLIR.

> > In a previous RFC [8] I tried to NOT introduce a data structure but to
> > re-use LLVM-IR. The only discussion there was about the RFC, was about
> > not to 'abuse' the LLVM-IR.
> >
> > https://lists.llvm.org/pipermail/llvm-dev/2017-October/118169.html
> > https://lists.llvm.org/pipermail/llvm-dev/2017-October/118258.html
> >
> > I definitely see the merits of using fewer data structures, but it is
> > also hard to re-use something existing for a different purpose (in
> > this case: VPlan) without making both more complex.
>
> My point about avoiding more structures and IRs was related to VPlan
> and MLIR, not LLVM-IR.
>
> I agree there should be an abstraction layer to do parallelisation
> analysis, but we already have two, and I'd rather add many of your
> good proposals on those than create a third.
>
> Perhaps it's not clear how we could do that now, but we should at
> least try to weigh the options.
>
> I'd seriously look at adding a tree-like annotation as an MLIR
> dialect, and use it for lean copies.

Like VPlan, MLIR is a representation with many references between objects from different levels. I do not see how to add cheap copies as an afterthought.

> > For the foreseeable future, Clang will generate LLVM-IR, but our
> > motivation is to (also) optimize C/C++ code. That is, I do not see a
> > way to not (also) handle LLVM-IR until Clang is changed to generate
> > MLIR (which then again will be another data struture in the system).
>
> Even if/when Clang generates MLIR, there's no guarantee the high-level
> dialects will be preserved until the vectorisation pass.

I'd put loop optimizations earlier into the pipeline than vectorization. Where exactly is a phase ordering problem. I'd want to at least preserve multi-dimensional subscripts. Fortunately MemRef is a core MLIR construct and unlikely to be lowered before lowering to another representation (likely LLVM-IR).

> And other
> front-ends may not generate the same quality of annotations.
> We may have to re-generate what we need anyway, so no point in waiting
> all the front-ends to do what we need as well as all the previous
> passes to guarantee to keep it.

I don't see how this is relevant for a Clang-based pipeline. Other languages likely need a different pipeline than one intended for C/C++ code.

There are not a lot of high-level semantics required to be preserved to build a loop hierarchy.

Thanks for the productive discussion,
Michael

Michael Kruse via llvm-dev

unread,

Jan 14, 2020, 11:40:23 PM1/14/20

to Chris Lattner, llvm-dev

Am So., 12. Jan. 2020 um 20:07 Uhr schrieb Chris Lattner <clat...@nondot.org>:

The central idea is to use a modifiable loop tree -- similar to
LoopInfo -- as the primary representation. LLVM-IR is converted to a
loop tree, then optimized and finally LLVM-IR is generated again for
subtrees that are considered profitable. This is not a new concept, it
has already been used in compilers such as IBM XL Fortran (called ASTI
[4]) and Silicon Graphics/Open64 (called LNO [10]), and in research
such as the Value State Dependence Graph [3].

Ignoring the details of its representation, this is also conceptually how Polly works: code is lifted into its representation, transformed, then lowered back down.

Indeed I tried to improve on Polly's internal representation, and improve on the issue that Polly can only represent a subset of LLVM-IR code.

Overall, I think that this discussion would be easier to process if we broke it into a few pieces. There seems to be consensus that LLVM IR (as is) is not the right representation for aggressive loop transformations. If we don’t have consensus on this, then I’d make sure to start there.

Once that is established, there is a question of “what is the right representation to use”? This question has two subcomponents: what data structure should we use, and what is the IR within it.

If you propose introducing a brand new data structure, please expect me to push back on that pretty hard.

Which I think is a good thing since I also do not want too many data structures being more-or-less well maintained. But I also think there is a good argument to a loop-centric data structure.

This is a perfect application of MLIR, and using it provides a lot of value (including amazing testing tools, round-tripping, location tracking, etc) which would otherwise would have to be reinvented, and doesn’t not dictate the IR structure otherwise. MLIR absolutely supports nested loop structures etc, as is seen in the affine dialect.

The MLIR community also is highly invested in HPC-style transformations on this, and a lot of thought has gone into it. You can learn more about this in the slides and videos from the MLIR open design meetings.

I have been following the development of MLIR.

One you achieve consensus on data structure, there is the question of what IR to use within it. I would recommend starting with some combination of “existing LLVM IR operations + high level control flow representation”, e.g. parallel and affine loops. The key here is that you need to always be able to lower in a simple and predictable way to LLVM IR (this is one of the things that classic polyhedral systems did sub optimally, making it difficult to reason about the cost model of various transformations), and this is a natural incremental starting point anyway. Over time, more high level concepts can be gradually introduced. FYI, MLIR already has a reasonable LLVM dialect and can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which should be straightforward to build.

Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning just introduce compile-time overhead and what Renato described as brittleness. I fear this hurts adaption.

Once you have the data structure and the dialect within it decided, you have the set of transformations. Again, you’ve given a lot of thought to this, and that all sounds great to me, but the priorities can be driven by whoever wants to contribute and what concrete problems they have to solve.

Once the infra for “raising to this representation and lowering back down” is figured out, we can open the box of having clang and other front ends directly generate it.

This suggestions would also apply to VPlan. Ignoring that work on VPlan started before MLIR, would you have suggested to implement VPlan on MLIR as well? Would you maybe even advise to retarget VPlan on MLIR now?

Q: Relation to MLIR?
A: MLIR is more similar to LLVM-IR than a loop hierarchy.

This is not true, MLIR is great for dialects that want to model loop hierarchies naturally, this is a major focus of the affine dialect (e.g. see affine.for on that page). MLIR is not limited to affine loops, that is just a choice made by the affine dialect - the loop dialect has more general constructs that are less developed.

This is definitely subjective question. I think that MLIR is closer to LLVM-IR for how it is processed. Both have a sequence of passes running over a single source of truth. Both allow walking the entire structure from every instruction/operation/block. Analyses are on function or module level. Both have CFGs (I think for a certain kind of transformations it is an advantage that control flow is handled implicitly).

For
instance, it also does not feature cheap copies.

I’m not sure what this means.

The possibility to make local changes speculatively without copying the entire data structure. IMHO this is a central idea that allows applying a transformations speculatively to pass it to a legality check and cost heuristic without committing to apply it. As a consequence, passes do not need to implement to implement these in a transformation-specific manner, drastically reducing the burden of implementation.

For instance, more loop transformations are feasible if instructions are moved into the innermost loops. With speculative transformations, we can canonicalize the representation to sink computations into loops -- the opposite of what LICM does -- and then see whether a transformation can applied. If not, the speculative representation is discarded without having an effect on the original representation (and not needing to hoist those computations again).

Because the MLIR classes have many references to related objects (such as pointer to parents and use-def chains), I don't think it is feasible to implement on top of MLIR.

An advantage is that
loops and multi-dimensional arrays can be represented in the language
without the need of being rediscovered, but have to be inserted by a
front-end.

This is correct, but I don’t see how this helps if your focus is raising code that has already been lowered to LLVM IR, e.g. by Clang or some other frontend that generates LLVM IR today.

Indeed, I would hope that LLVM-IR can preserve multi-dimensional array accesses in some fashion as well (https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html). However, currently MLIR has the advantage of being able represent it.

That is, if Clang was generating MLIR, loops and arrays
still have to be rediscovered.

This isn’t true, it would be perfectly sensible to lower C control flow structures directly too LLVM. The primary concern are things like unstructured switches (think duff’s device) and unstructured gotos, but these occur rarely: they need to be handled correctly, but can do so with a traditionally lowered CFG and other “best effort” attempts to raise them.

Moreover, syntactical loop structures are also not a reliable indicator that there is a loop. Often enough, do,for and while are used for syntactical reasons (`do { } while(0)`). Yes, you could eliminate them if a non-loop is detected, but handling break, continue, etc correctly is a lot of effort. Another case are corountines that are lowered with gotos into loops, unless you think loop optimizers should handle coroutines directly.

On the other side, natural loop detection on CFGs is quite mature (with a remaining issue of irreducible loops that might appear, but can also be eliminated again). As a plus, optimization does depend less on how the source code is written.

Renato Golin via llvm-dev

unread,

Jan 15, 2020, 8:30:23 AM1/15/20

to Michael Kruse, llvm-dev

On Wed, 15 Jan 2020 at 04:39, Michael Kruse <llv...@meinersbur.de> wrote:
> As I recall, the Parallel IR approaches were trying to add parallel constructs to the existing LLVM-IR. This added the issue that the current infrastructure suddenly need to handle those as well, becoming a major problem for adoption.

Yes, and that's why we could never agree on the one representation. A
completely separate one solves that problem, but introduces another,
itself.

> I see it as an advantage in respect of adoption: It can be switched on and off without affecting other parts.

That's not necessarily true.

If we do like Polly, it is, but then the ability to reuse code is very
low and the time spent converting across is high. If we want to reuse,
then we'll invariably add behavioural dependencies and disabling the
pass may have side-effects.

> Versioning is always a trade-off between how likely the preconditions apply and code size (and maybe how expensive the runtime checks are). IMHO this concern is separate from how code versioning is implemented.

Agreed.

> Agreed, but IMHO this is the price to pay for better loop optimizations.

This may be true, and I can easily accept that, as long as we'll all
aware of the costs of doing so up front.

> I'd put loop optimizations earlier into the pipeline than vectorization. Where exactly is a phase ordering problem. I'd want to at least preserve multi-dimensional subscripts. Fortunately MemRef is a core MLIR construct and unlikely to be lowered before lowering to another representation (likely LLVM-IR).

Many front-ends do that even before lowering to IR because of the
richer semantics of the AST, but it's also common for that to
introduce bugs down the line (don't want to name any proprietary
front-ends here).

I agree moving high-level optimisation passes up and doing so in a
high-level IR is a good idea.

> I don't see how this is relevant for a Clang-based pipeline. Other languages likely need a different pipeline than one intended for C/C++ code.

Yes, but we want our passes to work for all languages and be less
dependent on how well they lower their code.

If they do it well, awesome. If not, and if we can identify patterns
in LLVM IR then there is no reason not to.

Chris Lattner via llvm-dev

unread,

Jan 16, 2020, 1:28:07 AM1/16/20

to Michael Kruse, llvm-dev

On Jan 14, 2020, at 8:39 PM, Michael Kruse <llv...@meinersbur.de> wrote:

Once that is established, there is a question of “what is the right representation to use”? This question has two subcomponents: what data structure should we use, and what is the IR within it.

If you propose introducing a brand new data structure, please expect me to push back on that pretty hard.

Which I think is a good thing since I also do not want too many data structures being more-or-less well maintained. But I also think there is a good argument to a loop-centric data structure.

Agreed, I think it is incredibly important for a first class loop optimizer to have first class structured loops, parallel loops etc.

One you achieve consensus on data structure, there is the question of what IR to use within it. I would recommend starting with some combination of “existing LLVM IR operations + high level control flow representation”, e.g. parallel and affine loops. The key here is that you need to always be able to lower in a simple and predictable way to LLVM IR (this is one of the things that classic polyhedral systems did sub optimally, making it difficult to reason about the cost model of various transformations), and this is a natural incremental starting point anyway. Over time, more high level concepts can be gradually introduced. FYI, MLIR already has a reasonable LLVM dialect and can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which should be straightforward to build.

Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning just introduce compile-time overhead and what Renato described as brittleness. I fear this hurts adaption.

Isn’t this true of *any* higher level IR? Unless I’m missing something big, this seems inherent to your proposal.

Once you have the data structure and the dialect within it decided, you have the set of transformations. Again, you’ve given a lot of thought to this, and that all sounds great to me, but the priorities can be driven by whoever wants to contribute and what concrete problems they have to solve.

Once the infra for “raising to this representation and lowering back down” is figured out, we can open the box of having clang and other front ends directly generate it.

This suggestions would also apply to VPlan. Ignoring that work on VPlan started before MLIR, would you have suggested to implement VPlan on MLIR as well? Would you maybe even advise to retarget VPlan on MLIR now?

I don’t know enough to say: the tradeoffs depend a lot of where VPlan is, the challenges it faces etc. I don’t know much about VPlan or the engineering priorities behind it.

Here’s an observation though: if you ignore the engineering expense, it would clearly make sense to reimplement the mid-level LLVM optimizers on top of MLIR and replace include/llvm/IR with a dialect definition in MLIR instead.

MLIR as an IR is strictly equal to or better than the LLVM IR data structures in all ways that I’m aware of. In addition to representational flexibility, MLIR allows (and provides) a multithreaded pass manager (function passes run in parallel), has a better representation of PHI nodes, allows better terminators (eliminating need for the really ugly/unfortunate landingpad, catchpad etc hacks), has a better representation for “operands that must be constants” (immarg etc), provides a better representation for location information (important for debugging optimized code and diagnostic emission from the optimizer), and better testing tools (by building on the better location info).

The additional representational flexibility would allow a much more flexible compiler design - one where you could do progressive lowering of high level loops, OpenMP, separate out ABI lowering from Clang IRGen, etc.

I’m very fond of LLVM IR obviously, but a lot has been learned in the nearly 20 years since it was designed and implemented, and MLIR was implemented with a superset of the experience that built LLVM :)

Q: Relation to MLIR?
A: MLIR is more similar to LLVM-IR than a loop hierarchy.

This is not true, MLIR is great for dialects that want to model loop hierarchies naturally, this is a major focus of the affine dialect (e.g. see affine.for on that page). MLIR is not limited to affine loops, that is just a choice made by the affine dialect - the loop dialect has more general constructs that are less developed.

This is definitely subjective question. I think that MLIR is closer to LLVM-IR for how it is processed. Both have a sequence of passes running over a single source of truth. Both allow walking the entire structure from every instruction/operation/block. Analyses are on function or module level. Both have CFGs (I think for a certain kind of transformations it is an advantage that control flow is handled implicitly).

Right, but a frequent way that MLIR is used is without its CFG: most machine learning kernels use nests of loops and ifs, not CFGs. CFGs are exposed when those are lowered out. See some simple examples like:

https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir

For
instance, it also does not feature cheap copies.

I’m not sure what this means.

The possibility to make local changes speculatively without copying the entire data structure. IMHO this is a central idea that allows applying a transformations speculatively to pass it to a legality check and cost heuristic without committing to apply it. As a consequence, passes do not need to implement to implement these in a transformation-specific manner, drastically reducing the burden of implementation.

For instance, more loop transformations are feasible if instructions are moved into the innermost loops. With speculative transformations, we can canonicalize the representation to sink computations into loops -- the opposite of what LICM does -- and then see whether a transformation can applied. If not, the speculative representation is discarded without having an effect on the original representation (and not needing to hoist those computations again).

Because the MLIR classes have many references to related objects (such as pointer to parents and use-def chains), I don't think it is feasible to implement on top of MLIR.

Ah yes, I see what you mean. One way to do that is to represent multiple options as an op with region for each option. This means you only fork the part of the IR that you’re producing variants of. I think this is the red/green tree technique you mentioned, but I’m not sure.

An advantage is that
loops and multi-dimensional arrays can be represented in the language
without the need of being rediscovered, but have to be inserted by a
front-end.

This is correct, but I don’t see how this helps if your focus is raising code that has already been lowered to LLVM IR, e.g. by Clang or some other frontend that generates LLVM IR today.

Indeed, I would hope that LLVM-IR can preserve multi-dimensional array accesses in some fashion as well (https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html). However, currently MLIR has the advantage of being able represent it.

I don’t think LLVM IR will ever get there without a massive design change. It is possible that it will support static shaped accesses in limited ways though.

That is, if Clang was generating MLIR, loops and arrays
still have to be rediscovered.

This isn’t true, it would be perfectly sensible to lower C control flow structures directly too LLVM. The primary concern are things like unstructured switches (think duff’s device) and unstructured gotos, but these occur rarely: they need to be handled correctly, but can do so with a traditionally lowered CFG and other “best effort” attempts to raise them.

Moreover, syntactical loop structures are also not a reliable indicator that there is a loop. Often enough, do,for and while are used for syntactical reasons (`do { } while(0)`). Yes, you could eliminate them if a non-loop is detected, but handling break, continue, etc correctly is a lot of effort. Another case are corountines that are lowered with gotos into loops, unless you think loop optimizers should handle coroutines directly.

Yes, you’d want to canonicalize the form of course.

On the other side, natural loop detection on CFGs is quite mature (with a remaining issue of irreducible loops that might appear, but can also be eliminated again). As a plus, optimization does depend less on how the source code is written.

Yep totally. The question is whether you lose semantic information from lowering to a CFG and reconstructing back up. This can affect you when you have higher level language semantics (e.g. Fortran parallel loops, openmp or other concurrency constructs etc). This is where MLIR excels of course.

-Chris

Renato Golin via llvm-dev

unread,

Jan 16, 2020, 4:56:27 AM1/16/20

to Chris Lattner, llvm-dev

On Thu, 16 Jan 2020 at 06:27, Chris Lattner via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Ah yes, I see what you mean. One way to do that is to represent multiple options as an op with region for each option. This means you only fork the part of the IR that you’re producing variants of. I think this is the red/green tree technique you mentioned, but I’m not sure.

IIUC. the green tree is a lighter version of the tree (leaner memory
footprint) but still entire tree. It's ok to lose that info because
you don't usually need that for your transformations, and you can
always go back to the red tree (via pointer in the green tree) to ask
harder questions. Managing the semantics of the two becomes
non-trivial when you start adding and replacing nodes, there's a point
where you can't go back anymore to the red tree in the same way.

What I referred to as "journalling" is what you propose here. Add
metadata to the actual graph and during the heuristic search, only
clone those. If you make sure you can append those nodes to the graph,
and guarantee that the extra nodes are composable (ie semantically
valid in any order they may be applied), then having the original
graph + any intermediate state is valid. Therefore, keeping only the
extra nodes, and copying them along to try different routes, becomes
even cheaper than a green tree.

If those extra nodes are an MLIR dialect, with defined semantics and
structured composition, then using them in an heuristics search
produces semantically valid intermediate states and lightens the
burden of proof for every little step.

Michael Kruse via llvm-dev

unread,

Jan 21, 2020, 6:54:18 PM1/21/20

to Renato Golin, llvm-dev

Am Mi., 15. Jan. 2020 um 03:30 Uhr schrieb Renato Golin <reng...@gmail.com>:
> > I see it as an advantage in respect of adoption: It can be switched on and off without affecting other parts.
>
> That's not necessarily true.
>
> If we do like Polly, it is, but then the ability to reuse code is very
> low and the time spent converting across is high. If we want to reuse,
> then we'll invariably add behavioural dependencies and disabling the
> pass may have side-effects.

This applies literally to any pass.

I think the problem of reusability is even worse for the current loop
optimization passes. We have multiple, partially
transformation-specific dependence analyses, such LoopAccessAnalysis,
DependenceInfo, LoopInterchangeLegality, etc. Another one is currently
in the works.

https://xkcd.com/927/ actually does apply here, but I also think that
pass-specific dependence analyses do not scale.

> > I'd put loop optimizations earlier into the pipeline than vectorization. Where exactly is a phase ordering problem. I'd want to at least preserve multi-dimensional subscripts. Fortunately MemRef is a core MLIR construct and unlikely to be lowered before lowering to another representation (likely LLVM-IR).
>
> Many front-ends do that even before lowering to IR because of the
> richer semantics of the AST, but it's also common for that to
> introduce bugs down the line (don't want to name any proprietary
> front-ends here).

This is a problem for any intermediate representation. But isn't that
also the point of MLIR? To be ably to express higher-level language
concepts in the IR as dialects? This as well might introduce bugs.

One example is the lowering of multi-dimensional arrays from Clang's
AST to LLVM-IR. We can argue whether C/C++ spec would allow
GetElementPtr to be emitted with "inrange" modifier, but for VLAs, we
cannot even express them in the IR, so we had an RFC to change that.

I don't find the argument "there might be bugs" very convincing.

> > I don't see how this is relevant for a Clang-based pipeline. Other languages likely need a different pipeline than one intended for C/C++ code.
>
> Yes, but we want our passes to work for all languages and be less
> dependent on how well they lower their code.
>
> If they do it well, awesome. If not, and if we can identify patterns
> in LLVM IR then there is no reason not to.

This was relevant to the discussion that /all/ front-ends would have
to generate good-enough annotations for loop transformations. Only the
ones that do might enable loop optimization passes.

Generally, I'd try to to make it easy for other front-end to have loop
optimizations. For instance, avoid isa<LoadInst> in favor of a more
generic "mayReadFromMemory" in analysis/transformation phases.

Michael

Prashanth N R via llvm-dev

unread,

Jan 21, 2020, 11:53:20 PM1/21/20

to Michael Kruse, llvm-dev

Hi Michael-

Liked your proposal and hope that it gets implemented in MLIR. Linearized IR of LLVM is not suitable for LNO.

We have written multiple Loop Nest Optimizers (in LLVM) in past five years. We sent a talk proposal to LLVM developer meeting in 2017. It was rejected. From the review comment it looked like polly was the preferred path for Loop Nest Optimization. Hope it is not the case any more.

thanks,

-Prashanth

Michael Kruse via llvm-dev

unread,

Jan 22, 2020, 3:59:54 AM1/22/20

to Chris Lattner, llvm-dev

Am Mi., 15. Jan. 2020 um 20:27 Uhr schrieb Chris Lattner <clat...@nondot.org>:

One you achieve consensus on data structure, there is the question of what IR to use within it. I would recommend starting with some combination of “existing LLVM IR operations + high level control flow representation”, e.g. parallel and affine loops. The key here is that you need to always be able to lower in a simple and predictable way to LLVM IR (this is one of the things that classic polyhedral systems did sub optimally, making it difficult to reason about the cost model of various transformations), and this is a natural incremental starting point anyway. Over time, more high level concepts can be gradually introduced. FYI, MLIR already has a reasonable LLVM dialect and can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which should be straightforward to build.

Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning just introduce compile-time overhead and what Renato described as brittleness. I fear this hurts adaption.

Isn’t this true of *any* higher level IR? Unless I’m missing something big, this seems inherent to your proposal.

No. A loop hierarchy may be created on-demand and can be skipped if, e.g., the function does not contain a loop. For IRs that are translation-unit based, the entire module will have to do a round-trip whether changed or not. To improve the situation, one could e.g. add a "has been changed" flag to each function. But it has to be added somewhere into the MLIR data structure and kept up-to-date on modifications. In a loop-hierarchical structure only the node(s) that has been changed needs to be lowered (e.g. an innermost loop) and versioned with the original IR depending on taken assumptions.

This is definitely subjective question. I think that MLIR is closer to LLVM-IR for how it is processed. Both have a sequence of passes running over a single source of truth. Both allow walking the entire structure from every instruction/operation/block. Analyses are on function or module level. Both have CFGs (I think for a certain kind of transformations it is an advantage that control flow is handled implicitly).

Right, but a frequent way that MLIR is used is without its CFG: most machine learning kernels use nests of loops and ifs, not CFGs. CFGs are exposed when those are lowered out. See some simple examples like:
https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir

I agree that a loop nest can be represented in MLIR. What is missing IMHO is being able to have multiple versions of the same code. For instance, raising emitted C++ to such representation to make it more optimizable may only be possible under preconditions and by itself making the code slower. If the raised representation cannot be optimized, we will want to use the original one.

The possibility to make local changes speculatively without copying the entire data structure. IMHO this is a central idea that allows applying a transformations speculatively to pass it to a legality check and cost heuristic without committing to apply it. As a consequence, passes do not need to implement to implement these in a transformation-specific manner, drastically reducing the burden of implementation.

For instance, more loop transformations are feasible if instructions are moved into the innermost loops. With speculative transformations, we can canonicalize the representation to sink computations into loops -- the opposite of what LICM does -- and then see whether a transformation can applied. If not, the speculative representation is discarded without having an effect on the original representation (and not needing to hoist those computations again).

Because the MLIR classes have many references to related objects (such as pointer to parents and use-def chains), I don't think it is feasible to implement on top of MLIR.

Ah yes, I see what you mean. One way to do that is to represent multiple options as an op with region for each option. This means you only fork the part of the IR that you’re producing variants of. I think this is the red/green tree technique you mentioned, but I’m not sure.

The red-green tree technique even allows re-inserting entire unchanged subtrees (e.g. loop bodies after an interchange). If op takes multiple regions, each region still must be deep copies.

An advantage is that
loops and multi-dimensional arrays can be represented in the language
without the need of being rediscovered, but have to be inserted by a
front-end.

This is correct, but I don’t see how this helps if your focus is raising code that has already been lowered to LLVM IR, e.g. by Clang or some other frontend that generates LLVM IR today.

Indeed, I would hope that LLVM-IR can preserve multi-dimensional array accesses in some fashion as well (https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html). However, currently MLIR has the advantage of being able represent it.

I don’t think LLVM IR will ever get there without a massive design change. It is possible that it will support static shaped accesses in limited ways though.

Static sized rectangular multi-dimensional arrays are already possible using a standard GetElementPtr and its inrange qualifier. For dynamic sized multi-dimensional sized arrays what is needed is to convey the dimensions of the array in form of an llvm::Value. In the RFC we discussed an intrinsic and operand bundles, neither looks like massive design changes to me.

On the other side, natural loop detection on CFGs is quite mature (with a remaining issue of irreducible loops that might appear, but can also be eliminated again). As a plus, optimization does depend less on how the source code is written.

Yep totally. The question is whether you lose semantic information from lowering to a CFG and reconstructing back up. This can affect you when you have higher level language semantics (e.g. Fortran parallel loops, openmp or other concurrency constructs etc). This is where MLIR excels of course.

Indeed it is easier to not lower these constructs, but not impossible (as shown in https://reviews.llvm.org/D69930). I think the relevant difference is that these constructs come with additional guarantees (e.g. Single-Entry-Single-Exit regions) and optimization hurdles (e.g. thread synchronization; where programmers do not expect the compiler to do a lot of things) compared to C++ loop constructs.

Michael

Renato Golin via llvm-dev

unread,

Jan 22, 2020, 7:14:58 AM1/22/20

to Michael Kruse, llvm-dev

On Tue, 21 Jan 2020 at 23:54, Michael Kruse <llv...@meinersbur.de> wrote:
> Am Mi., 15. Jan. 2020 um 03:30 Uhr schrieb Renato Golin <reng...@gmail.com>:
> > > I see it as an advantage in respect of adoption: It can be switched on and off without affecting other parts.
> >
> > That's not necessarily true.
>

> This applies literally to any pass.

Precisely why I don't think adding more passes is an advantage to adoption. :)

> I don't find the argument "there might be bugs" very convincing.

Sorry, it wasn't an argument, just a jest at the expense of some old
commercial front-ends.

Pass ordering is complex no matter how you slice it.

> This was relevant to the discussion that /all/ front-ends would have
> to generate good-enough annotations for loop transformations. Only the
> ones that do might enable loop optimization passes.

This doesn't scale. You end up with a few metadata from a single
front-end justifying a huge new pass that does a few things in even
fewer cases.

VPlan is still crawling, requiring small incremental improvements,
because we're trying to replace an existing pass.

Your proposal is a new pass, that does some new things and therefore
shouldn't need to be incremental (only when the correct info is
present).

But that means now we have two loop transformation infrastructures
that could radically change the IR (and the loop metadata, if any).

Which one passes first will have the advantage, as I think we agreed,
pass ordering is not trivial.

If you restrict this new pass to only doing transformations that are
guaranteed to be valid and better (for some definition of both), then
you'll end like Polly that does wonders to a very small selection of
loops and nothing at all to most of them, just wasted time looking at
all the possibilities.

If you want to be more aggressive, then the IR will change more often
and the pass ordering problem gets worse, requiring changes in later
passes to cope with the changes.

For better or worse, we already have such a pass, the VPlan
infrastructure. It might not be perfect, but it's the path we took
years ago and I'd strongly encourage people to make it better before
throwing it away and coming up with a completely new idea.

VPlan is a simple idea, to represent loops in a more meaningful way,
to represent transformations in a composable way, and to find a
sequence of transforms that would yield the best result.

To me, your proposal is identical in its meaning, but has different
implementation details:
* Tree instead of VPobjects, with additional benefits for new cases.
* Heuristics search of transforms wih cheap state copies.

To me, it sounds like we could do both in VPlan, and even call it
LTPlan (Loop Transformation plan) to make it clear it's more generic
than just SIMD vectorisation.

More importantly, it would be a series of incremental steps towards a
better loop infrastructure, building on the work of the last many
years and likely show real benefits much sooner and with a lot less
conflicts than starting from scratch.

--renato

Michael Kruse via llvm-dev

unread,

Jan 25, 2020, 8:31:13 PM1/25/20

to Renato Golin, llvm-dev

Am Mi., 15. Jan. 2020 um 23:55 Uhr schrieb Renato Golin <reng...@gmail.com>:
> On Thu, 16 Jan 2020 at 06:27, Chris Lattner via llvm-dev
> <llvm...@lists.llvm.org> wrote:
> > Ah yes, I see what you mean. One way to do that is to represent multiple options as an op with region for each option. This means you only fork the part of the IR that you’re producing variants of. I think this is the red/green tree technique you mentioned, but I’m not sure.
>
> IIUC. the green tree is a lighter version of the tree (leaner memory
> footprint) but still entire tree. It's ok to lose that info because
> you don't usually need that for your transformations, and you can
> always go back to the red tree (via pointer in the green tree) to ask
> harder questions.

The green tree would contain the "heaver" object as they do not need
to be copied that often. The red tree is necessary for operations
depending on the ancestry/surrounding code. As an example, looking
whether a loop is actually a subloop of an outer loop, which it could
interchanged with.

> Managing the semantics of the two becomes
> non-trivial when you start adding and replacing nodes, there's a point
> where you can't go back anymore to the red tree in the same way.

The model is that the trees are immutable, hence no cost of managing
node replacement. Node only point to their children, their parent (red
tree) of the same code version and potentially the nodes they
originate from. When creating a new version, existing nodes are not
updated.

(to avoid unnecessary new versions, chains of transformations -- like
unroll-then-jam -- may modify existing nodes; if guaranteed that no
reference to them has been passed)
.

> What I referred to as "journalling" is what you propose here. Add
> metadata to the actual graph and during the heuristic search, only
> clone those. If you make sure you can append those nodes to the graph,
> and guarantee that the extra nodes are composable (ie semantically
> valid in any order they may be applied), then having the original
> graph + any intermediate state is valid. Therefore, keeping only the
> extra nodes, and copying them along to try different routes, becomes
> even cheaper than a green tree.
>
> If those extra nodes are an MLIR dialect, with defined semantics and
> structured composition, then using them in an heuristics search
> produces semantically valid intermediate states and lightens the
> burden of proof for every little step.

Journaling (I assume in the sense of filesystemes and databases: keep
a log of items before some change so the change can be rolled back) is
an interesting idea. I think it comes with drawbacks:

* Only one 'head' version is active at time. Switch to a different
version requires rolling-back one version and re-applying another. No
two versions are active at the same time.

* Every change/roll-back invalidates analyses (or analyses have to be
taught about journaling and keep previous results in memory)

* Cannot re-use subtrees (e.g. unrolling referencing the same subtree
factor-times)

* Potential abuse of the IR (in the sense
https://lists.llvm.org/pipermail/llvm-dev/2017-October/118258.html)

You mentioned intermediate states remaining valid, so you might have a
different kind of journalling in mind. However, due to the strong
coupling (e.g. use-def chains), I don't think this is possible. Please
elaborate more on your idea.

An analogy would be journaling and cow-btree nodes in Btrfs: Their
purposes are quite different.

Michael

Michael Kruse via llvm-dev

unread,

Jan 25, 2020, 8:36:44 PM1/25/20

to Prashanth N R, llvm-dev

Am Di., 21. Jan. 2020 um 18:52 Uhr schrieb Prashanth N R
<prasha...@gmail.com>:

> We have written multiple Loop Nest Optimizers (in LLVM) in past five years. We sent a talk proposal to LLVM developer meeting in 2017. It was rejected. From the review comment it looked like polly was the preferred path for Loop Nest Optimization. Hope it is not the case any more.

We made some effort into mainstreaming Polly by integrating it into
the main repository. There were some hurdles in there, one of the
largest being that it relies on an external library written in C.
Others are that it requires well-formed IR to do anything and may
significantly increase compile time. My proposal is intended to be a
solution for these problems.

Michael

Michael Kruse via llvm-dev

unread,

Jan 25, 2020, 9:06:26 PM1/25/20

to Renato Golin, llvm-dev

Am Mi., 22. Jan. 2020 um 02:14 Uhr schrieb Renato Golin <reng...@gmail.com>:
> Precisely why I don't think adding more passes is an advantage to adoption. :)

The alternative is to have a single pass for each kind of loop
transformations, i.e. many more than a single loop transformation
pass.

> > I don't find the argument "there might be bugs" very convincing.
>
> Sorry, it wasn't an argument, just a jest at the expense of some old
> commercial front-ends.
>
> Pass ordering is complex no matter how you slice it.

Indeed. I am already concerned of additional phase ordering problems
if we implement each loop transformation in its own pass, e.g. between
loop fusion and loop distribution. Do we first fuse into as few loops
as possible and then distribute, or the other way around?

> > This was relevant to the discussion that /all/ front-ends would have
> > to generate good-enough annotations for loop transformations. Only the
> > ones that do might enable loop optimization passes.
>
> This doesn't scale. You end up with a few metadata from a single
> front-end justifying a huge new pass that does a few things in even
> fewer cases.

I'd think that the metadata is not front-end/language specific.

A language where most instructions can access any memory is arguable
harder to optimize than a language where only a selected set of
instructions can do that. But the metadata describing what memory an
instruction can access is not frond-end specific.

> Your proposal is a new pass, that does some new things and therefore
> shouldn't need to be incremental (only when the correct info is
> present).
>
> But that means now we have two loop transformation infrastructures
> that could radically change the IR (and the loop metadata, if any).

I don't think LLVM's current loop optimizations are well developed.
Only LoopVectorize and LoopUnroll are even enabled by default.

> Which one passes first will have the advantage, as I think we agreed,
> pass ordering is not trivial.

Which is one of the things this proposal addresses.

> If you restrict this new pass to only doing transformations that are
> guaranteed to be valid and better (for some definition of both),

This is a strange argument. You want transformation that are invalid
and/or worse?

> then
> you'll end like Polly that does wonders to a very small selection of
> loops and nothing at all to most of them, just wasted time looking at
> all the possibilities.

This is exactly what this proposal is addressing.

I think the small selection mostly stems from Polly requiring
well-formed IR. Very often it could algorithmically optimize a
problem, but cannot represent the IR in its internal representation: a
SCoP which is based on ISL's schedule tree representation. The main
motivation of the proposal is to exactly address this, meaning there
is no external library that restricts what we can represent.

A second reason is that Polly relies on ISL's solver algorithm that
minimizes re-use distance and parallelism while the pass-based
optimizers use hand-written heuristics. I want to make both possible
withing the same framework.

> If you want to be more aggressive, then the IR will change more often
> and the pass ordering problem gets worse, requiring changes in later
> passes to cope with the changes.

My goal is to get a hierarchical optimization order: Once the
higher-level optimizations (that is, loops) have been decided on, only
lower-level optimizations (InstCombine, instruction motion, CFG
simplification, etc) are left to do. If we have to re-do loop
optimizations, something went very wrong.

Michael

Renato Golin via llvm-dev

unread,

Jan 26, 2020, 8:47:28 AM1/26/20

to Michael Kruse, llvm-dev

On Sun, 26 Jan 2020 at 02:06, Michael Kruse <llv...@meinersbur.de> wrote:
> A language where most instructions can access any memory is arguable
> harder to optimize than a language where only a selected set of
> instructions can do that. But the metadata describing what memory an
> instruction can access is not frond-end specific.

My point is that not all front-ends have the same pace at implementing
new metadata, and the discussion as to what each means in the context
of each language can take a while.

> I think the small selection mostly stems from Polly requiring
> well-formed IR. Very often it could algorithmically optimize a
> problem, but cannot represent the IR in its internal representation: a
> SCoP which is based on ISL's schedule tree representation. The main
> motivation of the proposal is to exactly address this, meaning there
> is no external library that restricts what we can represent.

I see, so this is basically the old proposal of re-writing ISL into
LLVM, but with a more powerful heuristics search.

I'm not against the idea (as I wasn't back then), but this will take a
very long time, and will somewhat compete with current VPlan (engine
to find transformations) and MLIR (extensible IR) efforts.

I'm also not saying there is a clear and trivial path from VPlan and
MLIR towards your proposal. There may be many hurdles, some even
impractical, so I'm not strongly proposing it either.

But I'd like to explore all the options before starting yet-another
effort to make high-level parallelisation more tractable.

cheers,
--renato

Chris Lattner via llvm-dev

unread,

Jan 26, 2020, 10:04:47 PM1/26/20

to Michael Kruse, llvm-dev

On Jan 22, 2020, at 12:58 AM, Michael Kruse <llv...@meinersbur.de> wrote:
Am Mi., 15. Jan. 2020 um 20:27 Uhr schrieb Chris Lattner <clat...@nondot.org>:
One you achieve consensus on data structure, there is the question of what IR to use within it. I would recommend starting with some combination of “existing LLVM IR operations + high level control flow representation”, e.g. parallel and affine loops. The key here is that you need to always be able to lower in a simple and predictable way to LLVM IR (this is one of the things that classic polyhedral systems did sub optimally, making it difficult to reason about the cost model of various transformations), and this is a natural incremental starting point anyway. Over time, more high level concepts can be gradually introduced. FYI, MLIR already has a reasonable LLVM dialect and can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which should be straightforward to build.

Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning just introduce compile-time overhead and what Renato described as brittleness. I fear this hurts adaption.

Isn’t this true of *any* higher level IR? Unless I’m missing something big, this seems inherent to your proposal.

No. A loop hierarchy may be created on-demand and can be skipped if, e.g., the function does not contain a loop.

I don’t see how this is specific to a loop IR. Any “LLVMIR -> X -> LLVMIR” system has the behavior you describe, whether X is polly, mlir, or some other loop IR; any decision about skipping the round trip could be applied to any of them.

The advantage of MLIR in this discussion is that it has the opportunity to subsume LLVMIR at some point in the future, eliminating the round trip at that point.

For IRs that are translation-unit based, the entire module will have to do a round-trip whether changed or not.

I think that this must be the misunderstanding. There is no requirement to do a “Full LLVM IR module to MLIR module” conversion, you can convert one function, one loop nest, one basic block or whatever you else you’d want to do.

This is definitely subjective question. I think that MLIR is closer to LLVM-IR for how it is processed. Both have a sequence of passes running over a single source of truth. Both allow walking the entire structure from every instruction/operation/block. Analyses are on function or module level. Both have CFGs (I think for a certain kind of transformations it is an advantage that control flow is handled implicitly).

Right, but a frequent way that MLIR is used is without its CFG: most machine learning kernels use nests of loops and ifs, not CFGs. CFGs are exposed when those are lowered out. See some simple examples like:
https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/affine-data-copy.mlir

I agree that a loop nest can be represented in MLIR. What is missing IMHO is being able to have multiple versions of the same code. For instance, raising emitted C++ to such representation to make it more optimizable may only be possible under preconditions and by itself making the code slower. If the raised representation cannot be optimized, we will want to use the original one.

This is pretty straight-forward (in principle, I haven’t actually built a system to prove this), because you can just have an operation with regions for each version, e.g.:

for {

op1

versioned {

stuff

} {

other stuff

}

op2

}

Then transform the code and select which version you want later.

MLIR doesn’t magically make the algorithms happen for you, but it does handle the representational issues, making it super flexible and easy to support things like this.

On the other side, natural loop detection on CFGs is quite mature (with a remaining issue of irreducible loops that might appear, but can also be eliminated again). As a plus, optimization does depend less on how the source code is written.

Yep totally. The question is whether you lose semantic information from lowering to a CFG and reconstructing back up. This can affect you when you have higher level language semantics (e.g. Fortran parallel loops, openmp or other concurrency constructs etc). This is where MLIR excels of course.

Indeed it is easier to not lower these constructs, but not impossible (as shown in https://reviews.llvm.org/D69930). I think the relevant difference is that these constructs come with additional guarantees (e.g. Single-Entry-Single-Exit regions) and optimization hurdles (e.g. thread synchronization; where programmers do not expect the compiler to do a lot of things) compared to C++ loop constructs.

There is a dangerous implication in the way you phrased this, and while it may not have been intentional, is something I think is important to point out. The whole point of good IR design is to make certain things *easy*. IR design rarely makes things “possible”, and so saying something is “not impossible” is a dangerous ground to stand on.

To reduce the point to absurdity, you could implement loop transformations on machine code: you can reconstruct a CFG, discover natural loops, raise the machine code to something like LLVM IR, etc. The problem with this is that it is extremely cumbersome, fragile, and would lose the ability to use high level information that cannot be persisted in machine code. The same thing is true about LLVM IR, just less absurdly so.

-Chris

Uday Kumar Reddy Bondhugula via llvm-dev

unread,

Jan 27, 2020, 11:06:25 PM1/27/20

to Michael Kruse, llvm...@lists.llvm.org

Hi Michael,

Although the approach to use a higher order in-memory abstraction like the loop tree will make it easier than what you have today, if you used MLIR for this representation, you already get a round trippable textual format that is *very close* to your form. The affine.for/if, std.for/if in MLIR are nearly isomorphic to the tree representation you want, and as such, this drastically reduces the delta between the in-memory data structures your passes want to operate on and what you see when you print the IR. Normally, there'd be resistance to building a textual form / introducing a first class concept in the IR for what you are proposing, but since this was already done for MLIR, it looks like it would be a big win from a compiler developers' productivity standpoint if you just used MLIR for this loop tree representation. With regard to the FAQ, I can't tell whether you meant something else or missed the representation used in MLIR for the affine dialect or in general for "ops with regions".

> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For

It's completely the opposite unless you are looking only at MLIR's std dialect! The affine dialect as well as the std.for/if (currently misnamed as loop.for/loop.if) are actually a loop tree. The affine ops are just an affine loop AST isomorphic to the materialization of polyhedral domains/schedules via code generation. Every IST AST or the output of polyhedral code generation can be represented in the affine dialect and vice versa. MLIR's loop/if ops are a hierarchy rather than a flat form / list of blocks CFG.

> still have to be rediscovered. However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.

This is true, but it's easier to apply it to MLIR because the actual IR is close by miles to the in-memory thing your loop hierarchy optimizer would be using. For eg., here's the input IR and the output IR of a simple outer loop vectorization performed in MLIR:

https://github.com/bondhugula/llvm-project/blob/hop/mlir/test/Transforms/vectorize.mlir#L23

~ Uday

Founder and Director, PolyMage Labs

Michael Kruse via llvm-dev

unread,

Jan 30, 2020, 4:04:56 AM1/30/20

to Chris Lattner, llvm-dev

Am So., 26. Jan. 2020 um 21:04 Uhr schrieb Chris Lattner <clat...@nondot.org>:

On Jan 22, 2020, at 12:58 AM, Michael Kruse <llv...@meinersbur.de> wrote:
Am Mi., 15. Jan. 2020 um 20:27 Uhr schrieb Chris Lattner <clat...@nondot.org>:
One you achieve consensus on data structure, there is the question of what IR to use within it. I would recommend starting with some combination of “existing LLVM IR operations + high level control flow representation”, e.g. parallel and affine loops. The key here is that you need to always be able to lower in a simple and predictable way to LLVM IR (this is one of the things that classic polyhedral systems did sub optimally, making it difficult to reason about the cost model of various transformations), and this is a natural incremental starting point anyway. Over time, more high level concepts can be gradually introduced. FYI, MLIR already has a reasonable LLVM dialect and can generate LLVM IR from it, so we’d just need an “LLVM IR -> MLIR LLVM dialect” conversion, which should be straightforward to build.

Adding a LLVM-IR -> MLIR -> LLVM-IR round-trip would at the beginning just introduce compile-time overhead and what Renato described as brittleness. I fear this hurts adaption.

Isn’t this true of *any* higher level IR? Unless I’m missing something big, this seems inherent to your proposal.

No. A loop hierarchy may be created on-demand and can be skipped if, e.g., the function does not contain a loop.

I don’t see how this is specific to a loop IR. Any “LLVMIR -> X -> LLVMIR” system has the behavior you describe, whether X is polly, mlir, or some other loop IR; any decision about skipping the round trip could be applied to any of them.

The advantage of MLIR in this discussion is that it has the opportunity to subsume LLVMIR at some point in the future, eliminating the round trip at that point.

As long as there is no plan for this to happen, this is speculative.

For IRs that are translation-unit based, the entire module will have to do a round-trip whether changed or not.

I think that this must be the misunderstanding. There is no requirement to do a “Full LLVM IR module to MLIR module” conversion, you can convert one function, one loop nest, one basic block or whatever you else you’d want to do.

Either the goal is for clang to generate MLIR translation units, or to have per-function round-trips. However, I do see that per-function round-trips could be an intermediate state.

However, for sub-function units, one somehow has to keep the reference to the llvm::Value objects declared outside the extracted union so they can be referenced again when generating LLVM-IR again. Possible with a DenseMap<mlir::Operation,llvm::Instruction>, but sounds fragile. Also, these llvm::Instructions must be represented as an mlir::Operation somehow.

This is pretty straight-forward (in principle, I haven’t actually built a system to prove this), because you can just have an operation with regions for each version, e.g.:

for {
op1
versioned {
stuff
} {
other stuff
}
op2
}

Then transform the code and select which version you want later.

MLIR doesn’t magically make the algorithms happen for you, but it does handle the representational issues, making it super flexible and easy to support things like this.

What I proposed is much more than the possibility to represent multiple versions in the same IR (assuming that we want to keep just one of the versions, these additional ones are visible in use-def chains and possibly confusing some analyses. It might also be a problem for convergent instructions).

I would like to stress that multi-versioning is just one of the possibilities enabled by cheap copies (i.e. Red/Green trees) and I remain convinced that cheap copies are not possible with a densely-coupled representation such as MLIR, LLVM-IR and VPlan (unless we separate the in-memory representation from the logical data structure).

Other thinks enabled by cheap copies:

* Apply transformation first, then check legality and profitability

* Speculative transformations / Profitability can be checked of a chain of operations

* Canonicalization that remove dependencies (e.g. moving computations into the loops)

* Lazy expansion of nodes (e.g. instructions outside of loops do not need to be represented)

* Cheap representation of unrolling

* Code versioning

Your suggestion above still requires copying the entire "stuff" and "other stuff" subtrees. Not rarely do e.g. stencil-loops have thousands of instructions and all would need to be copied for each version. This is a practical barrier for the applications above because of the compile time increase.

There is an analogy in file-systems: Btrfs and ZFS support copy-on-write of files and directory, enabling new applications such as snapshots. Without the copy-on-write feature, a snapshot would have to copy the entire filesystem. Possible, but not really feasible.

Indeed it is easier to not lower these constructs, but not impossible (as shown in https://reviews.llvm.org/D69930). I think the relevant difference is that these constructs come with additional guarantees (e.g. Single-Entry-Single-Exit regions) and optimization hurdles (e.g. thread synchronization; where programmers do not expect the compiler to do a lot of things) compared to C++ loop constructs.

There is a dangerous implication in the way you phrased this, and while it may not have been intentional, is something I think is important to point out. The whole point of good IR design is to make certain things *easy*. IR design rarely makes things “possible”, and so saying something is “not impossible” is a dangerous ground to stand on.

I cannot follow the deduction from "less easy" to "dangerous".

Anyway, this is also the argument I want to make: A loop-only representation can simplify implementing a loop transformations by a lot. Instead of each transformation having to implement its own legality check and profitability heuristics (often requiring the most lines of code), they all can share the same implementation.

Michael

Michael Kruse via llvm-dev

unread,

Jan 30, 2020, 4:06:48 AM1/30/20

to Uday Kumar Reddy Bondhugula, llvm-dev

Am Mo., 27. Jan. 2020 um 22:06 Uhr schrieb Uday Kumar Reddy Bondhugula <ud...@polymagelabs.com>:

Hi Michael,

Although the approach to use a higher order in-memory abstraction like the loop tree will make it easier than what you have today, if you used MLIR for this representation, you already get a round trippable textual format that is *very close* to your form. The affine.for/if, std.for/if in MLIR are nearly isomorphic to the tree representation you want, and as such, this drastically reduces the delta between the in-memory data structures your passes want to operate on and what you see when you print the IR. Normally, there'd be resistance to building a textual form / introducing a first class concept in the IR for what you are proposing, but since this was already done for MLIR, it looks like it would be a big win from a compiler developers' productivity standpoint if you just used MLIR for this loop tree representation. With regard to the FAQ, I can't tell whether you meant something else or missed the representation used in MLIR for the affine dialect or in general for "ops with regions".

The point of the proposal is not having a first-class construct for loops, but to allow speculative transformations. Please my response to Chris Lattner:

https://lists.llvm.org/pipermail/llvm-dev/2020-January/138778.html

> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For

It's completely the opposite unless you are looking only at MLIR's std dialect! The affine dialect as well as the std.for/if (currently misnamed as loop.for/loop.if) are actually a loop tree. The affine ops are just an affine loop AST isomorphic to the materialization of polyhedral domains/schedules via code generation. Every IST AST or the output of polyhedral code generation can be represented in the affine dialect and vice versa. MLIR's loop/if ops are a hierarchy rather than a flat form / list of blocks CFG.

As per my discussion with Chris Lattner, this is a very subjective question. It might be controversial, but I don't see MLIR regions as much more than syntactic sugar for inlined function calls that allow referencing the outer regions' definitions. This does not mean that I think they are they are useless, on the contrary.

Regarding the affine dialect, I see the same problem that Polly has when creating a schedule tree representation: A lot of work has to be done to make IR originating from Clang compatible. Everything becomes easy if the front-end can generate an affine dialect out-of-the box.

> still have to be rediscovered. However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.

This is true, but it's easier to apply it to MLIR because the actual IR is close by miles to the in-memory thing your loop hierarchy optimizer would be using. For eg., here's the input IR and the output IR of a simple outer loop vectorization performed in MLIR:
https://github.com/bondhugula/llvm-project/blob/hop/mlir/test/Transforms/vectorize.mlir#L23

Again, the proposal is about the in-memory representation using red/green trees (which I firmly disagree with being close to MLIR's in-memory representation), not the textual format.

Michael

Uday Kumar Reddy Bondhugula via llvm-dev

unread,

Jan 30, 2020, 5:40:21 AM1/30/20

to Michael Kruse, llvm-dev

Hi Michael,

On Thu, 30 Jan 2020 at 14:36, Michael Kruse <llv...@meinersbur.de> wrote:

Am Mo., 27. Jan. 2020 um 22:06 Uhr schrieb Uday Kumar Reddy Bondhugula <ud...@polymagelabs.com>:
Hi Michael,

Although the approach to use a higher order in-memory abstraction like the loop tree will make it easier than what you have today, if you used MLIR for this representation, you already get a round trippable textual format that is *very close* to your form. The affine.for/if, std.for/if in MLIR are nearly isomorphic to the tree representation you want, and as such, this drastically reduces the delta between the in-memory data structures your passes want to operate on and what you see when you print the IR. Normally, there'd be resistance to building a textual form / introducing a first class concept in the IR for what you are proposing, but since this was already done for MLIR, it looks like it would be a big win from a compiler developers' productivity standpoint if you just used MLIR for this loop tree representation. With regard to the FAQ, I can't tell whether you meant something else or missed the representation used in MLIR for the affine dialect or in general for "ops with regions".

The point of the proposal is not having a first-class construct for loops, but to allow speculative transformations. Please my response to Chris Lattner:

But that's now how your original post reads or opens with AFAICS. Presentation aside, even if your central goal was to allow speculative transformations, the same arguments hold. More below. (I had read your responses to @clattner).

https://lists.llvm.org/pipermail/llvm-dev/2020-January/138778.html

> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For

It's completely the opposite unless you are looking only at MLIR's std dialect! The affine dialect as well as the std.for/if (currently misnamed as loop.for/loop.if) are actually a loop tree. The affine ops are just an affine loop AST isomorphic to the materialization of polyhedral domains/schedules via code generation. Every IST AST or the output of polyhedral code generation can be represented in the affine dialect and vice versa. MLIR's loop/if ops are a hierarchy rather than a flat form / list of blocks CFG.

As per my discussion with Chris Lattner, this is a very subjective question. It might be controversial, but I don't see MLIR regions as much more than syntactic sugar for inlined function calls that allow referencing the outer regions' definitions. This does not mean that I think they are they are useless, on the contrary.

There are multiple ways regions in MLIR can be viewed, but the more relevant point here is you do have a loop tree structure native in the IR with MLIR. Regions in MLIR didn't evolve from modeling inlined calls - the affine.for/affine.if were originally the only two operations in MLIR that could hold blocks (which in turn are a list of operations as you know) and there wasn't anything by the name region. Later, "a list of blocks" was renamed "region" in order to generalize and unify it with other concepts that could be captured with "ops with regions", one of which is isomorphic to a "just inlined" call as you view it. But that doesn't mean a loop tree doesn't exist as a first class thing in the IR when you have the relevant ops around -- there is a hierarchy.

Regarding the affine dialect, I see the same problem that Polly has when creating a schedule tree representation: A lot of work has to be done to make IR originating from Clang compatible. Everything becomes easy if the front-end can generate an affine dialect out-of-the box.

Right - but for the purposes of your proposal, this isn't really relevant - for that matter, one could just use the loop.for, loop.if ops if you don't want to leverage affine restrictions. Moreover, with affine.graybox ops, you can always use affine.for/if wherever you have structured loops (otherwise, you would fall back to a flat list of blocks inside the region of the graybox.) While directly generating the affine dialect maximally from the frontend / Clang is one option, the other is to just generate grayboxes with trivial affine.for/if (or just loop.for/if), and then eliminate the grayboxes maximally within MLIR. This way things are reusable across different frontends, and it would be similar to Polly's approach except that you would be dealing with loops/multi-dimensional arrays where possible instead of flat list of CFGs and GEPs.

> still have to be rediscovered. However, a loop hierarchy optimizer
> could be applied to MLIR just as well as to LLVM-IR.

This is true, but it's easier to apply it to MLIR because the actual IR is close by miles to the in-memory thing your loop hierarchy optimizer would be using. For eg., here's the input IR and the output IR of a simple outer loop vectorization performed in MLIR:
https://github.com/bondhugula/llvm-project/blob/hop/mlir/test/Transforms/vectorize.mlir#L23

Again, the proposal is about the in-memory representation using red/green trees

That's actually not how I read it. Red/green trees was *one* of the nine items you mentioned in your list and this didn't come out as the central idea in your opening paras, but let's go with this now that it's clearer to me.

(which I firmly disagree with being close to MLIR's in-memory representation), not the textual format.

"close" isn't the right word here, "closer" is! Would you agree that the representation you are proposing is closer to MLIR's representation (both its in-memory and its textual representation) than to LLVM's or is this proximity not really relevant for the purposes of your proposal? I think it's important to know which among the two is the more important question. Note that currently there is really very little difference between MLIR's textual format for 'for'/if's and the in-memory form its passes use. The passes don't create any in-memory higher order representations for these IR units; they directly update them. There is nothing like the kind of complementary abstractions you are proposing (for cost models, copy-wise, etc.).

~ Uday

Michael

Michael Kruse via llvm-dev

unread,

Feb 3, 2020, 1:36:25 AM2/3/20

to Uday Kumar Reddy Bondhugula, llvm-dev

Am Do., 30. Jan. 2020 um 04:40 Uhr schrieb Uday Kumar Reddy Bondhugula
<ud...@polymagelabs.com>:

> There are multiple ways regions in MLIR can be viewed, but the more relevant point here is you do have a loop tree structure native in the IR with MLIR. Regions in MLIR didn't evolve from modeling inlined calls - the affine.for/affine.if were originally the only two operations in MLIR that could hold blocks (which in turn are a list of operations as you know) and there wasn't anything by the name region. Later, "a list of blocks" was renamed "region" in order to generalize and unify it with other concepts that could be captured with "ops with regions", one of which is isomorphic to a "just inlined" call as you view it. But that doesn't mean a loop tree doesn't exist as a first class thing in the IR when you have the relevant ops around -- there is a hierarchy.

Thanks for the interesting insights into the development history of MLIR.

>> Regarding the affine dialect, I see the same problem that Polly has when creating a schedule tree representation: A lot of work has to be done to make IR originating from Clang compatible. Everything becomes easy if the front-end can generate an affine dialect out-of-the box.
>
> Right - but for the purposes of your proposal, this isn't really relevant - for that matter, one could just use the loop.for, loop.if ops if you don't want to leverage affine restrictions. Moreover, with affine.graybox ops, you can always use affine.for/if wherever you have structured loops (otherwise, you would fall back to a flat list of blocks inside the region of the graybox.) While directly generating the affine dialect maximally from the frontend / Clang is one option, the other is to just generate grayboxes with trivial affine.for/if (or just loop.for/if), and then eliminate the grayboxes maximally within MLIR. This way things are reusable across different frontends, and it would be similar to Polly's approach except that you would be dealing with loops/multi-dimensional arrays where possible instead of flat list of CFGs and GEPs.

I think there is a relevant difference of whether we come from a
high-level code generator and then lift restrictions or from a
low-level IR which has to be raised. If starting with the high-level,
we will have to bail out on representing things because we cannot
ensure the expected semantics of the high-level idioms (while-,
do-loops, coroutines, possibly infinite loops, non-returning elements
in the loop body, ...) and have to work towards poking holes into the
high-level representation that existing passes must me able to handle.
When starting with a low-level approach, useful guarantees are added
to the representation, but everything can be represented at the
beginning.
I am not saying that the two approaches cannot meet, but I am afraid
that the high-level approach, like Polly, adds many bail-outs making
it difficult to use in practice. For instance, we want to apply
strip-mining to a loop. Since it does not change the execution order
of any body code, it is always valid, yet we'd have to bail out if we
cannot guarantee the representation's guarantees. I would like to
avoid that.

However, I agree that MLIR has the expressiveness requires for
hierarchical loop structures. We don't think we need to argue about
that.

> That's actually not how I read it. Red/green trees was *one* of the nine items you mentioned in your list and this didn't come out as the central idea in your opening paras, but let's go with this now that it's clearer to me.

Indeed, red/green trees (or DAGs) are one one of the ideas to improve
loop optimizations, but does justify its use by itself. They happen to
be effectively necessary for others in the list (e.g. versioning,
profitiability heuristic by cost function, etc...) and the reason why
I think the same cannot be done with MLIR. In hindsight, I could have
pointed this out more in the original RFC. Note that a hierarchical
representation was not an explicit feature in the list.

To convince me that MLIR is the better IR for loop optimizations,
might show that each of the features enabled by cheap subtree reuse
can also be done sufficiently efficient and easily on MLIR (or that a
feature is not what would actually would want).

>> (which I firmly disagree with being close to MLIR's in-memory representation), not the textual format.
>
> "close" isn't the right word here, "closer" is! Would you agree that the representation you are proposing is closer to MLIR's representation (both its in-memory and its textual representation) than to LLVM's or is this proximity not really relevant for the purposes of your proposal? I think it's important to know which among the two is the more important question.

I think MLIR is an evolution of LLVM-IR and Swift-IR, built around
similar principles such as SSA and Control-Flow Graphs (I understand
that in addition to CFGs, MLIR also enables structured control flow
idioms). SSA is a distinguishing feature: It allows to quickly
traversing def/uses (where compilers without it need a data-flow
analyses), but make subtree reuse hard.

Does this answer your question?

> Note that currently there is really very little difference between MLIR's textual format for 'for'/if's and the in-memory form its passes use. The passes don't create any in-memory higher order representations for these IR units; they directly update them. There is nothing like the kind of complementary abstractions you are proposing (for cost models, copy-wise, etc.).

The point I was making is that the in-memory format has references to
related items such as parents and use-def chains. These are implicit
in the textual format, e.g. the parent of an operation is defined by
its syntactical location. When reading into memory, it is not
obligatory for the objects to have all the references to related
objects.

Michael

Chris Lattner via llvm-dev

unread,

Feb 5, 2020, 7:14:30 PM2/5/20

to Michael Kruse, llvm-dev

On Feb 2, 2020, at 10:35 PM, Michael Kruse via llvm-dev <llvm...@lists.llvm.org> wrote:

That's actually not how I read it. Red/green trees was *one* of the nine items you mentioned in your list and this didn't come out as the central idea in your opening paras, but let's go with this now that it's clearer to me.

Indeed, red/green trees (or DAGs) are one one of the ideas to improve
loop optimizations, but does justify its use by itself. They happen to
be effectively necessary for others in the list (e.g. versioning,
profitiability heuristic by cost function, etc...) and the reason why
I think the same cannot be done with MLIR. In hindsight, I could have
pointed this out more in the original RFC. Note that a hierarchical
representation was not an explicit feature in the list.

To convince me that MLIR is the better IR for loop optimizations,
might show that each of the features enabled by cheap subtree reuse
can also be done sufficiently efficient and easily on MLIR (or that a
feature is not what would actually would want).

Hi Michael,

If I understand your claims, you are claiming both that red/green trees are essential for a loop optimizer, and that this essential nature “justifies” the cost of reinventing an entire compiler infrastructure is lower than the benefit of using (e.g.) MLIR to do this. I haven’t seen evidence of either point: lots of loop optimizers exist that don’t use red/green trees.

Furthermore, my experience is that specialized IRs never get the investment in (e.g.). testing, location propagation for debugging optimized code, textual round tripping, pass management, and the multitude of other things that are required for a world class compiler implementation.

-Chris

Hal Finkel via llvm-dev

unread,

Feb 5, 2020, 7:52:02 PM2/5/20

to Chris Lattner, Michael Kruse, llvm-dev

On 2/5/20 6:13 PM, Chris Lattner via llvm-dev wrote:

On Feb 2, 2020, at 10:35 PM, Michael Kruse via llvm-dev <llvm...@lists.llvm.org> wrote:

That's actually not how I read it. Red/green trees was *one* of the nine items you mentioned in your list and this didn't come out as the central idea in your opening paras, but let's go with this now that it's clearer to me.

Indeed, red/green trees (or DAGs) are one one of the ideas to improve
loop optimizations, but does justify its use by itself. They happen to
be effectively necessary for others in the list (e.g. versioning,
profitiability heuristic by cost function, etc...) and the reason why
I think the same cannot be done with MLIR. In hindsight, I could have
pointed this out more in the original RFC. Note that a hierarchical
representation was not an explicit feature in the list.

To convince me that MLIR is the better IR for loop optimizations,
might show that each of the features enabled by cheap subtree reuse
can also be done sufficiently efficient and easily on MLIR (or that a
feature is not what would actually would want).

Hi Michael,

If I understand your claims, you are claiming both that red/green trees are essential for a loop optimizer, and that this essential nature “justifies” the cost of reinventing an entire compiler infrastructure is lower than the benefit of using (e.g.) MLIR to do this. I haven’t seen evidence of either point:

Michael and I have discussed this offilne, and I think that more quantitative information here would be helpful. One phrasing might be: For a reasonable test suite of programs, for the loops we might optimize, how many different loop-nest variants do we need to represent for costing purposes (also note that, depending on profiling information and/or heuristics, we may need to cost at multiple trip-count values)? Given that, what is the relative cost of creating deep copies of the loop nests and transforming them vs. the cost of using a red/green tree? Does that cost seem significant?

We don't want to invest significant effort into a infrastructure that we know ahead of time won't scale sufficiently.

lots of loop optimizers exist that don’t use red/green trees.

I've worked with many compilers, some from vendors known for having strong loop optimizers, and none of them would meet our goals for performance, robustness, and modeling accuracy (e.g., not regressing performance in many cases). Our goal here is to significantly improve on the state of the art.

Furthermore, my experience is that specialized IRs never get the investment in (e.g.). testing, location propagation for debugging optimized code, textual round tripping, pass management, and the multitude of other things that are required for a world class compiler implementation.

I agree, but I think that these are well-known requirements within this ecosystem. What this means also depends on context (it might be more helpful to think of this more like SCEV than like a custom IR).

-Hal

-Chris

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Michael Kruse via llvm-dev

unread,

Feb 5, 2020, 11:23:22 PM2/5/20

to Chris Lattner, llvm-dev

Am Mi., 5. Feb. 2020 um 18:13 Uhr schrieb Chris Lattner <clat...@nondot.org>:
> If I understand your claims, you are claiming both that red/green trees are essential for a loop optimizer, and that this essential nature “justifies” the cost of reinventing an entire compiler infrastructure is lower than the benefit of using (e.g.) MLIR to do this. I haven’t seen evidence of either point: lots of loop optimizers exist that don’t use red/green trees.

I have spoken with multiple compiler vendors and we all could agree
that the approach of pass/transformation-local decision making is
limiting the global optimization goal the more transformations there
are.
seriously limiting the quality of global optimizations. Of those who
gave me some details, have a loop-specific data structure (although
not "red/green" trees). I don't suggest to replace anything in the
existing toolchain/pipeline, so there is no risk of regressing
anything that already works today. Every advancement will need to show
its advantages at one point, this was already the case with e.g. SSA
form.

> Furthermore, my experience is that specialized IRs never get the investment in (e.g.). testing, location propagation for debugging optimized code, textual round tripping, pass management, and the multitude of other things that are required for a world class compiler implementation.

With writing the RFC I also try to win contributors about the
usefulness of the idea and I think it is feasible. For example, Polly
started as a research project, attracting contributors, but did not
succeed to make it a main component of LLVM because of other reasons.
I think if LLVM also want's a world class loop optimizer, some time
needs to be invested.

The discussion here is valuable for me, helping me to make my
presentation about it at EuroLLVM as relevant as possible. My current
idea is to take a complex loop nest, and compare optimizing it using
red/green DAGs and traditional pass-based optimizers.

Chris Lattner via llvm-dev

unread,

Feb 7, 2020, 2:07:07 PM2/7/20

to Hal Finkel, llvm-dev

Hi Hal,

Just a disclaimer, I’m rearranging your comments below, I hope you don’t mind:

lots of loop optimizers exist that don’t use red/green trees.
I've worked with many compilers, some from vendors known for having strong loop optimizers, and none of them would meet our goals for performance, robustness, and modeling accuracy (e.g., not regressing performance in many cases). Our goal here is to significantly improve on the state of the art.

Agreed, that is the right goal, no disagreement on that! I also agree that phase ordering is a very difficult problem to solve, and that red/green trees are one interesting approach to help with them.

Furthermore, my experience is that specialized IRs never get the investment in (e.g.). testing, location propagation for debugging optimized code, textual round tripping, pass management, and the multitude of other things that are required for a world class compiler implementation.
I agree, but I think that these are well-known requirements within this ecosystem. What this means also depends on context (it might be more helpful to think of this more like SCEV than like a custom IR).

SCEV is a great example of the problem that I’m talking about here. SCEV cannot be easily tested, because the IR doesn’t round trip to a file-check-able textual representation, and cannot have passes applied to it. This is barely acceptable for SCEV, because there are workarounds for the simple cases, but would be extremely problematic for something as critical as the IR for a loop optimizer as a whole. SCEV also doesn’t propagate location information, and has several other problems.

The IR for a loop framework is going to require other duplications and have other problems, e.g. you’ll want a pass manager etc.

On Feb 5, 2020, at 4:51 PM, Hal Finkel <hfi...@anl.gov> wrote:
If I understand your claims, you are claiming both that red/green trees are essential for a loop optimizer, and that this essential nature “justifies” the cost of reinventing an entire compiler infrastructure is lower than the benefit of using (e.g.) MLIR to do this. I haven’t seen evidence of either point:
Michael and I have discussed this offilne, and I think that more quantitative information here would be helpful. One phrasing might be: For a reasonable test suite of programs, for the loops we might optimize, how many different loop-nest variants do we need to represent for costing purposes (also note that, depending on profiling information and/or heuristics, we may need to cost at multiple trip-count values)? Given that, what is the relative cost of creating deep copies of the loop nests and transforming them vs. the cost of using a red/green tree? Does that cost seem significant?
We don't want to invest significant effort into a infrastructure that we know ahead of time won't scale sufficiently.

Again, as I mentioned above, I completely agree with your goals. That said, I have a few concerns about this:

1) My understanding is that this approach is more of a research project that needs to be proven, than an implemented design the entire LLVM community should back as “the plan". If you’d like to pursue this, then I think the right way to go is to do some significant implementation work to build it out, see how it works, get experience with it, then propose inclusion as “the” loop optimizer when it can be measured empirically. This is effectively what Polly did.

2) On MLIR, in addition to being a nice technical solution to the IR data structures and infrastructure backing it, a large slice of the MLIR community has exactly the same goals as you propose. They are literally building loop transformation technology driving state of the art forward, and they also have phase ordering issues :-). These folks have lots of advanced thoughts on this, and a lot of expertise in vectorization, memory hierarchy optimizations, accelerators, HPC, and low-level code generation etc. Leveraging this work makes sense to me, and duplicating it seems wasteful.

3) On the technical point design of red/green trees, my personal opinion is that this isn’t solving the important part of the problem. Red/green trees (as far as I’m aware) are a memory saving optimization. However, “solving" the phase ordering issues require exploring an exponential (e.g. in depth of transformations / pass pipeline) search space problem (which is expensive in cycles, not just memory), and it relies on having a very good cost model for the generated code (which is a universally hard problem). Furthermore, on the cost model point, the performance of the code depends on the later “non loop optimizations” as well, which are not going to get changed to use this new IR.

4) While red/green trees are interesting, there are other approaches to tackle these problems. However, they are also researchy and need to be proven. I’m happy to explain my thoughts on this if you are interested.

Overall, I am super enthusiastic about new research directions, but I think they should be done along the lines of Polly, where the “new thing” gets built and can then be quantitatively evaluated based on its merits in practice, rather than being an up-front decision based on ideas. I am also a bit concerned that this effort is highly duplicative of work already happening in the MLIR community, but duplication isn’t necessarily bad when the path forward isn’t perfectly clear. In either case, I’d love to see more cross pollination between the efforts!

-Chris

Chris Lattner via llvm-dev

unread,

Feb 7, 2020, 6:04:14 PM2/7/20

to Michael Kruse, llvm-dev, Nicolas Vasilache, Albert Cohen

> On Feb 5, 2020, at 8:22 PM, Michael Kruse <llv...@meinersbur.de> wrote:
>
> Am Mi., 5. Feb. 2020 um 18:13 Uhr schrieb Chris Lattner <clat...@nondot.org>:
>> If I understand your claims, you are claiming both that red/green trees are essential for a loop optimizer, and that this essential nature “justifies” the cost of reinventing an entire compiler infrastructure is lower than the benefit of using (e.g.) MLIR to do this. I haven’t seen evidence of either point: lots of loop optimizers exist that don’t use red/green trees.
>
> I have spoken with multiple compiler vendors and we all could agree
> that the approach of pass/transformation-local decision making is
> limiting the global optimization goal the more transformations there
> are.
> seriously limiting the quality of global optimizations. Of those who
> gave me some details, have a loop-specific data structure (although
> not "red/green" trees). I don't suggest to replace anything in the
> existing toolchain/pipeline, so there is no risk of regressing
> anything that already works today. Every advancement will need to show
> its advantages at one point, this was already the case with e.g. SSA
> form.

Sure, to reiterate, I think that having an explicit loop nest representation is super important. My question was about red/green trees specifically.

>> Furthermore, my experience is that specialized IRs never get the investment in (e.g.). testing, location propagation for debugging optimized code, textual round tripping, pass management, and the multitude of other things that are required for a world class compiler implementation.
>
> With writing the RFC I also try to win contributors about the
> usefulness of the idea and I think it is feasible. For example, Polly
> started as a research project, attracting contributors, but did not
> succeed to make it a main component of LLVM because of other reasons.
> I think if LLVM also want's a world class loop optimizer, some time
> needs to be invested.
>
> The discussion here is valuable for me, helping me to make my
> presentation about it at EuroLLVM as relevant as possible. My current
> idea is to take a complex loop nest, and compare optimizing it using
> red/green DAGs and traditional pass-based optimizers.

Cool. I’d really recommend you connect with some of the loop optimization people working on MLIR to learn more about what they are doing, because it is directly related to this and I’d love for there to be more communication.

I’ve cc'd Nicolas Vasilache, Uday Bondhugula, and Albert Cohen as examples that would be great to connect with.

-Chris

Michael Kruse via llvm-dev

unread,

Feb 8, 2020, 1:11:12 AM2/8/20

to Chris Lattner, llvm-dev

Am Fr., 7. Feb. 2020 um 13:06 Uhr schrieb Chris Lattner <clat...@nondot.org>:

Furthermore, my experience is that specialized IRs never get the investment in (e.g.). testing, location propagation for debugging optimized code, textual round tripping, pass management, and the multitude of other things that are required for a world class compiler implementation.
I agree, but I think that these are well-known requirements within this ecosystem. What this means also depends on context (it might be more helpful to think of this more like SCEV than like a custom IR).
SCEV is a great example of the problem that I’m talking about here. SCEV cannot be easily tested, because the IR doesn’t round trip to a file-check-able textual representation, and cannot have passes applied to it. This is barely acceptable for SCEV, because there are workarounds for the simple cases, but would be extremely problematic for something as critical as the IR for a loop optimizer as a whole. SCEV also doesn’t propagate location information, and has several other problems.

A textual representation is not the highest in my priority list, but certainly doable. I'd wait until the data structure settles to something sufficiently stable. For instance, does the textual representation encompass the entire version history or just one snapshot (i.e. just the red tree?). Until then, I think it is sufficient when there is a predictable path from IR to red/green tree.

The IR for a loop framework is going to require other duplications and have other problems, e.g. you’ll want a pass manager etc.

I am not sure that a pass manager is that necessary. It adds flexibility to the optimization process, but as long we don't support dynamically adding transformations, I think organizing the transformations in code is sufficient at the beginning. Since transformations apply on subtrees, are prioritized, and can add candidates to be selected by a cost function, a pass manager has to be structured differently.

Michael and I have discussed this offilne, and I think that more quantitative information here would be helpful. One phrasing might be: For a reasonable test suite of programs, for the loops we might optimize, how many different loop-nest variants do we need to represent for costing purposes (also note that, depending on profiling information and/or heuristics, we may need to cost at multiple trip-count values)? Given that, what is the relative cost of creating deep copies of the loop nests and transforming them vs. the cost of using a red/green tree? Does that cost seem significant?
We don't want to invest significant effort into a infrastructure that we know ahead of time won't scale sufficiently.

Again, as I mentioned above, I completely agree with your goals. That said, I have a few concerns about this:

1) My understanding is that this approach is more of a research project that needs to be proven, than an implemented design the entire LLVM community should back as “the plan". If you’d like to pursue this, then I think the right way to go is to do some significant implementation work to build it out, see how it works, get experience with it, then propose inclusion as “the” loop optimizer when it can be measured empirically. This is effectively what Polly did.

100% agree

2) On MLIR, in addition to being a nice technical solution to the IR data structures and infrastructure backing it, a large slice of the MLIR community has exactly the same goals as you propose. They are literally building loop transformation technology driving state of the art forward, and they also have phase ordering issues :-). These folks have lots of advanced thoughts on this, and a lot of expertise in vectorization, memory hierarchy optimizations, accelerators, HPC, and low-level code generation etc. Leveraging this work makes sense to me, and duplicating it seems wasteful.

Agreed, although I think there is some difference the level we operate on. My focus was on optimizing code coming from relatively low-level C/C++ ensuring that we do not loose opportunities because the high-level representation does not exactly match the low-level IR. This does not mean we cannot learn from each other's experience.

3) On the technical point design of red/green trees, my personal opinion is that this isn’t solving the important part of the problem. Red/green trees (as far as I’m aware) are a memory saving optimization.

Not entirely, there would also be a run-time cost if we had to copy the complete IR before for creating a new "red" representation. If this was necessary, I think it would be a barrier to speculatively apply transformations.

However, “solving" the phase ordering issues require exploring an exponential (e.g. in depth of transformations / pass pipeline) search space problem (which is expensive in cycles, not just memory), and it relies on having a very good cost model for the generated code (which is a universally hard problem). Furthermore, on the cost model point, the performance of the code depends on the later “non loop optimizations” as well, which are not going to get changed to use this new IR.

Very good point. I don't claim a red/green tree is the universal solution to this problem, but I think it makes problem space exploration significantly easier, especially when combining multiple transformations before knowing the outcome.

As you point out, "non-loop optimizations" will be necessary after loop optimizations as well (same with LLVM's current loop optimizations: for instance, undo LCSAA) such as new InstCombine opportunities after loop fusion. This can be difficult to predict by a cost function, but they are usually less significant than what can be gained by a loop transformations. Others, such as LICM, can be significant though and should be considered by a cost function (or simply applied to the red/green tree before evaluating the cost function to be automatically be considered).

This is why I am enthusiastic about speculative transformations -- a generic cost function can handle many cases without requiring code for each transformation.

4) While red/green trees are interesting, there are other approaches to tackle these problems. However, they are also researchy and need to be proven. I’m happy to explain my thoughts on this if you are interested.

Overall, I am super enthusiastic about new research directions, but I think they should be done along the lines of Polly, where the “new thing” gets built and can then be quantitatively evaluated based on its merits in practice, rather than being an up-front decision based on ideas. I am also a bit concerned that this effort is highly duplicative of work already happening in the MLIR community, but duplication isn’t necessarily bad when the path forward isn’t perfectly clear. In either case, I’d love to see more cross pollination between the efforts!

One of the motivations of writing the RFC is to clarify goals and requirements. I would like to avoid the situation we had with Polly to discover 'blockers' when mainstreaming.

Another motivation is to find more interested parties.

Michael

Michael Kruse via llvm-dev

unread,

Feb 8, 2020, 1:17:27 AM2/8/20

to Chris Lattner, llvm-dev, Nicolas Vasilache, Albert Cohen

Am Fr., 7. Feb. 2020 um 17:03 Uhr schrieb Chris Lattner <clat...@nondot.org>:

> The discussion here is valuable for me, helping me to make my
> presentation about it at EuroLLVM as relevant as possible. My current
> idea is to take a complex loop nest, and compare optimizing it using
> red/green DAGs and traditional pass-based optimizers.

Cool. I’d really recommend you connect with some of the loop optimization people working on MLIR to learn more about what they are doing, because it is directly related to this and I’d love for there to be more communication.

I’ve cc'd Nicolas Vasilache, Uday Bondhugula, and Albert Cohen as examples that would be great to connect with.

You may have already seen my discussion with Uday on the mailing list. I would like to discuss approaches with all 3 of them, at latest at EuroLLVM (or contact be before that, e.g. on this mailing-list thread).

Michael

Chris Lattner via llvm-dev

unread,

Feb 8, 2020, 12:11:32 PM2/8/20

to Michael Kruse, llvm-dev, Nicolas Vasilache, Albert Cohen

Great! I’d be happy to join a discussion about this at EuroLLVM too. I think it is really important to improve the loop optimizer in LLVM, and I’m glad you’re pushing on it!

-Chris

Uday Kumar Reddy Bondhugula via llvm-dev

unread,

Feb 8, 2020, 1:22:37 PM2/8/20

to Michael Kruse, llvm-dev

I suspect that you are looking at this as "MLIR vs your proposal", but the point I've been trying to make is to compare "your proposal on MLIR" vs "your proposal on LLVM", i.e., you may want to evaluate the cost/benefit of hosting your approach on MLIR vis-a-vis LLVM while accounting for the fact that you'll need the conversion for LLVM to MLIR and back for the relevant IR units with the former. On this note, in my opinion, MLIR doesn't have a unified/clear strategy for loop optimization yet (even op representation wise); there are different ops/representations that are being pedalled by different folks, and some I believe are redundant or are simply duplicating infrastructure in less powerful ways. I'm saying this because if you choose MLIR to host your stuff, you'll have choices to make on which forms to use.

>> (which I firmly disagree with being close to MLIR's in-memory representation), not the textual format.
>
> "close" isn't the right word here, "closer" is! Would you agree that the representation you are proposing is closer to MLIR's representation (both its in-memory and its textual representation) than to LLVM's or is this proximity not really relevant for the purposes of your proposal? I think it's important to know which among the two is the more important question.

I think MLIR is an evolution of LLVM-IR and Swift-IR, built around
similar principles such as SSA and Control-Flow Graphs (I understand
that in addition to CFGs, MLIR also enables structured control flow
idioms). SSA is a distinguishing feature: It allows to quickly
traversing def/uses (where compilers without it need a data-flow
analyses), but make subtree reuse hard.

Does this answer your question?

Not directly, but I think I have more information now. You are probably saying it's nearly as easy/difficult to host your approach on LLVM IR as it is on MLIR.

All the best!

~ Uday

> Note that currently there is really very little difference between MLIR's textual format for 'for'/if's and the in-memory form its passes use. The passes don't create any in-memory higher order representations for these IR units; they directly update them. There is nothing like the kind of complementary abstractions you are proposing (for cost models, copy-wise, etc.).

The point I was making is that the in-memory format has references to
related items such as parents and use-def chains. These are implicit
in the textual format, e.g. the parent of an operation is defined by
its syntactical location. When reading into memory, it is not
obligatory for the objects to have all the references to related
objects.

Michael

Nicolas Vasilache via llvm-dev

unread,

Feb 10, 2020, 12:53:40 PM2/10/20

to Albert Cohen, llvm-dev

Albert, please loop me in too in your discussions (haha..), I'll be out the whole of next week though.

On Mon, Feb 10, 2020 at 10:50 AM Albert Cohen <alber...@google.com> wrote:

Thanks Chris for CCing us.

I remember Michael's presentation and suggestion to consider Roslyn's design and experience. I'll be glad to discuss further in April. Michael, we can also talk later this week if you'd like. I'll send you a separate email.

Loop transformations in MLIR take many different paths. Polyhedral affine-to-affine, generative/lowering in linalg, and also exploring lifting lower level constructs into affine and further into linalg and tensor compute ops. I'm all for exchanging on the rationale, use cases, and design of these paths, alongside with LLVM LNO. One practical option being to compose these lifting, affine transformations and lowering steps to build an LLVM LNO. Ideally this could be one generic effort that would also interop with numerical libraries or specialized hardware ops where they exist, and that can be used to implement domain-specific code generators.

More on Roslyn: I'm not convinced yet about the added value of red-green trees. I see them as an implementation detail at the moment: much like sea of nodes is a convenient abstraction for implementing SSA-based global optimization passes, red-green trees may improve on the practice of AST/loop-nest transformations, but I don't see much fundamental or solid engineering benefits... yet.

Albert

Michael Kruse via llvm-dev

unread,

Feb 10, 2020, 1:04:48 PM2/10/20

to Uday Kumar Reddy Bondhugula, llvm-dev

Am Sa., 8. Feb. 2020 um 12:22 Uhr schrieb Uday Kumar Reddy Bondhugula
<ud...@polymagelabs.com>:

>> To convince me that MLIR is the better IR for loop optimizations,
>> might show that each of the features enabled by cheap subtree reuse
>> can also be done sufficiently efficient and easily on MLIR (or that a
>> feature is not what would actually would want).
>
>
> I suspect that you are looking at this as "MLIR vs your proposal", but the point I've been trying to make is to compare "your proposal on MLIR" vs "your proposal on LLVM", i.e., you may want to evaluate the cost/benefit of hosting your approach on MLIR vis-a-vis LLVM while accounting for the fact that you'll need the conversion for LLVM to MLIR and back for the relevant IR units with the former. On this note, in my opinion, MLIR doesn't have a unified/clear strategy for loop optimization yet (even op representation wise); there are different ops/representations that are being pedalled by different folks, and some I believe are redundant or are simply duplicating infrastructure in less powerful ways. I'm saying this because if you choose MLIR to host your stuff, you'll have choices to make on which forms to use.

I was indeed interpreting your arguments as "MLIR vs your proposal".
Is mentioned in the original RFC, I would also like to make "proposal
on MLIR" work (mostly for flang), but "proposal on LLVM" for new seems
to offer the greater benefit. MLIR has the advantage that it can carry
more semantics from the front-end that can be used (multi-dimensional
array subscripts, already parallel loops, ...).

>> >> (which I firmly disagree with being close to MLIR's in-memory representation), not the textual format.
>> >
>> > "close" isn't the right word here, "closer" is! Would you agree that the representation you are proposing is closer to MLIR's representation (both its in-memory and its textual representation) than to LLVM's or is this proximity not really relevant for the purposes of your proposal? I think it's important to know which among the two is the more important question.
>>
>> I think MLIR is an evolution of LLVM-IR and Swift-IR, built around
>> similar principles such as SSA and Control-Flow Graphs (I understand
>> that in addition to CFGs, MLIR also enables structured control flow
>> idioms). SSA is a distinguishing feature: It allows to quickly
>> traversing def/uses (where compilers without it need a data-flow
>> analyses), but make subtree reuse hard.
>>
>> Does this answer your question?
>
>
> Not directly, but I think I have more information now. You are probably saying it's nearly as easy/difficult to host your approach on LLVM IR as it is on MLIR.

Yes.

Nicolai Hähnle via llvm-dev

unread,

Feb 10, 2020, 3:52:57 PM2/10/20

to Michael Kruse, llvm-dev

On Sat, Feb 8, 2020 at 7:11 AM Michael Kruse via llvm-dev
<llvm...@lists.llvm.org> wrote:
> I am not sure that a pass manager is that necessary. It adds flexibility to the optimization process, but as long we don't support dynamically adding transformations, I think organizing the transformations in code is sufficient at the beginning. Since transformations apply on subtrees, are prioritized, and can add candidates to be selected by a cost function, a pass manager has to be structured differently.

It can be convenient to be able to just plug different pass orders
together in a tool like opt, to explore what happens with different
pass structures.

[snip]

>> 3) On the technical point design of red/green trees, my personal opinion is that this isn’t solving the important part of the problem. Red/green trees (as far as I’m aware) are a memory saving optimization.
>
> Not entirely, there would also be a run-time cost if we had to copy the complete IR before for creating a new "red" representation. If this was necessary, I think it would be a barrier to speculatively apply transformations.

Agreed.

>> However, “solving" the phase ordering issues require exploring an exponential (e.g. in depth of transformations / pass pipeline) search space problem (which is expensive in cycles, not just memory), and it relies on having a very good cost model for the generated code (which is a universally hard problem). Furthermore, on the cost model point, the performance of the code depends on the later “non loop optimizations” as well, which are not going to get changed to use this new IR.
>
> Very good point. I don't claim a red/green tree is the universal solution to this problem, but I think it makes problem space exploration significantly easier, especially when combining multiple transformations before knowing the outcome.
>
> As you point out, "non-loop optimizations" will be necessary after loop optimizations as well (same with LLVM's current loop optimizations: for instance, undo LCSAA) such as new InstCombine opportunities after loop fusion. This can be difficult to predict by a cost function, but they are usually less significant than what can be gained by a loop transformations. Others, such as LICM, can be significant though and should be considered by a cost function (or simply applied to the red/green tree before evaluating the cost function to be automatically be considered).

I've actually seen a bunch of code with loops that would traditionally
be considered quite weird, where loop unrolling unlocks a huge number
of InstCombine and even SimplifyCFG opportunities to the point where
after loop unrolling, the code is barely larger than the original
code. So being able to speculatively combine not just loop passes but
also generic IR passes is definitely interesting.

Cheers,
Nicolai

--
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.

Michael Kruse via llvm-dev

unread,

Feb 10, 2020, 8:35:00 PM2/10/20

to Nicolai Hähnle, llvm-dev

Am Mo., 10. Feb. 2020 um 14:52 Uhr schrieb Nicolai Hähnle <nhae...@gmail.com>:
> On Sat, Feb 8, 2020 at 7:11 AM Michael Kruse via llvm-dev
> <llvm...@lists.llvm.org> wrote:
> > I am not sure that a pass manager is that necessary. It adds flexibility to the optimization process, but as long we don't support dynamically adding transformations, I think organizing the transformations in code is sufficient at the beginning. Since transformations apply on subtrees, are prioritized, and can add candidates to be selected by a cost function, a pass manager has to be structured differently.
>
> It can be convenient to be able to just plug different pass orders
> together in a tool like opt, to explore what happens with different
> pass structures.

Such a tool would need to know to which subtree apply to, or define a
traversal order (such as innermost-to-outer, revisit modified subtrees
and apply each transformation at most once). I currently don't know
what makes most sense.

For regression test, I'd prefer to use unittests: A test has a
reference to node, applies the transformation, and validates the
result by introspection/visitor/matcher pattern.

> I've actually seen a bunch of code with loops that would traditionally
> be considered quite weird, where loop unrolling unlocks a huge number
> of InstCombine and even SimplifyCFG opportunities to the point where
> after loop unrolling, the code is barely larger than the original
> code. So being able to speculatively combine not just loop passes but
> also generic IR passes is definitely interesting.

Really interesting scenario, it might be worth it implementing a
subset of InstCombine on the tree representation. In contrast, I'd
have no clue how to make the current LoopUnroll heuristic consider
this.

Michael

> --
> Lerne, wie die Welt wirklich ist,
> aber vergiss niemals, wie sie sein sollte.

Wenn wir uns nur darauf einigen könnten, wie sie sein solle ;-)

Albert Cohen via llvm-dev

unread,

Feb 10, 2020, 11:12:58 PM2/10/20

to Chris Lattner, llvm-dev, Nicolas Vasilache

Thanks Chris for CCing us.

I remember Michael's presentation and suggestion to consider Roslyn's design and experience. I'll be glad to discuss further in April. Michael, we can also talk later this week if you'd like. I'll send you a separate email.

Loop transformations in MLIR take many different paths. Polyhedral affine-to-affine, generative/lowering in linalg, and also exploring lifting lower level constructs into affine and further into linalg and tensor compute ops. I'm all for exchanging on the rationale, use cases, and design of these paths, alongside with LLVM LNO. One practical option being to compose these lifting, affine transformations and lowering steps to build an LLVM LNO. Ideally this could be one generic effort that would also interop with numerical libraries or specialized hardware ops where they exist, and that can be used to implement domain-specific code generators.

More on Roslyn: I'm not convinced yet about the added value of red-green trees. I see them as an implementation detail at the moment: much like sea of nodes is a convenient abstraction for implementing SSA-based global optimization passes, red-green trees may improve on the practice of AST/loop-nest transformations, but I don't see much fundamental or solid engineering benefits... yet.

Albert

On Sat, Feb 8, 2020 at 6:11 PM Chris Lattner <clat...@nondot.org> wrote:

Hal Finkel via llvm-dev

unread,

Feb 11, 2020, 6:04:29 AM2/11/20

to Albert Cohen, Chris Lattner, llvm-dev, Nicolas Vasilache

On 2/10/20 9:50 AM, Albert Cohen via llvm-dev wrote:

Thanks Chris for CCing us.

I remember Michael's presentation and suggestion to consider Roslyn's design and experience. I'll be glad to discuss further in April. Michael, we can also talk later this week if you'd like. I'll send you a separate email.

Loop transformations in MLIR take many different paths. Polyhedral affine-to-affine, generative/lowering in linalg, and also exploring lifting lower level constructs into affine and further into linalg and tensor compute ops. I'm all for exchanging on the rationale, use cases, and design of these paths, alongside with LLVM LNO. One practical option being to compose these lifting, affine transformations and lowering steps to build an LLVM LNO. Ideally this could be one generic effort that would also interop with numerical libraries or specialized hardware ops where they exist, and that can be used to implement domain-specific code generators.

More on Roslyn: I'm not convinced yet about the added value of red-green trees. I see them as an implementation detail at the moment: much like sea of nodes is a convenient abstraction for implementing SSA-based global optimization passes, red-green trees may improve on the practice of AST/loop-nest transformations, but I don't see much fundamental or solid engineering benefits... yet.

Hi, Albert,

I definitely think that we're focused too much on the particular data-structure choice being proposed, rather than on the motivation for that proposal. At a high level, it seems like the questions here are:

1. When considering the optimization of a given loop nest, how many potential transformations should be considered?

2. For those potential transformations, how many of them need to be given a concrete representation in order to generate a cost?

As I see it, the underlying motivation here anticipates that:

1. The number of potential transformations might be large - not that it always must be large, but that the infrastructure should support it being large, and...

2. A large number of them need a concrete representation in order to generate a cost (because we'll need to hand off certain aspects to backend models, etc.), given one or more potential sets of trip-count values.

To start, do you have thoughts on these?

Thanks,

Hal

Albert

On Sat, Feb 8, 2020 at 6:11 PM Chris Lattner <clat...@nondot.org> wrote:

On Feb 7, 2020, at 10:16 PM, Michael Kruse <llv...@meinersbur.de> wrote:

Am Fr., 7. Feb. 2020 um 17:03 Uhr schrieb Chris Lattner <clat...@nondot.org>:

> The discussion here is valuable for me, helping me to make my
> presentation about it at EuroLLVM as relevant as possible. My current
> idea is to take a complex loop nest, and compare optimizing it using
> red/green DAGs and traditional pass-based optimizers.

Cool. I’d really recommend you connect with some of the loop optimization people working on MLIR to learn more about what they are doing, because it is directly related to this and I’d love for there to be more communication.

I’ve cc'd Nicolas Vasilache, Uday Bondhugula, and Albert Cohen as examples that would be great to connect with.

You may have already seen my discussion with Uday on the mailing list. I would like to discuss approaches with all 3 of them, at latest at EuroLLVM (or contact be before that, e.g. on this mailing-list thread).

Great! I’d be happy to join a discussion about this at EuroLLVM too. I think it is really important to improve the loop optimizer in LLVM, and I’m glad you’re pushing on it!

-Chris

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Reply all

Reply to author

Forward

0 new messages