General question about FLAME methodology and libflame

SUYASH BAKSHI

unread,

Jan 3, 2022, 10:28:35 AM1/3/22

to libflame-discuss

Hello,

I am fairly new to understanding FLAME methodology and libflame library, and want to get a general idea about what it can be utilized for. From few of the listed publications, I understand that libflame is a package of high performance DLA library derived using FLAME methodology.

But the specific question I have is, how is the methodology different from some of the recent frameworks (e.g. dMazeRunner, ZigZag, MAESTRO, Timeloop etc.) that allow searching for optimal implementations (i.e. loop ordering, tiling, PE parallelism) of loop-nest based algorithms on different types of hardware architectures.

I'd greatly appreciate any comments/reading material that would be helpful to get a good understanding.

Thank you!

Robert van de Geijn

unread,

Jan 3, 2022, 7:52:39 PM1/3/22

to libflame-discuss

The fundamental difference between the FLAME methodology and various compiler approaches is that with FLAME, you can discover algorithms for operations for which there are no algorithms or implementations that were previously discovered.

Now, beyond that, it has been shown that with the FLAME methodology, you can sidestep the usual loop ordering problems (the phase ordering problem). This was shown in Tze Meng Low's dissertation. (Available from https://www.cs.utexas.edu/~flame/web/FLAMEPublications.html). Tiling is covered by deriving blocked algorithms. I'll ask Tze Meng to comment further.

You may also want to look at Bryan Marker's dissertation, also available from https://www.cs.utexas.edu/~flame/web/FLAMEPublications.html.

Tze Meng Low

unread,

Jan 6, 2022, 4:34:41 PM1/6/22

to libflame-discuss

This will be a long reply. IMO, FLAME and the mentioned frameworks take different approaches towards implementing high performance code.

I will try to describe the differences with an analogy.

Let us imagine the process of implementing/optimizing a high performance code as a tree where the root is the unoptimized version, the intermediate nodes are different decision/optimization points, and the leaves are different codes obtained from applying those decision/optimizations on the path from the root to the leaf. In addition, let us assume that the tree describes all code paths attainable through the classical loop transformations (tiling, permuting etc).

The above-mentioned frameworks (among many others) can be viewed as performing a depth-first search through the tree to find an optimal solution. A path from root to leaf is chosen and an implementation is generated and evaluated against a predefined metric (energy, power, performance) and then the process of choosing a different path is repeated. Different frameworks use different heuristics and techniques to more effectively search the tree for an optimized implementation.

The FLAME methodology is different in that every derived FLAME algorithm is either a subtree or a segment of a path. Picking a loop invariant in the FLAME methodology means that a sequence of one or more decisions/optimizations have been picked. Therefore, if an algorithm is needed to have a high amount of parallelism in order to map it to PEs, then it is a matter of picking the appropriate loop invariant(s) that will yield algorithm with that desired performance characteristic. Different FLAME algorithms can be composed together to obtain a path (or paths) from the root to different implementations. This means that one algorithm can be used to optimize for the number of PEs, another algorithm may be layered (thereby creating a tiled algorithm) where this newer algorithm may be chosen to reduce memory traffic from different layers of the memory hierarchy.

SUYASH BAKSHI

unread,

Jan 7, 2022, 11:49:40 AM1/7/22

to libflame-discuss

Thank you Robert and Tze! The dissertations and the analogy are very insightful. I have a few follow-up questions if you don't mind.

If my understanding of loop invariants is correct from Bryan Marker's dissertation, they are (or can be) generated from a knowledge base of optimization techniques. My questions are:
1. Are loop invariants generalizable for different problem sizes? Meaning that, can a loop variant that theoretically can be expected to reduce memory traffic will do that for all problem sizes? For example, for convolutions, multiple publications have shown that same loop order or tiling sizes may result in different cost based on the size/dimensions of output operand or weight operand.
2. How are the loop invariants parameterized for different problem sizes? In Bryan's dissertation, it is mentioned that the tunnels in DAGs can encode partitioning direction and block sizes. But in section 5.2.7, it is mentioned that the block sizes were hand tuned. So is it expected to have an auto tuning step once invariants have been generated? (I feel like I'm probably mixing up two independent concepts i.e. the loop invariants and the DAG representation used in DxTer)
3. Is the knowledge base generalizable to different hardware architectures?
Are there any references that provide details about the time taken to find an optimal algorithm for different operations? I was able to find performance comparisons of FLAME/DxTer generated codes with other BLAS libraries but not the search time to find the optimal algorithm.

Thank you and please feel free to correct me if my understanding of any concept seems incorrect/incomplete.

Reply all

Reply to author

Forward