I think a similar result could be obtained by combining:
- Block matrices
- Structural recursion
- "Tail recursion modulo cons"
The last turns the recursion into a "while" loop. This should at least derive C1, or the "lazy" algorithm.
I think some of the C2-C16 variants can be obtained by moving stuff from the end of the while loop to its beginning.
I'm still wondering why these 15 other variants might be useful. Is caching the reason?
To explain my interests: I'm interested in very general methods for deriving numerical linear algebra algorithms.