First, the caveats:
Sivan Toledo's recursive LU factorization ( http://dx.doi.org/10.1137/S0895479896297744 ) isn't just good for optimizing cache behavior but also avoiding top-level overheads. A somewhat straight-forward iterative implementation in Julia, attached, performs about as quickly as the current LAPACK call for square matrices of dimension 3000 or above so long as you limit OpenBLAS to a single thread. I haven't tested against MKL.
Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM and xTRSV directly. The allocations and construction of StridedMatrix views at every step destroys any performance. Similarly with a light-weight wrapper attempt called LASubDomain below. Also, I have to write loops rather than vector expressions. That leads to ugly code compared to modern Fortran versions.
Current ratio of recursive LU time to the Julia LAPACK dispatch on a dual-core, Sandy Bridge server by dimension (vertical) and number of threads (horizontal):
| N | NT: 1 | 2 | 4 | 8 | 16 | 24 | 32 |
|---|---|---|---|---|---|---|---|
| 10 | 2301.01 | 2291.16 | 2335.39 | 2268.1 | 2532.12 | 2304.23 | 2320.84 |
| 30 | 661.25 | 560.34 | 443.07 | 514.52 | 392.04 | 425.56 | 472 |
| 100 | 98.62 | 90.38 | 96.69 | 94.31 | 74.6 | 73.61 | 76.33 |
| 300 | 8.12 | 8.14 | 12.19 | 16.15 | 14.55 | 14.16 | 15.82 |
| 1000 | 1.6 | 1.55 | 2.36 | 3.35 | 3.5 | 3.86 | 3.48 |
| 3000 | 1.04 | 1.13 | 1.36 | 1.99 | 1.62 | 1.36 | 1.39 |
| 10000 | 1.02 | 1.06 | 1.13 | 1.26 | 1.55 | 1.54 | 1.58 |
The best case for traditional LU factorization relying on xGEMM for the Schur complement is 7.7x slower than the LAPACK dispatch, and it scales far worse.
If there's interest, I'd love to work this into a faster generic base/linalg/lu.jl for use with other data types. I know I need to consider if the laswap!, lutrisolve!, and luschur! functions are the right utility methods. I'm interested in supporting doubled double, etc. with relatively high performance. That implies building everything out of at least dot products. For "multiplied" precisions, those can be optimized to avoid continual re-normalization (see Siegfried Rump, et al.'s work).
(I actually wrote the code last November, but this is first chance I've had to describe it. Might be a tad out of date style-wise with respect to current Julia. Given that lag, I suppose I shouldn't promise to beat this into shape at any fast pace…)
<#part type="text/plain" filename="~/src/julia-stuff/reclu/reclu.jl" disposition=inline>
<#/part>
<#part type="text/plain" filename="~/src/julia-stuff/reclu/tst.jl" disposition=inline>
<#/part>