LU factorization in Julia can be almost as fast as native code.

497 views
Skip to first unread message

Jason Riedy

unread,
May 26, 2014, 7:36:30 PM5/26/14
to julia...@googlegroups.com

First, the caveats:

  • on large matrices,
  • with few threads, and
  • punching a hole directly to the BLAS library.

Sivan Toledo's recursive LU factorization ( http://dx.doi.org/10.1137/S0895479896297744 ) isn't just good for optimizing cache behavior but also avoiding top-level overheads. A somewhat straight-forward iterative implementation in Julia, attached, performs about as quickly as the current LAPACK call for square matrices of dimension 3000 or above so long as you limit OpenBLAS to a single thread. I haven't tested against MKL.

Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM and xTRSV directly. The allocations and construction of StridedMatrix views at every step destroys any performance. Similarly with a light-weight wrapper attempt called LASubDomain below. Also, I have to write loops rather than vector expressions. That leads to ugly code compared to modern Fortran versions.

Current ratio of recursive LU time to the Julia LAPACK dispatch on a dual-core, Sandy Bridge server by dimension (vertical) and number of threads (horizontal):

N NT: 1 2 4 8 16 24 32
10 2301.01 2291.16 2335.39 2268.1 2532.12 2304.23 2320.84
30 661.25 560.34 443.07 514.52 392.04 425.56 472
100 98.62 90.38 96.69 94.31 74.6 73.61 76.33
300 8.12 8.14 12.19 16.15 14.55 14.16 15.82
1000 1.6 1.55 2.36 3.35 3.5 3.86 3.48
3000 1.04 1.13 1.36 1.99 1.62 1.36 1.39
10000 1.02 1.06 1.13 1.26 1.55 1.54 1.58

The best case for traditional LU factorization relying on xGEMM for the Schur complement is 7.7x slower than the LAPACK dispatch, and it scales far worse.

If there's interest, I'd love to work this into a faster generic base/linalg/lu.jl for use with other data types. I know I need to consider if the laswap!, lutrisolve!, and luschur! functions are the right utility methods. I'm interested in supporting doubled double, etc. with relatively high performance. That implies building everything out of at least dot products. For "multiplied" precisions, those can be optimized to avoid continual re-normalization (see Siegfried Rump, et al.'s work).

(I actually wrote the code last November, but this is first chance I've had to describe it. Might be a tad out of date style-wise with respect to current Julia. Given that lag, I suppose I shouldn't promise to beat this into shape at any fast pace…)

<#part type="text/plain" filename="~/src/julia-stuff/reclu/reclu.jl" disposition=inline>
<#/part>
<#part type="text/plain" filename="~/src/julia-stuff/reclu/tst.jl" disposition=inline>
<#/part>

reclu.jl
tst.jl

Alan Edelman

unread,
May 27, 2014, 8:56:33 AM5/27/14
to julia...@googlegroups.com, ja...@lovesgoodfood.com
This is great to hear!  More to do ..

Viral Shah

unread,
May 27, 2014, 3:00:49 PM5/27/14
to julia...@googlegroups.com, ja...@lovesgoodfood.com
This is really fascinating! There is certainly interest in supporting linear algebra operations for types other than the ones BLAS/LAPACK support.

Do see the discussion in the issue below. I know that you do not like github much, but seeing the discussion should not require you to have a login. 

Andreas Noack Jensen

unread,
May 27, 2014, 3:35:51 PM5/27/14
to julia...@googlegroups.com
Have you tried to time the recursive algorithm without BLAS calls against the noblas version?

I agree that would be great if it wasn't necessary to write out everything in loops to get high performance. Hopefully, this can improve over time.
--
Med venlig hilsen

Andreas Noack Jensen

Viral B. Shah

unread,
May 27, 2014, 3:50:23 PM5/27/14
to julia...@googlegroups.com
With array views, we should be able to avoid writing loops. Hopefully we will tackle this early in 0.4.

-viral

Reply all
Reply to author
Forward
0 new messages