The handling of BLAS/LAPACK libraries suck, Sage basically hardcodes ATLAS or OSX accelerate into the build process. Matrix operations are one place where using ISA extensions and careful choice of blocking to match processor cache really matters, but we have no mechanism to use a good implementation in binary builds.
My proposal is to make the BLAS library configurable at runtime with defaults chosen by an automatic benchmark run. This is similar to what GMP/MPIR does but at a higher level. Instead of writing our own "FAT" BLAS, we use a wrapper library that resolves (using dlopen/dlsym) the BLAS library at runtime to the reference/netlib BLAS, ATLAS, OpenBlas, Intel MKL, or any other implementation. The backend can be selected with the LINEAR_ALGEBRA_CONFIG environment variable.
Ultimately, there will be a Python program to build different BLAS/LAPACK libraries and install them either in /usr/lib/linear_algebra/NAME/ (or /usr/lib64 or any compile-time prefix) or $(HOME)/.local/lib/linear_algebra/NAME/ where NAME is a combination of implementation and compile options used. The LINEAR_ALGEBRA_CONFIG environment variable will then be set to NAME.
As a demo, I wrote the wrapper library linalg_cblas.so, see
which wraps two CBLAS functions. For now, LINEAR_ALGEBRA_CONFIG is used as name (possibly with full path info) of the cblas library, though this will be changed. The example program
is just a call to these two CBLAS functions and is linked with -llinalg_cblas -ldl. Then the CBLAS library can be selected at runtime (here switched between Sage GSL CBLAS, the Fedora GSL CBLAS, and ATLAS):
[vbraun@volker-desktop linear_algebra]$ LINEAR_ALGEBRA_CONFIG=~/Sage/sage/local/lib/libgslcblas.so ./test/startup
Address of cblas_zdotc_sub: d817a0b0
Address of cblas_dgemm: d815ffb0
destructor
[vbraun@volker-desktop linear_algebra]$ LINEAR_ALGEBRA_CONFIG=libgslcblas.so.0 ./test/startup
Address of cblas_zdotc_sub: c30260f0
Address of cblas_dgemm: c300c000
destructor
[vbraun@volker-desktop linear_algebra]$ LINEAR_ALGEBRA_CONFIG=libcblas.so ./test/startup
Address of cblas_zdotc_sub: c58159e0
Address of cblas_dgemm: c580c420
destructor
The computational cost is of course an extra stack frame whenever you call a BLAS/LAPACK function. But then thats already faster than any operation that you can do in Python.
In the Sage build process we would then build the wrapper library and the reference blas by default. The wrapper will try to pick up system BLAS/LAPACK implementations if they are available. Installing ATLAS or OpenBlas would then be optional. Right now we compile ATLAS in parallel together with anything else, which makes little sense if you want to get accurate timings. Also the ATLAS library interface changed in ATLAS-3.10 so we'll have to modify everything sooner or later. Finally, ATLAS-3.10 doesn't work on Itanium any more and there is not much interest in making it work there, so if we ever want to get AVX support, say, on modern Intel chips we need to fall back to a dumb implementation on Itanium (or face the fact that Itanium is dead).