[racket] performance problem in math/matrix

Berthold Bäuml

unread,

Jan 20, 2013, 7:44:51 AM1/20/13

to Neil Toronto, us...@racket-lang.org

Dear Neil,

first I want to thank you for the great work the math package is. We hope to use it heavily in our "Racket in Robotics" environment we are currently developing.

I tried a little test with mildly large Flonum matrices using math/matrix in Typed Racket. Two 1000x1000 dimensional matrices should be multiplied with (array-strictness #t). Running this in DrRacket results in an "out of memory" error (also having set the memory limit to 1GB) and running it with racket on the command line makes the computer start to heavily swap. Where is this high memory consumption coming from? I would have expected that it will not take more than 3*8*10^6 bytes = 24MB to hold the three involved matrices in memory.

I also tried matrix* with two 100x100 dimensional matrices. This finishes but takes about 1s for the multiplication only which is quite long. For comparison, I tried the same with Mathematica, Python/NumPy and C++ using the Eigen-libary (I know, that this is not quite fair as the former two also use a optimized C-library as their backend, but it gives a baseline of what performance is possible)

100x100 1000x1000
Typed Racket 1000ms (out of memory)
Mathematica 2ms 250ms
Python/NumPy 1ms 240ms
C++/Eigen <2ms <300ms

Here the configuration details of the MacBook I used:
Racket 5.3.1.900
OS X 10.8.2
2.53 GHz Intel Core 2 Duo with 4GB memory

And here the code snippets:

Typed Racket:
------------------------------------

#lang typed/racket/base
(require math)

(array-strictness #t)

(define dim 1000)

(define big1 (build-matrix dim dim (lambda (i j) (random))))

(define big2 (build-matrix dim dim (lambda (i j) (random))))

(define res (time (matrix* big1 big2)))

Mathematica:
------------------------------------------
dim = 1000;
big1 = Array[Random[] &, {dim, dim}];
big2 = Array[Random[] &, {dim, dim}];

Timing[res = big1.big2 ;]

Python/NumPy
------------------------------------------
import numpy
import time

dim = 100
big1 = numpy.random.random((dim,dim))
big2 = numpy.random.random((dim,dim))

start = time.time()
res = numpy.dot(big1, big2)
stop = time.time()

print(stop - start)

C++/Eigen (compiled with g++ -O2)
---------------------------------------------

#include <Eigen/Dense>
using Eigen::MatrixXd;
int main()
{
const int dim = 100;
MatrixXd big1(dim, dim);
MatrixXd big2(dim, dim);

MatrixXd res(dim, dim);

res = big1 * big2;
}

--
-----------------------------------------------------------------------
Berthold Bäuml
DLR, Robotics and Mechatronics Center (RMC)
Münchner Str. 20, D-82234 Wessling
Phone +49 8153 282489
http://www.robotic.de/Berthold.Baeuml

____________________
Racket Users list:
http://lists.racket-lang.org/users

Jens Axel Søgaard

unread,

Jan 20, 2013, 9:09:42 AM1/20/13

to Berthold Bäuml, us...@racket-lang.org

Hi Berthold,

2013/1/20 Berthold Bäuml <berthol...@dlr.de>:

> I tried a little test with mildly large Flonum matrices using math/matrix in Typed
> Racket. Two 1000x1000 dimensional matrices should be multiplied with (array-
> strictness #t). Running this in DrRacket results in an "out of memory" error (also
> having set the memory limit to 1GB) and running it with racket on the command
> line makes the computer start to heavily swap.

> Where is this high memory consumption coming from?

I consider this a bug in matrix*.

The culprit is inline-matrix-multiply in "untyped-matrix-arithmetic.rkt".
When multiplying an mxp matrix with an pxn matrix, it builds a
temporary array of size n*p*m with all products needed, then
it computes sums of these products.

Replacing this algorithm with the standard O(n^3) ought to be easy.

Numpy (and Mathematica, I believe) uses BLAS and LAPACK, which use
Strassen's algorithm (or better?) for matrices this large, so there will
still be a performance gap.

If this performance gaps turns out to be too large, we must figure
out a way to integrate BLAS and LAPACK bindings into math/matrix.

For the time being, I have attached an implementation of math/matrix
using supporting matrices over doubles (math/matrix supports
matrices over Racket numbers). The implementation uses Racket
bindings to BLAS and LAPACK. All names ought to the same,
just replace "matrix" with "flmatrix".

Your example becomes:

#lang racket
(require "flmatrix.rkt")
(define dim 1000)
(define big1 (build-flmatrix dim dim (lambda (i j) (random))))
(define big2 (build-flmatrix dim dim (lambda (i j) (random))))
(define res (time (flmatrix* big1 big2)))

Note: Disable "Populate "compiled" directories (for faster loading)"
to run it in DrRacket. To disable: Choose the language menu.
Choose the menu item "Choose language". Click the button
"advanced". Then disable.

If it is enabled, I currently get this curious error from the compiler:

../collects/compiler/cm.rkt:438:6: write: cannot marshal value that is
embedded in compiled code
value: #<ctype:float>

I unsure whether it is a bug in my code, or in cm.rkt.

--
Jens Axel Søgaard

flmatrix.rkt

Tobias Hammer

unread,

Jan 20, 2013, 9:24:34 AM1/20/13

to Jens Axel Søgaard, us...@racket-lang.org

On Sun, 20 Jan 2013 15:09:42 +0100, Jens Axel Søgaard
<jens...@soegaard.net> wrote:

> Note: Disable "Populate "compiled" directories (for faster loading)"
> to run it in DrRacket. To disable: Choose the language menu.
> Choose the menu item "Choose language". Click the button
> "advanced". Then disable.
>
> If it is enabled, I currently get this curious error from the compiler:
>
> ../collects/compiler/cm.rkt:438:6: write: cannot marshal value that is
> embedded in compiled code
> value: #<ctype:float>
>
> I unsure whether it is a bug in my code, or in cm.rkt.

Most likely you ran into 3D-syntax (as i did before :)
http://lists.racket-lang.org/dev/archive//2012-July/010101.html

Without knowing the exact location of the error, i think you tried to pass
a ctype from phase 1 to 0 by simply inserting it there (e.g by #,ctype).
But as a ctype can't be printed it can't be stored in the compiled
bytecode.

Tobias

Neil Toronto

unread,

Jan 20, 2013, 12:09:00 PM1/20/13

to Jens Axel Søgaard, us...@racket-lang.org

On 01/20/2013 07:09 AM, Jens Axel Søgaard wrote:
> Hi Berthold,
>
> 2013/1/20 Berthold Bäuml <berthol...@dlr.de>:
>> I tried a little test with mildly large Flonum matrices using math/matrix in Typed
>> Racket. Two 1000x1000 dimensional matrices should be multiplied with (array-
>> strictness #t). Running this in DrRacket results in an "out of memory" error (also
>> having set the memory limit to 1GB) and running it with racket on the command
>> line makes the computer start to heavily swap.
>
>> Where is this high memory consumption coming from?
>
> I consider this a bug in matrix*.
>
> The culprit is inline-matrix-multiply in "untyped-matrix-arithmetic.rkt".
> When multiplying an mxp matrix with an pxn matrix, it builds a
> temporary array of size n*p*m with all products needed, then
> it computes sums of these products.
>
> Replacing this algorithm with the standard O(n^3) ought to be easy.

Aye, that's been done. The fix isn't in the current release candidate,
though. It'll be in the next one.

(I made arrays strict by default in two stages: one for `math/array',
and one for `math/matrix'. The release candidate was rolled out between
stages, so the `math/matrix' changes aren't in it. The change for
`matrix*' ensures that the n*p*m-size matrix is nonstrict, so it never
allocates an n*p*m-size vector to store its elements.)

> Numpy (and Mathematica, I believe) uses BLAS and LAPACK, which use
> Strassen's algorithm (or better?) for matrices this large, so there will
> still be a performance gap.
>
> If this performance gaps turns out to be too large, we must figure
> out a way to integrate BLAS and LAPACK bindings into math/matrix.

Yes, this.

I'd like to make an `FlMatrix' type that uses BLAS and LAPACK when
they're available, and falls back to a Racket floating-point-only
implementation. The latter would at least fix the biggest performance
issue with flonum matrices, which is that their elements are heap-allocated.

In the meantime, I'll profile matrix multiply and see if I can speed it
up. My 1000x1000 multiply just finished at 418sec, which is... pitiful.

Neil ⊥

Berthold Bäuml

unread,

Jan 20, 2013, 12:48:57 PM1/20/13

to Neil Toronto, Jens Axel Søgaard, us...@racket-lang.org

With the ffi-binding example from Jens (thank you!) I get for the 1000x1000 multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating such a binding in math/matrix looks promising.

Nevertheless, my original motivation for the little test was to get an impression of what performance could be achieved in purely Typed Racket for numerical algorithms. Would it in principle be possible to come close to pure C-code performance (using the same algorithm) when using a floating-point-implementation?

> In the meantime, I'll profile matrix multiply and see if I can speed it up. My 1000x1000 multiply just finished at 418sec, which is... pitiful.
>
> Neil ⊥
>

Berthold

--
-----------------------------------------------------------------------
Berthold Bäuml
DLR, Robotics and Mechatronics Center (RMC)
Münchner Str. 20, D-82234 Wessling
Phone +49 8153 282489
http://www.robotic.de/Berthold.Baeuml

Matthias Felleisen

unread,

Jan 20, 2013, 12:55:04 PM1/20/13

to Berthold Bäuml, Jens Axel Søgaard, us...@racket-lang.org

On Jan 20, 2013, at 12:48 PM, Berthold Bäuml wrote:

> Nevertheless, my original motivation for the little test was to get an impression of what performance could be achieved in purely Typed Racket for numerical algorithms. Would it in principle be possible to come close to pure C-code performance (using the same algorithm) when using a floating-point-implementation?

This is an interesting challenge for all layers involved here:
-- typed racket
-- racket plus unsafe (the target language)
-- the compiler
-- and possibly the optimization coach

But we need to be careful to compare apples and apples (as much as we can). Our doubles aren't the same range as C++'s. They will always carry a 'tag'. Our vectors have a different layout, and that has performance implications.

Finally, our goal isn't high performance -- we would have to raise a lot more funding and employ pure compiler people, memory hierarchy people (a C++ compiler can account for cache sizes and memory placement), etc.

We should still try to do a good job. -- Matthias

Jens Axel Søgaard

unread,

Jan 20, 2013, 1:14:56 PM1/20/13

to Berthold Bäuml, us...@racket-lang.org

2013/1/20 Berthold Bäuml <berthol...@dlr.de>:

> With the ffi-binding example from Jens (thank you!) I get for the 1000x1000
> multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating
> such a binding in math/matrix looks promising.

Huh? I am surprised it is twice as slow.

Ah! The Numpy example uses floats and not doubles.

> Nevertheless, my original motivation for the little test was to get an impression of
> what performance could be achieved in purely Typed Racket for numerical
> algorithms. Would it in principle be possible to come close to pure C-code
> performance (using the same algorithm) when using a floating-point-
> implementation?

Let's try and write a matrix* that works for matrices represented as
vectors of floats and see how fast it is.

/Jens Axel

Berthold Bäuml

unread,

Jan 20, 2013, 1:26:23 PM1/20/13

to Jens Axel Søgaard, us...@racket-lang.org

On 20.01.2013, at 19:14, Jens Axel Søgaard <jens...@soegaard.net> wrote:

> 2013/1/20 Berthold Bäuml <berthol...@dlr.de>:
>
>> With the ffi-binding example from Jens (thank you!) I get for the 1000x1000
>> multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating
>> such a binding in math/matrix looks promising.
>
> Huh? I am surprised it is twice as slow.
>
> Ah! The Numpy example uses floats and not doubles.

On my machine Numpy says it is using float64, hence, double.
big1.dtype.name -> float64

>> Nevertheless, my original motivation for the little test was to get an impression of
>> what performance could be achieved in purely Typed Racket for numerical
>> algorithms. Would it in principle be possible to come close to pure C-code
>> performance (using the same algorithm) when using a floating-point-
>> implementation?
>
> Let's try and write a matrix* that works for matrices represented as
> vectors of floats and see how fast it is.

Looking forward what is possible!

Berthold

> /Jens Axel

--
-----------------------------------------------------------------------
Berthold Bäuml
DLR, Robotics and Mechatronics Center (RMC)
Münchner Str. 20, D-82234 Wessling
Phone +49 8153 282489
http://www.robotic.de/Berthold.Baeuml

Neil Toronto

unread,

Jan 20, 2013, 1:59:44 PM1/20/13

to Jens Axel Søgaard, us...@racket-lang.org

On 01/20/2013 11:14 AM, Jens Axel Søgaard wrote:
> 2013/1/20 Berthold Bäuml <berthol...@dlr.de>:
>
>> With the ffi-binding example from Jens (thank you!) I get for the 1000x1000
>> multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating
>> such a binding in math/matrix looks promising.
>
> Huh? I am surprised it is twice as slow.
>
> Ah! The Numpy example uses floats and not doubles.

It may be FFI overhead as well.

>> Nevertheless, my original motivation for the little test was to get an impression of
>> what performance could be achieved in purely Typed Racket for numerical
>> algorithms. Would it in principle be possible to come close to pure C-code
>> performance (using the same algorithm) when using a floating-point-
>> implementation?
>
> Let's try and write a matrix* that works for matrices represented as
> vectors of floats and see how fast it is.

I just did that. Here are the types:

real-matrix* : (Array Real) (Array Real) -> (Array Real)

flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)

flmatrix* : FlArray FlArray -> FlArray

Results so far, measured in DrRacket with debugging off:

Function Size Time
-----------------------------------------
matrix* 100x100 340ms
real-matrix* 100x100 40ms
flonum-matrix* 100x100 10ms
flmatrix* 100x100 6ms

matrix* 1000x1000 418000ms
real-matrix* 1000x1000 76000ms
flonum-matrix* 1000x1000 7000ms
flmatrix* 1000x1000 4900ms

The only difference between `real-matrix*' and `flonum-matrix*' is that
the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if
I can inline `real-matrix*', TR's optimizer will change the former to
the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be
the largest speedup TR's optimizer will have ever shown me.)

It looks like the biggest speedup comes from doing only flonum ops in
the inner loop sum, which keeps all the intermediate flonums unboxed
(i.e. not heap-allocated or later garbage-collected).

Right now, `flmatrix*' is implemented a bit stupidly, so I could speed
it up further. I won't yet, because I haven't settled on a type for
matrices of unboxed flonums. The type has to work with LAPACK if it's
installed, which `FlArray' doesn't do because its data layout is
row-major and LAPACK expects column-major.

I'll change `matrix*' to look like `real-matrix*'. It won't give the
very best performance, but it's a 60x speedup for 1000x1000 matrices.

Neil ⊥

Jens Axel Søgaard

unread,

Jan 20, 2013, 2:35:31 PM1/20/13

to Berthold Bäuml, us...@racket-lang.org

2013/1/20 Berthold Bäuml <berthol...@dlr.de>:

>> Ah! The Numpy example uses floats and not doubles.
>
> On my machine Numpy says it is using float64, hence, double.
> big1.dtype.name -> float64

Sorry for the confusion. I am used to the Racket documentation
were float means single precision, so I misinterpreted the Numpy
documentation.

Also

> numpy.random.random((5,5))
array([[ 0.04748764, 0.15242073, 0.2366255 , 0.86926654, 0.16586114],
[ 0.33117124, 0.7504899 , 0.43179013, 0.1992083 , 0.83266119],
[ 0.25741902, 0.36912007, 0.38273193, 0.29622522, 0.75817261],
[ 0.49313726, 0.20514653, 0.45329344, 0.964105 , 0.77459153],
[ 0.5267172 , 0.87269101, 0.34530438, 0.38218051, 0.05107181]])

suggested, that the numbers were single precision. It turns out
that random only produces 53 bit random floats.

Anyways, here is a standard matrix multiplication algorithm in
Typed Racket.

#lang typed/racket
(require racket/unsafe/ops
racket/flonum)

(define-type NF Nonnegative-Fixnum)

(: matrix* :
(Vectorof Flonum) (Vectorof Flonum) Index Index Index
-> (Vectorof Flonum))
(define (matrix* A B m p n)
(define size (* m n))
(define: v : (Vectorof Flonum) (make-vector size 0.0))
(define index 0)
(: loop-i : NF -> (Vectorof Flonum))
(define (loop-i i)
(cond [(< i m)
(loop-j i 0)
(loop-i (+ i 1))]
[else v]))
(: loop-j : NF NF -> Void)
(define (loop-j i j)
(when (< j n)
(loop-k i j 0 0.0)
(set! index (+ index 1))
(loop-j i (+ j 1))))
(: loop-k : NF NF NF Flonum -> Void)
(define (loop-k i j k sum)
(define i*p (* i p))
(define k*n (* k n))
(cond
[(< k p)
(loop-k i j (+ k 1)
(+ sum
(* (unsafe-vector-ref A (+ i*p k))
(unsafe-vector-ref B (+ k*n j)))))]
[else
(vector-set! v index sum)]))
(loop-i 0))

(define dim 200)
(define A (build-vector (* dim dim) (λ (_) (random))))
(define B (build-vector (* dim dim) (λ (_) (random))))
(collect-garbage)
(collect-garbage)
(define A*B (time (matrix* A B dim dim dim)))

/Jens Axel

Ray Racine

unread,

Jan 20, 2013, 4:25:13 PM1/20/13

to Jens Axel Søgaard, Racket

Modification of the above to use FlVector did not show a difference.

Ray Racine

unread,

Jan 20, 2013, 4:40:15 PM1/20/13

to Jens Axel Søgaard, Racket

Recall that last. Always measure at the command line <sigh>. I'm seeing FlVector at 30% faster.

#lang typed/racket

(provide main)

(require racket/unsafe/ops

racket/fixnum

racket/flonum)

(define-type NF Nonnegative-Fixnum)

(: matrix* :

FlVector FlVector Index Index Index

-> FlVector)

(define (matrix* A B m p n)

(define size (* m n))

(define: v : FlVector (make-flvector size 0.0))

(define index 0)

(: loop-i : NF -> FlVector)

(define (loop-i i)

(cond [(< i m)

(loop-j i 0)

(loop-i (+ i 1))]

[else v]))

(: loop-j : NF NF -> Void)

(define (loop-j i j)

(when (< j n)

(loop-k i j 0 0.0)

(set! index (+ index 1))

(loop-j i (+ j 1))))

(: loop-k : NF NF NF Flonum -> Void)

(define (loop-k i j k sum)

(define i*p (* i p))

(define k*n (* k n))

(cond

[(< k p)

(loop-k i j (+ k 1)

(+ sum

(* (unsafe-flvector-ref A (+ i*p k))

(unsafe-flvector-ref B (+ k*n j)))))]

[else

(unsafe-flvector-set! v index sum)]))

(loop-i 0))

(: mkFlVector (Fixnum (-> Flonum) -> FlVector))

(define (mkFlVector sz init)

(let ((v (make-flvector sz)))

(do ((i #{0 : Fixnum} (+ i 1)))

((>= i sz) v)

(unsafe-flvector-set! v i (init)))))

(define: dim : Index 300)

(define: A : FlVector (mkFlVector (fx* dim dim) random))

(define: B : FlVector (mkFlVector (fx* dim dim) random))

(define (main)

(collect-garbage)

(let ((m (time (matrix* A B dim dim dim))))

(void)))

Tobias Hammer

unread,

Jan 21, 2013, 4:56:59 AM1/21/13

to Jens Axel Søgaard, us...@racket-lang.org

The attached patch should fix it.

Tobias

---------------------------------------------------------
Tobias Hammer
DLR / Institute of Robotics and Mechatronics
Muenchner Str. 20, D-82234 Wessling
Tel.: 08153/28-1487
Mail: tobias...@dlr.de

flmatrix.patch

Jens Axel Søgaard

unread,

Jan 21, 2013, 8:19:31 AM1/21/13

to Tobias Hammer, us...@racket-lang.org

2013/1/21 Tobias Hammer <tobias...@dlr.de>:

> The attached patch should fix it.

Thanks!

/Jens Axel

Berthold Bäuml

unread,

Jan 21, 2013, 6:33:39 PM1/21/13

to Neil Toronto, Jens Axel Søgaard, us...@racket-lang.org

>
> I just did that. Here are the types:
>
> real-matrix* : (Array Real) (Array Real) -> (Array Real)
>
> flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)
>
> flmatrix* : FlArray FlArray -> FlArray
>
> Results so far, measured in DrRacket with debugging off:
>
> Function Size Time
> -----------------------------------------
> matrix* 100x100 340ms
> real-matrix* 100x100 40ms
> flonum-matrix* 100x100 10ms
> flmatrix* 100x100 6ms
>
> matrix* 1000x1000 418000ms
> real-matrix* 1000x1000 76000ms
> flonum-matrix* 1000x1000 7000ms
> flmatrix* 1000x1000 4900ms
>
> The only difference between `real-matrix*' and `flonum-matrix*' is that the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if I can inline `real-matrix*', TR's optimizer will change the former to the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be the largest speedup TR's optimizer will have ever shown me.)
>
> It looks like the biggest speedup comes from doing only flonum ops in the inner loop sum, which keeps all the intermediate flonums unboxed (i.e. not heap-allocated or later garbage-collected).
>
> Right now, `flmatrix*' is implemented a bit stupidly, so I could speed it up further. I won't yet, because I haven't settled on a type for matrices of unboxed flonums. The type has to work with LAPACK if it's installed, which `FlArray' doesn't do because its data layout is row-major and LAPACK expects column-major.
>
> I'll change `matrix*' to look like `real-matrix*'. It won't give the very best performance, but it's a 60x speedup for 1000x1000 matrices.
>

These results look very promising, esp. if , as you mentioned, in the end the real-matrix* will automatically reach the flonum-matrix* performance for Flonums and the flmatrix* automatically switches to a LAPACK based variant, when available. For the latter it would be great if one could even change the used library to, e.g., redirect to a installation of the highly efficient MKL library from Intel.

Looking forward to benchmark it against Numpy and Mathematica (which is MKL based) again!

Berthold

Neil Toronto

unread,

Jan 22, 2013, 12:19:30 AM1/22/13

to Berthold Bäuml, Jens Axel Søgaard, us...@racket-lang.org

The next release candidate will have a `matrix*' that runs as fast as
`flonum-matrix*' on (Matrix Flonum) arguments, and as fast as
`real-matrix*' on (Matrix Real) arguments.

(Curiously, I wasn't able to speed up `flmatrix*' at all. The "a bit
stupidly" implementation was actually the fastest I could do, which
probably means it's as fast as it can be done in Racket without dropping
to C.)

Additionally, all the matrix functions will have more precise types to
make it easy for TR to prove that operations on (Matrix Flonum) values
yield (Matrix Flonum) values. (Currently, it only proves (Matrix Real)
and (Matrix Number).) Example:

> (:print-type (matrix-qr (matrix [[1.0 2.0] [3.0 4.0]])))
(Values (Array Flonum) (Array Flonum))
> (:print-type (matrix-qr (matrix [[1.0+0.0i 2.0+0.0i]
[3.0+0.0i 4.0+0.0i]])))
(Values (Array Float-Complex) (Array Float-Complex))

The current release candidate returns two (Array Real) and (Array
Number) instead.

Since I've brought up decompositions, you should know that these,
`matrix-inverse', `matrix-solve', and pretty much every O(n^3) operation
except multiplication will be slow on large matrices. I can't think of a
way to inline them in a way that TR can optimize. For fast versions,
which will use LAPACK if it's installed, you'll have to wait for 5.3.3. :/

Is the MKL library binary-compatible with LAPACK?

Neil ⊥

Berthold Bäuml

unread,

Jan 22, 2013, 4:09:45 PM1/22/13

to Neil Toronto, Jens Axel Søgaard, us...@racket-lang.org

On 22.01.2013, at 06:19, Neil Toronto <neil.t...@gmail.com> wrote:

> On 01/21/2013 04:33 PM, Berthold Bäuml wrote:
>>
>>>
>>> I just did that. Here are the types:
>>>
>>> real-matrix* : (Array Real) (Array Real) -> (Array Real)
>>>
>>> flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)
>>>
>>> flmatrix* : FlArray FlArray -> FlArray
>>>
>>> Results so far, measured in DrRacket with debugging off:
>>>
>>> Function Size Time
>>> -----------------------------------------
>>> matrix* 100x100 340ms
>>> real-matrix* 100x100 40ms
>>> flonum-matrix* 100x100 10ms
>>> flmatrix* 100x100 6ms
>>>
>>> matrix* 1000x1000 418000ms
>>> real-matrix* 1000x1000 76000ms
>>> flonum-matrix* 1000x1000 7000ms
>>> flmatrix* 1000x1000 4900ms
>>>
>>> The only difference between `real-matrix*' and `flonum-matrix*' is that the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if I can inline `real-matrix*', TR's optimizer will change the former to the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be the largest speedup TR's optimizer will have ever shown me.)
>>>
>>> It looks like the biggest speedup comes from doing only flonum ops in the inner loop sum, which keeps all the intermediate flonums unboxed (i.e. not heap-allocated or later garbage-collected).
>>>
>>> Right now, `flmatrix*' is implemented a bit stupidly, so I could speed it up further. I won't yet, because I haven't settled on a type for matrices of unboxed flonums. The type has to work with LAPACK if it's installed, which `FlArray' doesn't do because its data layout is row-major and LAPACK expects column-major.
>>>
>>> I'll change `matrix*' to look like `real-matrix*'. It won't give the very best performance, but it's a 60x speedup for 1000x1000 matrices.
>>>
>>
>> These results look very promising, esp. if , as you mentioned, in the end the real-matrix* will automatically reach the flonum-matrix* performance for Flonums and the flmatrix* automatically switches to a LAPACK based variant, when available. For the latter it would be great if one could even change the used library to, e.g., redirect to a installation of the highly efficient MKL library from Intel.
>>
>> Looking forward to benchmark it against Numpy and Mathematica (which is MKL based) again!
>
> The next release candidate will have a `matrix*' that runs as fast as `flonum-matrix*' on (Matrix Flonum) arguments, and as fast as `real-matrix*' on (Matrix Real) arguments.

Cool!

> (Curiously, I wasn't able to speed up `flmatrix*' at all. The "a bit stupidly" implementation was actually the fastest I could do, which probably means it's as fast as it can be done in Racket without dropping to C.)
>
> Additionally, all the matrix functions will have more precise types to make it easy for TR to prove that operations on (Matrix Flonum) values yield (Matrix Flonum) values. (Currently, it only proves (Matrix Real) and (Matrix Number).) Example:
>
> > (:print-type (matrix-qr (matrix [[1.0 2.0] [3.0 4.0]])))
> (Values (Array Flonum) (Array Flonum))
> > (:print-type (matrix-qr (matrix [[1.0+0.0i 2.0+0.0i]
> [3.0+0.0i 4.0+0.0i]])))
> (Values (Array Float-Complex) (Array Float-Complex))
>
> The current release candidate returns two (Array Real) and (Array Number) instead.
>
> Since I've brought up decompositions, you should know that these, `matrix-inverse', `matrix-solve', and pretty much every O(n^3) operation except multiplication will be slow on large matrices. I can't think of a way to inline them in a way that TR can optimize. For fast versions, which will use LAPACK if it's installed, you'll have to wait for 5.3.3. :/

Will the LAPACK bindings be in the git head sooner, maybe ;-)
>

> Is the MKL library binary-compatible with LAPACK?
>

I did not try it myself yet, but the Intel documentation does say so:
"If your application already relies on the BLAS or LAPACK functionality, simply re-link with Intel® MKL to get better performance on Intel and compatible architectures."
http://software.intel.com/en-us/intel-mkl/

Berthold

Jens Axel Søgaard

unread,

Jan 22, 2013, 5:13:38 PM1/22/13

to Berthold Bäuml, us...@racket-lang.org

2013/1/22 Berthold Bäuml <berthol...@dlr.de>:

> On 22.01.2013, at 06:19, Neil Toronto <neil.t...@gmail.com> wrote:
>> On 01/21/2013 04:33 PM, Berthold Bäuml wrote:

>>> For the latter it would be great if one could even change the used library to,
>>> e.g., redirect to a installation of the highly efficient MKL library from Intel.
>>> Looking forward to benchmark it against Numpy and Mathematica (which is
>>> MKL based) again!

>> Is the MKL library binary-compatible with LAPACK?

>
> I did not try it myself yet, but the Intel documentation does say so:
> "If your application already relies on the BLAS or LAPACK functionality, simply
> re-link with Intel® MKL to get better performance on Intel and compatible
> architectures."

> http://software.intel.com/en-us/intel-mkl/

In that case you can try it yourself.
All you need to do, is to change the paths of BLAS and LAPACK in:

(define veclib-lib ; CBLAS
(ffi-lib "/System/Library/Frameworks/vecLib.framework/Versions/Current/vecLib"))

(define cblas-lib veclib-lib)

(define lapack-lib
(ffi-lib
(string-append
"/System/Library/Frameworks/Accelerate.framework/"
"Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK")))

--
Jens Axel Søgaard

Reply all

Reply to author

Forward