> Note: Disable "Populate "compiled" directories (for faster loading)"
> to run it in DrRacket. To disable: Choose the language menu.
> Choose the menu item "Choose language". Click the button
> "advanced". Then disable.
>
> If it is enabled, I currently get this curious error from the compiler:
>
> ../collects/compiler/cm.rkt:438:6: write: cannot marshal value that is
> embedded in compiled code
> value: #<ctype:float>
>
> I unsure whether it is a bug in my code, or in cm.rkt.
Most likely you ran into 3D-syntax (as i did before :)
http://lists.racket-lang.org/dev/archive//2012-July/010101.html
Without knowing the exact location of the error, i think you tried to pass
a ctype from phase 1 to 0 by simply inserting it there (e.g by #,ctype).
But as a ctype can't be printed it can't be stored in the compiled
bytecode.
Tobias
Aye, that's been done. The fix isn't in the current release candidate,
though. It'll be in the next one.
(I made arrays strict by default in two stages: one for `math/array',
and one for `math/matrix'. The release candidate was rolled out between
stages, so the `math/matrix' changes aren't in it. The change for
`matrix*' ensures that the n*p*m-size matrix is nonstrict, so it never
allocates an n*p*m-size vector to store its elements.)
> Numpy (and Mathematica, I believe) uses BLAS and LAPACK, which use
> Strassen's algorithm (or better?) for matrices this large, so there will
> still be a performance gap.
>
> If this performance gaps turns out to be too large, we must figure
> out a way to integrate BLAS and LAPACK bindings into math/matrix.
Yes, this.
I'd like to make an `FlMatrix' type that uses BLAS and LAPACK when
they're available, and falls back to a Racket floating-point-only
implementation. The latter would at least fix the biggest performance
issue with flonum matrices, which is that their elements are heap-allocated.
In the meantime, I'll profile matrix multiply and see if I can speed it
up. My 1000x1000 multiply just finished at 418sec, which is... pitiful.
Neil ⊥
With the ffi-binding example from Jens (thank you!) I get for the 1000x1000 multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating such a binding in math/matrix looks promising.
Nevertheless, my original motivation for the little test was to get an impression of what performance could be achieved in purely Typed Racket for numerical algorithms. Would it in principle be possible to come close to pure C-code performance (using the same algorithm) when using a floating-point-implementation?
> In the meantime, I'll profile matrix multiply and see if I can speed it up. My 1000x1000 multiply just finished at 418sec, which is... pitiful.
>
> Neil ⊥
>
Berthold
--
-----------------------------------------------------------------------
Berthold Bäuml
DLR, Robotics and Mechatronics Center (RMC)
Münchner Str. 20, D-82234 Wessling
Phone +49 8153 282489
http://www.robotic.de/Berthold.Baeuml
> With the ffi-binding example from Jens (thank you!) I get for the 1000x1000
> multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating
> such a binding in math/matrix looks promising.
Huh? I am surprised it is twice as slow.
Ah! The Numpy example uses floats and not doubles.
> Nevertheless, my original motivation for the little test was to get an impression of
> what performance could be achieved in purely Typed Racket for numerical
> algorithms. Would it in principle be possible to come close to pure C-code
> performance (using the same algorithm) when using a floating-point-
> implementation?
Let's try and write a matrix* that works for matrices represented as
vectors of floats and see how fast it is.
/Jens Axel
It may be FFI overhead as well.
>> Nevertheless, my original motivation for the little test was to get an impression of
>> what performance could be achieved in purely Typed Racket for numerical
>> algorithms. Would it in principle be possible to come close to pure C-code
>> performance (using the same algorithm) when using a floating-point-
>> implementation?
>
> Let's try and write a matrix* that works for matrices represented as
> vectors of floats and see how fast it is.
I just did that. Here are the types:
real-matrix* : (Array Real) (Array Real) -> (Array Real)
flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)
flmatrix* : FlArray FlArray -> FlArray
Results so far, measured in DrRacket with debugging off:
Function Size Time
-----------------------------------------
matrix* 100x100 340ms
real-matrix* 100x100 40ms
flonum-matrix* 100x100 10ms
flmatrix* 100x100 6ms
matrix* 1000x1000 418000ms
real-matrix* 1000x1000 76000ms
flonum-matrix* 1000x1000 7000ms
flmatrix* 1000x1000 4900ms
The only difference between `real-matrix*' and `flonum-matrix*' is that
the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if
I can inline `real-matrix*', TR's optimizer will change the former to
the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be
the largest speedup TR's optimizer will have ever shown me.)
It looks like the biggest speedup comes from doing only flonum ops in
the inner loop sum, which keeps all the intermediate flonums unboxed
(i.e. not heap-allocated or later garbage-collected).
Right now, `flmatrix*' is implemented a bit stupidly, so I could speed
it up further. I won't yet, because I haven't settled on a type for
matrices of unboxed flonums. The type has to work with LAPACK if it's
installed, which `FlArray' doesn't do because its data layout is
row-major and LAPACK expects column-major.
I'll change `matrix*' to look like `real-matrix*'. It won't give the
very best performance, but it's a 60x speedup for 1000x1000 matrices.
Neil ⊥
Sorry for the confusion. I am used to the Racket documentation
were float means single precision, so I misinterpreted the Numpy
documentation.
Also
> numpy.random.random((5,5))
array([[ 0.04748764, 0.15242073, 0.2366255 , 0.86926654, 0.16586114],
[ 0.33117124, 0.7504899 , 0.43179013, 0.1992083 , 0.83266119],
[ 0.25741902, 0.36912007, 0.38273193, 0.29622522, 0.75817261],
[ 0.49313726, 0.20514653, 0.45329344, 0.964105 , 0.77459153],
[ 0.5267172 , 0.87269101, 0.34530438, 0.38218051, 0.05107181]])
suggested, that the numbers were single precision. It turns out
that random only produces 53 bit random floats.
Anyways, here is a standard matrix multiplication algorithm in
Typed Racket.
#lang typed/racket
(require racket/unsafe/ops
racket/flonum)
(define-type NF Nonnegative-Fixnum)
(: matrix* :
(Vectorof Flonum) (Vectorof Flonum) Index Index Index
-> (Vectorof Flonum))
(define (matrix* A B m p n)
(define size (* m n))
(define: v : (Vectorof Flonum) (make-vector size 0.0))
(define index 0)
(: loop-i : NF -> (Vectorof Flonum))
(define (loop-i i)
(cond [(< i m)
(loop-j i 0)
(loop-i (+ i 1))]
[else v]))
(: loop-j : NF NF -> Void)
(define (loop-j i j)
(when (< j n)
(loop-k i j 0 0.0)
(set! index (+ index 1))
(loop-j i (+ j 1))))
(: loop-k : NF NF NF Flonum -> Void)
(define (loop-k i j k sum)
(define i*p (* i p))
(define k*n (* k n))
(cond
[(< k p)
(loop-k i j (+ k 1)
(+ sum
(* (unsafe-vector-ref A (+ i*p k))
(unsafe-vector-ref B (+ k*n j)))))]
[else
(vector-set! v index sum)]))
(loop-i 0))
(define dim 200)
(define A (build-vector (* dim dim) (λ (_) (random))))
(define B (build-vector (* dim dim) (λ (_) (random))))
(collect-garbage)
(collect-garbage)
(define A*B (time (matrix* A B dim dim dim)))
/Jens Axel
The next release candidate will have a `matrix*' that runs as fast as
`flonum-matrix*' on (Matrix Flonum) arguments, and as fast as
`real-matrix*' on (Matrix Real) arguments.
(Curiously, I wasn't able to speed up `flmatrix*' at all. The "a bit
stupidly" implementation was actually the fastest I could do, which
probably means it's as fast as it can be done in Racket without dropping
to C.)
Additionally, all the matrix functions will have more precise types to
make it easy for TR to prove that operations on (Matrix Flonum) values
yield (Matrix Flonum) values. (Currently, it only proves (Matrix Real)
and (Matrix Number).) Example:
> (:print-type (matrix-qr (matrix [[1.0 2.0] [3.0 4.0]])))
(Values (Array Flonum) (Array Flonum))
> (:print-type (matrix-qr (matrix [[1.0+0.0i 2.0+0.0i]
[3.0+0.0i 4.0+0.0i]])))
(Values (Array Float-Complex) (Array Float-Complex))
The current release candidate returns two (Array Real) and (Array
Number) instead.
Since I've brought up decompositions, you should know that these,
`matrix-inverse', `matrix-solve', and pretty much every O(n^3) operation
except multiplication will be slow on large matrices. I can't think of a
way to inline them in a way that TR can optimize. For fast versions,
which will use LAPACK if it's installed, you'll have to wait for 5.3.3. :/
Is the MKL library binary-compatible with LAPACK?
Neil ⊥
> On 01/21/2013 04:33 PM, Berthold Bäuml wrote:
>>
>>>
>>> I just did that. Here are the types:
>>>
>>> real-matrix* : (Array Real) (Array Real) -> (Array Real)
>>>
>>> flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)
>>>
>>> flmatrix* : FlArray FlArray -> FlArray
>>>
>>> Results so far, measured in DrRacket with debugging off:
>>>
>>> Function Size Time
>>> -----------------------------------------
>>> matrix* 100x100 340ms
>>> real-matrix* 100x100 40ms
>>> flonum-matrix* 100x100 10ms
>>> flmatrix* 100x100 6ms
>>>
>>> matrix* 1000x1000 418000ms
>>> real-matrix* 1000x1000 76000ms
>>> flonum-matrix* 1000x1000 7000ms
>>> flmatrix* 1000x1000 4900ms
>>>
>>> The only difference between `real-matrix*' and `flonum-matrix*' is that the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if I can inline `real-matrix*', TR's optimizer will change the former to the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be the largest speedup TR's optimizer will have ever shown me.)
>>>
>>> It looks like the biggest speedup comes from doing only flonum ops in the inner loop sum, which keeps all the intermediate flonums unboxed (i.e. not heap-allocated or later garbage-collected).
>>>
>>> Right now, `flmatrix*' is implemented a bit stupidly, so I could speed it up further. I won't yet, because I haven't settled on a type for matrices of unboxed flonums. The type has to work with LAPACK if it's installed, which `FlArray' doesn't do because its data layout is row-major and LAPACK expects column-major.
>>>
>>> I'll change `matrix*' to look like `real-matrix*'. It won't give the very best performance, but it's a 60x speedup for 1000x1000 matrices.
>>>
>>
>> These results look very promising, esp. if , as you mentioned, in the end the real-matrix* will automatically reach the flonum-matrix* performance for Flonums and the flmatrix* automatically switches to a LAPACK based variant, when available. For the latter it would be great if one could even change the used library to, e.g., redirect to a installation of the highly efficient MKL library from Intel.
>>
>> Looking forward to benchmark it against Numpy and Mathematica (which is MKL based) again!
>
> The next release candidate will have a `matrix*' that runs as fast as `flonum-matrix*' on (Matrix Flonum) arguments, and as fast as `real-matrix*' on (Matrix Real) arguments.
Cool!
> (Curiously, I wasn't able to speed up `flmatrix*' at all. The "a bit stupidly" implementation was actually the fastest I could do, which probably means it's as fast as it can be done in Racket without dropping to C.)
>
> Additionally, all the matrix functions will have more precise types to make it easy for TR to prove that operations on (Matrix Flonum) values yield (Matrix Flonum) values. (Currently, it only proves (Matrix Real) and (Matrix Number).) Example:
>
> > (:print-type (matrix-qr (matrix [[1.0 2.0] [3.0 4.0]])))
> (Values (Array Flonum) (Array Flonum))
> > (:print-type (matrix-qr (matrix [[1.0+0.0i 2.0+0.0i]
> [3.0+0.0i 4.0+0.0i]])))
> (Values (Array Float-Complex) (Array Float-Complex))
>
> The current release candidate returns two (Array Real) and (Array Number) instead.
>
> Since I've brought up decompositions, you should know that these, `matrix-inverse', `matrix-solve', and pretty much every O(n^3) operation except multiplication will be slow on large matrices. I can't think of a way to inline them in a way that TR can optimize. For fast versions, which will use LAPACK if it's installed, you'll have to wait for 5.3.3. :/
Will the LAPACK bindings be in the git head sooner, maybe ;-)
>
> Is the MKL library binary-compatible with LAPACK?
>
I did not try it myself yet, but the Intel documentation does say so:
"If your application already relies on the BLAS or LAPACK functionality, simply re-link with Intel® MKL to get better performance on Intel and compatible architectures."
http://software.intel.com/en-us/intel-mkl/
Berthold
>>> For the latter it would be great if one could even change the used library to,
>>> e.g., redirect to a installation of the highly efficient MKL library from Intel.
>>> Looking forward to benchmark it against Numpy and Mathematica (which is
>>> MKL based) again!
>> Is the MKL library binary-compatible with LAPACK?
>
> I did not try it myself yet, but the Intel documentation does say so:
> "If your application already relies on the BLAS or LAPACK functionality, simply
> re-link with Intel® MKL to get better performance on Intel and compatible
> architectures."
> http://software.intel.com/en-us/intel-mkl/
In that case you can try it yourself.
All you need to do, is to change the paths of BLAS and LAPACK in:
(define veclib-lib ; CBLAS
(ffi-lib "/System/Library/Frameworks/vecLib.framework/Versions/Current/vecLib"))
(define cblas-lib veclib-lib)
(define lapack-lib
(ffi-lib
(string-append
"/System/Library/Frameworks/Accelerate.framework/"
"Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK")))
--
Jens Axel Søgaard