In spite of Erik's nice signature I have chosen for this message, too,
I'm still interested in low-level performance of my programs. In my
case (I'm doing numerical analysis for partial differential
equations), it is especially the floating point performance which
matters. I'm using CMUCL and it doesn't perform badly in comparison
with C, at least on my computer (some of you will remember that they
helped me with my first steps in CL exactly at this problem).
Now, what I would like to have is some more data, about how Lisp
implementations run this program. Especially, I would be interested
with CMUCL on SUN workstations, ACL, Lispworks, ... on X86 and other
architectures. If someone would like to test it, please go ahead.
I'm very interested in the results. Please always report the results
for the C program
Nicolas.
P.S.: The demo versions for commercial Lisps will probably not
allocate the memory needed by the program. Also: don't be too
disappointed if your Lisp does not perform very well. Floating-point
performance ist not of highest importance for most of applications.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;; mflop.lisp
;;;; (C) Nicolas Neuss (Nicola...@iwr.uni-heidelberg.de)
;;;; mflop.lisp is in the public domain.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(defconstant +N-long+ #x100000) ; does not fit in secondary cache
(defconstant +N-short+ #x100) ; fits in primary cache
(defparameter *mflop-delta* 5.0
"Time interval in seconds over which we measure performance.")
(defun make-double-float-array (size &optional (initial 0.0d0))
(make-array size :element-type 'double-float :initial-element initial))
(defun ddot (x y n)
(declare (type fixnum n)
(type (simple-array double-float (*)) x y))
(declare (optimize (safety 0) (space 0) (debug 0) (speed 3)))
(loop for i fixnum from 0 below n
summing (* (aref x i) (aref y i)) double-float))
(defun daxpy (x y n)
(declare (type fixnum n)
(type (simple-array double-float (*)) x y))
(declare (optimize (safety 0) (space 0) (debug 0) (speed 3)))
(loop with s double-float = 0.3d0
for i from 0 below n do
(setf (aref x i) (+ (* s (aref y i))))))
(defun test (fn size)
(let ((x (make-double-float-array +N-long+))
(y (make-double-float-array +N-long+)))
(format
t "~A-~A: ~$ MFLOPS~%"
fn
(if (= size +N-long+) "long" "short")
(loop with after = 0
for before = (get-internal-run-time) then after
and count = 1 then (* count 2)
do
(loop repeat count do (funcall fn x y size))
(setq after (get-internal-run-time))
(when (> (/ (- after before) internal-time-units-per-second)
*mflop-delta*)
(return (/ (* 2 size count internal-time-units-per-second)
(* 1e6 (- after before)))))))))
(defun mflop-test ()
"Returns several numbers characteristic for floating point efficiency of
your CL implementation. Please compare these numbers to those obtained by
the C version in mflop.c."
(test 'ddot +N-long+)
(test 'ddot +N-short+)
(test 'daxpy +N-long+)
(test 'daxpy +N-short+))
#+ignore (mflop-test)
/**********************************************************************
mflop.c -- performance testing
(C) Nicolas Neuss (Nicola...@iwr.uni-heidelberg.de)
mflop.c is public domain.
**********************************************************************/
/* Reasonable compilation lines are
Linux: gcc -O3 mflop.c
IRIS Octane: cc -Ofast mflop.c
Sparc Ultra II: cc -fast mflop.c
IBM RS6000/590: xlc -O3 -qarch=pwrx -qtune=pwrx mflop.c */
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define MFLOP_DELTA 5.0 /* time interval over which we measure performance */
#define Nlong 1000000 /* does not fit in secondary cache */
#define Nshort 256 /* fits in primary cache */
#define CURRENT_TIME (((double)clock()) / ((double)CLOCKS_PER_SEC))
double ddot (double *x, double *y, int n) {
int j;
double s = 0.0;
for (j=0; j<n; j++)
s += x[j]*y[j];
return s;
}
double daxpy (double *x, double *y, int n) {
int j;
double s = 0.1;
for (j=0; j<n; j++)
y[j] += s*x[j];
return 0.0;
}
typedef double testfun (double *, double *, int n);
void test (testfun f, char *name, int n) {
int i, nr;
double start_time, end_time;
double s = 0.0;
double *x = (double *) malloc(sizeof(double)*Nlong);
double *y = (double *) malloc(sizeof(double)*Nlong);
for (i=0; i<Nlong; i++)
x[i] = 0.0; y[i] = 0.9;
nr = 1;
do {
nr = 2*nr;
start_time = CURRENT_TIME;
for (i=0; i<nr; i++)
s += f(x, y, n);
end_time = CURRENT_TIME;
} while (end_time-start_time<MFLOP_DELTA);
printf ("%s%s %4.2f MFLOPS\n", name, ((n==Nlong) ? "-long":"-short"),
1.0e-6*2*n*(s+nr/(end_time-start_time)));
}
int main (void) {
test(ddot, "ddot", Nlong);
test(ddot, "ddot", Nshort);
test(daxpy, "daxpy", Nlong);
test(daxpy, "daxpy", Nshort);
return 0;
}
Sample results for my Toshiba TECRA 8000 Laptop:
CMUCL:
* ;;; Evaluate mflop-test
DDOT-long: 42.01 MFLOPS
DDOT-short: 108.90 MFLOPS
DAXPY-long: 23.46 MFLOPS
DAXPY-short: 136.26 MFLOPS
NIL
gcc -O3 mflop-neu.c; a.out
ddot-long 62.75 MFLOPS
ddot-short 178.36 MFLOPS
daxpy-long 22.82 MFLOPS
daxpy-short 119.70 MFLOPS
Speed disadvantage of CMUCL:
ddot-long: 1.7
ddot-short: 0.61
daxpy-long: 1.0
daxpy-short: 0.9
--
Performance is the last refuge of the miserable programmer.
-- Erik Naggum
1. Of course, you should drop the #+ignore in front of the call
(mflop-test) in mflop.lisp.
2. Speed disadvantage of CMUCL:
ddot-long: 0.67
ddot-short: 0.61
daxpy-long: 1.0
daxpy-short: 0.9
I.e. not too much difference (about 0 - 40% loss of efficiency).
By the way: this program was optimized for CMUCL. It might be
necessary that some changes are necessary for good performance on ACL
or Lispworks. In those changes I would also be interested.
Yours, Nicolas
Too large for the trial version of Allegro/Franz.
DDOT-long: 17.73 MFLOPS
Error: An allocation request for 8388624 bytes caused a need for
22282240 more bytes of heap. This request cannot be satisfied
because you have hit the Allegro CL Trial heap limit.
Igor.
x86 - celeron 900 MHz - WinXP Home :
Lispworks:
Error: Unknown LOOP keyword in (... DOUBLE-FLOAT). Maybe missing
OF-TYPE loop keyword.
CLISP:
DDOT-long: 0.20 MFLOPS
DDOT-short: 0.20 MFLOPS
DAXPY-long: 0.16 MFLOPS
DAXPY-short: 0.16 MFLOPS
ACL 6.1 (trial version):
DDOT-long: 16.73 MFLOPS
and then error because of trial version.
The CLISP results are depressing.
Igor.
Compiled my mflop.c with mingw32 2.95.2
gcc -O3 -c mflop.c
dllwrap --output-lib=libmflop.a --dllname=mflop.dll --driver-name=gcc mflop.o
Results on 1.4GHz Athlon
CL-USER 36 > (mflop-test)
DDOT-long: 49.18 MFLOPS
DDOT-short: 339.85 MFLOPS
DAXPY-long: 41.30 MFLOPS
DAXPY-short: 372.31 MFLOPS
NIL
CL-USER 37 >
The best I could get for LWW (with modifications) with straight Lisp was
CL-USER 27 > (mflop-test)
DDOT-long: 5.32 MFLOPS
DDOT-short: 5.65 MFLOPS
DAXPY-long: 3.19 MFLOPS
DAXPY-short: 3.64 MFLOPS
NIL
mflop.c -----------------------------------
#ifdef BUILD_DLL
// the dll exports
#define EXPORT __declspec(dllexport)
#else
// the exe imports
#define EXPORT __declspec(dllimport)
#endif
// function to be imported/exported
EXPORT double ddot (double*, double*, int);
EXPORT double daxpy (double*, double*, int);
EXPORT double ddot (double *x, double *y, int n) {
int j;
double s = 0.0;
for (j=0; j<n; j++)
s += x[j]*y[j];
return s;
}
EXPORT double daxpy (double *x, double *y, int n) {
int j;
double s = 0.1;
for (j=0; j<n; j++)
y[j] += s*x[j];
return 0.0;
}
mflop.lisp ------------------------------
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;; mflop.lisp
;;;; (C) Nicolas Neuss (Nicola...@iwr.uni-heidelberg.de)
;;;; mflop.lisp is in the public domain.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(defconstant +N-long+ #x100000) ; does not fit in secondary cache
(defconstant +N-short+ #x100) ; fits in primary cache
(defparameter *mflop-delta* 5.0
"Time interval in seconds over which we measure performance.")
(fli:register-module #p"d:/user/wade/lww/mflop.dll")
(fli:define-foreign-function (ddot "ddot")
((x (:pointer :double))
(y (:pointer :double))
(n :int))
:result-type :double
:calling-convention :cdecl)
(fli:define-foreign-function (daxpy "daxpy")
((x (:pointer :double))
(y (:pointer :double))
(n :int))
:result-type :double
:calling-convention :cdecl)
(defun allocate-double-float-array (size &optional (initial 0.0d0))
(fli:allocate-foreign-object :type :double :nelems size :initial-element initial))
(defun free-double-float-array (array)
(fli:free-foreign-object array))
(defun test (fn size)
(let ((x (allocate-double-float-array +N-long+))
(y (allocate-double-float-array +N-long+)))
(unwind-protect
(format
t "~A-~A: ~$ MFLOPS~%"
fn
(if (= size +N-long+) "long" "short")
(loop with after = 0
for before = (get-internal-run-time) then after
and count = 1 then (* count 2)
do
(loop repeat count do (funcall fn x y size))
(setq after (get-internal-run-time))
(when (> (/ (- after before) internal-time-units-per-second)
*mflop-delta*)
(return (/ (* 2 size count internal-time-units-per-second)
(* 1e6 (- after before)))))))
(free-double-float-array x)
(free-double-float-array y))))
> Just for the fun of it, I changed around your code for LWW 4.1.20. Made a
> DLL and did a foreign function approach.
>
> Compiled my mflop.c with mingw32 2.95.2
>
> gcc -O3 -c mflop.c
> dllwrap --output-lib=libmflop.a --dllname=mflop.dll --driver-name=gcc
> mflop.o
>
> Results on 1.4GHz Athlon
>
> CL-USER 36 > (mflop-test)
> DDOT-long: 49.18 MFLOPS
> DDOT-short: 339.85 MFLOPS
> DAXPY-long: 41.30 MFLOPS
> DAXPY-short: 372.31 MFLOPS
> NIL
>
> CL-USER 37 >
>
> The best I could get for LWW (with modifications) with straight Lisp was
>
> CL-USER 27 > (mflop-test)
> DDOT-long: 5.32 MFLOPS
> DDOT-short: 5.65 MFLOPS
> DAXPY-long: 3.19 MFLOPS
> DAXPY-short: 3.64 MFLOPS
> NIL
Mine was as follows (with LW4.2.6):
Results on Duron 800 :
jsc@shsp0629:~ > gcc -O3 mflop.c
jsc@shsp0629:~ > ./a.out
ddot-long 78.29 MFLOPS
ddot-short 382.80 MFLOPS
daxpy-long 34.59 MFLOPS
daxpy-short 594.87 MFLOPS
CL-USER 8 > (mflop-test)
DDOT-long: 33.55 MFLOPS
DDOT-short: 65.39 MFLOPS
DAXPY-long: 28.99 MFLOPS
DAXPY-short: 56.69 MFLOPS
NIL
For the CL version I had to reduce the size of the longer double array
because it exeeced the maximum array size in LW4.2.6.
The bad numbers for the short tests may be because of smaller caches on the
Duron. Setting the "float" optimization flag to 0 doesn't seem to bring
much gains.
This is the actual code I used
(note that I changed the setf in daxpy to incf - which looks better if you
compare it with the C source):
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;; mflop.lisp
;;;; (C) Nicolas Neuss (Nicola...@iwr.uni-heidelberg.de)
;;;; mflop.lisp is in the public domain.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(defconstant +N-long+ (floor #x100000 2)) ; does not fit in secondary cache
(defconstant +N-short+ #x100) ; fits in primary cache
(defparameter mflop-delta 5.0
"Time interval in seconds over which we measure performance.")
(defun make-double-float-array (size &optional (initial 0.0s0))
(make-array size :element-type 'double-float :initial-element initial))
(defun ddot (x y n)
(declare (type fixnum n)
(type (simple-array double-float (*)) x y))
(declare (optimize (safety 0) (space 0) (debug 0) (speed
3)#+lispworks(float 0)))
(loop for i of-type fixnum from 0 below n
summing (* (aref x i) (aref y i)) of-type double-float))
(defun daxpy (x y n)
(declare (type fixnum n)
(type (simple-array double-float (*)) x y))
(declare (optimize (safety 0) (space 0) (debug 0) (speed 3)#-lispworks
(float 0)))
(loop with s of-type double-float = 0.3d0
for i of-type fixnum from 0 below n do
(incf (aref x i) (* s (aref y i))))))))
(defun test (fn size)
(let ((x (make-double-float-array +N-long+))
(y (make-double-float-array +N-long+)))
(format
t "~A-~A: ~$ MFLOPS~%"
fn
(if (= size +N-long+) "long" "short")
(loop with after = 0
for before = (get-internal-run-time) then after
and count = 1 then (* count 2)
do
(loop repeat count do (funcall fn x y size))
(setq after (get-internal-run-time))
(when (> (/ (- after before) internal-time-units-per-second)
mflop-delta)
(return (/ (* 2 size count internal-time-units-per-second)
(* 1e6 (- after before)))))))))
(defun mflop-test ()
"Returns several numbers characteristic for floating point efficiency of
your CL implementation. Please compare these numbers to those obtained by
the C version in mflop.c."
(test 'ddot +N-long+)
(test 'ddot +N-short+)
(test 'daxpy +N-long+)
(test 'daxpy +N-short+))
;(mflop-test)
> (defun daxpy (x y n)
> (declare (type fixnum n)
> (type (simple-array double-float (*)) x y))
> (declare (optimize (safety 0) (space 0) (debug 0) (speed 3)#-lispworks
> (float 0)))
> (loop with s of-type double-float = 0.3d0
> for i of-type fixnum from 0 below n do
> (incf (aref x i) (* s (aref y i))))))))
For the given test values I had #+lispworks in the above code and not
#-lispworks.
ciao,
Jochen
Used your code. The result on a 1.4GHz Athlon, LWW 4.1.20 is
CL-USER 1 > (mflop-test)
DDOT-long: 51.55 MFLOPS
DDOT-short: 116.80 MFLOPS
DAXPY-long: 42.62 MFLOPS
DAXPY-short: 91.48 MFLOPS
NIL
CL-USER 2 >
So the key is (float 0).
Wade
> Mine was as follows (with LW4.2.6):
>
> Results on Duron 800 :
>
> jsc@shsp0629:~ > gcc -O3 mflop.c
> jsc@shsp0629:~ > ./a.out
> ddot-long 78.29 MFLOPS
> ddot-short 382.80 MFLOPS
> daxpy-long 34.59 MFLOPS
> daxpy-short 594.87 MFLOPS
>
> CL-USER 8 > (mflop-test)
> DDOT-long: 33.55 MFLOPS
> DDOT-short: 65.39 MFLOPS
> DAXPY-long: 28.99 MFLOPS
> DAXPY-short: 56.69 MFLOPS
> NIL
LWL4.2.6 on a Pentium IV 1.7 GHz:
cartan@darkstar:[lisp]$ gcc -O3 -o mflop mflop.c
cartan@darkstar:[lisp]$ ./mflop
ddot-long 381.38 MFLOPS
ddot-short 610.08 MFLOPS
daxpy-long 150.59 MFLOPS
daxpy-short 525.06 MFLOPS
CL-USER 1 > (mflop-test)
DDOT-long: 118.25 MFLOPS
DDOT-short: 117.86 MFLOPS
DAXPY-long: 132.40 MFLOPS
DAXPY-short: 251.46 MFLOPS
NIL
Regards,
--
Nils Goesche
Ask not for whom the <CONTROL-G> tolls.
PGP key ID #xC66D6E6F
* (mflop-test)
DDOT-long: 71.39 MFLOPS
DDOT-short: 446.00 MFLOPS
DAXPY-long: 72.65 MFLOPS
DAXPY-short: 465.83 MFLOPS
> LWL4.2.6 on a Pentium IV 1.7 GHz:
>
> cartan@darkstar:[lisp]$ gcc -O3 -o mflop mflop.c
> cartan@darkstar:[lisp]$ ./mflop
> ddot-long 381.38 MFLOPS
> ddot-short 610.08 MFLOPS
> daxpy-long 150.59 MFLOPS
> daxpy-short 525.06 MFLOPS
>
>
> CL-USER 1 > (mflop-test)
> DDOT-long: 118.25 MFLOPS
> DDOT-short: 117.86 MFLOPS
> DAXPY-long: 132.40 MFLOPS
> DAXPY-short: 251.46 MFLOPS
> NIL
Oh, and CMUCL 18d:
* (mflop-test)
ddot-long: 280.72 MFLOPS
ddot-short: 454.01 MFLOPS
daxpy-long: 161.22 MFLOPS
daxpy-short: 318.15 MFLOPS
[Note: These are informal benchmarks. I tried to ensure that the machines were
relatively idle (load avg close to 0) before running them. That's about it.]
Linux/x86: Pentium IV 2.4GHz 900 MB RAM
ACL 6.0 Enterprise (Linux/x86) ((speed 1) (safety 1) (debug 2) (space 1)):
DDOT-long: 123.70 MFLOPS
DDOT-short: 129.37 MFLOPS
DAXPY-long: 69.36 MFLOPS
DAXPY-short: 122.99 MFLOPS
ACL 6.0 Enterprise (Linux/x86) ((speed 3) (safety 1) (debug 0) (space 0)):
DDOT-long: 118.91 MFLOPS
DDOT-short: 133.05 MFLOPS
DAXPY-long: 71.97 MFLOPS
DAXPY-short: 125.73 MFLOPS
ACL 6.0 Enterprise (Linux/x86) ((speed 3) (safety 0) (debug 0) (space 0)):
DDOT-long: 119.44 MFLOPS
DDOT-short: 163.18 MFLOPS
DAXPY-long: 68.48 MFLOPS
DAXPY-short: 124.13 MFLOPS
GCC (Linux/x86) (no opt):
ddot-long 162.28 MFLOPS
ddot-short 186.41 MFLOPS
daxpy-long 68.91 MFLOPS
daxpy-short 324.88 MFLOPS
GCC (Linux/x86) (-O):
ddot-long 238.97 MFLOPS
ddot-short 920.68 MFLOPS
daxpy-long 69.85 MFLOPS
daxpy-short 1077.78 MFLOPS
[gcc -O2,3 gave worse performance!]
Looks like ACL 6.0 either doesn't have Pentium IV optimizations or I don't know
how to turn them on.
Solaris/SPARC:
SunOS <host omitted> 5.8 Generic_108528-13 sun4u sparc SUNW,Ultra-60
sparcv9 at 360MHz, has sparcv9 FP processor
512MB RAM
ACL 6.0 Enterprise (Solaris/SPARC) ((speed 1) (safety 1) (debug 2) (space 1))
DDOT-long: 23.38 MFLOPS
DDOT-short: 43.23 MFLOPS
DAXPY-long: 27.42 MFLOPS
DAXPY-short: 56.51 MFLOPS
ACL 6.0 Enterprise (Solaris/SPARC) ((speed 3) (safety 0) (debug 0) (space 0))
DDOT-long: 23.59 MFLOPS
DDOT-short: 43.16 MFLOPS
DAXPY-long: 27.31 MFLOPS
DAXPY-short: 56.57 MFLOPS
ucbcc (Solaris/SPARC) (-fast):
Segmentation Fault (core dumped)
GCC (Solaris/SPARC) (-O3):
ddot-long 22.42 MFLOPS
ddot-short 63.76 MFLOPS
Segmentation Fault (core dumped)
[ I had to add free(x); free(y); at end of test() to even get it this far! ]
--
; Matthew Danish <mda...@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."
GCC (Solaris/SPARC) (-O3):
daxpy-long 21.26 MFLOPS
daxpy-short 58.80 MFLOPS
> for (i=0; i<Nlong; i++)
> x[i] = 0.0; y[i] = 0.9;
I recieve a segmentation fault on Solaris using gcc and the Sun
compiler. One problem I notied is this for loop above in test(). There
are two statements there:
x[i] = 0.0;
y[i] = 0.9;
You must use { ... } when there is more than one statement. Once I do
this, the code runs fine on Solaris. I have not had a chance to test it
further. Is this what you intended:
for (i=0; i<Nlong; i++) {
x[i] = 0.0;
y[i] = 0.9;
}
?
--
Shaun Rowland row...@cis.ohio-state.edu
http://www.cis.ohio-state.edu/~rowland/
> Now, what I would like to have is some more data, about how Lisp
> implementations run this program. Especially, I would be interested
> with CMUCL on SUN workstations, ACL, Lispworks, ... on X86 and other
> architectures. If someone would like to test it, please go ahead.
> I'm very interested in the results. Please always report the results
> for the C program
With the for loop fix I mentioned before, these are the results I have
been able to produce using the provided code:
AMD Athlon(TM) XP1800+ (1.55GHz) Red Hat 7.3
============================================
CMUCL 18d
---------
DDOT-long: 103.24 MFLOPS
DDOT-short: 530.90 MFLOPS
DAXPY-long: 89.18 MFLOPS
DAXPY-short: 530.90 MFLOPS
gcc 2.96 (no optimization)
--------------------------
ddot-long 88.58 MFLOPS
ddot-short 170.98 MFLOPS
daxpy-long 78.89 MFLOPS
daxpy-short 166.47 MFLOPS
gcc 2.96 (-O3 optimization)
---------------------------
ddot-long 110.34 MFLOPS
ddot-short 736.70 MFLOPS
daxpy-long 97.15 MFLOPS
daxpy-short 891.07 MFLOPS
Solaris 8 4x480MHz US-II (Ultra Enterprise 450)
===============================================
CMUCL 18c
---------
DDOT-long: 22.41 MFLOPS
DDOT-short: 59.06 MFLOPS
DAXPY-long: 29.34 MFLOPS
DAXPY-short: 85.22 MFLOPS
CMUCL 18d
---------
DDOT-long: 21.93 MFLOPS
DDOT-short: 65.79 MFLOPS
DAXPY-long: 33.68 MFLOPS
DAXPY-short: 91.00 MFLOPS
gcc 3.0.4 (no optimization)
---------------------------
ddot-long 15.08 MFLOPS
ddot-short 28.96 MFLOPS
daxpy-long 13.62 MFLOPS
daxpy-short 26.06 MFLOPS
gcc 3.0.4 (-O3 optimization)
----------------------------
ddot-long 22.07 MFLOPS
ddot-short 84.41 MFLOPS
daxpy-long 22.03 MFLOPS
daxpy-short 77.25 MFLOPS
Sun cc v. 6 update 1 (no optimization)
--------------------------------------
ddot-long 14.87 MFLOPS
ddot-short 28.29 MFLOPS
daxpy-long 12.24 MFLOPS
daxpy-short 21.30 MFLOPS
Sun cc v. 6 update 1 (-fast optimization with -xarch=native [v8plusa])
----------------------------------------------------------------------
ddot-long 38.04 MFLOPS
ddot-short 416.99 MFLOPS
daxpy-long 24.29 MFLOPS
daxpy-short 152.52 MFLOPS
GCC (Linux/x86) (no opt):
ddot-long 117.57 MFLOPS
ddot-short 187.06 MFLOPS
daxpy-long 72.01 MFLOPS
daxpy-short 327.86 MFLOPS
GCC (Linux/x86) (-O):
ddot-long 119.35 MFLOPS
ddot-short 921.67 MFLOPS
daxpy-long 72.01 MFLOPS
daxpy-short 1083.22 MFLOPS
ucbcc (Solaris/SPARC) (-fast):
ddot-long 42.52 MFLOPS
ddot-short 317.68 MFLOPS
daxpy-long 24.38 MFLOPS
daxpy-short 115.46 MFLOPS
GCC (Solaris/SPARC) (-O3):
ddot-long 22.42 MFLOPS
ddot-short 64.53 MFLOPS
daxpy-long 21.26 MFLOPS
daxpy-short 58.74 MFLOPS
1. When using short (IEEE single precision) floating point, code
written in C will likely be noticably faster than the equivalent
code written in Lisp.
2. When using long (IEEE double precision) floating point, code
written in C will have approximately the same performance as the
equivalent code written in Lisp.
The "short" measurements actually used a smaller array of doubles than the
"long" test. The reason (following the comment in the code) was that the
long test should exceed the caches. On my Duron 800 both parts seems to
have exceeded the caches.
<homer>D'oh!</homer>
Ok, so when using a short array (that fits in the cache) the C code should
run noticably faster than the Lisp code. When using a long array (which does
not fit in the cache), the Lisp and the C code seem to take about the same
amount of time (probably because cache refill dominates).
any fix for the lispworks problem ?
Igor.
Of course - you can use the code of my posting in this thread which fixes
this. The code used the old LOOP syntax for declaring types of LOOP
variables.
;; Old Syntax Example
(loop with n fixnum = 5)
;; New Syntax Example
(loop with n of-type fixnum = 5)
For backwards compatibility one can use the old syntax for the types
fixnum, float, t or nil. It is unclear to me if this is thought to include
subtypes of those (like the double-float of the code in the original
posting).
http://www.xanalys.com/software_tools/reference/HyperSpec/Body/06_aag.htm
Describes the case in detail.
I would recommend to always use of-type when declaring types for LOOP
variables.
[...]
>
> (defun daxpy (x y n)
> (declare (type fixnum n)
> (type (simple-array double-float (*)) x y))
> (declare (optimize (safety 0) (space 0) (debug 0) (speed 3)))
> (loop with s double-float = 0.3d0
> for i from 0 below n do
> (setf (aref x i) (+ (* s (aref y i))))))
[...]
Probably, the last line in DAXPY should be
(setf (aref x i) (+ (aref x i) (* s (aref y i))))))
Speaking honestly, I don't understand what you are trying to measure in
this test. Is it a speed of operations with float-point numbers or might
be a speed of access elements in Lisp and C arrays?
On my Linux box with 1.2 GHz Athlon and CMU Common Lisp 18d+:
;;original version of mflop.lisp
DDOT-long: 72.94 MFLOPS
DDOT-short: 451.39 MFLOPS
DAXPY-long: 65.87 MFLOPS
DAXPY-short: 413.77 MFLOPS
;; Version of mflop.lisp with the last line in DAXPY fixed (see above)
DDOT-long: 72.94 MFLOPS
DDOT-short: 449.97 MFLOPS
DAXPY-long: 51.97 MFLOPS
DAXPY-short: 356.72 MFLOPS
Results for C version
GCC-2.95.3
(gcc -o mflop -O3 -malign-functions=4 -malign-loops=4 \
-funroll-loops -fexpensive-optimizations mflop.c)
ddot-long 124.57 MFLOPS
ddot-short 574.19 MFLOPS
daxpy-long 71.41 MFLOPS
daxpy-short 845.47 MFLOPS
(gcc -o mflop -O3 mflop.c)
ddot-long 100.99 MFLOPS
ddot-short 574.96 MFLOPS
daxpy-long 58.92 MFLOPS
daxpy-short 892.92 MFLOPS
(gcc -o mflop -O mflop.c)
ddot-long 78.41 MFLOPS
ddot-short 572.66 MFLOPS
daxpy-long 65.73 MFLOPS
daxpy-short 750.87 MFLOPS
GCC-3.1
(gcc -o mflop -O3 -falign-functions=4 -falign-loops=4 \
-funroll-loops -fexpensive-optimizations mflop.c)
ddot-long 123.08 MFLOPS
ddot-short 574.96 MFLOPS
daxpy-long 72.73 MFLOPS
daxpy-short 884.65 MFLOPS
(gcc-3.1 -o mflop -O3 mflop.c)
ddot-long 78.29 MFLOPS
ddot-short 575.73 MFLOPS
daxpy-long 58.25 MFLOPS
daxpy-short 896.65 MFLOPS
(gcc-3.1 -o mflop -O mflop.c)
ddot-long 123.67 MFLOPS
ddot-short 576.51 MFLOPS
daxpy-long 68.36 MFLOPS
daxpy-short 692.74 MFLOPS
> Nicolas Neuss <Nicola...@iwr.uni-heidelberg.de> writes:
>
> > for (i=0; i<Nlong; i++)
> > x[i] = 0.0; y[i] = 0.9;
>
> I recieve a segmentation fault on Solaris using gcc and the Sun
> compiler. One problem I notied is this for loop above in test(). There
> are two statements there:
>
> x[i] = 0.0;
> y[i] = 0.9;
>
> You must use { ... } when there is more than one statement. Once I do
> this, the code runs fine on Solaris. I have not had a chance to test it
> further. Is this what you intended:
>
> for (i=0; i<Nlong; i++) {
> x[i] = 0.0;
> y[i] = 0.9;
> }
Of course. Here we see drastically the advantages of using lisp
instead of C to get correct code:-)
Nicolas.
>
> The CLISP results are depressing.
>
> Igor.
Do not be too depressed. For many problems performance depends on
other factors, e.g. how fast is the network connection, or on the
efficiency of built-in functions like SORT. Also note that you can
get very efficient matrix routines by simply using the Matlisp
interface to the Fortran BLAS/Lapack routines (see
http://sourceforge.net/projects/matlisp). This would probably even
surpass the C speed, at least for the long array computations when the
generic function overhead can be neglected.
Even if I'm not using it, I guess that CLISP is a nice development
environment. Furthermore, it is running on many platforms. And when
you should need native code compilation you should be able to switch
to another implementation quite easily. (I.e., with only small
changes like using OF-TYPE in loop declarations:-)
Yours, Nicolas.
> Nicolas Neuss <Nicola...@iwr.uni-heidelberg.de> writes:
>
>
> [...]
>
> >
> > (defun daxpy (x y n)
> > (declare (type fixnum n)
> > (type (simple-array double-float (*)) x y))
> > (declare (optimize (safety 0) (space 0) (debug 0) (speed 3)))
> > (loop with s double-float = 0.3d0
> > for i from 0 below n do
> > (setf (aref x i) (+ (* s (aref y i))))))
>
> [...]
>
> Probably, the last line in DAXPY should be
> (setf (aref x i) (+ (aref x i) (* s (aref y i))))))
Of course. Even better is using incf as Jochen noted. And to be
completely correct with the naming one should also switch x and y.
(incf (aref y i) (* s (aref x i)))
> Speaking honestly, I don't understand what you are trying to measure in
> this test. Is it a speed of operations with float-point numbers or might
> be a speed of access elements in Lisp and C arrays?
You're right. I changed the comment for mflop-test now to:
"Returns several numbers characteristic for the efficiency with which
cour CL implementation will process numeric code written in a C/Fortran
style. This results should be significant also for other code using
arrays for achieving good data locality. Please compare these numbers
to those obtained by the C version in mflop.c."
I have put updated code at
http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop.c
http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop.lisp
Nicolas
First I want to thank you very much for all the responses I got.
Second, I guess that my code needs some updates which are provided
here:
http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop.c
http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop.lisp
I changed the following:
1. Fixed the bug discovered by Jochen Schmidt and Alexander Skobelev
in the Lisp daxpy routine.
2. Fixed the bug discovered by Shwan Rowland and Roland Kaufmann
(private mail) in the C program.
3. Inserted OF-TYPE in my loop type declarations. Does it work on
Lispworks now out of the box?
4. Changed the interface of the test routine to take both function and
name analogous to the C code. For me this makes no difference, but
for faster machines with slow symbol-lookup this might be
significant.
5. Changed the comment of mflop-test to reflect Alexander Skobelev's
point. These values will be significant for all code using uniform
arrays. Vector operations on long vectors are just an important
example of such code.
Thanks again,
Nicolas.
> 4. Changed the interface of the test routine to take both function and
> name analogous to the C code. For me this makes no difference, but
> for faster machines with slow symbol-lookup this might be
> significant.
I forgot: this was pointed out to me by Paul Foley in private mail.
Nicolas.
Here are some more numbers...
Machine: Intel(R) Pentium(R) 4 CPU 1700MHz
** gcc -O3:
ddot-long 269.12 MFLOPS
ddot-short 624.27 MFLOPS
daxpy-long 174.45 MFLOPS
daxpy-short 523.78 MFLOPS
** cmucl 18d:
DDOT-long: 271.83 MFLOPS
DDOT-short: 483.67 MFLOPS
DAXPY-long: 168.83 MFLOPS
DAXPY-short: 334.50 MFLOPS
So, looks pretty good, right? C and lisp are roughly the same in
performance. Well, unfortunately for cmucl, we get much better
results using Intel's compiler (version 6.0):
** icc -xW -ip -tpp7 -O3
ddot-long 295.53 MFLOPS
ddot-short 1410.50 MFLOPS
daxpy-long 179.02 MFLOPS
daxpy-short 1651.91 MFLOPS
Now lisp doesn't look so good, and neither does gcc. But there's some
good news... using the BLAS routines from Intel's MKL, we get:
** icc -xW -ip -tpp7 -O3 -lmkl_p4
ddot-long 291.74 MFLOPS
ddot-short 1533.92 MFLOPS
daxpy-long 182.53 MFLOPS
daxpy-short 1214.98 MFLOPS
Presumably, you could get these same results from lisp or gcc by using
the appropriate BLAS calls.
Which, I think, leads to a question about what it is you are trying to
do. Rather than worrying about how you can most efficiently implement
numeric kernels for your programs, wouldn't it make more sense to
figure out how to best use the highly optimized numeric codes that are
already out there? For just about anything you want to do, someone's
written a killer fortran subroutine to do just that. Why try to
reinvent it in lisp?
Rob Malouf
malouf at let dot rug dot nl
> Now lisp doesn't look so good, and neither does gcc. But there's some
> good news... using the BLAS routines from Intel's MKL, we get:
>
> ** icc -xW -ip -tpp7 -O3 -lmkl_p4
>
> ddot-long 291.74 MFLOPS
> ddot-short 1533.92 MFLOPS
> daxpy-long 182.53 MFLOPS
> daxpy-short 1214.98 MFLOPS
>
> Presumably, you could get these same results from lisp or gcc by using
> the appropriate BLAS calls.
Yes, I mentioned in my reply to Igor Carron that there even exists a
CL interface for BLAS/LAPACK called Matlisp.
> Which, I think, leads to a question about what it is you are trying to
> do. Rather than worrying about how you can most efficiently implement
> numeric kernels for your programs, wouldn't it make more sense to
> figure out how to best use the highly optimized numeric codes that are
> already out there? For just about anything you want to do, someone's
> written a killer fortran subroutine to do just that. Why try to
> reinvent it in lisp?
This is a model situation. My actual application is a pde solver on
unstructured grids where similar operations occur on small vectors.
To be reasonably fast I will probably have to inline such operations
into more complex functions. This needs more flexibility than a FFI
gets you. For handling larger linear algebra problems involving block
matrices and vectors I use Matlisp, of course.
Nicolas
Thanks Jochen,
The Lispworks personal edition now tells me :
Error: can't make array of 8388624 bytes.
So both commercial versions in their personal/trial versions can't go
through the benchmark.
Igor.
I agree, however the build for matlisp doesn't seem to be available
for the windows OS. Actually that could be rebuild given a compiler
using either open watcom or gcc/cygwin but this looks like a full time
job :-)
> Even if I'm not using it, I guess that CLISP is a nice development
> environment. Furthermore, it is running on many platforms. And when
> you should need native code compilation you should be able to switch
> to another implementation quite easily. (i.e., with only small
> changes like using OF-TYPE in loop declarations :-)
Sure, but like you, I am really trying to figure out if LISP can do
rapidily the things I already know how to do in fortran or C. It looks
like it can on the right platform (CUCML is not available on the win
OS). So unless I spend some time building matlip, the rapid
prototyping capability in LISP will not be available to me.... then
again I could also have dual boot on my current laptop.
Thanks,
Igor.
Here is another data point for Corman Lisp:
win XP, celeron, 900 MHz
DDOT-long: 0.581552 MFLOPS
DDOT-short: 0.533124 MFLOPS
DAXPY-long: 0.200635 MFLOPS
DAXPY-short: 0.919238 MFLOPS
Igor.
> Thanks Jochen,
>
> The Lispworks personal edition now tells me :
> Error: can't make array of 8388624 bytes.
>
> So both commercial versions in their personal/trial versions can't go
> through the benchmark.
I fixed that too in the version of the other posting in this thread. The
longer array exceeds the maximum array size of LispWorks. I you make it
half as big it fits within the array dimension limit.
Actually, I had hoped that Corman Lisp would be an alternative to
ACL/Lispworks on Windows:-(
Thanks, Nicolas.
P.S.: Somehow I had in mind that Corman Lisp would compile to native
code, but now I assume that it does not. Can someone say more?
Igor> I agree, however the build for matlisp doesn't seem to be available
Igor> for the windows OS. Actually that could be rebuild given a compiler
Igor> using either open watcom or gcc/cygwin but this looks like a full time
Igor> job :-)
I believe there is (was?) a version of matlisp using ACL on Windows.
I do not know if this still works.
>> Even if I'm not using it, I guess that CLISP is a nice development
>> environment. Furthermore, it is running on many platforms. And when
>> you should need native code compilation you should be able to switch
>> to another implementation quite easily. (i.e., with only small
>> changes like using OF-TYPE in loop declarations :-)
Igor> Sure, but like you, I am really trying to figure out if LISP can do
Igor> rapidily the things I already know how to do in fortran or C. It looks
Igor> like it can on the right platform (CUCML is not available on the win
Igor> OS). So unless I spend some time building matlip, the rapid
Igor> prototyping capability in LISP will not be available to me.... then
Igor> again I could also have dual boot on my current laptop.
You may want to try Clisp or gcl on Windows. While matlisp doesn't
support either at this time, it shouldn't be difficult to get it to
work for either of these, if you know how the FFI for these work.
The only real requirement is that you be able to access the actual
memory used by Lisp arrays. Even if you can't, you can always do a
copy-in/out, but that can cause severe degradation in performance,
obviously.
Ray
It does compile to native code. I expect that there is something
wrong that is causing those awful numbers.
can somebody else check these numbers ?
Igor.
> >
> > P.S.: Somehow I had in mind that Corman Lisp would compile to native
> > code, but now I assume that it does not. Can someone say more?
>
> It does compile to native code. I expect that there is something
> wrong that is causing those awful numbers.
Another possibility would be that it does not have uniform
double-float arrays or that it does not eliminate the generic
arithmetic.
To Igor:
1. Did you use the program from
http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop.lisp
unchanged?
2. Does it help to use this version (avoids loop):
http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop2.lisp
Nicolas.
Since Corman Lisp if fully compiled, all the time, there is a
lot it can do to easily get far better performance. Right now
all floating point operations are out-of-line function calls,
and all floating point values (at least IEEE single and double)
are boxed and do heap allocation. Much of the time in this
benchmark is measuring garbage collections times.
In the 2.0 release, simple arrays of unboxed single and double
floats have been added, and this will lead to support for
inline operations on floating point operations. Presumably in the
next (post 2.0) release you will see significantly better numbers.
At the moment, using 2.0, I am getting these results on a
1.0-ghz Athlon:
DDOT-long: 1.258809 MFLOPS
DDOT-short: 1.295908 MFLOPS
DAXPY-long: 1.83115 MFLOPS
DAXPY-short: 1.887128 MFLOPS
On my system using CCL 1.5, I get this:
DDOT-long: 1.156547 MFLOPS
DDOT-short: 1.167793 MFLOPS
DAXPY-long: 0.336824 MFLOPS
DAXPY-short: 1.694285 MFLOPS
Probably there is some speedup in 2.0 due to the reduced
storage requirements by supporting arrays of unboxed floats.
We realize these are low relative to some other compilers.
We expect to be able to produce floating point performance
comparable to the fastest compilers in the near future.
Corman Lisp is still actively evolving, if slowly because of
limited development resources. It is to a large degree
driven by market demand. The things that the most people
are asking for are the things we prioritize.
Roger
-----------------------------------------
Nicolas Neuss <Nicola...@iwr.uni-heidelberg.de> wrote in message news:<87vg7mu...@ortler.iwr.uni-heidelberg.de>...
straight out of your examples:
Corman Lisp 1.5 Copyright © 2001 Roger Corman. All rights reserved.
+N-LONG+
+N-SHORT+
*MFLOP-DELTA*
MAKE-DOUBLE-FLOAT-ARRAY
;;; Autoloading C:\Program Files\Corman Tools\Corman Lisp
1.5\sys\loop.lisp ...
DDOT
DAXPY
TEST
MFLOP-TEST
DDOT-long: 0.638327 MFLOPS
DDOT-short: 0.634276 MFLOPS
DAXPY-long: 0.147311 MFLOPS
DAXPY-short: 0.573144 MFLOPS
NIL
;;;
;;;and then mflop2
;;;
Corman Lisp 1.5 Copyright © 2001 Roger Corman. All rights reserved.
+N-LONG+
+N-SHORT+
*MFLOP-DELTA*
MAKE-DOUBLE-FLOAT-ARRAY
DDOT
DAXPY
TEST
MFLOP-TEST
ddot-long: 0.770556 MFLOPS
ddot-short: 0.786894 MFLOPS
;;; An error occurred in function +:
;;; Error: Cannot call function '+' with these operands: #< UNKNOWN
OBJECT TYPE: #xBFFC15 > and 0.0d0
;;; Entering Corman Lisp debug loop.
;;; Use :C followed by an option to exit. Type :HELP for help.
;;; Restart options:
;;; 1 Abort to top level.
Win xp, celeron 900 Mhz. tried that on other machines (Win 2000, NT)
and get similar results.
Igor.