Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss
Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Floating Point speed in Common Lisp

301 views
Skip to first unread message

Rainer Joswig

unread,
Mar 10, 1998, 3:00:00 AM3/10/98
to

What is the current speed of CL implementations
when it comes to FP code?


This is the raytracer code from the ANSI CL book by Paul Graham.
How about compiling it as it is and run it
with (time (ray-test 5)) ?


Somebody might also want to come up with a
version which is optimized for
the various CL implementations:
- optimizations settings changed
- type declarations added
- inlining
- reduced consing
- ???

;;; From Paul Graham's book ANSI Common Lisp, Example 9.8

(defun sq (x) (* x x))

(defun mag (x y z)
(sqrt (+ (sq x) (sq y) (sq z))))

(defun unit-vector (x y z)
(let ((d (mag x y z)))
(values (/ x d) (/ y d) (/ z d))))

(defstruct (point (:conc-name nil))
x y z)

(defun distance (p1 p2)
(mag (- (x p1) (x p2))
(- (y p1) (y p2))
(- (z p1) (z p2))))

(defun minroot (a b c)
(if (zerop a)
(/ (- c) b)
(let ((disc (- (sq b) (* 4 a c))))
(unless (minusp disc)
(let ((discrt (sqrt disc)))
(min (/ (+ (- b) discrt) (* 2 a))
(/ (- (- b) discrt) (* 2 a))))))))

(defstruct surface color)


(defparameter *world* nil)

(defconstant eye (make-point :x 0 :y 0 :z 200))

(defun tracer (pathname &optional (res 1))
(with-open-file (p pathname :direction :output)
(format p "P2 ~A ~A 255" (* res 100) (* res 100))
(let ((inc (/ res)))
(do ((y -50 (+ y inc)))
((< (- 50 y) inc))
(do ((x -50 (+ x inc)))
((< (- 50 x) inc))
(print (color-at x y) p))))))

(defun color-at (x y)
(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))
(round (* (sendray eye xr yr zr) 255))))

(defun sendray (pt xr yr zr)
(multiple-value-bind (s int) (first-hit pt xr yr zr)
(if s
(* (lambert s int xr yr zr)
(surface-color s))
0)))

(defun first-hit (pt xr yr zr)
(let (surface hit dist)
(dolist (s *world*)
(let ((h (intersect s pt xr yr zr)))
(when h
(let ((d (distance h pt)))
(when (or (null dist) (< d dist))
(setf surface s hit h dist d))))))
(values surface hit)))

(defun lambert (s int xr yr zr)
(multiple-value-bind (xn yn zn) (normal s int)
(max 0 (+ (* xr xn) (* yr yn) (* zr zn)))))

(defstruct (sphere (:include surface))
radius center)

(defun defsphere (x y z r c)
(let ((s (make-sphere
:radius r
:center (make-point :x x :y y :z z)
:color c)))
(push s *world*)
s))

(defun intersect (s pt xr yr zr)
(funcall (typecase s (sphere #'sphere-intersect))
s pt xr yr zr))

(defun sphere-intersect (s pt xr yr zr)
(let* ((c (sphere-center s))
(n (minroot (+ (sq xr) (sq yr) (sq zr))
(* 2 (+ (* (- (x pt) (x c)) xr)
(* (- (y pt) (y c)) yr)
(* (- (z pt) (z c)) zr)))
(+ (sq (- (x pt) (x c)))
(sq (- (y pt) (y c)))
(sq (- (z pt) (z c)))
(- (sq (sphere-radius s)))))))
(if n
(make-point :x (+ (x pt) (* n xr))
:y (+ (y pt) (* n yr))
:z (+ (z pt) (* n zr))))))

(defun normal (s pt)
(funcall (typecase s (sphere #'sphere-normal))
s pt))


(defun sphere-normal (s pt)
(let ((c (sphere-center s)))
(unit-vector (- (x c) (x pt))
(- (y c) (y pt))
(- (z c) (z pt)))))

(defun ray-test (&optional (res 1))
(setf *world* nil)
(defsphere 0 -300 -1200 200 .8)
(defsphere -80 -150 -1200 200 .7)
(defsphere 70 -100 -1200 200 .9)
(do ((x -2 (1+ x)))
((> x 2))
(do ((z 2 (1+ z)))
((> z 7))
(defsphere (* x 200) 300 (* z -400) 40 .75)))
(tracer (make-pathname :name "spheres.pgm") res))


Larry Hunter

unread,
Mar 10, 1998, 3:00:00 AM3/10/98
to

Rainer Joswig askes about the current speed of CL implementations on FP
code, proposing the Graham raytracer code as a benchmark.

Using ACL 4.3.1 (with current patch set) on an SGI Indigo^2 running Irix
6.2, with the code he supplied and the default optimizations I get:

USER(58): (time (ray-test 5))
; cpu time (non-gc) 195,970 msec (00:03:15.970) user, 1,280 msec system
; cpu time (gc) 64,700 msec (00:01:04.700) user, 370 msec system
; cpu time (total) 260,670 msec (00:04:20.670) user, 1,650 msec system
; real time 292,094 msec (00:04:52.094)
; space allocation:
; 515,900 cons cells, 0 symbols, 2,240,640,560 other bytes

Setting optimize (speed 3) (safety 0) (debug 0), I get about a 15% speedup:

USER(66): (time (ray-test 5))
; cpu time (non-gc) 170,460 msec (00:02:50.460) user, 1,200 msec system
; cpu time (gc) 65,000 msec (00:01:05) user, 360 msec system
; cpu time (total) 235,460 msec (00:03:55.460) user, 1,560 msec system
; real time 252,231 msec (00:04:12.231)
; space allocation:
; 515,803 cons cells, 0 symbols, -2,054,326,848 other bytes

Judging by a quick look at the code (not to mention the huge amount of
boxing indicated above), there is a lot of optimizing that could be done. I
don't have time to do it, but if someone else takes a crack, I'll test the
resulting code on the ACL/SGI implementation.

Larry


--
Lawrence Hunter, PhD.
National Library of Medicine phone: +1 (301) 496-9303
Bldg. 38A, 9th fl, MS-54 fax: +1 (301) 496-0673
Bethesda. MD 20894 USA email: hun...@nlm.nih.gov

Will Hartung

unread,
Mar 10, 1998, 3:00:00 AM3/10/98
to

"Rainer Joswig" <jos...@lavielle.com> writes:

>What is the current speed of CL implementations
>when it comes to FP code?


>This is the raytracer code from the ANSI CL book by Paul Graham.
>How about compiling it as it is and run it
>with (time (ray-test 5)) ?


>Somebody might also want to come up with a
>version which is optimized for
>the various CL implementations:
>- optimizations settings changed
>- type declarations added
>- inlining
>- reduced consing
>- ???

This is a pretty cool idea. It would be interesting to see "how fast"
it can go, and what are necessary for it to happen.

We had an interesting optimization thread a little while back that was
pretty interesting, but if this one takes off, I think it can be more
approachable.

Also, there was that fellow from Nichimen Graphics (my apologies for
forgeting his name and possibly not spelling his company name
properly) whose life pursuits seem to have been speeding up floating
point operations, specifically by, what, "unboxing" the data? So,
hopefully he'll have some input into this thread.

It would also be interesting to see what the fastest "portable" version
is compared to the fastest for each implementation.

--
Will Hartung - Rancho Santa Margarita. It's a dry heat. vfr...@netcom.com
1990 VFR750 - VFR=Very Red "Ho, HaHa, Dodge, Parry, Spin, HA! THRUST!"
1993 Explorer - Cage? Hell, it's a prison. -D. Duck

David J. Fiander

unread,
Mar 10, 1998, 3:00:00 AM3/10/98
to

vfr...@netcom.com (Will Hartung) writes:

> We had an interesting optimization thread a little while back that was
> pretty interesting, but if this one takes off, I think it can be more
> approachable.

This came up just as I started spending serious amounts of time
playing with Lisp. Graham talks about the fact that it's
possible to make lisp go fast, but it may require lots of
declarations that are optional. He also says that one wouldn't
actually stuff crufty little (declare)s and (the)s all over the
code, but would use macros. So, here are a couple that I came up
with along that vein. Do these look like things that might
appear in a mature (ie. well used and customized) development
environment?

;;;
;;; Declare a function to be inline, without too much fur
;;;
(defmacro definline (name args &body body)
`(progn
(declaim (inline ,name))
(defun ,name ,args ,@body)))

;;;
;;; Type a variable, to give a few extra hints to the compiler
;;;
(defmacro deftypedvar (type name &optional initval)
`(progn
(declaim (type ,type ,name))
(defvar ,name ,@(if initval `((the ,type ,initval))))))

Frank A. Adrian

unread,
Mar 11, 1998, 3:00:00 AM3/11/98
to

Larry Hunter wrote in message ...

>Judging by a quick look at the code (not to mention the huge amount of
>boxing indicated above), there is a lot of optimizing that could be done.
I
>don't have time to do it, but if someone else takes a crack, I'll test the
>resulting code on the ACL/SGI implementation.

Yup, boxing floats is the main thing that seems to slow down LISP FP
computation (and from looking at this code, I'd imagine that it would cons a
bit, being written for clarity rather than speed). I was wondering if
anyone had done an implementation that used tag bits in the lower end of the
word to differentiate between primitive types (e.g., smallInt / pointer /
singleFloat / constant) and just burned the last few bits of IEEE
compatibility. It seems that you could get single float performance to
within a small amount of "C"-speed with only a small penalty in accuracy. I
know that most compilers can do a good job of not boxing floats once they
get into the stack frame, but whenever they're returned or stored to memory,
most implementations seem to have to box them. So has anyone tried this
approach?
--
Frank A. Adrian
First DataBank
frank_...@firstdatabank.com (W)
fra...@europa.com (H)
This message does not necessarily reflect those of my employer,
its parent company, or any of the co-subsidiaries of the parent
company.


Bulent Murtezaoglu

unread,
Mar 11, 1998, 3:00:00 AM3/11/98
to

>>>>> "FAA" == Frank A Adrian <frank_...@firstdatabank.com> writes:

FAA> ... I know that
FAA> most compilers can do a good job of not boxing floats once
FAA> they get into the stack frame, but whenever they're returned
FAA> or stored to memory, most implementations seem to have to box
FAA> them. So has anyone tried this approach?

It appears that CMUCL block-compile extension should be able to do this
without undue decleration noise. It enables you to declare a bunch of
functions in " a block" as being only called from others in the same block
with the exception of ones you declare as entry points (ie called from
elsewhere). Excellent idea, as then just like as in inlining, all kinds
of optimizations become feasible without the programmer having to spell
everything out in declares. That should as far as I can see take care of
the above. Of course I was unable to test it since I couldn't yet coerce
the experimental version of CMUCL I have to do block compiles, but I think
the idea is an elegant and feasible one. The inconvenience of not being
able to redefine functions and such is a small trade off for finished code.

BM


Raymond Toy

unread,
Mar 11, 1998, 3:00:00 AM3/11/98
to

"Rainer Joswig" <jos...@lavielle.com> writes:

> What is the current speed of CL implementations
> when it comes to FP code?
>

The example isn't particulary good about *using* FP, but it could be
made so.

What kind of changes am I allowed to do?

I've made some mods to this code so that everything is a single-float,
as much as possible, and added some appropriate declarations for
CMUCL. (CMUCL derives the rest for itself.)

With these mods, on a 300-MHz Ultra, running CMUCL "18b", I get

(time (ray-test 5))

Evaluation took:
63.17 seconds of real time
44.21 seconds of user run time
11.37 seconds of system run time
[Run times include 13.84 seconds GC run time]
0 page faults and
1263144016 bytes consed.

Pretty bad compared to what someone else posted for SGI ACL.

The GC for CMUCL is not great, so it seems to spend lots of time doing
GC.

Someone more knowledgeable about CMUCL could probably speed this up
some more.

Ray

Modified code:

;;; From Paul Graham's book ANSI Common Lisp, Example 9.8

(declaim (optimize (speed 3)))

(eval-when (compile load eval)


(defstruct (point (:conc-name nil))

(x 0.0 :type single-float)
(y 0.0 :type single-float)
(z 0.0 :type single-float))
(defstruct surface
(color 0.0 :type single-float))


(defstruct (sphere (:include surface))

(radius 0.0 :type single-float)
(center (required-argument) :type point))
)

(defun sq (x)
(declare (type single-float x))
(* x x))

(defun mag (x y z)

(declare (type single-float x y z))


(sqrt (+ (sq x) (sq y) (sq z))))

(defun unit-vector (x y z)

(declare (type single-float x y z)
(values (single-float 0.0 1.0)
(single-float 0.0 1.0)
(single-float 0.0 1.0)))


(let ((d (mag x y z)))
(values (/ x d) (/ y d) (/ z d))))


(defun distance (p1 p2)
(declare (type point p1 p2))


(mag (- (x p1) (x p2))
(- (y p1) (y p2))
(- (z p1) (z p2))))

(declaim (inline minroot))


(defun minroot (a b c)

(declare (type single-float a b c))


(if (zerop a)
(/ (- c) b)
(let ((disc (- (sq b) (* 4 a c))))
(unless (minusp disc)
(let ((discrt (sqrt disc)))
(min (/ (+ (- b) discrt) (* 2 a))
(/ (- (- b) discrt) (* 2 a))))))))

(defparameter *world* nil)

(defconstant eye (make-point :x 0.0 :y 0.0 :z 200.0))

(defun tracer (pathname &optional (res 1))

(declare (type (integer 1 100) res))


(with-open-file (p pathname :direction :output)
(format p "P2 ~A ~A 255" (* res 100) (* res 100))

(let ((inc (/ (float res 1.0))))
(do ((y -50.0 (+ y inc)))
((< (- 50.0 y) inc))
(declare (single-float y))
(do ((x -50.0 (+ x inc)))
((< (- 50.0 x) inc))
(declare (single-float x))


(print (color-at x y) p))))))

(defun defsphere (x y z r c)
(declare (type single-float x y z r))


(let ((s (make-sphere
:radius r
:center (make-point :x x :y y :z z)
:color c)))
(push s *world*)
s))

#+cmu
(declaim (ext:start-block color-at sendray first-hit lambert intersect))

#+nil
(defun color-at (x y)
(declare (type single-float x y))


(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))

(round (* (sendray eye xr yr zr) 255.0))))

(defun sendray (pt xr yr zr)

(declare (type point pt)
(type single-float xr yr zr))


(multiple-value-bind (s int)
(first-hit pt xr yr zr)
(if s
(* (lambert s int xr yr zr)
(surface-color s))

0.0)))

(defun first-hit (pt xr yr zr)

(declare (type point pt)
(type single-float xr yr zr))


(let (surface hit dist)
(dolist (s *world*)

(declare (type sphere s))


(let ((h (intersect s pt xr yr zr)))
(when h
(let ((d (distance h pt)))
(when (or (null dist) (< d dist))
(setf surface s hit h dist d))))))
(values surface hit)))

(defun lambert (s int xr yr zr)

(declare (type single-float xr yr zr)
(type point int))


(multiple-value-bind (xn yn zn)
(normal s int)

(max 0.0 (+ (* xr xn) (* yr yn) (* zr zn)))))

#+nil


(defun intersect (s pt xr yr zr)

(declare (type point pt)
(single-float xr yr zr))


(funcall (typecase s (sphere #'sphere-intersect))
s pt xr yr zr))
(defun intersect (s pt xr yr zr)

(declare (type sphere s)
(type point pt)
(single-float xr yr zr))
(sphere-intersect s pt xr yr zr))

(defun sphere-intersect (s pt xr yr zr)

(declare (type sphere s)
(type point pt)
(type single-float xr yr zr))


(let* ((c (sphere-center s))
(n (minroot (+ (sq xr) (sq yr) (sq zr))
(* 2 (+ (* (- (x pt) (x c)) xr)
(* (- (y pt) (y c)) yr)
(* (- (z pt) (z c)) zr)))
(+ (sq (- (x pt) (x c)))
(sq (- (y pt) (y c)))
(sq (- (z pt) (z c)))
(- (sq (sphere-radius s)))))))
(if n
(make-point :x (+ (x pt) (* n xr))
:y (+ (y pt) (* n yr))
:z (+ (z pt) (* n zr))))))

(defun color-at (x y)
(declare (type single-float x y))


(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))

(let ((color (* (sendray eye xr yr zr) 255.0)))
(declare (type (single-float 0.0 (255.0)) color))
(floor (+ color 0.5)))))

(declaim (ext:end-block))

#+nil
(defun normal (s pt)
(declare (type point pt)
(values single-float single-float single-float))


(funcall (typecase s (sphere #'sphere-normal))
s pt))

(defun normal (s pt)
(declare (type sphere s)
(type point pt)
(values single-float single-float single-float))
(sphere-normal s pt))


(defun sphere-normal (s pt)
(let ((c (sphere-center s)))
(unit-vector (- (x c) (x pt))
(- (y c) (y pt))
(- (z c) (z pt)))))

(defun ray-test (&optional (res 1))
(setf *world* nil)

(defsphere 0.0 -300.0 -1200.0 200.0 .8)
(defsphere -80.0 -150.0 -1200.0 200.0 .7)
(defsphere 70.0 -100.0 -1200.0 200.0 .9)
(do ((x -2.0 (1+ x)))
((> x 2.0))
(declare (type single-float x))
(do ((z 2.0 (1+ z)))
((> z 7.0))
(declare (type single-float z))
(defsphere (* x 200) 300.0 (* z -400) 40.0 .75)))

Bulent Murtezaoglu

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Ok here are my results using both the Graham's original, Raymond Toy's
code and my modifications to RT's code for both ACL and CMUCL.

This took a lot longer than I expected (I figure 4 hours at least,
semi-distracted). Even barely usable benchmarking takes a lot of effort.

Short summary the fastest version runs under 10s to render the image
(no file access). Tables and code are at the end.

Notes:

RT's code did not produce 500x500 pictures as advertised in the header
(you guys have been looking at the output right?). My guess is
rational arithmetic was getting used originally so there were no
errors. In the short-float version, accumulated errors were causing
the do loops to terminate one iteration sooner. Maybe the point could
be made that optimizing with declares is not as straightforward as
advertised. It took me more than 15 minutes of scratching my head to
understand and fix the problem.

I also spend about an hour and half on top of what RT spent optimizing
the code using CMUCL. Usually silencing CMUCL's efficiency notes gets
the code in a reasonably good shape for other Lisps also. Exceptions in
this case to the last sentence are

(1) ACL's inlining policy. Franz's compiler (at least as far as Linux
and v. 4.3 for SunOS goes) does not ever inline user functions. So I
turned the square function into a macro for ACL and timed it again.

(2) CMUCL block compile feature (non-ANSI)

(3) CMUCL values declaration (non-ANSI, but there's an ANSI equivalent, I
didn't get to converting the code so the results are BIASED in favor
of CMUCL).

I think it is a good idea to split timings for the generation of the image
and the dumping of the text file. Output tends to be slow and sometimes
conses with print, it is also dependent on the filesystem/NFS speed. Since
we are after raw number crunching performance here, I modified the code
to stuff an array with the pixel values.


Hardware: Homegrown system, PentiumPro 180 overclocked to 231Mhz with
256K L1 cache and 128M EDO ram sitting on a 77MHZ bus.

OS: Linux 2.0.32. Fileserver on NSF over 10-base-2 (slow server, lightly
loaded network).

Lisp:
ACL is Allegro Common Lisp Linux version fully patched up to November 97
CMUCL is the version dated 30/01/98 off ftp.cons.org (Peter VanEynde's
binary kit)

Timing: fresh lisp on an very lightly loaded system from the command
line. I compiled everything once at the start and ran them three times
with full gc in between #+acl (gc t) #+cmu (ext:gc) and I'm reporting the
third time (too lazy to average and it doesn't matter given the large
we're looking for). There was negligible variation between the runs (2% max).
I don't think there were any page faults during the timed runs
(ACL doesn't report it [by default?], and CMUCL didn't report page
faults after the first run.)

All times in ms. There are differences in what time reports between the
systems. I took a reasonable inetersection. One notable difference is
ACL reports both cons cells counts and "other bytes" while CMUCL reports
everything as cons cells in bytes. I report CMU's byte count under ACL's
"other" column.

Original code as posted:

Real Time User System GC time cons other
ACL 188,472 185,360 500 2,190 507,541 2,238,426,232
CMUCL 263,970 244.650 15,870 50,380 1,422,303,632

Original code + global (declaim (optimize (speed 3) (safety 0) (debug 0)))

ACL 185,948 182,750 600 2,370 507,649 2,238,455,232
CMUCL 285,890 266.300 16.050 50,430 1,422,346,560

Please note the following are somehwat unfair to ACL in varying degrees


RT's code as posted (w/ the bug but I did add one #+ to get allegro to load it)

Real Time User System GC time cons other
ACL 158,726 154,930 650 4,550 512,941 -19,566,024 (!)
CMUCL 117,530 99,230 15,340 45,910 1,288,595,072

RT's code + my additions and fix (generate and write to file as usual)
Real Time User System GC time cons other
CMUCL 25,280 21.290 2.000 5,690 161,279,400
ACL 156,760 153,080 740 4,460 514,974 4,259,431,824

when (setf *bytes-consed-between-gcs* 8000000) for CMUCL

CMUCL 22,750 18,000 1.770 2.200 161,590,736


After turning sq into a macro for ACL's benefit:
Real Time User System GC time cons other
ACL 59,061 56,450 370 850 503,473 1,321,391,360

(here I should turn CMUCL values extension into "the" declations, but it's
late already...)

Timing for no output file (time (ray-test-no 5)) see attached file
Real Time User System GC time cons other
CMUCL 9,650 9,060 240 910 19,902,152
ACL 40,334 40,310 20 920 39 1,320,743,280


----Code:

;; from Graham's ANSi CL book, modified by Raymond Toy and Bulent
;; Murtezaoglu.
;; More speed can be gained by (1) fixing cmucl so
;; it works better, (2) going overboard with inline and such.
;; As it is the code generates about 15 efficiency notes some of which
;; I (BM) couldn't fix, (the others I didn't bother with).

(declaim (optimize (speed 3) (safety 0) (debug 0))) ;; does compilation speed do anything?

#+cmu (declaim (start-block ray-test tracer tracer-no ray-test-no))
;;would crash with just ray-test

(eval-when (compile load eval)
(defstruct (point (:conc-name nil))
(x 0.0 :type single-float)
(y 0.0 :type single-float)
(z 0.0 :type single-float))
(defstruct surface
(color 0.0 :type single-float))
(defstruct (sphere (:include surface))
(radius 0.0 :type single-float)
(center (required-argument) :type point))
)

(declaim (inline sq))
(declaim (inline minroot))
(declaim (inline sqrt))
(declaim (inline mag))

#-allegro


(defun sq (x)
(declare (type single-float x))
(* x x))

#+allegro
(defmacro sq (x)
(let ((foo (gensym)))
`(let ((,foo ,x)) (declare (type single-float ,foo)) (* ,foo ,foo))))

(defun mag (x y z)
(declare (type single-float x y z))
(sqrt (+ (sq x) (sq y) (sq z))))

(defun unit-vector (x y z)
(declare (type single-float x y z)
(values (single-float 0.0 1.0)
(single-float 0.0 1.0)
(single-float 0.0 1.0)))
(let ((d (mag x y z)))
(values (/ x d) (/ y d) (/ z d))))


(defun distance (p1 p2)
(declare (type point p1 p2)

(values single-float))


(mag (- (x p1) (x p2))
(- (y p1) (y p2))
(- (z p1) (z p2))))

(defun minroot (a b c)


(declare (type single-float a b c))

(if (zerop a) ;;zerop doesn't get optimized for some reason, neither does =


(/ (- c) b)
(let ((disc (- (sq b) (* 4 a c))))

(unless (minusp disc) ; cmucl should not need the cruft below but it does
(let ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0 *) disc)))))


(min (/ (+ (- b) discrt) (* 2 a))
(/ (- (- b) discrt) (* 2 a))))))))

(defparameter *world* nil)

(defvar *image* nil)

(defconstant eye (make-point :x 0.0 :y 0.0 :z 200.0))

(defun tracer (pathname &optional (res 1))
(declare (type (integer 1 100) res))
(with-open-file (p pathname :direction :output)
(format p "P2 ~A ~A 255" (* res 100) (* res 100))

(let* ((inc (/ (float res)))
(xymax (- (* res 100 inc) 50)))
(declare (single-float xymax))


(do ((y -50.0 (+ y inc)))

((< xymax y))


(declare (single-float y))
(do ((x -50.0 (+ x inc)))

((< xymax x))
(declare (single-float x))
(print (color-at x y) p)))))))

(let (surface hit (dist most-positive-single-float))
(declare (type single-float dist))


(dolist (s *world*)
(declare (type sphere s))
(let ((h (intersect s pt xr yr zr)))
(when h
(let ((d (distance h pt)))

(when (< d dist) ;; was (or (null dist) (< d dist))


(setf surface s hit h dist d))))))
(values surface hit)))

(defun lambert (s int xr yr zr)
(declare (type single-float xr yr zr)
(type point int))
(multiple-value-bind (xn yn zn)
(normal s int)
(max 0.0 (+ (* xr xn) (* yr yn) (* zr zn)))))

#+nil
(defun intersect (s pt xr yr zr) ;; note the dispatch on type (slow)

#+nil
(defun normal (s pt);; note the dispatch on type (slow)

;;; these are the no-output versions

(defvar *imgarray*)

(defun tracer-no (&optional (res 1))


(declare (type (integer 1 100) res))

(setf *imgarray*
(make-array (* 100 100 res res) :element-type '(unsigned-byte 8)
:fill-pointer 0))
(let* ((inc (/ (float res)))
(xymax (- (* res 100 inc) 50)))
(declare (single-float xymax))


(do ((y -50.0 (+ y inc)))

((< xymax y))


(declare (single-float y))
(do ((x -50.0 (+ x inc)))

((< xymax x))
(declare (single-float x))
(vector-push (color-at x y) *imgarray*)))))))


(defun ray-test-no (&optional (res 1))


(setf *world* nil)
(defsphere 0.0 -300.0 -1200.0 200.0 .8)
(defsphere -80.0 -150.0 -1200.0 200.0 .7)
(defsphere 70.0 -100.0 -1200.0 200.0 .9)
(do ((x -2.0 (1+ x)))
((> x 2.0))
(declare (type single-float x))
(do ((z 2.0 (1+ z)))
((> z 7.0))
(declare (type single-float z))
(defsphere (* x 200) 300.0 (* z -400) 40.0 .75)))

(tracer-no res))


David J. Fiander

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:
> (declaim (inline sq))
> (declaim (inline minroot))
> (declaim (inline sqrt))
> (declaim (inline mag))
>
> #-allegro
> (defun sq (x)
> (declare (type single-float x))
> (* x x))

I found, for CMUCL at least, that once you've declaimed the
function inline, there's no need to declare the types in the
function. The compiler will produce efficiency notes about
compiling sq, minroot, mag, but when you actually _call_ minroot
elsewhere with the types of the arguments declared properly, it
will very happily inline just the correct version of the code and
perform an inline single-float multiplication.

- David

Raymond Toy

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

>
> I also spend about an hour and half on top of what RT spent optimizing
> the code using CMUCL. Usually silencing CMUCL's efficiency notes gets

About an hour or two, part time.


>
> (defun minroot (a b c)
> (declare (type single-float a b c))
> (if (zerop a) ;;zerop doesn't get optimized for some reason, neither does =

This seems to be optimized for me.

> (/ (- c) b)
> (let ((disc (- (sq b) (* 4 a c))))
> (unless (minusp disc) ; cmucl should not need the cruft below but it does
> (let ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0 *) disc)))))

This might be because of the fact that for CMUCL (sqrt -0.0) is #c(0.0
0.0) and -0.0 is an element of (single-float 0.0 *). You may want to
say (the (or (eql 0.0) (single-float (0.0))) disc) to get better
code. Alternatively, if you have the :negative-zero-is-not-zero
feature, then you probably don't need any extra declarations. I
didn't need them.

> (defun color-at (x y)
> (declare (type single-float x y))
> (multiple-value-bind (xr yr zr)
> (unit-vector (- x (x eye))
> (- y (y eye))
> (- 0 (z eye)))
> (let ((color (* (sendray eye xr yr zr) 255.0)))
> (declare (type (single-float 0.0 (255.0)) color))
> (floor (+ color 0.5)))))

Since color > 0, it helps to replace floor with truncate in this case
because truncate can use the FPU directly. (Hmm, need to add this
optimization to CMUCL.)

With these changes, I now get only 3 compiler efficiency notes from
minroot. The run time is now

Evaluation took:
18.07 seconds of real time
15.23 seconds of user run time
1.41 seconds of system run time
[Run times include 1.96 seconds GC run time]
0 page faults and
151782136 bytes consed.

About a 3-4 times faster than my previous post.

Ray

Bulent Murtezaoglu

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
RT> About an hour or two, part time.

So, it's probably safe to say that it took about 2 solid hours total
to get to fast code (adjusting for my inexperience with cmucl).
I say this is a bit too long even when we consider we weren't
the original authors. It would have been much easier of course if
we [I?] used tools like graphical profilers etc. On the plus side,
the resulting code is not too cluttered IMHO (and some of the clutter
is probably unnecessary). I'd be very interested to see what kind of
performance can be gotten from LW and Lucid/Liquid at what kind of price.

It would also be nice to get some figures from C code. And maybe compare
.s files with the output from disassemble.

>> (defun minroot (a b c) (declare (type single-float a b c)) (if
>> (zerop a) ;;zerop doesn't get optimized for some reason,
>> neither does =

RT> This seems to be optimized for me.

On mine I get an efficiency note warning me about some compiler-generated
temporaries not being guaranteed to be the same type so zerop doesn't get
open coded. I don't remember the exact note. Maybe you are running a
different version?

>> (/ (- c) b) (let ((disc (- (sq b) (* 4 a c)))) (unless (minusp
>> disc) ; cmucl should not need the cruft below but it does (let
>> ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0
>> *) disc)))))

RT> This might be because of the fact that for CMUCL (sqrt -0.0)
RT> is #c(0.0 0.0) and -0.0 is an element of (single-float 0.0 *).
RT> You may want to say (the (or (eql 0.0) (single-float (0.0)))
RT> disc) to get better code. Alternatively, if you have the
RT> :negative-zero-is-not-zero feature, then you probably don't
RT> need any extra declarations. I didn't need them.

Hmmm. My comment concerns:

(unless (minusp disc) ;up till here we know disc is a single-float
(let ((disc (sqrt disc))) ; here we know disc is non-negative
...

I was hoping the compiler would notice that additional constraint on
disc and not need any declarations to open code sqrt. Does your
explanation above apply to minusp? What does [should?] minusp do to -0.0?

[...]
RT> Evaluation took: 18.07 seconds of real time 15.23 seconds of
RT> user run time 1.41 seconds of system run time [Run times
RT> include 1.96 seconds GC run time] 0 page faults and 151782136
RT> bytes consed.

Did you also try the no-output version? I'll try to run your new code
tonight and see what happens.

BM

Raymond Toy

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

> >>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
> RT> About an hour or two, part time.
>
> So, it's probably safe to say that it took about 2 solid hours total
> to get to fast code (adjusting for my inexperience with cmucl).

Perhaps. I think if I had concentrated on it, the conversion should
have only taken 30 minutes or so. The compiler notes were very good.

> I say this is a bit too long even when we consider we weren't
> the original authors. It would have been much easier of course if
> we [I?] used tools like graphical profilers etc. On the plus side,

Actually, I did use CMUCL's profile to figure out some things. I
think sphere-intersect takes up most of the time and generates by far
the most garbage.

> open coded. I don't remember the exact note. Maybe you are running a
> different version?

I was running the latest development version.

>
> (unless (minusp disc) ;up till here we know disc is a single-float
> (let ((disc (sqrt disc))) ; here we know disc is non-negative
> ...
>
> I was hoping the compiler would notice that additional constraint on
> disc and not need any declarations to open code sqrt. Does your
> explanation above apply to minusp? What does [should?] minusp do to -0.0?

It does notice the constraint (well, it's supposed to). However, the
problem is that CMUCL currently has (sqrt -0.0) returning #c(0.0 0.0),
so the generic sqrt needs to be called. Also (minusp -0.0) = (minusp
0.0) = NIL, as required by IEEE arithmetic. To get CMUCL to produce
the desired code, disc has be be declared as (or (eql 0.0)
(single-float (0.0) *)) so that the compiler knows -0.0 isn't a valid
value. My version has :negative-zero-is-not-zero feature, so I can
say (single-float 0.0 *) to get the desired result. (This feature
seems to contradict ANSI CL, but ANSI CL doesn't really quite cover
this issue either.)


> Did you also try the no-output version? I'll try to run your new code
> tonight and see what happens.
>

Here is the no-output result:

Evaluation took:
6.38 seconds of real time
6.1 seconds of user run time
0.1 seconds of system run time
[Run times include 0.11 seconds GC run time]
0 page faults and
9944752 bytes consed.

3 times better than before. The I/O generates a lot of garbage and is
pretty slow!

Ray

Marc Battyani

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu wrote in message <87btvc8...@isttest.bogus>...


>
>Ok here are my results using both the Graham's original, Raymond Toy's
>code and my modifications to RT's code for both ACL and CMUCL.

>Hardware: Homegrown system, PentiumPro 180 overclocked to 231Mhz with


>256K L1 cache and 128M EDO ram sitting on a 77MHZ bus.
>
>OS: Linux 2.0.32. Fileserver on NSF over 10-base-2 (slow server, lightly
>loaded network).
>
>Lisp:
>ACL is Allegro Common Lisp Linux version fully patched up to November 97
>CMUCL is the version dated 30/01/98 off ftp.cons.org (Peter VanEynde's
>binary kit)

>Original code + global (declaim (optimize (speed 3) (safety 0) (debug 0)))
>
>ACL 185,948 182,750 600 2,370 507,649 2,238,455,232
>CMUCL 285,890 266.300 16.050 50,430 1,422,346,560

>RT's code as posted (w/ the bug but I did add one #+ to get allegro to load
it)
> Real Time User System GC time cons other
>ACL 158,726 154,930 650 4,550 512,941 -19,566,024 (!)
>CMUCL 117,530 99,230 15,340 45,910 1,288,595,072

Here are the results for Harlequin LispWorks 4.0.1 with latest patches on a
Pentium Pro
200 MHz with 80 Mo RAM
Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float 3)))

CL-USER 15 > (extended-time (ray-test 5))
757.5 seconds used.
Standard Allocation 3397066608 bytes.
Fixlen Allocation 9912 bytes.
total gc activity = 172.106
main promote ( 4 calls) = 0.110
mark and sweep (13235 calls) = 171.996
internal promote ( 4 calls) = 0.000
promote ( 0 calls) = 0.000
fixup ( 4 calls) = 0.16
compact ( 0 calls) = 0.000
NIL

>RT's code as posted (w/ the bug but I did add one #+ to get allegro to load
it)

CL-USER 16 > (extended-time (ray-test-no 5))
613.9 seconds used.
Standard Allocation 4535355248 bytes.
Fixlen Allocation 1056 bytes.
total gc activity = 187.337
main promote ( 4 calls) = 1.346
mark and sweep (17683 calls) = 185.991
internal promote ( 5 calls) = 0.94
promote ( 0 calls) = 0.000
fixup ( 6 calls) = 0.438
compact ( 0 calls) = 0.000
NIL

Marc Battyani


Bulent Murtezaoglu

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:

RT> Perhaps. I think if I had concentrated on it, the conversion
RT> should have only taken 30 minutes or so. The compiler notes
RT> were very good.

I'll haggle with you on that! _I_ couldn't have gotten as far as I got
even starting from your already-optimized code in 30 min. I had to
fire up allegro, get it to compile the file and run it to convince
myself that the bug I was seeing wasn't a cmucl artifact. For the slower
versions of the code you have to wait an appreciable amount to test after
a few steps. Given what we are trying to get a sense of, I think it is not
unreasonable to assume that programmers will introduce bugs during
optimization, and incremental inprovement/testing will take place (out of
habit if nothing else). So I'd say pretty close to an hour for me if I
had everything (manuals and such) handy. What do other people think?

RT> I was running the latest development version.

That probably explains a number of things. I haven't figured out how to
build cmucl from sources yet...

[...]
RT> Here is the no-output result:

RT> Evaluation took: 6.38 seconds of real time 6.1 seconds of user
RT> run time 0.1 seconds of system run time [Run times include
RT> 0.11 seconds GC run time] 0 page faults and 9944752 bytes
RT> consed.

RT> 3 times better than before. The I/O generates a lot of
RT> garbage and is pretty slow!

Indeed. Maybe not too surprising given that we're using print. I
wonder what's still producing the 9Megs of garbage? Unless I screwed
up its decleration the image array should take about 250k, say another
few 10k's for misc. unavoidable consing we still have a huge amount.
I'm not familiar with cmu's profiler, maybe you could post sth. about
what's generating all this garbage?

BM

Raymond Toy

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

> >>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
>
> RT> Perhaps. I think if I had concentrated on it, the conversion
> RT> should have only taken 30 minutes or so. The compiler notes
> RT> were very good.
>
> I'll haggle with you on that! _I_ couldn't have gotten as far as I got
> even starting from your already-optimized code in 30 min. I had to
> fire up allegro, get it to compile the file and run it to convince
> myself that the bug I was seeing wasn't a cmucl artifact. For the slower

Perhaps I'm being optimistic? :-) I don't remember. But I didn't try
running anything else except CMUCL, so that saves some time.


>
> RT> I was running the latest development version.
>
> That probably explains a number of things. I haven't figured out how to
> build cmucl from sources yet...

It's not easy to do, unless you're just compiling the current version
with the current version, but you would need to do that anyway. :-)

> I'm not familiar with cmu's profiler, maybe you could post sth. about
> what's generating all this garbage?
>

First do (profile:profile-all) to turn on profiling everywhere. Run
the code, then (profile:report-time) to see the results, and
(profile:reset-time) to reset the time. Here are some results that I
got from your version. I had to run off the block compile because
then the functions wouldn't exist.

* (time (ray-test 2))
[snip]
Evaluation took:
27.96 seconds of real time
20.89 seconds of user run time
5.48 seconds of system run time
[Run times include 0.48 seconds GC run time]
0 page faults and
30963304 bytes consed.
NIL
* (profile:report-time)
Seconds | Consed | Calls | Sec/Call | Name:
------------------------------------------------------
0.145 | 1,470,600 | 61,275 | 0.00000 | UNIT-VECTOR
0.043 | 500,976 | 20,874 | 0.00000 | SPHERE-NORMAL
0.010 | 264 | 33 | 0.00029 | DEFSPHERE
0.001 | 258,800 | 32,350 | 0.00000 | DISTANCE
0.000 | 1,295,320 | 32,383 | 0.00000 | MAKE-POINT
0.000 | 333,984 | 40,401 | 0.00000 | SENDRAY
0.000 | 0 | 1,333,233 | 0.00000 | INTERSECT
0.000 | 0 | 40,401 | 0.00000 | FIRST-HIT
0.000 | 1,035,200 | 1,333,233 | 0.00000 | SPHERE-INTERSECT
0.000 | 1,056 | 33 | 0.00000 | MAKE-SPHERE
0.000 | 23,615,544 | 1 | 0.00000 | TRACER
0.000 | 0 | 20,874 | 0.00000 | NORMAL
0.000 | 834,960 | 20,874 | 0.00000 | LAMBERT
0.000 | 1,616,040 | 40,401 | 0.00000 | COLOR-AT
0.000 | 512 | 1 | 0.00000 | RAY-TEST
------------------------------------------------------
0.199 | 30,963,256 | 2,976,367 | | Total

Estimated total profiling overhead: 23.81 seconds

So, actually unit-vector takes the most time now and tracer generates
the most garbage.

Ray

Bulent Murtezaoglu

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:

RT> Here is the no-output result:

RT> Evaluation took: 6.38 seconds of real time 6.1 seconds of user

RT> run time 0.1 seconds of system run time [Run times include
RT> 0.11 seconds GC run time] 0 page faults and 9944752 bytes
RT> consed.

After incorporating your suggestions and a few other minor things
the best I can do over here is

Evaluation took:
5.41 seconds of real time
5.14 seconds of user run time
0.14 seconds of system run time
[Run times include 0.36 seconds GC run time]
0 page faults and
8300816 bytes consed.

(array stuffing version with CMUCL)

Here's what I couldn't get to work (ie gave up after a cursory look):

inlining unit-vector seems to break type inference somehow (either that or
there's some subtlety I don't understand right now).

I tried explicitly defining the structs as vectors hoping to save some
more space but apparently unnamed vector structs cannot be used as types
with the name given to defstruct in declarations. It seemed too ugly
to declare points and spheres explicitly as vectors, so I didn't go for
that workaround.

BM

Bulent Murtezaoglu

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
[...]
RT> Perhaps I'm being optimistic? :-) I don't remember. But I
RT> didn't try running anything else except CMUCL, so that saves
RT> some time.

Maybe you are, but more importantly one of the points of this exercise
is to get some sense of whether authors like Norvig and Graham are being
too optimistic about how much time it would take to go from elegant working
code to something within 2-3x of C.

I think for simple things that could also be easily implemented in C right
from the start, it isn't obvious that Common Lisp is a good environment.
Incremental development etc. is all very nice, but when the smallest problem
you can try takes as long to run as several edit-make-crash cycles in C,
the benefits of Lisp are less clear. The raytracing example does not expose
this problem adequately because you can go to tiny pictures and do your
exploratory programming there if you need to. Nonetheless I think the point
is valid, as there are problems where Lisp could really shine but doesn't
because of inadequate performance. (machine learning comes to my mind where
for some algorithms there's a knee in the problem size graph you have to hit
to observe interesting stuff. If the smallest such interesting problem
takes hours with unoptimized code, some of the benefits of Lisp are lost.
This isn't horrible, and one can't have everything, but maybe we ought
to recognize that costs for all this dynamic luxury might be prohibitive
more often than one would first think.)

We should also note that we got the relatively decent speeds using
non-ANSI extensions on an experimental implementation. Lisp vendors
probably don't get many requests for good float performance.

>> That probably explains a number of things. I haven't figured
>> out how to build cmucl from sources yet...

RT> It's not easy to do, unless you're just compiling the current
RT> version with the current version, but you would need to do
RT> that anyway. :-)

Yeah, I found out the other day that it isn't easy. It appears that having
functions/variables/macros available by name at run time in the working
Lisp complicates things when it comes to compiling the compiler.
Am I understanding this correctly? I naively thought that some clever
package manipulation and uninterning would fix things in an elegant manner,
but didn't get very far even in my head. Kept thinking about two levels of
packages with relative paths, but there's no such thing in Common Lisp.
Anyhow, how do you guys do it? Is this documented anywhere? What am I
missing?


BM

Erik Naggum

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

* Bulent Murtezaoglu

| I think for simple things that could also be easily implemented in C
| right from the start, it isn't obvious that Common Lisp is a good
| environment. Incremental development etc. is all very nice, but when the
| smallest problem you can try takes as long to run as several
| edit-make-crash cycles in C, the benefits of Lisp are less clear.

one of the curious properties of programming in Common Lisp is that you
adopt different styles according as the needs you foresee for your code
differ. usually, you know beforehand that some code is speed-critical
and you care about consing, declarations, and such early, and take care
not to violate the principles that you know will result in fast code.

adopting a number of different styles and becoming willing to write code
that is not "beautiful Lisp" but only _exports_ a "beautiful Lisp"
interface takes a lot of experience. I remember I spent a lot of time
getting to grips with the need to write really _ugly_ code, but then it
dawned on me that while I liked C because I could bury ugliness in code
and I like Common Lisp because it is elegant and beautiful, these are
_not_ mutually exclusive qualities. I can bury ugliness in Common Lisp,
too, but if I take care to write an elegant interface to it, it's gone
forever, quite unlike C, where the interface still sucks.

there's one thing I miss, though, but I'll make that a separate thread.

#:Erik
--
God grant me serenity to accept the code I cannot change,
courage to change the code I can, and wisdom to know the difference.

Jason Trenouth

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

On Thu, 12 Mar 1998 20:29:14 +0100, "Marc Battyani"
<marc_b...@email.msn.com> wrote:

> Here are the results for Harlequin LispWorks 4.0.1 with latest patches on a
> Pentium Pro
> 200 MHz with 80 Mo RAM
> Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float 3)))

^^^^^^^

NB "Float" is a misnomer for "float-safety" in LispWorks, so you really want
"(float 0)". Also this won't make much difference in the current PC release
anyway. A future PC release will include floating point optimization
comparable with LispWorks on Unix.

__Jason

Raymond Toy

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

>
> inlining unit-vector seems to break type inference somehow (either that or
> there's some subtlety I don't understand right now).

This should work. I'll take a peek at it.

>
> I tried explicitly defining the structs as vectors hoping to save some
> more space but apparently unnamed vector structs cannot be used as types
> with the name given to defstruct in declarations. It seemed too ugly
> to declare points and spheres explicitly as vectors, so I didn't go for
> that workaround.

Converting the structs to vectors might help, but I think structure
access is already quite fast, and it probably doesn't reduce consing.

Ray

Raymond Toy

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

>
> We should also note that we got the relatively decent speeds using
> non-ANSI extensions on an experimental implementation. Lisp vendors

No, I disagree. The only non-ANSI extension was CMUCL's start-block,
which essentially wraps everything up in a labels form. You could
have done that yourself and made a portable version. Even without it,
I don't think the speed was that much slower.

> probably don't get many requests for good float performance.

Perhaps, but when I sent a small comparison of FP performance between
CMUCL and Franz, an engineer from Franz volunteered a fair amount of
time helping me get good FP performance. He mentioned that he was
actively looking at improving the compiler to handle FP better. And
this was the free Linux version! If I ever need a commercial one, I
will surely look at Franz.

>
> >> That probably explains a number of things. I haven't figured
> >> out how to build cmucl from sources yet...
>
> RT> It's not easy to do, unless you're just compiling the current
> RT> version with the current version, but you would need to do
> RT> that anyway. :-)
>
> Yeah, I found out the other day that it isn't easy. It appears that having
> functions/variables/macros available by name at run time in the working
> Lisp complicates things when it comes to compiling the compiler.
> Am I understanding this correctly? I naively thought that some clever
> package manipulation and uninterning would fix things in an elegant manner,
> but didn't get very far even in my head. Kept thinking about two levels of
> packages with relative paths, but there's no such thing in Common Lisp.
> Anyhow, how do you guys do it? Is this documented anywhere? What am I
> missing?

Compiling a new version from an old is sometimes quite hard. What you
say can be done and has been, I hear, but for the really hard builds,
I think a cross compiler (either from a different architecture, or the
same architecture) is used. I've never used the cross compiler.

If you want more info about building, try www.cons.org. For
complicated builds, join the mailing list. Or, do like me, and just
wait for someone to build it for you. :-)

Ray

Christopher J. Vogt

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to Bulent Murtezaoglu

Bulent Murtezaoglu wrote:
>
> >>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
> [...]
> RT> Perhaps I'm being optimistic? :-) I don't remember. But I
> RT> didn't try running anything else except CMUCL, so that saves
> RT> some time.
>
> Maybe you are, but more importantly one of the points of this exercise
> is to get some sense of whether authors like Norvig and Graham are being
> too optimistic about how much time it would take to go from elegant working
> code to something within 2-3x of C.

I forget if it was Steele or Gabriel who observed that one of the things that
makes programming in common lisp uniquely different from other languages, is
that to get good performance, you have to be intimately familiar with your
implementation. This sort of issue doesn't exist in C, for example. One
implementation of C is generally as fast (within a factor of 2) as another.
However, CL implementations can vary quite a bit, in different areas. One
might have (relatively) good FP performance but horrible Bignum performance
etc.

In my experience with CL, much of it involving graphics and FP, the two
areas that CL implementations fall far behind C in terms of performance are
FP and I/O. So if there is something in particular about CL that you want
to benchmark, e.g. array performance, make sure that you aren't doing any
I/O and are not using FP numbers. I find it interesting that the original
message that started this thread was titled "Floating Point speed in CL",
but the code wasn't doing very much FP, it was mostly ratios!

>
> I think for simple things that could also be easily implemented in C right
> from the start, it isn't obvious that Common Lisp is a good environment.
> Incremental development etc. is all very nice, but when the smallest problem
> you can try takes as long to run as several edit-make-crash cycles in C,

> the benefits of Lisp are less clear. The raytracing example does not expose
> this problem adequately because you can go to tiny pictures and do your
> exploratory programming there if you need to. Nonetheless I think the point
> is valid, as there are problems where Lisp could really shine but doesn't
> because of inadequate performance. (machine learning comes to my mind where
> for some algorithms there's a knee in the problem size graph you have to hit
> to observe interesting stuff. If the smallest such interesting problem
> takes hours with unoptimized code, some of the benefits of Lisp are lost.
> This isn't horrible, and one can't have everything, but maybe we ought
> to recognize that costs for all this dynamic luxury might be prohibitive
> more often than one would first think.)

I think that it is *very* important that everyone understand the difference
between a language i.e. Common Lisp, and an implementation, i.e. CMUCL. There
is very little in Common Lisp the language that prohibits somebody from
developing an implementation that is competitive with C in terms of performance.
It may not be easy, and it may not exist yet, and it will require the proverbial
SSC (Sufficiently Smart Compiler), but it can be done! Common Lisp FP is not
slower than C, but FP in every CL implementation I have used has been slower
than C, but it doesn't have to be so.

>
> We should also note that we got the relatively decent speeds using
> non-ANSI extensions on an experimental implementation. Lisp vendors

> probably don't get many requests for good float performance.

I don't know how many requests they get for good FP performance, but I can
guarantee you that they do get some, speaking from personal experience :-)

>
>[...]

--
Christopher J. Vogt - Computer Consultant - Lisp, AI, Graphics, etc.
http://members.home.com/vogt/

Erik Naggum

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

* Christopher J. Vogt

| I forget if it was Steele or Gabriel who observed that one of the things
| that makes programming in common lisp uniquely different from other
| languages, is that to get good performance, you have to be intimately
| familiar with your implementation. This sort of issue doesn't exist in
| C, for example. One implementation of C is generally as fast (within a
| factor of 2) as another. However, CL implementations can vary quite a
| bit, in different areas. One might have (relatively) good FP performance
| but horrible Bignum performance etc.

I beg to differ. C systems (compilers, libraries, etc) generally differ
by just as large factors as Common Lisp, often more, when non-trivial
functions are used, and they tend to be for non-trivial programs, and
optimizing C is generally very hard beyond the simple cases.

I'm not saying that CL is doing a better job than C in the general case,
but I'm getting somewhat peeved with the repetitive claims that C is so
fast. it isn't. it never was. it's just that the language is fairly
transparent down to the machine level so when you change something in the
C code, you can predict the effect on the compiled machine code. that
doesn't make it _fast_, it only ingratiates the control-freak aspect of
most programmers.

Marc Battyani

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

Jason Trenouth wrote in message <357c3188.3987049031@newshost>...


>On Thu, 12 Mar 1998 20:29:14 +0100, "Marc Battyani"
><marc_b...@email.msn.com> wrote:
>
>> Here are the results for Harlequin LispWorks 4.0.1 with latest patches on
a
>> Pentium Pro
>> 200 MHz with 80 Mo RAM
>> Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float
3)))
> ^^^^^^^
>
>NB "Float" is a misnomer for "float-safety" in LispWorks, so you really
want
>"(float 0)". Also this won't make much difference in the current PC release


Hoops sorry,
I tried with float 0 and you were right it does not seem to change much.
BTW I didn't find "Float" in the LWW doc (section 7.4 Compiler Control)

>anyway. A future PC release will include floating point optimization
>comparable with LispWorks on Unix.
>
>__Jason

That's good news! We do OpenGL and signal processing apps in LWW so this
would be quite welcome.

Marc Battyani


Raffael Cavallaro

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

Rainer,

we've seen posts of times for CMUCL, Franz's ACL for Linux, and Harlequin's
LispWorks for Windows.

How about MCL 4.2? You do run that, don't you? We'd like to see what MCL's
floating point performance is like.

thanks,

raf


Matthew McDonald

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

Erik Naggum <cle...@naggum.no> writes:

> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial
> functions are used,

Do you have any examples to substantiate this claim?

-- Matthew McDonald <ma...@cs.uwa.edu.au>
"The Statistician is the natural predator of the Scientist" -- Ross Ihaka

Erik Naggum

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

* Matthew McDonald

| Do you have any examples to substantiate this claim?

not any good set of particular examples, but 15 years of experience with
C on lots of platforms and lots of compilers has alerted me to the fact
that C compilers and environments differ _very_ much in quality, more
than my experience with (native-compiled) Common Lisp implementations and
environments do. Common Lisp implementations are generally quite good,
but the same is not necessarily true for C compilers -- naive compilation
of C is often a viable option for small systems.

if you need a specific example, consider the bundled C compiler with
SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
appears to be slightly better than GCC.

I find it interesting that you jump up to require substantiation of the
claim that C implementations differ, but do not challenge that Common
Lisp implementations do. this is the general sentiment out there, as
sort of a cultural requirement on programmers.

C is not fast. C is primitive, which means it has a potential of being
fast if the programmer is a genius and the compiler is on his side. if
the programmer is an idiot, and the compiler expects smarter programmers,
you will get abysmal performance. e.g., a system currently written in C
but which I am replacing with one rewritten in Common Lisp shows a
promise of reducing system requirements by a factor of between 5 and 6,
after months of work to decipher the incredibly bad C code to figure out
what it actually does.

#:Erik
--
religious cult update in light of new scientific discoveries:
"when we cannot go to the comet, the comet must come to us."

Raymond Toy

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

ma...@cs.uwa.edu.au (Matthew McDonald) writes:

> Erik Naggum <cle...@naggum.no> writes:
>
> > I beg to differ. C systems (compilers, libraries, etc) generally differ
> > by just as large factors as Common Lisp, often more, when non-trivial
> > functions are used,
>

> Do you have any examples to substantiate this claim?
>

Compare gcc on a Linux Alpha system against the DEC C compiler for
Alpha. (Or was that f77 vs DEC Fortran?) I understand that
the difference in performance can be a factor of 2 or more for the
same code on the same machine.

See comp.os.linux.alpha.

Ray

Tim Bradshaw

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

* Erik Naggum wrote:

> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial

> functions are used, and they tend to be for non-trivial programs, and
> optimizing C is generally very hard beyond the simple cases.

I think that in general this is quite right (and in general most
programs are a long way away from the case where even language
performance matters), but in the *specific* case of heavily numerical
code it is significantly difficult to get good performance out of CL
implementations, and even harder to get good performance with code
which is nearly portable. (In fact, it's probably really hard to get
good performance in all cases where most of the data should be
represented as immediate objects rather than pointers, but numerical
code is the best and commonest example of this).

And in these cases, when you get it wrong (or the compiler isn't smart
enough), you often end up with stuff that is 10s or 100s of times
slower than it should be. Languages like C or Fortran are unlikely to
cause you to take this kind of hit in cases where everything is
basically a float or int of some flavour. Of course they will
silently add two large positive integers & get a large negative
result...

--tim

Andreas Eder

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

Tim Bradshaw <t...@aiai.ed.ac.uk> writes:

> And in these cases, when you get it wrong (or the compiler isn't smart
> enough), you often end up with stuff that is 10s or 100s of times
> slower than it should be. Languages like C or Fortran are unlikely to
> cause you to take this kind of hit in cases where everything is
> basically a float or int of some flavour. Of course they will
> silently add two large positive integers & get a large negative
> result...
>
> --tim

That is quite right, but then C is also almost always at least a factor
of 2 slower than a good Fortran compiler. I have seen this with lots of
programs on C and Fortran on Convex and Cray computers.
But then, this was 5 years ago, things may have changed.

Andreas Eder NetWorker (ReliantUnix)
SNI AG Phone: +49-89-636-42549
OEC HES XP 2 Fax: +49-89-636-40601
D-81739 Munich, Germany

EMail: mailto:Andrea...@mch.sni.de
for public access: http://www.sni.de/public/mr/nsr/nsr_en.htm
for SNI personnel only: http://athen.mch.sni.de/networker/welcome.htm

---
Is "Windows for Dummies" another case of "this sentence no verb?"


Christopher J. Vogt

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to Erik Naggum

The following exchange got trimmed, and I have added it back in for direct
reference.

> * Christopher J. Vogt
> | I forget if it was Steele or Gabriel who observed that one of the things
> | that makes programming in common lisp uniquely different from other
> | languages, is that to get good performance, you have to be intimately
> | familiar with your implementation. This sort of issue doesn't exist in
> | C, for example. One implementation of C is generally as fast (within a
> | factor of 2) as another. However, CL implementations can vary quite a
> | bit, in different areas. One might have (relatively) good FP performance
> | but horrible Bignum performance etc.
>

Eirc Naggum

> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial
> functions are used, and they tend to be for non-trivial programs, and
> optimizing C is generally very hard beyond the simple cases.


Erik Naggum wrote:
>
> * Matthew McDonald


> | Do you have any examples to substantiate this claim?
>

> not any good set of particular examples, but 15 years of experience with
> C on lots of platforms and lots of compilers has alerted me to the fact
> that C compilers and environments differ _very_ much in quality, more
> than my experience with (native-compiled) Common Lisp implementations and
> environments do. Common Lisp implementations are generally quite good,
> but the same is not necessarily true for C compilers -- naive compilation
> of C is often a viable option for small systems.
>
> if you need a specific example, consider the bundled C compiler with
> SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
> the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
> appears to be slightly better than GCC.

Well, I did say "within a factor of 2", so you seem to be in violent agreement
with me? Yes, commercial C compilers vary in speed, generally within a factor
of 2. Commercial CL systems, however, seem to vary by much larger factors, in
many different areas.

Marco Antoniotti

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

ma...@cs.uwa.edu.au (Matthew McDonald) writes:

> Erik Naggum <cle...@naggum.no> writes:
>
> > I beg to differ. C systems (compilers, libraries, etc) generally differ
> > by just as large factors as Common Lisp, often more, when non-trivial
> > functions are used,
>

> Do you have any examples to substantiate this claim?
>

Here is a simple example

void
main()
{
void C_entry();

C_entry();
}

void
C_entry()
{
auto signed char buf[50];
strcpy(buf, "abcd");
return;
}

The program is redundant but it was written in this way to prove
exactly the point of "impredictability" of compiled C code. (The
compiler was for the ARM). The compiler is so aggressive that
recognizes the 'strcpy' and "inlines" it with a 'memcpy' when the
constant string pased to it fits in a given memory pattern.
Substituting "abcd" with "abc" makes the compiler change its strategy,
hence the optimization level. Substituting 'strcpy' with the obvious
counterpart 'mystrcpy' (taken straight out of K&R) seem to yield more
predictable performance.

Hence the proposition "there exists C compilers with various degrees
of resulting optimizations" is constructively proved :)

(Of course this may or may not be an answer to the question, but IMHO,
it comes pretty close).

The example is taken from lectures of the EECS 290a course held by
Professor Sangiovanni-Vincentelli at UCB last
fall. http://www-cad.eecs.berkeley.edu/~polis/class/index.html.
Check the second Lecture by Baker (actually the author is Grant) on
"Software Performance Estimation".

Cheers

--
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - (0)6 - 68 80 79 23, fax. +39 - (0)6 - 68 80 79 26
http://www.parades.rm.cnr.it

Matthew McDonald

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

I didn't really want to write a long article about this since (a) I'm
not a lisp expert - (though I've had some experience and success
optimising lisp code under CMUCL) and (b) most of my experience comparing
lisp and C implementations is several years old and therefore probably
somewhat outdated.

That said ......

Erik Naggum <cle...@naggum.no> writes:

>| Do you have any examples to substantiate this claim?

> not any good set of particular examples,
[...]


> SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
> the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
> appears to be slightly better than GCC.

I was originally planning to post a longer message in which I was
going to mention that I could remember seeing C code differ in
performance by a factor of 2 occasionally and around 3 only rarely.

Unless I misunderstood something, one of the versions of the ray-tracing
code ran *four* times faster on CMUCL than it did on ACL. Admittedly, it
had been optimised for CMUCL, but I'd be surprised to see a non-contrived
piece of C that revealed similar differences in C implementations.

> I find it interesting that you jump up to require substantiation of the
> claim that C implementations differ, but do not challenge that Common
> Lisp implementations do. this is the general sentiment out there, as
> sort of a cultural requirement on programmers.

I asked for examples because:

(1) I thought by that by "large differences", we meant something on the
order of a factor of four - not factors of two or less.

(2) Your claim doesn't match my personal experience,

The thread has already given an example of a piece of code that
differed by a factor of 4 under two differ lisp implementations. I
can't think of code I've seen that revealed similar differences
between serious C implementations.

> C is not fast. C is primitive, which means it has a potential of being
> fast if the programmer is a genius and the compiler is on his side.

With simple languages like C, the difference between naive compilation
and full optimisation is usually less than a factor of two (in my
experience). Look at the gcc vs oberon comparisions (the standard
oberon compilers do practically no optimisation) or even gcc with and
without optimisation.

Even a stupid C compiler can typically produce code that's within
spitting distance of what's produced by an optimising compiler,
because the language is basically a portable assembler. The same can't
be said of common lisp. When you're optimising floating point code in
C you don't usually need to go to much trouble to ensure that the
major operations are on unboxed data!

I keep seeing lisp people say that code written in lisp can easily be
made to perform as well as code written in C, and it just doesn't
match my (limited) experience. With CMUCL, floating point code can be
made to run about as fast as in C, but it seems like hard work. I'm
not even sure it's feasible in ACL.

It was only three or four years ago that a well-known lisp implentor
posted something here suggesting that: when you want lots of speed
from a certain lisp implementation, you should estimate the number of
registers required by the sub-expressions of important expressions so
you can then rearrange the expression to minimise the number of loads
and stores during the execution of the expression, since the compiled
code would execute the expression strictly left-to-right.

In the C/pascal/FORTRAN world, even straightforward compilers weigh
subexpressions before deciding what order to evaluate them in.

Perhaps I missing something, or I misunderstood the "well-known lisp
implentor", and the advice he posted was thirty years out of date and
for historical interest only.

The only substantial example of lisp code I can remember that was
competive is the Koza genetic algorithm code was easily optimised to
the point where it beat the C equivalent. I'd be delighted to see
large numbers of significant examples of straightforward lisp programs
that performed within a factor of two of C equivalents.

-- Matthew McDonald <ma...@cs.uwa.edu.au>

Appendix: to substantiate the claim that C implementations often don't
differ dramatically, here's a benchmark given at the end of the lcc
OVERVIEW.TEX file (from some time in 1993).

benchmark
_compiler__1._gcc1.35__8._espresso__22._li_23._eqntott__


VAX: MicroVAX II w/16MB running Ultrix 3.1
lcc 1734 2708 7015 3532
cc 1824 2782 7765 3569
cc -O 1661 2653 7086 3757
gcc 1439 2757 7353 3263
gcc -O 1274 2291 6397 1131


68020: Sun 3/60 w/24MB running SunOS 4.0.3
lcc 544 1070 2591 567
cc 514 1005 3308 619
cc -O 428 882 2237 571
gcc 426 1048 2498 591
gcc -O 337 834 1951 326


MIPS: IRIS 4D/220GTX w/32MB running IRIX 3.3.1
lcc 116 150 352 111
cc 107 153 338 100
cc -O 92 130 299 70
gcc 188 502 132
gcc -O 145 411 112


SPARC: Sun 4/260 w/32MB running SunOS 4.0.3
lcc 196 370 790 209
cc 203 381 1094 275
cc -O 150 296 707 183
gcc 186 411 1139 256
gcc -O 127 309 788 179

Martin Volf

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

Matthew McDonald wrote:
>
> With simple languages like C, the difference between naive compilation
> and full optimisation is usually less than a factor of two (in my
> experience). Look at the gcc vs oberon comparisions (the standard
> oberon compilers do practically no optimisation) or even gcc with and
> without optimisation.
>

Well, this is just what I have done some time ago, and the result may be
quite surprising:
without optimization (-O0) 2000 ms
optimizing (-02) 330 ms

So you have factor 6. The program was very simple, something like
bubble-sort. (I looked carefully if there is no huge job running
simultaneously and tried it more times, so it is no mistake.)

Martin
(martin.volf at st dot ms dot mff dot cuni dot cz)

John Watton

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to


>I keep seeing lisp people say that code written in lisp can easily be
>made to perform as well as code written in C, and it just doesn't
>match my (limited) experience. With CMUCL, floating point code can be
>made to run about as fast as in C, but it seems like hard work. I'm
>not even sure it's feasible in ACL.

Of course its feasible in ACL (4.3 varity on unix, NT and linux). I do
it all the time. I posted the benchmark below on comp.lang.lisp
sometime ago and didn't generate a fraction of the interest directed
to the ray tracing code. (Unlike the ray tracing code, I guess the
benchmark was too easy to get your hands around.) As you can see it is
comparible to unoptimized C code. With ACL I find I use the nonANSI
declare keyword :explain to tell me where the boxing is (declare
(:explain :boxing :calls :types)). Also with ACL is a nonANSI and
undocumented feature called immediate-args for passing and returning
unboxed arguments. I don't use it in this benchmark but it is
available at least on the Unix variety. I suspect that this is one
area that the ray tracing code suffers in the ACL version. In general
I prefer to pass floats inside arrays to avoid boxing - it's easy and
ANSI. But, I don't do that here to be fair to C.

John Watton
ALCOA

Machine: Pentium 90 Mhz, 40 Meg RAM
OS: Windows NT

Double floating point benchmark
Pnpoly area sec rel
----------- --- ---
C | VC++ 4.0 (cl -Ox) 16.5 1.0
Java | JDK 1.1.4 (JIT) 20.2 1.2
C | VC++ 4.0 (cl) 23.5 1.4
Lisp | Franz ACL 4.3.2 25.6 1.6
Java | JDK 1.1.4 (-O) 106.9 6.5
Lisp | Franz ACL 3.0.1 521.3 31.6
Perl | Perl5.003 1007.0 61.0


PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C
#include "math.h"

int pnpoly(int npol, double *xp, double *yp, double x, double y)
{
int i, j, c = 0;
for (i = 0, j = npol-1; i < npol; j = i++) {
if ((((yp[i]<=y) && (y<yp[j])) ||
((yp[j]<=y) && (y<yp[i]))) &&
(x < (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i]))
c = !c;
}
return c;
}

main(int argc, char *argv[]) {
int npol = atoi(argv[1]); /* 200 */
int div = atoi(argv[2]); /* 400 */
int count = 0;
double pi = acos(-1.0);
double* xp;
double* yp;
double theta;
double a = 10.0;
double fdiv = 2*a/div;
int i=0, j=0;
xp = malloc(npol*sizeof(double));
yp = malloc(npol*sizeof(double));
for(i=0;i<npol;i++) {
theta = i*2*pi/npol;
xp[i] = a+a*pow(cos(theta), 3);
yp[i] = a+a*pow(sin(theta), 3);
}
for(i=0;i<=div;i++) {
for(j=0;j<=div;j++) {
if (pnpoly(npol,xp,yp,i*fdiv,j*fdiv)) count++;
}
}
printf("\nArea: %f \n", 4*a*a*count/((div+1)*(div+1)));
return(0);
}

PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP
(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))
(let* ((c nil)
(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))
(if (and (or (and (<= (aref yp i) y) (< y (aref yp j)))
(and (<= (aref yp j) y) (< y (aref yp i))))
(< x (+ (aref xp i) (/ (* (- (aref xp j) (aref xp i))
(- y (aref yp i)))
(- (aref yp j) (aref yp i))))))
(setq c (not c)))
(setq j i))))

(defun pnpolymain (npol div) ; call with 200 400
(declare (optimize (speed 3) (safety 0))
(fixnum npol div))
(let* ((xp (make-array npol :element-type 'double-float))
(yp (make-array npol :element-type 'double-float))
(theta 0.0d0)
(a 10.0d0)
(fdiv (/ (* 2 a) div))
(count 0))
(declare (double-float fdiv a theta)
(fixnum count)
(type (simple-array double-float (*)) xp yp))
(dotimes (i npol)
(declare (fixnum i))
(setq theta (/ (* 2 i pi) npol))
(setf (aref xp i) (+ a (* a (expt (cos theta) 3)))
(aref yp i) (+ a (* a (expt (sin theta) 3)))))
(dotimes (u (1+ div))
(declare (fixnum u))
(dotimes (v (1+ div))
(declare (fixnum v))
(if (pnpoly npol xp yp (* u fdiv) (* v fdiv)) (incf count))))
(format t "~%Area: ~a" (/ (* count 4 a a) (* (1+ div) (1+ div))))))

Raymond Toy

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

ma...@cs.uwa.edu.au (Matthew McDonald) writes:

>
> Unless I misunderstood something, one of the versions of the ray-tracing
> code ran *four* times faster on CMUCL than it did on ACL. Admittedly, it
> had been optimised for CMUCL, but I'd be surprised to see a non-contrived
> piece of C that revealed similar differences in C implementations.

The code was not really optimized for CMUCL. The changes were
portable, and the only CMUCL specific item was the start-block stuff.
This could easily have been made portable by wrapping up everything in
a labels block.

I think that if *I* wrote a C compiler for a pentium, I could produce
one that was produces code that's probably 2-10 times slower than any
other C compiler. It would take me forever to do it, though.

I guess the bottom line is that writing a compiler is hard, and
writing good compilers are even harder.

Ray

Raymond Toy

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

John Watton <john....@alcoa.com> writes:

> it all the time. I posted the benchmark below on comp.lang.lisp
> sometime ago and didn't generate a fraction of the interest directed
> to the ray tracing code. (Unlike the ray tracing code, I guess the

Perhaps because you had already optimized the code? CMUCL only gives
a few notes about (* (1+ DIV) (1+ DIV)) because the result is not a
fixnum. This should not impact speed too much.

> (:explain :boxing :calls :types)). Also with ACL is a nonANSI and
> undocumented feature called immediate-args for passing and returning
> unboxed arguments. I don't use it in this benchmark but it is
> available at least on the Unix variety. I suspect that this is one
> area that the ray tracing code suffers in the ACL version. In general
> I prefer to pass floats inside arrays to avoid boxing - it's easy and
> ANSI. But, I don't do that here to be fair to C.

I don't remember precisely, but aren't parameters allowed to be passed
in registers in C? If so, Lisp should be allowed the same freedom.


Here are my times on an Ultra-30 300 MHz sparc, using Sun C compiler
(-O), and CMUCL 18a+:

sec
cc 5.7
cc -O 2.0
CMUCL 3.1

I note that the test cons up 5 MB or so, as reported by time. If I
add the CMUCL extension (declaim (ext:start-block pnpolymain)), the
run time doesn't change, but it now only conses 8000 bytes.

Ray


Raymond Toy

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

Raymond Toy <t...@rtp.ericsson.se> writes:

> John Watton <john....@alcoa.com> writes:
>
> > it all the time. I posted the benchmark below on comp.lang.lisp
> > sometime ago and didn't generate a fraction of the interest directed
> > to the ray tracing code. (Unlike the ray tracing code, I guess the
>
> Perhaps because you had already optimized the code? CMUCL only gives
> a few notes about (* (1+ DIV) (1+ DIV)) because the result is not a
> fixnum. This should not impact speed too much.
>

Here is one small optimization for Lisp which a C compiler probably
would have already done. This does the redundant array references
just once at the beginning of the loop. Are there any Lisp compilers
that do subexpression elimination (is that what it's called?)?

This change reduces the Lisp time from 3.2 sec to 2.5 sec. We are now
within .4 sec (20%) of the optimized C code.

Ray

(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))
(let* ((c nil)
(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))

(let ((ypi (aref yp i))
(ypj (aref yp j))
(xpi (aref xp i))
(xpj (aref xp j)))
(if (and (or (and (<= ypi y) (< y ypj))
(and (<= ypj y) (< y ypi)))
(< x (+ xpi (/ (* (- xpj xpi)
(- y ypi))
(- ypj ypi)))))
(setq c (not c))))
(setq j i))))

Bulent Murtezaoglu

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
[...]
RT> I note that the test cons up 5 MB or so, as reported by time.
RT> If I add the CMUCL extension (declaim (ext:start-block
RT> pnpolymain)), the run time doesn't change, but it now only
RT> conses 8000 bytes.

Yes I've been noticing that too. That's probably due to return values
not getting consed when the file is block compiled. Here's what I've been
wondering lately:

(deftsruct foo ;bla bla
)

(defun f1 (x)
;; some computation
(make-foo ;some results
))

(defun f2 (y)
;; some computation
(let ((local1 (f1 y)))
;; some more compuation
))

you get the idea (excuse the bad paranthesis placement):
the make-foo call generates garbage. Is Common Lisp allowed to
stack-allocate what f1 returns under certain circumstances?
Maybe if f1 is inlined and local1 is declared dynamic-extent?
(make-foo ing in f2 and passing that to f1 is too ugly)
Can some enlightened soul shed light on this?

cheers,

BM

Christopher J. Vogt

unread,
Mar 17, 1998, 3:00:00 AM3/17/98