Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Floating Point speed in Common Lisp

300 views
Skip to first unread message

Rainer Joswig

unread,
Mar 10, 1998, 3:00:00 AM3/10/98
to

What is the current speed of CL implementations
when it comes to FP code?


This is the raytracer code from the ANSI CL book by Paul Graham.
How about compiling it as it is and run it
with (time (ray-test 5)) ?


Somebody might also want to come up with a
version which is optimized for
the various CL implementations:
- optimizations settings changed
- type declarations added
- inlining
- reduced consing
- ???

;;; From Paul Graham's book ANSI Common Lisp, Example 9.8

(defun sq (x) (* x x))

(defun mag (x y z)
(sqrt (+ (sq x) (sq y) (sq z))))

(defun unit-vector (x y z)
(let ((d (mag x y z)))
(values (/ x d) (/ y d) (/ z d))))

(defstruct (point (:conc-name nil))
x y z)

(defun distance (p1 p2)
(mag (- (x p1) (x p2))
(- (y p1) (y p2))
(- (z p1) (z p2))))

(defun minroot (a b c)
(if (zerop a)
(/ (- c) b)
(let ((disc (- (sq b) (* 4 a c))))
(unless (minusp disc)
(let ((discrt (sqrt disc)))
(min (/ (+ (- b) discrt) (* 2 a))
(/ (- (- b) discrt) (* 2 a))))))))

(defstruct surface color)


(defparameter *world* nil)

(defconstant eye (make-point :x 0 :y 0 :z 200))

(defun tracer (pathname &optional (res 1))
(with-open-file (p pathname :direction :output)
(format p "P2 ~A ~A 255" (* res 100) (* res 100))
(let ((inc (/ res)))
(do ((y -50 (+ y inc)))
((< (- 50 y) inc))
(do ((x -50 (+ x inc)))
((< (- 50 x) inc))
(print (color-at x y) p))))))

(defun color-at (x y)
(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))
(round (* (sendray eye xr yr zr) 255))))

(defun sendray (pt xr yr zr)
(multiple-value-bind (s int) (first-hit pt xr yr zr)
(if s
(* (lambert s int xr yr zr)
(surface-color s))
0)))

(defun first-hit (pt xr yr zr)
(let (surface hit dist)
(dolist (s *world*)
(let ((h (intersect s pt xr yr zr)))
(when h
(let ((d (distance h pt)))
(when (or (null dist) (< d dist))
(setf surface s hit h dist d))))))
(values surface hit)))

(defun lambert (s int xr yr zr)
(multiple-value-bind (xn yn zn) (normal s int)
(max 0 (+ (* xr xn) (* yr yn) (* zr zn)))))

(defstruct (sphere (:include surface))
radius center)

(defun defsphere (x y z r c)
(let ((s (make-sphere
:radius r
:center (make-point :x x :y y :z z)
:color c)))
(push s *world*)
s))

(defun intersect (s pt xr yr zr)
(funcall (typecase s (sphere #'sphere-intersect))
s pt xr yr zr))

(defun sphere-intersect (s pt xr yr zr)
(let* ((c (sphere-center s))
(n (minroot (+ (sq xr) (sq yr) (sq zr))
(* 2 (+ (* (- (x pt) (x c)) xr)
(* (- (y pt) (y c)) yr)
(* (- (z pt) (z c)) zr)))
(+ (sq (- (x pt) (x c)))
(sq (- (y pt) (y c)))
(sq (- (z pt) (z c)))
(- (sq (sphere-radius s)))))))
(if n
(make-point :x (+ (x pt) (* n xr))
:y (+ (y pt) (* n yr))
:z (+ (z pt) (* n zr))))))

(defun normal (s pt)
(funcall (typecase s (sphere #'sphere-normal))
s pt))


(defun sphere-normal (s pt)
(let ((c (sphere-center s)))
(unit-vector (- (x c) (x pt))
(- (y c) (y pt))
(- (z c) (z pt)))))

(defun ray-test (&optional (res 1))
(setf *world* nil)
(defsphere 0 -300 -1200 200 .8)
(defsphere -80 -150 -1200 200 .7)
(defsphere 70 -100 -1200 200 .9)
(do ((x -2 (1+ x)))
((> x 2))
(do ((z 2 (1+ z)))
((> z 7))
(defsphere (* x 200) 300 (* z -400) 40 .75)))
(tracer (make-pathname :name "spheres.pgm") res))


Larry Hunter

unread,
Mar 10, 1998, 3:00:00 AM3/10/98
to

Rainer Joswig askes about the current speed of CL implementations on FP
code, proposing the Graham raytracer code as a benchmark.

Using ACL 4.3.1 (with current patch set) on an SGI Indigo^2 running Irix
6.2, with the code he supplied and the default optimizations I get:

USER(58): (time (ray-test 5))
; cpu time (non-gc) 195,970 msec (00:03:15.970) user, 1,280 msec system
; cpu time (gc) 64,700 msec (00:01:04.700) user, 370 msec system
; cpu time (total) 260,670 msec (00:04:20.670) user, 1,650 msec system
; real time 292,094 msec (00:04:52.094)
; space allocation:
; 515,900 cons cells, 0 symbols, 2,240,640,560 other bytes

Setting optimize (speed 3) (safety 0) (debug 0), I get about a 15% speedup:

USER(66): (time (ray-test 5))
; cpu time (non-gc) 170,460 msec (00:02:50.460) user, 1,200 msec system
; cpu time (gc) 65,000 msec (00:01:05) user, 360 msec system
; cpu time (total) 235,460 msec (00:03:55.460) user, 1,560 msec system
; real time 252,231 msec (00:04:12.231)
; space allocation:
; 515,803 cons cells, 0 symbols, -2,054,326,848 other bytes

Judging by a quick look at the code (not to mention the huge amount of
boxing indicated above), there is a lot of optimizing that could be done. I
don't have time to do it, but if someone else takes a crack, I'll test the
resulting code on the ACL/SGI implementation.

Larry


--
Lawrence Hunter, PhD.
National Library of Medicine phone: +1 (301) 496-9303
Bldg. 38A, 9th fl, MS-54 fax: +1 (301) 496-0673
Bethesda. MD 20894 USA email: hun...@nlm.nih.gov

Will Hartung

unread,
Mar 10, 1998, 3:00:00 AM3/10/98
to

"Rainer Joswig" <jos...@lavielle.com> writes:

>What is the current speed of CL implementations
>when it comes to FP code?


>This is the raytracer code from the ANSI CL book by Paul Graham.
>How about compiling it as it is and run it
>with (time (ray-test 5)) ?


>Somebody might also want to come up with a
>version which is optimized for
>the various CL implementations:
>- optimizations settings changed
>- type declarations added
>- inlining
>- reduced consing
>- ???

This is a pretty cool idea. It would be interesting to see "how fast"
it can go, and what are necessary for it to happen.

We had an interesting optimization thread a little while back that was
pretty interesting, but if this one takes off, I think it can be more
approachable.

Also, there was that fellow from Nichimen Graphics (my apologies for
forgeting his name and possibly not spelling his company name
properly) whose life pursuits seem to have been speeding up floating
point operations, specifically by, what, "unboxing" the data? So,
hopefully he'll have some input into this thread.

It would also be interesting to see what the fastest "portable" version
is compared to the fastest for each implementation.

--
Will Hartung - Rancho Santa Margarita. It's a dry heat. vfr...@netcom.com
1990 VFR750 - VFR=Very Red "Ho, HaHa, Dodge, Parry, Spin, HA! THRUST!"
1993 Explorer - Cage? Hell, it's a prison. -D. Duck

David J. Fiander

unread,
Mar 10, 1998, 3:00:00 AM3/10/98
to

vfr...@netcom.com (Will Hartung) writes:

> We had an interesting optimization thread a little while back that was
> pretty interesting, but if this one takes off, I think it can be more
> approachable.

This came up just as I started spending serious amounts of time
playing with Lisp. Graham talks about the fact that it's
possible to make lisp go fast, but it may require lots of
declarations that are optional. He also says that one wouldn't
actually stuff crufty little (declare)s and (the)s all over the
code, but would use macros. So, here are a couple that I came up
with along that vein. Do these look like things that might
appear in a mature (ie. well used and customized) development
environment?

;;;
;;; Declare a function to be inline, without too much fur
;;;
(defmacro definline (name args &body body)
`(progn
(declaim (inline ,name))
(defun ,name ,args ,@body)))

;;;
;;; Type a variable, to give a few extra hints to the compiler
;;;
(defmacro deftypedvar (type name &optional initval)
`(progn
(declaim (type ,type ,name))
(defvar ,name ,@(if initval `((the ,type ,initval))))))

Frank A. Adrian

unread,
Mar 11, 1998, 3:00:00 AM3/11/98
to

Larry Hunter wrote in message ...

>Judging by a quick look at the code (not to mention the huge amount of
>boxing indicated above), there is a lot of optimizing that could be done.
I
>don't have time to do it, but if someone else takes a crack, I'll test the
>resulting code on the ACL/SGI implementation.

Yup, boxing floats is the main thing that seems to slow down LISP FP
computation (and from looking at this code, I'd imagine that it would cons a
bit, being written for clarity rather than speed). I was wondering if
anyone had done an implementation that used tag bits in the lower end of the
word to differentiate between primitive types (e.g., smallInt / pointer /
singleFloat / constant) and just burned the last few bits of IEEE
compatibility. It seems that you could get single float performance to
within a small amount of "C"-speed with only a small penalty in accuracy. I
know that most compilers can do a good job of not boxing floats once they
get into the stack frame, but whenever they're returned or stored to memory,
most implementations seem to have to box them. So has anyone tried this
approach?
--
Frank A. Adrian
First DataBank
frank_...@firstdatabank.com (W)
fra...@europa.com (H)
This message does not necessarily reflect those of my employer,
its parent company, or any of the co-subsidiaries of the parent
company.


Bulent Murtezaoglu

unread,
Mar 11, 1998, 3:00:00 AM3/11/98
to

>>>>> "FAA" == Frank A Adrian <frank_...@firstdatabank.com> writes:

FAA> ... I know that
FAA> most compilers can do a good job of not boxing floats once
FAA> they get into the stack frame, but whenever they're returned
FAA> or stored to memory, most implementations seem to have to box
FAA> them. So has anyone tried this approach?

It appears that CMUCL block-compile extension should be able to do this
without undue decleration noise. It enables you to declare a bunch of
functions in " a block" as being only called from others in the same block
with the exception of ones you declare as entry points (ie called from
elsewhere). Excellent idea, as then just like as in inlining, all kinds
of optimizations become feasible without the programmer having to spell
everything out in declares. That should as far as I can see take care of
the above. Of course I was unable to test it since I couldn't yet coerce
the experimental version of CMUCL I have to do block compiles, but I think
the idea is an elegant and feasible one. The inconvenience of not being
able to redefine functions and such is a small trade off for finished code.

BM


Raymond Toy

unread,
Mar 11, 1998, 3:00:00 AM3/11/98
to

"Rainer Joswig" <jos...@lavielle.com> writes:

> What is the current speed of CL implementations
> when it comes to FP code?
>

The example isn't particulary good about *using* FP, but it could be
made so.

What kind of changes am I allowed to do?

I've made some mods to this code so that everything is a single-float,
as much as possible, and added some appropriate declarations for
CMUCL. (CMUCL derives the rest for itself.)

With these mods, on a 300-MHz Ultra, running CMUCL "18b", I get

(time (ray-test 5))

Evaluation took:
63.17 seconds of real time
44.21 seconds of user run time
11.37 seconds of system run time
[Run times include 13.84 seconds GC run time]
0 page faults and
1263144016 bytes consed.

Pretty bad compared to what someone else posted for SGI ACL.

The GC for CMUCL is not great, so it seems to spend lots of time doing
GC.

Someone more knowledgeable about CMUCL could probably speed this up
some more.

Ray

Modified code:

;;; From Paul Graham's book ANSI Common Lisp, Example 9.8

(declaim (optimize (speed 3)))

(eval-when (compile load eval)


(defstruct (point (:conc-name nil))

(x 0.0 :type single-float)
(y 0.0 :type single-float)
(z 0.0 :type single-float))
(defstruct surface
(color 0.0 :type single-float))


(defstruct (sphere (:include surface))

(radius 0.0 :type single-float)
(center (required-argument) :type point))
)

(defun sq (x)
(declare (type single-float x))
(* x x))

(defun mag (x y z)

(declare (type single-float x y z))


(sqrt (+ (sq x) (sq y) (sq z))))

(defun unit-vector (x y z)

(declare (type single-float x y z)
(values (single-float 0.0 1.0)
(single-float 0.0 1.0)
(single-float 0.0 1.0)))


(let ((d (mag x y z)))
(values (/ x d) (/ y d) (/ z d))))


(defun distance (p1 p2)
(declare (type point p1 p2))


(mag (- (x p1) (x p2))
(- (y p1) (y p2))
(- (z p1) (z p2))))

(declaim (inline minroot))


(defun minroot (a b c)

(declare (type single-float a b c))


(if (zerop a)
(/ (- c) b)
(let ((disc (- (sq b) (* 4 a c))))
(unless (minusp disc)
(let ((discrt (sqrt disc)))
(min (/ (+ (- b) discrt) (* 2 a))
(/ (- (- b) discrt) (* 2 a))))))))

(defparameter *world* nil)

(defconstant eye (make-point :x 0.0 :y 0.0 :z 200.0))

(defun tracer (pathname &optional (res 1))

(declare (type (integer 1 100) res))


(with-open-file (p pathname :direction :output)
(format p "P2 ~A ~A 255" (* res 100) (* res 100))

(let ((inc (/ (float res 1.0))))
(do ((y -50.0 (+ y inc)))
((< (- 50.0 y) inc))
(declare (single-float y))
(do ((x -50.0 (+ x inc)))
((< (- 50.0 x) inc))
(declare (single-float x))


(print (color-at x y) p))))))

(defun defsphere (x y z r c)
(declare (type single-float x y z r))


(let ((s (make-sphere
:radius r
:center (make-point :x x :y y :z z)
:color c)))
(push s *world*)
s))

#+cmu
(declaim (ext:start-block color-at sendray first-hit lambert intersect))

#+nil
(defun color-at (x y)
(declare (type single-float x y))


(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))

(round (* (sendray eye xr yr zr) 255.0))))

(defun sendray (pt xr yr zr)

(declare (type point pt)
(type single-float xr yr zr))


(multiple-value-bind (s int)
(first-hit pt xr yr zr)
(if s
(* (lambert s int xr yr zr)
(surface-color s))

0.0)))

(defun first-hit (pt xr yr zr)

(declare (type point pt)
(type single-float xr yr zr))


(let (surface hit dist)
(dolist (s *world*)

(declare (type sphere s))


(let ((h (intersect s pt xr yr zr)))
(when h
(let ((d (distance h pt)))
(when (or (null dist) (< d dist))
(setf surface s hit h dist d))))))
(values surface hit)))

(defun lambert (s int xr yr zr)

(declare (type single-float xr yr zr)
(type point int))


(multiple-value-bind (xn yn zn)
(normal s int)

(max 0.0 (+ (* xr xn) (* yr yn) (* zr zn)))))

#+nil


(defun intersect (s pt xr yr zr)

(declare (type point pt)
(single-float xr yr zr))


(funcall (typecase s (sphere #'sphere-intersect))
s pt xr yr zr))
(defun intersect (s pt xr yr zr)

(declare (type sphere s)
(type point pt)
(single-float xr yr zr))
(sphere-intersect s pt xr yr zr))

(defun sphere-intersect (s pt xr yr zr)

(declare (type sphere s)
(type point pt)
(type single-float xr yr zr))


(let* ((c (sphere-center s))
(n (minroot (+ (sq xr) (sq yr) (sq zr))
(* 2 (+ (* (- (x pt) (x c)) xr)
(* (- (y pt) (y c)) yr)
(* (- (z pt) (z c)) zr)))
(+ (sq (- (x pt) (x c)))
(sq (- (y pt) (y c)))
(sq (- (z pt) (z c)))
(- (sq (sphere-radius s)))))))
(if n
(make-point :x (+ (x pt) (* n xr))
:y (+ (y pt) (* n yr))
:z (+ (z pt) (* n zr))))))

(defun color-at (x y)
(declare (type single-float x y))


(multiple-value-bind (xr yr zr)
(unit-vector (- x (x eye))
(- y (y eye))
(- 0 (z eye)))

(let ((color (* (sendray eye xr yr zr) 255.0)))
(declare (type (single-float 0.0 (255.0)) color))
(floor (+ color 0.5)))))

(declaim (ext:end-block))

#+nil
(defun normal (s pt)
(declare (type point pt)
(values single-float single-float single-float))


(funcall (typecase s (sphere #'sphere-normal))
s pt))

(defun normal (s pt)
(declare (type sphere s)
(type point pt)
(values single-float single-float single-float))
(sphere-normal s pt))


(defun sphere-normal (s pt)
(let ((c (sphere-center s)))
(unit-vector (- (x c) (x pt))
(- (y c) (y pt))
(- (z c) (z pt)))))

(defun ray-test (&optional (res 1))
(setf *world* nil)

(defsphere 0.0 -300.0 -1200.0 200.0 .8)
(defsphere -80.0 -150.0 -1200.0 200.0 .7)
(defsphere 70.0 -100.0 -1200.0 200.0 .9)
(do ((x -2.0 (1+ x)))
((> x 2.0))
(declare (type single-float x))
(do ((z 2.0 (1+ z)))
((> z 7.0))
(declare (type single-float z))
(defsphere (* x 200) 300.0 (* z -400) 40.0 .75)))

Bulent Murtezaoglu

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Ok here are my results using both the Graham's original, Raymond Toy's
code and my modifications to RT's code for both ACL and CMUCL.

This took a lot longer than I expected (I figure 4 hours at least,
semi-distracted). Even barely usable benchmarking takes a lot of effort.

Short summary the fastest version runs under 10s to render the image
(no file access). Tables and code are at the end.

Notes:

RT's code did not produce 500x500 pictures as advertised in the header
(you guys have been looking at the output right?). My guess is
rational arithmetic was getting used originally so there were no
errors. In the short-float version, accumulated errors were causing
the do loops to terminate one iteration sooner. Maybe the point could
be made that optimizing with declares is not as straightforward as
advertised. It took me more than 15 minutes of scratching my head to
understand and fix the problem.

I also spend about an hour and half on top of what RT spent optimizing
the code using CMUCL. Usually silencing CMUCL's efficiency notes gets
the code in a reasonably good shape for other Lisps also. Exceptions in
this case to the last sentence are

(1) ACL's inlining policy. Franz's compiler (at least as far as Linux
and v. 4.3 for SunOS goes) does not ever inline user functions. So I
turned the square function into a macro for ACL and timed it again.

(2) CMUCL block compile feature (non-ANSI)

(3) CMUCL values declaration (non-ANSI, but there's an ANSI equivalent, I
didn't get to converting the code so the results are BIASED in favor
of CMUCL).

I think it is a good idea to split timings for the generation of the image
and the dumping of the text file. Output tends to be slow and sometimes
conses with print, it is also dependent on the filesystem/NFS speed. Since
we are after raw number crunching performance here, I modified the code
to stuff an array with the pixel values.


Hardware: Homegrown system, PentiumPro 180 overclocked to 231Mhz with
256K L1 cache and 128M EDO ram sitting on a 77MHZ bus.

OS: Linux 2.0.32. Fileserver on NSF over 10-base-2 (slow server, lightly
loaded network).

Lisp:
ACL is Allegro Common Lisp Linux version fully patched up to November 97
CMUCL is the version dated 30/01/98 off ftp.cons.org (Peter VanEynde's
binary kit)

Timing: fresh lisp on an very lightly loaded system from the command
line. I compiled everything once at the start and ran them three times
with full gc in between #+acl (gc t) #+cmu (ext:gc) and I'm reporting the
third time (too lazy to average and it doesn't matter given the large
we're looking for). There was negligible variation between the runs (2% max).
I don't think there were any page faults during the timed runs
(ACL doesn't report it [by default?], and CMUCL didn't report page
faults after the first run.)

All times in ms. There are differences in what time reports between the
systems. I took a reasonable inetersection. One notable difference is
ACL reports both cons cells counts and "other bytes" while CMUCL reports
everything as cons cells in bytes. I report CMU's byte count under ACL's
"other" column.

Original code as posted:

Real Time User System GC time cons other
ACL 188,472 185,360 500 2,190 507,541 2,238,426,232
CMUCL 263,970 244.650 15,870 50,380 1,422,303,632

Original code + global (declaim (optimize (speed 3) (safety 0) (debug 0)))

ACL 185,948 182,750 600 2,370 507,649 2,238,455,232
CMUCL 285,890 266.300 16.050 50,430 1,422,346,560

Please note the following are somehwat unfair to ACL in varying degrees


RT's code as posted (w/ the bug but I did add one #+ to get allegro to load it)

Real Time User System GC time cons other
ACL 158,726 154,930 650 4,550 512,941 -19,566,024 (!)
CMUCL 117,530 99,230 15,340 45,910 1,288,595,072

RT's code + my additions and fix (generate and write to file as usual)
Real Time User System GC time cons other
CMUCL 25,280 21.290 2.000 5,690 161,279,400
ACL 156,760 153,080 740 4,460 514,974 4,259,431,824

when (setf *bytes-consed-between-gcs* 8000000) for CMUCL

CMUCL 22,750 18,000 1.770 2.200 161,590,736


After turning sq into a macro for ACL's benefit:
Real Time User System GC time cons other
ACL 59,061 56,450 370 850 503,473 1,321,391,360

(here I should turn CMUCL values extension into "the" declations, but it's
late already...)

Timing for no output file (time (ray-test-no 5)) see attached file
Real Time User System GC time cons other
CMUCL 9,650 9,060 240 910 19,902,152
ACL 40,334 40,310 20 920 39 1,320,743,280


----Code:

;; from Graham's ANSi CL book, modified by Raymond Toy and Bulent
;; Murtezaoglu.
;; More speed can be gained by (1) fixing cmucl so
;; it works better, (2) going overboard with inline and such.
;; As it is the code generates about 15 efficiency notes some of which
;; I (BM) couldn't fix, (the others I didn't bother with).

(declaim (optimize (speed 3) (safety 0) (debug 0))) ;; does compilation speed do anything?

#+cmu (declaim (start-block ray-test tracer tracer-no ray-test-no))
;;would crash with just ray-test

(eval-when (compile load eval)
(defstruct (point (:conc-name nil))
(x 0.0 :type single-float)
(y 0.0 :type single-float)
(z 0.0 :type single-float))
(defstruct surface
(color 0.0 :type single-float))
(defstruct (sphere (:include surface))
(radius 0.0 :type single-float)
(center (required-argument) :type point))
)

(declaim (inline sq))
(declaim (inline minroot))
(declaim (inline sqrt))
(declaim (inline mag))

#-allegro


(defun sq (x)
(declare (type single-float x))
(* x x))

#+allegro
(defmacro sq (x)
(let ((foo (gensym)))
`(let ((,foo ,x)) (declare (type single-float ,foo)) (* ,foo ,foo))))

(defun mag (x y z)
(declare (type single-float x y z))
(sqrt (+ (sq x) (sq y) (sq z))))

(defun unit-vector (x y z)
(declare (type single-float x y z)
(values (single-float 0.0 1.0)
(single-float 0.0 1.0)
(single-float 0.0 1.0)))
(let ((d (mag x y z)))
(values (/ x d) (/ y d) (/ z d))))


(defun distance (p1 p2)
(declare (type point p1 p2)

(values single-float))


(mag (- (x p1) (x p2))
(- (y p1) (y p2))
(- (z p1) (z p2))))

(defun minroot (a b c)


(declare (type single-float a b c))

(if (zerop a) ;;zerop doesn't get optimized for some reason, neither does =


(/ (- c) b)
(let ((disc (- (sq b) (* 4 a c))))

(unless (minusp disc) ; cmucl should not need the cruft below but it does
(let ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0 *) disc)))))


(min (/ (+ (- b) discrt) (* 2 a))
(/ (- (- b) discrt) (* 2 a))))))))

(defparameter *world* nil)

(defvar *image* nil)

(defconstant eye (make-point :x 0.0 :y 0.0 :z 200.0))

(defun tracer (pathname &optional (res 1))
(declare (type (integer 1 100) res))
(with-open-file (p pathname :direction :output)
(format p "P2 ~A ~A 255" (* res 100) (* res 100))

(let* ((inc (/ (float res)))
(xymax (- (* res 100 inc) 50)))
(declare (single-float xymax))


(do ((y -50.0 (+ y inc)))

((< xymax y))


(declare (single-float y))
(do ((x -50.0 (+ x inc)))

((< xymax x))
(declare (single-float x))
(print (color-at x y) p)))))))

(let (surface hit (dist most-positive-single-float))
(declare (type single-float dist))


(dolist (s *world*)
(declare (type sphere s))
(let ((h (intersect s pt xr yr zr)))
(when h
(let ((d (distance h pt)))

(when (< d dist) ;; was (or (null dist) (< d dist))


(setf surface s hit h dist d))))))
(values surface hit)))

(defun lambert (s int xr yr zr)
(declare (type single-float xr yr zr)
(type point int))
(multiple-value-bind (xn yn zn)
(normal s int)
(max 0.0 (+ (* xr xn) (* yr yn) (* zr zn)))))

#+nil
(defun intersect (s pt xr yr zr) ;; note the dispatch on type (slow)

#+nil
(defun normal (s pt);; note the dispatch on type (slow)

;;; these are the no-output versions

(defvar *imgarray*)

(defun tracer-no (&optional (res 1))


(declare (type (integer 1 100) res))

(setf *imgarray*
(make-array (* 100 100 res res) :element-type '(unsigned-byte 8)
:fill-pointer 0))
(let* ((inc (/ (float res)))
(xymax (- (* res 100 inc) 50)))
(declare (single-float xymax))


(do ((y -50.0 (+ y inc)))

((< xymax y))


(declare (single-float y))
(do ((x -50.0 (+ x inc)))

((< xymax x))
(declare (single-float x))
(vector-push (color-at x y) *imgarray*)))))))


(defun ray-test-no (&optional (res 1))


(setf *world* nil)
(defsphere 0.0 -300.0 -1200.0 200.0 .8)
(defsphere -80.0 -150.0 -1200.0 200.0 .7)
(defsphere 70.0 -100.0 -1200.0 200.0 .9)
(do ((x -2.0 (1+ x)))
((> x 2.0))
(declare (type single-float x))
(do ((z 2.0 (1+ z)))
((> z 7.0))
(declare (type single-float z))
(defsphere (* x 200) 300.0 (* z -400) 40.0 .75)))

(tracer-no res))


David J. Fiander

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:
> (declaim (inline sq))
> (declaim (inline minroot))
> (declaim (inline sqrt))
> (declaim (inline mag))
>
> #-allegro
> (defun sq (x)
> (declare (type single-float x))
> (* x x))

I found, for CMUCL at least, that once you've declaimed the
function inline, there's no need to declare the types in the
function. The compiler will produce efficiency notes about
compiling sq, minroot, mag, but when you actually _call_ minroot
elsewhere with the types of the arguments declared properly, it
will very happily inline just the correct version of the code and
perform an inline single-float multiplication.

- David

Raymond Toy

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

>
> I also spend about an hour and half on top of what RT spent optimizing
> the code using CMUCL. Usually silencing CMUCL's efficiency notes gets

About an hour or two, part time.


>
> (defun minroot (a b c)
> (declare (type single-float a b c))
> (if (zerop a) ;;zerop doesn't get optimized for some reason, neither does =

This seems to be optimized for me.

> (/ (- c) b)
> (let ((disc (- (sq b) (* 4 a c))))
> (unless (minusp disc) ; cmucl should not need the cruft below but it does
> (let ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0 *) disc)))))

This might be because of the fact that for CMUCL (sqrt -0.0) is #c(0.0
0.0) and -0.0 is an element of (single-float 0.0 *). You may want to
say (the (or (eql 0.0) (single-float (0.0))) disc) to get better
code. Alternatively, if you have the :negative-zero-is-not-zero
feature, then you probably don't need any extra declarations. I
didn't need them.

> (defun color-at (x y)
> (declare (type single-float x y))
> (multiple-value-bind (xr yr zr)
> (unit-vector (- x (x eye))
> (- y (y eye))
> (- 0 (z eye)))
> (let ((color (* (sendray eye xr yr zr) 255.0)))
> (declare (type (single-float 0.0 (255.0)) color))
> (floor (+ color 0.5)))))

Since color > 0, it helps to replace floor with truncate in this case
because truncate can use the FPU directly. (Hmm, need to add this
optimization to CMUCL.)

With these changes, I now get only 3 compiler efficiency notes from
minroot. The run time is now

Evaluation took:
18.07 seconds of real time
15.23 seconds of user run time
1.41 seconds of system run time
[Run times include 1.96 seconds GC run time]
0 page faults and
151782136 bytes consed.

About a 3-4 times faster than my previous post.

Ray

Bulent Murtezaoglu

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
RT> About an hour or two, part time.

So, it's probably safe to say that it took about 2 solid hours total
to get to fast code (adjusting for my inexperience with cmucl).
I say this is a bit too long even when we consider we weren't
the original authors. It would have been much easier of course if
we [I?] used tools like graphical profilers etc. On the plus side,
the resulting code is not too cluttered IMHO (and some of the clutter
is probably unnecessary). I'd be very interested to see what kind of
performance can be gotten from LW and Lucid/Liquid at what kind of price.

It would also be nice to get some figures from C code. And maybe compare
.s files with the output from disassemble.

>> (defun minroot (a b c) (declare (type single-float a b c)) (if
>> (zerop a) ;;zerop doesn't get optimized for some reason,
>> neither does =

RT> This seems to be optimized for me.

On mine I get an efficiency note warning me about some compiler-generated
temporaries not being guaranteed to be the same type so zerop doesn't get
open coded. I don't remember the exact note. Maybe you are running a
different version?

>> (/ (- c) b) (let ((disc (- (sq b) (* 4 a c)))) (unless (minusp
>> disc) ; cmucl should not need the cruft below but it does (let
>> ((discrt (the (single-float 0.0 *) (sqrt (the (single-float 0.0
>> *) disc)))))

RT> This might be because of the fact that for CMUCL (sqrt -0.0)
RT> is #c(0.0 0.0) and -0.0 is an element of (single-float 0.0 *).
RT> You may want to say (the (or (eql 0.0) (single-float (0.0)))
RT> disc) to get better code. Alternatively, if you have the
RT> :negative-zero-is-not-zero feature, then you probably don't
RT> need any extra declarations. I didn't need them.

Hmmm. My comment concerns:

(unless (minusp disc) ;up till here we know disc is a single-float
(let ((disc (sqrt disc))) ; here we know disc is non-negative
...

I was hoping the compiler would notice that additional constraint on
disc and not need any declarations to open code sqrt. Does your
explanation above apply to minusp? What does [should?] minusp do to -0.0?

[...]
RT> Evaluation took: 18.07 seconds of real time 15.23 seconds of
RT> user run time 1.41 seconds of system run time [Run times
RT> include 1.96 seconds GC run time] 0 page faults and 151782136
RT> bytes consed.

Did you also try the no-output version? I'll try to run your new code
tonight and see what happens.

BM

Raymond Toy

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

> >>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
> RT> About an hour or two, part time.
>
> So, it's probably safe to say that it took about 2 solid hours total
> to get to fast code (adjusting for my inexperience with cmucl).

Perhaps. I think if I had concentrated on it, the conversion should
have only taken 30 minutes or so. The compiler notes were very good.

> I say this is a bit too long even when we consider we weren't
> the original authors. It would have been much easier of course if
> we [I?] used tools like graphical profilers etc. On the plus side,

Actually, I did use CMUCL's profile to figure out some things. I
think sphere-intersect takes up most of the time and generates by far
the most garbage.

> open coded. I don't remember the exact note. Maybe you are running a
> different version?

I was running the latest development version.

>
> (unless (minusp disc) ;up till here we know disc is a single-float
> (let ((disc (sqrt disc))) ; here we know disc is non-negative
> ...
>
> I was hoping the compiler would notice that additional constraint on
> disc and not need any declarations to open code sqrt. Does your
> explanation above apply to minusp? What does [should?] minusp do to -0.0?

It does notice the constraint (well, it's supposed to). However, the
problem is that CMUCL currently has (sqrt -0.0) returning #c(0.0 0.0),
so the generic sqrt needs to be called. Also (minusp -0.0) = (minusp
0.0) = NIL, as required by IEEE arithmetic. To get CMUCL to produce
the desired code, disc has be be declared as (or (eql 0.0)
(single-float (0.0) *)) so that the compiler knows -0.0 isn't a valid
value. My version has :negative-zero-is-not-zero feature, so I can
say (single-float 0.0 *) to get the desired result. (This feature
seems to contradict ANSI CL, but ANSI CL doesn't really quite cover
this issue either.)


> Did you also try the no-output version? I'll try to run your new code
> tonight and see what happens.
>

Here is the no-output result:

Evaluation took:
6.38 seconds of real time
6.1 seconds of user run time
0.1 seconds of system run time
[Run times include 0.11 seconds GC run time]
0 page faults and
9944752 bytes consed.

3 times better than before. The I/O generates a lot of garbage and is
pretty slow!

Ray

Marc Battyani

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu wrote in message <87btvc8...@isttest.bogus>...


>
>Ok here are my results using both the Graham's original, Raymond Toy's
>code and my modifications to RT's code for both ACL and CMUCL.

>Hardware: Homegrown system, PentiumPro 180 overclocked to 231Mhz with


>256K L1 cache and 128M EDO ram sitting on a 77MHZ bus.
>
>OS: Linux 2.0.32. Fileserver on NSF over 10-base-2 (slow server, lightly
>loaded network).
>
>Lisp:
>ACL is Allegro Common Lisp Linux version fully patched up to November 97
>CMUCL is the version dated 30/01/98 off ftp.cons.org (Peter VanEynde's
>binary kit)

>Original code + global (declaim (optimize (speed 3) (safety 0) (debug 0)))
>
>ACL 185,948 182,750 600 2,370 507,649 2,238,455,232
>CMUCL 285,890 266.300 16.050 50,430 1,422,346,560

>RT's code as posted (w/ the bug but I did add one #+ to get allegro to load
it)
> Real Time User System GC time cons other
>ACL 158,726 154,930 650 4,550 512,941 -19,566,024 (!)
>CMUCL 117,530 99,230 15,340 45,910 1,288,595,072

Here are the results for Harlequin LispWorks 4.0.1 with latest patches on a
Pentium Pro
200 MHz with 80 Mo RAM
Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float 3)))

CL-USER 15 > (extended-time (ray-test 5))
757.5 seconds used.
Standard Allocation 3397066608 bytes.
Fixlen Allocation 9912 bytes.
total gc activity = 172.106
main promote ( 4 calls) = 0.110
mark and sweep (13235 calls) = 171.996
internal promote ( 4 calls) = 0.000
promote ( 0 calls) = 0.000
fixup ( 4 calls) = 0.16
compact ( 0 calls) = 0.000
NIL

>RT's code as posted (w/ the bug but I did add one #+ to get allegro to load
it)

CL-USER 16 > (extended-time (ray-test-no 5))
613.9 seconds used.
Standard Allocation 4535355248 bytes.
Fixlen Allocation 1056 bytes.
total gc activity = 187.337
main promote ( 4 calls) = 1.346
mark and sweep (17683 calls) = 185.991
internal promote ( 5 calls) = 0.94
promote ( 0 calls) = 0.000
fixup ( 6 calls) = 0.438
compact ( 0 calls) = 0.000
NIL

Marc Battyani


Bulent Murtezaoglu

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:

RT> Perhaps. I think if I had concentrated on it, the conversion
RT> should have only taken 30 minutes or so. The compiler notes
RT> were very good.

I'll haggle with you on that! _I_ couldn't have gotten as far as I got
even starting from your already-optimized code in 30 min. I had to
fire up allegro, get it to compile the file and run it to convince
myself that the bug I was seeing wasn't a cmucl artifact. For the slower
versions of the code you have to wait an appreciable amount to test after
a few steps. Given what we are trying to get a sense of, I think it is not
unreasonable to assume that programmers will introduce bugs during
optimization, and incremental inprovement/testing will take place (out of
habit if nothing else). So I'd say pretty close to an hour for me if I
had everything (manuals and such) handy. What do other people think?

RT> I was running the latest development version.

That probably explains a number of things. I haven't figured out how to
build cmucl from sources yet...

[...]
RT> Here is the no-output result:

RT> Evaluation took: 6.38 seconds of real time 6.1 seconds of user
RT> run time 0.1 seconds of system run time [Run times include
RT> 0.11 seconds GC run time] 0 page faults and 9944752 bytes
RT> consed.

RT> 3 times better than before. The I/O generates a lot of
RT> garbage and is pretty slow!

Indeed. Maybe not too surprising given that we're using print. I
wonder what's still producing the 9Megs of garbage? Unless I screwed
up its decleration the image array should take about 250k, say another
few 10k's for misc. unavoidable consing we still have a huge amount.
I'm not familiar with cmu's profiler, maybe you could post sth. about
what's generating all this garbage?

BM

Raymond Toy

unread,
Mar 12, 1998, 3:00:00 AM3/12/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

> >>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
>
> RT> Perhaps. I think if I had concentrated on it, the conversion
> RT> should have only taken 30 minutes or so. The compiler notes
> RT> were very good.
>
> I'll haggle with you on that! _I_ couldn't have gotten as far as I got
> even starting from your already-optimized code in 30 min. I had to
> fire up allegro, get it to compile the file and run it to convince
> myself that the bug I was seeing wasn't a cmucl artifact. For the slower

Perhaps I'm being optimistic? :-) I don't remember. But I didn't try
running anything else except CMUCL, so that saves some time.


>
> RT> I was running the latest development version.
>
> That probably explains a number of things. I haven't figured out how to
> build cmucl from sources yet...

It's not easy to do, unless you're just compiling the current version
with the current version, but you would need to do that anyway. :-)

> I'm not familiar with cmu's profiler, maybe you could post sth. about
> what's generating all this garbage?
>

First do (profile:profile-all) to turn on profiling everywhere. Run
the code, then (profile:report-time) to see the results, and
(profile:reset-time) to reset the time. Here are some results that I
got from your version. I had to run off the block compile because
then the functions wouldn't exist.

* (time (ray-test 2))
[snip]
Evaluation took:
27.96 seconds of real time
20.89 seconds of user run time
5.48 seconds of system run time
[Run times include 0.48 seconds GC run time]
0 page faults and
30963304 bytes consed.
NIL
* (profile:report-time)
Seconds | Consed | Calls | Sec/Call | Name:
------------------------------------------------------
0.145 | 1,470,600 | 61,275 | 0.00000 | UNIT-VECTOR
0.043 | 500,976 | 20,874 | 0.00000 | SPHERE-NORMAL
0.010 | 264 | 33 | 0.00029 | DEFSPHERE
0.001 | 258,800 | 32,350 | 0.00000 | DISTANCE
0.000 | 1,295,320 | 32,383 | 0.00000 | MAKE-POINT
0.000 | 333,984 | 40,401 | 0.00000 | SENDRAY
0.000 | 0 | 1,333,233 | 0.00000 | INTERSECT
0.000 | 0 | 40,401 | 0.00000 | FIRST-HIT
0.000 | 1,035,200 | 1,333,233 | 0.00000 | SPHERE-INTERSECT
0.000 | 1,056 | 33 | 0.00000 | MAKE-SPHERE
0.000 | 23,615,544 | 1 | 0.00000 | TRACER
0.000 | 0 | 20,874 | 0.00000 | NORMAL
0.000 | 834,960 | 20,874 | 0.00000 | LAMBERT
0.000 | 1,616,040 | 40,401 | 0.00000 | COLOR-AT
0.000 | 512 | 1 | 0.00000 | RAY-TEST
------------------------------------------------------
0.199 | 30,963,256 | 2,976,367 | | Total

Estimated total profiling overhead: 23.81 seconds

So, actually unit-vector takes the most time now and tracer generates
the most garbage.

Ray

Bulent Murtezaoglu

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:

RT> Here is the no-output result:

RT> Evaluation took: 6.38 seconds of real time 6.1 seconds of user

RT> run time 0.1 seconds of system run time [Run times include
RT> 0.11 seconds GC run time] 0 page faults and 9944752 bytes
RT> consed.

After incorporating your suggestions and a few other minor things
the best I can do over here is

Evaluation took:
5.41 seconds of real time
5.14 seconds of user run time
0.14 seconds of system run time
[Run times include 0.36 seconds GC run time]
0 page faults and
8300816 bytes consed.

(array stuffing version with CMUCL)

Here's what I couldn't get to work (ie gave up after a cursory look):

inlining unit-vector seems to break type inference somehow (either that or
there's some subtlety I don't understand right now).

I tried explicitly defining the structs as vectors hoping to save some
more space but apparently unnamed vector structs cannot be used as types
with the name given to defstruct in declarations. It seemed too ugly
to declare points and spheres explicitly as vectors, so I didn't go for
that workaround.

BM

Bulent Murtezaoglu

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
[...]
RT> Perhaps I'm being optimistic? :-) I don't remember. But I
RT> didn't try running anything else except CMUCL, so that saves
RT> some time.

Maybe you are, but more importantly one of the points of this exercise
is to get some sense of whether authors like Norvig and Graham are being
too optimistic about how much time it would take to go from elegant working
code to something within 2-3x of C.

I think for simple things that could also be easily implemented in C right
from the start, it isn't obvious that Common Lisp is a good environment.
Incremental development etc. is all very nice, but when the smallest problem
you can try takes as long to run as several edit-make-crash cycles in C,
the benefits of Lisp are less clear. The raytracing example does not expose
this problem adequately because you can go to tiny pictures and do your
exploratory programming there if you need to. Nonetheless I think the point
is valid, as there are problems where Lisp could really shine but doesn't
because of inadequate performance. (machine learning comes to my mind where
for some algorithms there's a knee in the problem size graph you have to hit
to observe interesting stuff. If the smallest such interesting problem
takes hours with unoptimized code, some of the benefits of Lisp are lost.
This isn't horrible, and one can't have everything, but maybe we ought
to recognize that costs for all this dynamic luxury might be prohibitive
more often than one would first think.)

We should also note that we got the relatively decent speeds using
non-ANSI extensions on an experimental implementation. Lisp vendors
probably don't get many requests for good float performance.

>> That probably explains a number of things. I haven't figured
>> out how to build cmucl from sources yet...

RT> It's not easy to do, unless you're just compiling the current
RT> version with the current version, but you would need to do
RT> that anyway. :-)

Yeah, I found out the other day that it isn't easy. It appears that having
functions/variables/macros available by name at run time in the working
Lisp complicates things when it comes to compiling the compiler.
Am I understanding this correctly? I naively thought that some clever
package manipulation and uninterning would fix things in an elegant manner,
but didn't get very far even in my head. Kept thinking about two levels of
packages with relative paths, but there's no such thing in Common Lisp.
Anyhow, how do you guys do it? Is this documented anywhere? What am I
missing?


BM

Erik Naggum

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

* Bulent Murtezaoglu

| I think for simple things that could also be easily implemented in C
| right from the start, it isn't obvious that Common Lisp is a good
| environment. Incremental development etc. is all very nice, but when the
| smallest problem you can try takes as long to run as several
| edit-make-crash cycles in C, the benefits of Lisp are less clear.

one of the curious properties of programming in Common Lisp is that you
adopt different styles according as the needs you foresee for your code
differ. usually, you know beforehand that some code is speed-critical
and you care about consing, declarations, and such early, and take care
not to violate the principles that you know will result in fast code.

adopting a number of different styles and becoming willing to write code
that is not "beautiful Lisp" but only _exports_ a "beautiful Lisp"
interface takes a lot of experience. I remember I spent a lot of time
getting to grips with the need to write really _ugly_ code, but then it
dawned on me that while I liked C because I could bury ugliness in code
and I like Common Lisp because it is elegant and beautiful, these are
_not_ mutually exclusive qualities. I can bury ugliness in Common Lisp,
too, but if I take care to write an elegant interface to it, it's gone
forever, quite unlike C, where the interface still sucks.

there's one thing I miss, though, but I'll make that a separate thread.

#:Erik
--
God grant me serenity to accept the code I cannot change,
courage to change the code I can, and wisdom to know the difference.

Jason Trenouth

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

On Thu, 12 Mar 1998 20:29:14 +0100, "Marc Battyani"
<marc_b...@email.msn.com> wrote:

> Here are the results for Harlequin LispWorks 4.0.1 with latest patches on a
> Pentium Pro
> 200 MHz with 80 Mo RAM
> Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float 3)))

^^^^^^^

NB "Float" is a misnomer for "float-safety" in LispWorks, so you really want
"(float 0)". Also this won't make much difference in the current PC release
anyway. A future PC release will include floating point optimization
comparable with LispWorks on Unix.

__Jason

Raymond Toy

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

>
> inlining unit-vector seems to break type inference somehow (either that or
> there's some subtlety I don't understand right now).

This should work. I'll take a peek at it.

>
> I tried explicitly defining the structs as vectors hoping to save some
> more space but apparently unnamed vector structs cannot be used as types
> with the name given to defstruct in declarations. It seemed too ugly
> to declare points and spheres explicitly as vectors, so I didn't go for
> that workaround.

Converting the structs to vectors might help, but I think structure
access is already quite fast, and it probably doesn't reduce consing.

Ray

Raymond Toy

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

Bulent Murtezaoglu <mu...@servtech.com> writes:

>
> We should also note that we got the relatively decent speeds using
> non-ANSI extensions on an experimental implementation. Lisp vendors

No, I disagree. The only non-ANSI extension was CMUCL's start-block,
which essentially wraps everything up in a labels form. You could
have done that yourself and made a portable version. Even without it,
I don't think the speed was that much slower.

> probably don't get many requests for good float performance.

Perhaps, but when I sent a small comparison of FP performance between
CMUCL and Franz, an engineer from Franz volunteered a fair amount of
time helping me get good FP performance. He mentioned that he was
actively looking at improving the compiler to handle FP better. And
this was the free Linux version! If I ever need a commercial one, I
will surely look at Franz.

>
> >> That probably explains a number of things. I haven't figured
> >> out how to build cmucl from sources yet...
>
> RT> It's not easy to do, unless you're just compiling the current
> RT> version with the current version, but you would need to do
> RT> that anyway. :-)
>
> Yeah, I found out the other day that it isn't easy. It appears that having
> functions/variables/macros available by name at run time in the working
> Lisp complicates things when it comes to compiling the compiler.
> Am I understanding this correctly? I naively thought that some clever
> package manipulation and uninterning would fix things in an elegant manner,
> but didn't get very far even in my head. Kept thinking about two levels of
> packages with relative paths, but there's no such thing in Common Lisp.
> Anyhow, how do you guys do it? Is this documented anywhere? What am I
> missing?

Compiling a new version from an old is sometimes quite hard. What you
say can be done and has been, I hear, but for the really hard builds,
I think a cross compiler (either from a different architecture, or the
same architecture) is used. I've never used the cross compiler.

If you want more info about building, try www.cons.org. For
complicated builds, join the mailing list. Or, do like me, and just
wait for someone to build it for you. :-)

Ray

Christopher J. Vogt

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to Bulent Murtezaoglu

Bulent Murtezaoglu wrote:
>
> >>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
> [...]
> RT> Perhaps I'm being optimistic? :-) I don't remember. But I
> RT> didn't try running anything else except CMUCL, so that saves
> RT> some time.
>
> Maybe you are, but more importantly one of the points of this exercise
> is to get some sense of whether authors like Norvig and Graham are being
> too optimistic about how much time it would take to go from elegant working
> code to something within 2-3x of C.

I forget if it was Steele or Gabriel who observed that one of the things that
makes programming in common lisp uniquely different from other languages, is
that to get good performance, you have to be intimately familiar with your
implementation. This sort of issue doesn't exist in C, for example. One
implementation of C is generally as fast (within a factor of 2) as another.
However, CL implementations can vary quite a bit, in different areas. One
might have (relatively) good FP performance but horrible Bignum performance
etc.

In my experience with CL, much of it involving graphics and FP, the two
areas that CL implementations fall far behind C in terms of performance are
FP and I/O. So if there is something in particular about CL that you want
to benchmark, e.g. array performance, make sure that you aren't doing any
I/O and are not using FP numbers. I find it interesting that the original
message that started this thread was titled "Floating Point speed in CL",
but the code wasn't doing very much FP, it was mostly ratios!

>
> I think for simple things that could also be easily implemented in C right
> from the start, it isn't obvious that Common Lisp is a good environment.
> Incremental development etc. is all very nice, but when the smallest problem
> you can try takes as long to run as several edit-make-crash cycles in C,

> the benefits of Lisp are less clear. The raytracing example does not expose
> this problem adequately because you can go to tiny pictures and do your
> exploratory programming there if you need to. Nonetheless I think the point
> is valid, as there are problems where Lisp could really shine but doesn't
> because of inadequate performance. (machine learning comes to my mind where
> for some algorithms there's a knee in the problem size graph you have to hit
> to observe interesting stuff. If the smallest such interesting problem
> takes hours with unoptimized code, some of the benefits of Lisp are lost.
> This isn't horrible, and one can't have everything, but maybe we ought
> to recognize that costs for all this dynamic luxury might be prohibitive
> more often than one would first think.)

I think that it is *very* important that everyone understand the difference
between a language i.e. Common Lisp, and an implementation, i.e. CMUCL. There
is very little in Common Lisp the language that prohibits somebody from
developing an implementation that is competitive with C in terms of performance.
It may not be easy, and it may not exist yet, and it will require the proverbial
SSC (Sufficiently Smart Compiler), but it can be done! Common Lisp FP is not
slower than C, but FP in every CL implementation I have used has been slower
than C, but it doesn't have to be so.

>
> We should also note that we got the relatively decent speeds using
> non-ANSI extensions on an experimental implementation. Lisp vendors

> probably don't get many requests for good float performance.

I don't know how many requests they get for good FP performance, but I can
guarantee you that they do get some, speaking from personal experience :-)

>
>[...]

--
Christopher J. Vogt - Computer Consultant - Lisp, AI, Graphics, etc.
http://members.home.com/vogt/

Erik Naggum

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

* Christopher J. Vogt

| I forget if it was Steele or Gabriel who observed that one of the things
| that makes programming in common lisp uniquely different from other
| languages, is that to get good performance, you have to be intimately
| familiar with your implementation. This sort of issue doesn't exist in
| C, for example. One implementation of C is generally as fast (within a
| factor of 2) as another. However, CL implementations can vary quite a
| bit, in different areas. One might have (relatively) good FP performance
| but horrible Bignum performance etc.

I beg to differ. C systems (compilers, libraries, etc) generally differ
by just as large factors as Common Lisp, often more, when non-trivial
functions are used, and they tend to be for non-trivial programs, and
optimizing C is generally very hard beyond the simple cases.

I'm not saying that CL is doing a better job than C in the general case,
but I'm getting somewhat peeved with the repetitive claims that C is so
fast. it isn't. it never was. it's just that the language is fairly
transparent down to the machine level so when you change something in the
C code, you can predict the effect on the compiled machine code. that
doesn't make it _fast_, it only ingratiates the control-freak aspect of
most programmers.

Marc Battyani

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

Jason Trenouth wrote in message <357c3188.3987049031@newshost>...


>On Thu, 12 Mar 1998 20:29:14 +0100, "Marc Battyani"
><marc_b...@email.msn.com> wrote:
>
>> Here are the results for Harlequin LispWorks 4.0.1 with latest patches on
a
>> Pentium Pro
>> 200 MHz with 80 Mo RAM
>> Compiled with (proclaim '(optimize (debug 0) (safety 0) (speed 3)(float
3)))
> ^^^^^^^
>
>NB "Float" is a misnomer for "float-safety" in LispWorks, so you really
want
>"(float 0)". Also this won't make much difference in the current PC release


Hoops sorry,
I tried with float 0 and you were right it does not seem to change much.
BTW I didn't find "Float" in the LWW doc (section 7.4 Compiler Control)

>anyway. A future PC release will include floating point optimization
>comparable with LispWorks on Unix.
>
>__Jason

That's good news! We do OpenGL and signal processing apps in LWW so this
would be quite welcome.

Marc Battyani


Raffael Cavallaro

unread,
Mar 13, 1998, 3:00:00 AM3/13/98
to

Rainer,

we've seen posts of times for CMUCL, Franz's ACL for Linux, and Harlequin's
LispWorks for Windows.

How about MCL 4.2? You do run that, don't you? We'd like to see what MCL's
floating point performance is like.

thanks,

raf


Matthew McDonald

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

Erik Naggum <cle...@naggum.no> writes:

> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial
> functions are used,

Do you have any examples to substantiate this claim?

-- Matthew McDonald <ma...@cs.uwa.edu.au>
"The Statistician is the natural predator of the Scientist" -- Ross Ihaka

Erik Naggum

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

* Matthew McDonald

| Do you have any examples to substantiate this claim?

not any good set of particular examples, but 15 years of experience with
C on lots of platforms and lots of compilers has alerted me to the fact
that C compilers and environments differ _very_ much in quality, more
than my experience with (native-compiled) Common Lisp implementations and
environments do. Common Lisp implementations are generally quite good,
but the same is not necessarily true for C compilers -- naive compilation
of C is often a viable option for small systems.

if you need a specific example, consider the bundled C compiler with
SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
appears to be slightly better than GCC.

I find it interesting that you jump up to require substantiation of the
claim that C implementations differ, but do not challenge that Common
Lisp implementations do. this is the general sentiment out there, as
sort of a cultural requirement on programmers.

C is not fast. C is primitive, which means it has a potential of being
fast if the programmer is a genius and the compiler is on his side. if
the programmer is an idiot, and the compiler expects smarter programmers,
you will get abysmal performance. e.g., a system currently written in C
but which I am replacing with one rewritten in Common Lisp shows a
promise of reducing system requirements by a factor of between 5 and 6,
after months of work to decipher the incredibly bad C code to figure out
what it actually does.

#:Erik
--
religious cult update in light of new scientific discoveries:
"when we cannot go to the comet, the comet must come to us."

Raymond Toy

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

ma...@cs.uwa.edu.au (Matthew McDonald) writes:

> Erik Naggum <cle...@naggum.no> writes:
>
> > I beg to differ. C systems (compilers, libraries, etc) generally differ
> > by just as large factors as Common Lisp, often more, when non-trivial
> > functions are used,
>

> Do you have any examples to substantiate this claim?
>

Compare gcc on a Linux Alpha system against the DEC C compiler for
Alpha. (Or was that f77 vs DEC Fortran?) I understand that
the difference in performance can be a factor of 2 or more for the
same code on the same machine.

See comp.os.linux.alpha.

Ray

Tim Bradshaw

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

* Erik Naggum wrote:

> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial

> functions are used, and they tend to be for non-trivial programs, and
> optimizing C is generally very hard beyond the simple cases.

I think that in general this is quite right (and in general most
programs are a long way away from the case where even language
performance matters), but in the *specific* case of heavily numerical
code it is significantly difficult to get good performance out of CL
implementations, and even harder to get good performance with code
which is nearly portable. (In fact, it's probably really hard to get
good performance in all cases where most of the data should be
represented as immediate objects rather than pointers, but numerical
code is the best and commonest example of this).

And in these cases, when you get it wrong (or the compiler isn't smart
enough), you often end up with stuff that is 10s or 100s of times
slower than it should be. Languages like C or Fortran are unlikely to
cause you to take this kind of hit in cases where everything is
basically a float or int of some flavour. Of course they will
silently add two large positive integers & get a large negative
result...

--tim

Andreas Eder

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

Tim Bradshaw <t...@aiai.ed.ac.uk> writes:

> And in these cases, when you get it wrong (or the compiler isn't smart
> enough), you often end up with stuff that is 10s or 100s of times
> slower than it should be. Languages like C or Fortran are unlikely to
> cause you to take this kind of hit in cases where everything is
> basically a float or int of some flavour. Of course they will
> silently add two large positive integers & get a large negative
> result...
>
> --tim

That is quite right, but then C is also almost always at least a factor
of 2 slower than a good Fortran compiler. I have seen this with lots of
programs on C and Fortran on Convex and Cray computers.
But then, this was 5 years ago, things may have changed.

Andreas Eder NetWorker (ReliantUnix)
SNI AG Phone: +49-89-636-42549
OEC HES XP 2 Fax: +49-89-636-40601
D-81739 Munich, Germany

EMail: mailto:Andrea...@mch.sni.de
for public access: http://www.sni.de/public/mr/nsr/nsr_en.htm
for SNI personnel only: http://athen.mch.sni.de/networker/welcome.htm

---
Is "Windows for Dummies" another case of "this sentence no verb?"


Christopher J. Vogt

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to Erik Naggum

The following exchange got trimmed, and I have added it back in for direct
reference.

> * Christopher J. Vogt
> | I forget if it was Steele or Gabriel who observed that one of the things
> | that makes programming in common lisp uniquely different from other
> | languages, is that to get good performance, you have to be intimately
> | familiar with your implementation. This sort of issue doesn't exist in
> | C, for example. One implementation of C is generally as fast (within a
> | factor of 2) as another. However, CL implementations can vary quite a
> | bit, in different areas. One might have (relatively) good FP performance
> | but horrible Bignum performance etc.
>

Eirc Naggum

> I beg to differ. C systems (compilers, libraries, etc) generally differ
> by just as large factors as Common Lisp, often more, when non-trivial
> functions are used, and they tend to be for non-trivial programs, and
> optimizing C is generally very hard beyond the simple cases.


Erik Naggum wrote:
>
> * Matthew McDonald


> | Do you have any examples to substantiate this claim?
>

> not any good set of particular examples, but 15 years of experience with
> C on lots of platforms and lots of compilers has alerted me to the fact
> that C compilers and environments differ _very_ much in quality, more
> than my experience with (native-compiled) Common Lisp implementations and
> environments do. Common Lisp implementations are generally quite good,
> but the same is not necessarily true for C compilers -- naive compilation
> of C is often a viable option for small systems.
>
> if you need a specific example, consider the bundled C compiler with
> SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
> the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
> appears to be slightly better than GCC.

Well, I did say "within a factor of 2", so you seem to be in violent agreement
with me? Yes, commercial C compilers vary in speed, generally within a factor
of 2. Commercial CL systems, however, seem to vary by much larger factors, in
many different areas.

Marco Antoniotti

unread,
Mar 16, 1998, 3:00:00 AM3/16/98
to

ma...@cs.uwa.edu.au (Matthew McDonald) writes:

> Erik Naggum <cle...@naggum.no> writes:
>
> > I beg to differ. C systems (compilers, libraries, etc) generally differ
> > by just as large factors as Common Lisp, often more, when non-trivial
> > functions are used,
>

> Do you have any examples to substantiate this claim?
>

Here is a simple example

void
main()
{
void C_entry();

C_entry();
}

void
C_entry()
{
auto signed char buf[50];
strcpy(buf, "abcd");
return;
}

The program is redundant but it was written in this way to prove
exactly the point of "impredictability" of compiled C code. (The
compiler was for the ARM). The compiler is so aggressive that
recognizes the 'strcpy' and "inlines" it with a 'memcpy' when the
constant string pased to it fits in a given memory pattern.
Substituting "abcd" with "abc" makes the compiler change its strategy,
hence the optimization level. Substituting 'strcpy' with the obvious
counterpart 'mystrcpy' (taken straight out of K&R) seem to yield more
predictable performance.

Hence the proposition "there exists C compilers with various degrees
of resulting optimizations" is constructively proved :)

(Of course this may or may not be an answer to the question, but IMHO,
it comes pretty close).

The example is taken from lectures of the EECS 290a course held by
Professor Sangiovanni-Vincentelli at UCB last
fall. http://www-cad.eecs.berkeley.edu/~polis/class/index.html.
Check the second Lecture by Baker (actually the author is Grant) on
"Software Performance Estimation".

Cheers

--
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - (0)6 - 68 80 79 23, fax. +39 - (0)6 - 68 80 79 26
http://www.parades.rm.cnr.it

Matthew McDonald

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

I didn't really want to write a long article about this since (a) I'm
not a lisp expert - (though I've had some experience and success
optimising lisp code under CMUCL) and (b) most of my experience comparing
lisp and C implementations is several years old and therefore probably
somewhat outdated.

That said ......

Erik Naggum <cle...@naggum.no> writes:

>| Do you have any examples to substantiate this claim?

> not any good set of particular examples,
[...]


> SunOS 4.X with the commercial compiler for Solaris 2.X. GCC 2.8.1 beats
> the former by a factor of 1.7, but the Solaris 2.X commercial C compiler
> appears to be slightly better than GCC.

I was originally planning to post a longer message in which I was
going to mention that I could remember seeing C code differ in
performance by a factor of 2 occasionally and around 3 only rarely.

Unless I misunderstood something, one of the versions of the ray-tracing
code ran *four* times faster on CMUCL than it did on ACL. Admittedly, it
had been optimised for CMUCL, but I'd be surprised to see a non-contrived
piece of C that revealed similar differences in C implementations.

> I find it interesting that you jump up to require substantiation of the
> claim that C implementations differ, but do not challenge that Common
> Lisp implementations do. this is the general sentiment out there, as
> sort of a cultural requirement on programmers.

I asked for examples because:

(1) I thought by that by "large differences", we meant something on the
order of a factor of four - not factors of two or less.

(2) Your claim doesn't match my personal experience,

The thread has already given an example of a piece of code that
differed by a factor of 4 under two differ lisp implementations. I
can't think of code I've seen that revealed similar differences
between serious C implementations.

> C is not fast. C is primitive, which means it has a potential of being
> fast if the programmer is a genius and the compiler is on his side.

With simple languages like C, the difference between naive compilation
and full optimisation is usually less than a factor of two (in my
experience). Look at the gcc vs oberon comparisions (the standard
oberon compilers do practically no optimisation) or even gcc with and
without optimisation.

Even a stupid C compiler can typically produce code that's within
spitting distance of what's produced by an optimising compiler,
because the language is basically a portable assembler. The same can't
be said of common lisp. When you're optimising floating point code in
C you don't usually need to go to much trouble to ensure that the
major operations are on unboxed data!

I keep seeing lisp people say that code written in lisp can easily be
made to perform as well as code written in C, and it just doesn't
match my (limited) experience. With CMUCL, floating point code can be
made to run about as fast as in C, but it seems like hard work. I'm
not even sure it's feasible in ACL.

It was only three or four years ago that a well-known lisp implentor
posted something here suggesting that: when you want lots of speed
from a certain lisp implementation, you should estimate the number of
registers required by the sub-expressions of important expressions so
you can then rearrange the expression to minimise the number of loads
and stores during the execution of the expression, since the compiled
code would execute the expression strictly left-to-right.

In the C/pascal/FORTRAN world, even straightforward compilers weigh
subexpressions before deciding what order to evaluate them in.

Perhaps I missing something, or I misunderstood the "well-known lisp
implentor", and the advice he posted was thirty years out of date and
for historical interest only.

The only substantial example of lisp code I can remember that was
competive is the Koza genetic algorithm code was easily optimised to
the point where it beat the C equivalent. I'd be delighted to see
large numbers of significant examples of straightforward lisp programs
that performed within a factor of two of C equivalents.

-- Matthew McDonald <ma...@cs.uwa.edu.au>

Appendix: to substantiate the claim that C implementations often don't
differ dramatically, here's a benchmark given at the end of the lcc
OVERVIEW.TEX file (from some time in 1993).

benchmark
_compiler__1._gcc1.35__8._espresso__22._li_23._eqntott__


VAX: MicroVAX II w/16MB running Ultrix 3.1
lcc 1734 2708 7015 3532
cc 1824 2782 7765 3569
cc -O 1661 2653 7086 3757
gcc 1439 2757 7353 3263
gcc -O 1274 2291 6397 1131


68020: Sun 3/60 w/24MB running SunOS 4.0.3
lcc 544 1070 2591 567
cc 514 1005 3308 619
cc -O 428 882 2237 571
gcc 426 1048 2498 591
gcc -O 337 834 1951 326


MIPS: IRIS 4D/220GTX w/32MB running IRIX 3.3.1
lcc 116 150 352 111
cc 107 153 338 100
cc -O 92 130 299 70
gcc 188 502 132
gcc -O 145 411 112


SPARC: Sun 4/260 w/32MB running SunOS 4.0.3
lcc 196 370 790 209
cc 203 381 1094 275
cc -O 150 296 707 183
gcc 186 411 1139 256
gcc -O 127 309 788 179

Martin Volf

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

Matthew McDonald wrote:
>
> With simple languages like C, the difference between naive compilation
> and full optimisation is usually less than a factor of two (in my
> experience). Look at the gcc vs oberon comparisions (the standard
> oberon compilers do practically no optimisation) or even gcc with and
> without optimisation.
>

Well, this is just what I have done some time ago, and the result may be
quite surprising:
without optimization (-O0) 2000 ms
optimizing (-02) 330 ms

So you have factor 6. The program was very simple, something like
bubble-sort. (I looked carefully if there is no huge job running
simultaneously and tried it more times, so it is no mistake.)

Martin
(martin.volf at st dot ms dot mff dot cuni dot cz)

John Watton

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to


>I keep seeing lisp people say that code written in lisp can easily be
>made to perform as well as code written in C, and it just doesn't
>match my (limited) experience. With CMUCL, floating point code can be
>made to run about as fast as in C, but it seems like hard work. I'm
>not even sure it's feasible in ACL.

Of course its feasible in ACL (4.3 varity on unix, NT and linux). I do
it all the time. I posted the benchmark below on comp.lang.lisp
sometime ago and didn't generate a fraction of the interest directed
to the ray tracing code. (Unlike the ray tracing code, I guess the
benchmark was too easy to get your hands around.) As you can see it is
comparible to unoptimized C code. With ACL I find I use the nonANSI
declare keyword :explain to tell me where the boxing is (declare
(:explain :boxing :calls :types)). Also with ACL is a nonANSI and
undocumented feature called immediate-args for passing and returning
unboxed arguments. I don't use it in this benchmark but it is
available at least on the Unix variety. I suspect that this is one
area that the ray tracing code suffers in the ACL version. In general
I prefer to pass floats inside arrays to avoid boxing - it's easy and
ANSI. But, I don't do that here to be fair to C.

John Watton
ALCOA

Machine: Pentium 90 Mhz, 40 Meg RAM
OS: Windows NT

Double floating point benchmark
Pnpoly area sec rel
----------- --- ---
C | VC++ 4.0 (cl -Ox) 16.5 1.0
Java | JDK 1.1.4 (JIT) 20.2 1.2
C | VC++ 4.0 (cl) 23.5 1.4
Lisp | Franz ACL 4.3.2 25.6 1.6
Java | JDK 1.1.4 (-O) 106.9 6.5
Lisp | Franz ACL 3.0.1 521.3 31.6
Perl | Perl5.003 1007.0 61.0


PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C_PNPOLY_C
#include "math.h"

int pnpoly(int npol, double *xp, double *yp, double x, double y)
{
int i, j, c = 0;
for (i = 0, j = npol-1; i < npol; j = i++) {
if ((((yp[i]<=y) && (y<yp[j])) ||
((yp[j]<=y) && (y<yp[i]))) &&
(x < (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i]))
c = !c;
}
return c;
}

main(int argc, char *argv[]) {
int npol = atoi(argv[1]); /* 200 */
int div = atoi(argv[2]); /* 400 */
int count = 0;
double pi = acos(-1.0);
double* xp;
double* yp;
double theta;
double a = 10.0;
double fdiv = 2*a/div;
int i=0, j=0;
xp = malloc(npol*sizeof(double));
yp = malloc(npol*sizeof(double));
for(i=0;i<npol;i++) {
theta = i*2*pi/npol;
xp[i] = a+a*pow(cos(theta), 3);
yp[i] = a+a*pow(sin(theta), 3);
}
for(i=0;i<=div;i++) {
for(j=0;j<=div;j++) {
if (pnpoly(npol,xp,yp,i*fdiv,j*fdiv)) count++;
}
}
printf("\nArea: %f \n", 4*a*a*count/((div+1)*(div+1)));
return(0);
}

PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP_PNPOLY_LISP
(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))
(let* ((c nil)
(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))
(if (and (or (and (<= (aref yp i) y) (< y (aref yp j)))
(and (<= (aref yp j) y) (< y (aref yp i))))
(< x (+ (aref xp i) (/ (* (- (aref xp j) (aref xp i))
(- y (aref yp i)))
(- (aref yp j) (aref yp i))))))
(setq c (not c)))
(setq j i))))

(defun pnpolymain (npol div) ; call with 200 400
(declare (optimize (speed 3) (safety 0))
(fixnum npol div))
(let* ((xp (make-array npol :element-type 'double-float))
(yp (make-array npol :element-type 'double-float))
(theta 0.0d0)
(a 10.0d0)
(fdiv (/ (* 2 a) div))
(count 0))
(declare (double-float fdiv a theta)
(fixnum count)
(type (simple-array double-float (*)) xp yp))
(dotimes (i npol)
(declare (fixnum i))
(setq theta (/ (* 2 i pi) npol))
(setf (aref xp i) (+ a (* a (expt (cos theta) 3)))
(aref yp i) (+ a (* a (expt (sin theta) 3)))))
(dotimes (u (1+ div))
(declare (fixnum u))
(dotimes (v (1+ div))
(declare (fixnum v))
(if (pnpoly npol xp yp (* u fdiv) (* v fdiv)) (incf count))))
(format t "~%Area: ~a" (/ (* count 4 a a) (* (1+ div) (1+ div))))))

Raymond Toy

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

ma...@cs.uwa.edu.au (Matthew McDonald) writes:

>
> Unless I misunderstood something, one of the versions of the ray-tracing
> code ran *four* times faster on CMUCL than it did on ACL. Admittedly, it
> had been optimised for CMUCL, but I'd be surprised to see a non-contrived
> piece of C that revealed similar differences in C implementations.

The code was not really optimized for CMUCL. The changes were
portable, and the only CMUCL specific item was the start-block stuff.
This could easily have been made portable by wrapping up everything in
a labels block.

I think that if *I* wrote a C compiler for a pentium, I could produce
one that was produces code that's probably 2-10 times slower than any
other C compiler. It would take me forever to do it, though.

I guess the bottom line is that writing a compiler is hard, and
writing good compilers are even harder.

Ray

Raymond Toy

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

John Watton <john....@alcoa.com> writes:

> it all the time. I posted the benchmark below on comp.lang.lisp
> sometime ago and didn't generate a fraction of the interest directed
> to the ray tracing code. (Unlike the ray tracing code, I guess the

Perhaps because you had already optimized the code? CMUCL only gives
a few notes about (* (1+ DIV) (1+ DIV)) because the result is not a
fixnum. This should not impact speed too much.

> (:explain :boxing :calls :types)). Also with ACL is a nonANSI and
> undocumented feature called immediate-args for passing and returning
> unboxed arguments. I don't use it in this benchmark but it is
> available at least on the Unix variety. I suspect that this is one
> area that the ray tracing code suffers in the ACL version. In general
> I prefer to pass floats inside arrays to avoid boxing - it's easy and
> ANSI. But, I don't do that here to be fair to C.

I don't remember precisely, but aren't parameters allowed to be passed
in registers in C? If so, Lisp should be allowed the same freedom.


Here are my times on an Ultra-30 300 MHz sparc, using Sun C compiler
(-O), and CMUCL 18a+:

sec
cc 5.7
cc -O 2.0
CMUCL 3.1

I note that the test cons up 5 MB or so, as reported by time. If I
add the CMUCL extension (declaim (ext:start-block pnpolymain)), the
run time doesn't change, but it now only conses 8000 bytes.

Ray


Raymond Toy

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

Raymond Toy <t...@rtp.ericsson.se> writes:

> John Watton <john....@alcoa.com> writes:
>
> > it all the time. I posted the benchmark below on comp.lang.lisp
> > sometime ago and didn't generate a fraction of the interest directed
> > to the ray tracing code. (Unlike the ray tracing code, I guess the
>
> Perhaps because you had already optimized the code? CMUCL only gives
> a few notes about (* (1+ DIV) (1+ DIV)) because the result is not a
> fixnum. This should not impact speed too much.
>

Here is one small optimization for Lisp which a C compiler probably
would have already done. This does the redundant array references
just once at the beginning of the loop. Are there any Lisp compilers
that do subexpression elimination (is that what it's called?)?

This change reduces the Lisp time from 3.2 sec to 2.5 sec. We are now
within .4 sec (20%) of the optimized C code.

Ray

(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))
(let* ((c nil)
(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))

(let ((ypi (aref yp i))
(ypj (aref yp j))
(xpi (aref xp i))
(xpj (aref xp j)))
(if (and (or (and (<= ypi y) (< y ypj))
(and (<= ypj y) (< y ypi)))
(< x (+ xpi (/ (* (- xpj xpi)
(- y ypi))
(- ypj ypi)))))
(setq c (not c))))
(setq j i))))

Bulent Murtezaoglu

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
[...]
RT> I note that the test cons up 5 MB or so, as reported by time.
RT> If I add the CMUCL extension (declaim (ext:start-block
RT> pnpolymain)), the run time doesn't change, but it now only
RT> conses 8000 bytes.

Yes I've been noticing that too. That's probably due to return values
not getting consed when the file is block compiled. Here's what I've been
wondering lately:

(deftsruct foo ;bla bla
)

(defun f1 (x)
;; some computation
(make-foo ;some results
))

(defun f2 (y)
;; some computation
(let ((local1 (f1 y)))
;; some more compuation
))

you get the idea (excuse the bad paranthesis placement):
the make-foo call generates garbage. Is Common Lisp allowed to
stack-allocate what f1 returns under certain circumstances?
Maybe if f1 is inlined and local1 is declared dynamic-extent?
(make-foo ing in f2 and passing that to f1 is too ugly)
Can some enlightened soul shed light on this?

cheers,

BM

Christopher J. Vogt

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to John Watton

John Watton wrote:
>
> >I keep seeing lisp people say that code written in lisp can easily be
> >made to perform as well as code written in C, and it just doesn't
> >match my (limited) experience. With CMUCL, floating point code can be
> >made to run about as fast as in C, but it seems like hard work. I'm
> >not even sure it's feasible in ACL.
>
> Of course its feasible in ACL (4.3 varity on unix, NT and linux). I do
> it all the time. I posted the benchmark below on comp.lang.lisp
> sometime ago and didn't generate a fraction of the interest directed
> to the ray tracing code. (Unlike the ray tracing code, I guess the
> benchmark was too easy to get your hands around.) As you can see it is
> comparible to unoptimized C code. With ACL I find I use the nonANSI
> declare keyword :explain to tell me where the boxing is (declare
> (:explain :boxing :calls :types)). Also with ACL is a nonANSI and
> undocumented feature called immediate-args for passing and returning
> unboxed arguments. I don't use it in this benchmark but it is
> available at least on the Unix variety. I suspect that this is one
> area that the ray tracing code suffers in the ACL version. In general
> I prefer to pass floats inside arrays to avoid boxing - it's easy and
> ANSI. But, I don't do that here to be fair to C.
>
> John Watton
> ALCOA
>
> Machine: Pentium 90 Mhz, 40 Meg RAM
> OS: Windows NT
>
> Double floating point benchmark
> Pnpoly area sec rel
> ----------- --- ---
> C | VC++ 4.0 (cl -Ox) 16.5 1.0
> Java | JDK 1.1.4 (JIT) 20.2 1.2
> C | VC++ 4.0 (cl) 23.5 1.4
> Lisp | Franz ACL 4.3.2 25.6 1.6
> Java | JDK 1.1.4 (-O) 106.9 6.5
> Lisp | Franz ACL 3.0.1 521.3 31.6
> Perl | Perl5.003 1007.0 61.0

One needs to be very careful with benchmarks. It is easy to think you are
benchmarking something in particular, in this case DFP performance, when
in actuality you are measuring something else, in this case: conditionals,
incremental loops, function calling and integer to DFP conversion. The code
benchamrked only executes the following DFP code 1% of the time:


(xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i]))

If you change the code to what I have below, I think you'll *really* be
testing DFP performance, and not something else. I have omitted the
CL code for brevity, but it should be simple to make the similar changes.

I realize that what started out as a meaningful program is now a
synthetic meaningless program, but it will truly test DFP performance, and
not something else. OTOH, you could also use the FFT benchmark in Gabriel.

I'll be surprised if you don't find a factor of 10 performance difference
between CL and C.


int pnpoly(int npol, double *xp, double *yp, double x, double y)
{
int i, j, c = 0;
for (i = 0, j = npol-1; i < npol; j = i++) {

xp[i] = (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i];
yp[j] = (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i];
}
return c;
}

poly(int npol, int div)
{


int count = 0;
double pi = acos(-1.0);
double* xp;
double* yp;
double theta;
double a = 10.0;
double fdiv = 2*a/div;
int i=0, j=0;

xp = (double *)malloc(npol*sizeof(double) + 1);
yp = (double *)malloc(npol*sizeof(double) + 1);


for(i=0;i<npol;i++) {
theta = i*2*pi/npol;
xp[i] = a+a*pow(cos(theta), 3);
yp[i] = a+a*pow(sin(theta), 3);
}
for(i=0;i<=div;i++) {
for(j=0;j<=div;j++) {

pnpoly(npol,xp,yp,i*fdiv,j*fdiv);


}
}
printf("\nArea: %f \n", 4*a*a*count/((div+1)*(div+1)));
return(0);
}

--

Raymond Toy

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

"Christopher J. Vogt" <vo...@computer.org> writes:

>
> If you change the code to what I have below, I think you'll *really* be
> testing DFP performance, and not something else. I have omitted the
> CL code for brevity, but it should be simple to make the similar changes.

CL code appended. Hope it's right.

> I'll be surprised if you don't find a factor of 10 performance difference
> between CL and C.

I don't see why you would say that. For a smart Lisp compiler like
CMUCL no boxing of any float is needed in pnpoly, so CL doesn't have
to be any slower than C code.

Here is what I get, using Sun C and CMUCL 18a+

cc -O 7.25 sec
cc -xO5 6.06 sec
CMUCL 6.67 sec
CMUCL 5.86 sec (don't do redundant array refs)

So not only is CL not 10 times slower than C, but CL is actually
competitive with C. (Perhaps I made a mistake in translating the C
code to Lisp?)

Also, with CMUCL, I get overflow errors, which I turned off to run
this benchmark. C quietly runs to completion.

Ray

;; Translation of original


(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))

(let* ((c 0)


(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))

(setf (aref xp i) (+ (/ (* (- (aref xp j) (aref xp i))


(- y (aref yp i)))
(- (aref yp j) (aref yp i)))

(aref xp i)))
(setf (aref yp j) (+ (/ (* (- (aref xp j) (aref xp i))


(- y (aref yp i)))
(- (aref yp j) (aref yp i)))

(aref xp i)))
(setq j i))))

;; Move some redundant array refs. Not all removed because I didn't
;; want to figure out if they were really redundant.


(defun pnpoly (npol xp yp x y)
(declare (optimize (speed 3) (safety 0))
(fixnum npol)
(double-float x y)
(type (simple-array double-float (*)) xp yp))

(let* ((c 0)


(j (1- npol)))
(declare (fixnum j))
(dotimes (i npol c)
(declare (fixnum i))
(let ((ypi (aref yp i))
(ypj (aref yp j))
(xpi (aref xp i))
(xpj (aref xp j)))

(setf (aref xp i) (+ (/ (* (- xpj xpi)


(- y ypi))
(- ypj ypi))

xpi))
(setf (aref yp j) (+ (/ (* (- (aref xp j) (aref xp i))


(- y ypi))
(- ypj ypi))

(aref xp i))))
(setq j i))))

Christopher J. Vogt

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to Raymond Toy

Raymond Toy wrote:
>
> "Christopher J. Vogt" <vo...@computer.org> writes:
>
> >
> > If you change the code to what I have below, I think you'll *really* be
> > testing DFP performance, and not something else. I have omitted the
> > CL code for brevity, but it should be simple to make the similar changes.
>
> CL code appended. Hope it's right.
>
> > I'll be surprised if you don't find a factor of 10 performance difference
> > between CL and C.
>
> I don't see why you would say that. For a smart Lisp compiler like
> CMUCL no boxing of any float is needed in pnpoly, so CL doesn't have
> to be any slower than C code.
>
> Here is what I get, using Sun C and CMUCL 18a+
>
> cc -O 7.25 sec
> cc -xO5 6.06 sec
> CMUCL 6.67 sec
> CMUCL 5.86 sec (don't do redundant array refs)
>
> So not only is CL not 10 times slower than C, but CL is actually
> competitive with C. (Perhaps I made a mistake in translating the C
> code to Lisp?)

Well, obviously I have been using the *wrong* Lisp implementation. Sadly,
CMUCL doesn't run under Windows NT or MacOS, which are my two current
setups. However, this confirms what I have said for many years (and
most recently earlier in this thread), which is that there is nothing
in the CL spec that prohibits an implementation from attaing performance
similar to C (with 10%). Apprently CMUCL is a SSC (Sufficiently Smart
Compiler).

Now, how do I go about getting CMUCL ported to NT or MacOS?

>
> Also, with CMUCL, I get overflow errors, which I turned off to run
> this benchmark. C quietly runs to completion.

Indeed, one of the many advantages of Lisp over C.

Raymond Toy

unread,
Mar 17, 1998, 3:00:00 AM3/17/98
to

"Christopher J. Vogt" <vo...@computer.org> writes:

>
> Now, how do I go about getting CMUCL ported to NT or MacOS?
>

I doubt if any would want to do the MacOS port, especially if it's the
68k version. Porting to a new architecture is *extremely* hard. I
have a hard enough time just compiling up a new version! The PowerPC
Mac might be possible since there was an RS-6000 port.

The first step is to massage the C code to run on NT. Then you'll
have to figure out the Lisp part so that it could talk to the OS
correctly and all of the other OS specific things.

There was some discussion on the CMUCL mailing list about this, but it
seemed that the cygwin stuff for NT wasn't quite ready and nobody was
really motivated for lack of time, energy, desire, etc. to do a native
NT port. But I think many people would love to have one.

If you're interested, join the CMUCL mailing list for hints and
suggestions. See http://www.cons.org.

Ray

Espen Vestre

unread,
Mar 18, 1998, 3:00:00 AM3/18/98
to

"Christopher J. Vogt" <vo...@computer.org> writes:

> I'll be surprised if you don't find a factor of 10 performance difference
> between CL and C.

I think I already tried to explain you this in a thread in
comp.lang.lisp.mcl: The bad FP performance you achieve is caused
by the bad FP implementation of MCL. ACL and CMUCL behave much,
much better. (But then, my execution times for FP code with
ACL4.3.1/Solaris were so low that you didn't believe them, since
the specific code you posted compiled to pure FP-crunching machine
code).

It's very sad that Digitool hasn't done anything about this yet,
because (most of) the rest of MCL is really, really great. But
the real world need for good FP performance is probably greater
than the real world need for the outstanding bignum performance
of MCL. (But take any piece of integer/symbol-crunching software
and port it to MCL, and you'll see very good performance. And
try and port it to C, and you'll find yourself reinventing the
concept of symbols. I've seen _that_ done serveral times :-()

--

regards,
Espen Vestre

crac...@wavehh.hanse.de

unread,
Mar 18, 1998, 3:00:00 AM3/18/98
to

In article <350EFD44...@computer.org>,

vo...@computer.org wrote:
> > Also, with CMUCL, I get overflow errors, which I turned off to run
> > this benchmark. C quietly runs to completion.
>
> Indeed, one of the many advantages of Lisp over C.

In the usual case, you can make your C code send SIGFPE on exceptions
like overflow, in FreeBSD see fpsetmask(3). In CMUCL and C, the native
features of the CPU are used.

Martin


-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/ Now offering spam-free web-based newsreading

Pierre Mai

unread,
Mar 18, 1998, 3:00:00 AM3/18/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:

RT> If you're interested, join the CMUCL mailing list for hints
RT> and suggestions. See http://www.cons.org.

Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't
think so), or is it being planned? Or which other Common Lisp
implementation -- preferably free or low-cost -- which is
"sufficiently fast"[1] is available for the Alpha?

I'm finding it somewhat difficult to contact *.cons.org (except
ftp2.cons.org, which is Walnut Creek's FTP Server) recently, getting
messages that these hosts don't exist (DNS problem?)... I'll try
again later today...

Regs, Pierre.

[1] FP performance is *not* my foremost concern here, but CLOS
performance and general funcalling and list manipulation is.

--
Pierre Mai <de...@cs.tu-berlin.de> http://home.pages.de/~trillian/
"Such is life." -- Fiona in "Four Weddings and a Funeral" (UK/1994)

Christopher J. Vogt

unread,
Mar 18, 1998, 3:00:00 AM3/18/98
to

Espen Vestre wrote:
>
> "Christopher J. Vogt" <vo...@computer.org> writes:
>
> > I'll be surprised if you don't find a factor of 10 performance difference
> > between CL and C.
>
> I think I already tried to explain you this in a thread in
> comp.lang.lisp.mcl: The bad FP performance you achieve is caused
> by the bad FP implementation of MCL.

Sloppy writing on my part, I guess. Earlier in the thread I stated:


In my experience with CL, much of it involving graphics and FP, the two
areas that CL implementations fall far behind C in terms of performance are
FP and I/O. So if there is something in particular about CL that you want
to benchmark, e.g. array performance, make sure that you aren't doing any
I/O and are not using FP numbers. I find it interesting that the original
message that started this thread was titled "Floating Point speed in CL",
but the code wasn't doing very much FP, it was mostly ratios!

> ACL and CMUCL behave much,
> much better.

I have not used CMUCL, which apprently has comparable performance. This
is great, and is an existence proof to what I have been saying, and should
be a target for all the comercial lisp vendors: Digitool, Franz and Harlequin.
I have not used the most recent versions of ACL, which appear to be
within a factor of 2 of C. Older versions of ACL, however where as I described,
as are current versions of LWW and MCL. Unfortunately for me, I need something
that runs under MacOS or WinNT, and CMUCL doesn't run in either place. I need
to look at the current version of ACL though, as it sounds promising.

> (But then, my execution times for FP code with
> ACL4.3.1/Solaris were so low that you didn't believe them, since
> the specific code you posted compiled to pure FP-crunching machine
> code).

What I found difficult to believe is that you showed the time for floating
point CL code 7 times faster than fixnum code. I find this counter
intuitive to the extreme. It suggested to me some investigation. It isn't
the "absolute" time of the FP code, but the time relative to the fixnum
version of the same code. I remain skeptical. Did you look at the disassembly
to be sure the compiler didn't, for example, optimize out the floating point
multiply because the result was never used? Or maybe, and I'm not familiar
with your hardware, you have 10 floating point units, and only one integer
unit in you CPU, that might explain why the FP code was faster than the
fixnum code.

> [...]

Patrick Tufts

unread,
Mar 18, 1998, 3:00:00 AM3/18/98
to

Pierre Mai <de...@cs.tu-berlin.de> writes:

>think so), or is it being planned? Or which other Common Lisp
>implementation -- preferably free or low-cost -- which is
>"sufficiently fast"[1] is available for the Alpha?

I doubt it's low-cost, but I remember reading that Symbolics ported
Genera the DEC Alpha. I think the price was several thousand US
dollars a few years ago.

--Pat

Fred Gilham

unread,
Mar 18, 1998, 3:00:00 AM3/18/98
to

What's the answer that (pnpoly 200 400) is supposed to give? In C I
get 177.287828. In CMU lisp I get 0.0d0.

Seems like it's a little useless to compare two programs that give
different answers to the same problem.

--
Fred Gilham gilham @ csl . sri . com
King Christ, this world is all aleak, / And life preservers there are none,
And waves that only He may walk / Who dared to call Himself a man.
-- e. e. cummings, from Jehovah Buried, Satan Dead

Erik Naggum

unread,
Mar 18, 1998, 3:00:00 AM3/18/98
to

* Christopher J. Vogt

| Older versions of ACL, however where as I described, as are current
| versions of LWW and MCL.

FWIW, "Allegro Common Lisp" is used both for the Windows and the Unix
version of Franz Inc's product line. the Windows version is a cheap and
slow CLtL1 implementation with CLOS added -- I was very disappointed with
it. however, its main fortés are its small size and the GUI builder.
the Unix version purports to be conforming to ANSI Common Lisp (and is
very close), is fast and holds very high quality -- I have been very
satisfied with it. if all you have is ACL for Windows, the Unix version
is like it's from a different world. (actually, it's from a different
vendor, some years back.)

the latest releases of these products are ACL 3.0.2 for Windows, ACL 4.3
for Linux, ACL 4.3.1 for Unix, and ACL 4.3.2 for NT, which is based on
the Unix version plus the GUI builder for Windows NT, and I understand it
is a preliminary release towards ACL 5.0.

Istvan Marko

unread,
Mar 18, 1998, 3:00:00 AM3/18/98
to

Pierre Mai <de...@cs.tu-berlin.de> writes:

> >>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
>
> RT> If you're interested, join the CMUCL mailing list for hints
> RT> and suggestions. See http://www.cons.org.
>
> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't

> think so), or is it being planned?


ftp://ftp2.cons.org/pub/lisp/cmucl/release/cmucl-18a.alpha_DU4*

--
Istvan

Espen Vestre

unread,
Mar 19, 1998, 3:00:00 AM3/19/98
to

"Christopher J. Vogt" <vo...@computer.org> writes:

> What I found difficult to believe is that you showed the time for floating
> point CL code 7 times faster than fixnum code. I find this counter
> intuitive to the extreme. It suggested to me some investigation. It isn't
> the "absolute" time of the FP code, but the time relative to the fixnum
> version of the same code. I remain skeptical. Did you look at the disassembly
> to be sure the compiler didn't, for example, optimize out the floating point
> multiply because the result was never used?

I glanced at the dissasembly of both the ACL and MCL compiled functions,
and as far as I could tell (I'm not very fluent in reading assembly code -
if I have some time, I could try it again), the FP code was as optimized
as you could wish, but the fixnum code wasn't quite optimized. For MCL,
the fixnum code was _very_ good, but the FP code looked like a disaster
(much less optimized than the ACL _fixnum_ code).

> Or maybe, and I'm not familiar
> with your hardware, you have 10 floating point units, and only one integer
> unit in you CPU, that might explain why the FP code was faster than the
> fixnum code.

Compared to the overall speed of the machine, the UltraSPARC 140Mhz
processor of my workstation, which I use for ACL, has very good FP
specs (take a look at http://www.spec.org).

--

regards,
Espen Vestre
Telenor Nextel
Norway

Raymond Toy

unread,
Mar 19, 1998, 3:00:00 AM3/19/98
to

Fred Gilham <gil...@snapdragon.csl.sri.com> writes:

> What's the answer that (pnpoly 200 400) is supposed to give? In C I
> get 177.287828. In CMU lisp I get 0.0d0.

Bummer. I get the same answer as C using CMUCL 18a+ Sparc. What
version are you running? Time to get a newer version?

Ray

Raymond Toy

unread,
Mar 19, 1998, 3:00:00 AM3/19/98
to

Pierre Mai <de...@cs.tu-berlin.de> writes:

>
> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't

> think so), or is it being planned? Or which other Common Lisp

It exists for Digital Unix (no Linux/Alpha) and you can find it
www.cons.org.

Well, except that apparently some of the machines there have died in
the past few days. They're being worked on and should be up in a few
days.

>
> [1] FP performance is *not* my foremost concern here, but CLOS
> performance and general funcalling and list manipulation is.
>

The Alpha version supposedly has quite good FP performance.

Ray

Raymond Toy

unread,
Mar 19, 1998, 3:00:00 AM3/19/98
to

Pierre Mai <de...@cs.tu-berlin.de> writes:

>
> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't
> think so), or is it being planned? Or which other Common Lisp

It exists for Digital Unix (no Linux/Alpha) and you can find it
www.cons.org.

Well, except that apparently many of the machines there have died in

Fred Gilham

unread,
Mar 19, 1998, 3:00:00 AM3/19/98
to

Raymond Toy <t...@rtp.ericsson.se> writes:

After I thought about it a little, it became pretty clear that 0.0d0
isn't the right answer.... :-)

I try to run the newer versions of CMUCL. Oddly enough, the older,
release version for x86 (CMU Common Lisp 97.02.22) gives the same
answer as the C version. This is under FreeBSD on a PII-300.

snapdragon:tests > lisp
CMU Common Lisp 97.02.22, running on snapdragon.csl.sri.com
Send bug reports and questions to your local CMU CL maintainer, or to
cmucl...@cs.cmu.edu.
Loaded subsystems:
Python 1.0, target Intel x86
CLOS based on PCL version: September 16 92 PCL (f)
* (load "pnpoly.x86f")

; Loading #p"/a/japonica/carnelian/homes/gilham/cmucl/tests/pnpoly.x86f".
T
* (time (pnpolymain 200 400))
Compiling LAMBDA NIL:
Compiling Top-Level Form:
[GC threshold exceeded with 2,000,384 bytes in use. Commencing GC.]
[GC completed with 41,984 bytes retained and 1,958,400 bytes freed.]
[GC will next occur when at least 2,041,984 bytes are in use.]
[GC threshold exceeded with 2,042,368 bytes in use. Commencing GC.]
[GC completed with 41,984 bytes retained and 2,000,384 bytes freed.]
[GC will next occur when at least 2,041,984 bytes are in use.]

Area: 117.28782781201609d0
Evaluation took:
3.15 seconds of real time
2.962338 seconds of user run time
0.018394 seconds of system run time
[Run times include 0.21 seconds GC run time]
0 page faults and
5326336 bytes consed.
NIL
*

Now for the punch line. The following is under Linux on a P-Pro 200.

boron:tests > lisp
Allegro CL 4.3 [Linux/X86; R1] (12/11/96 1:33)
Copyright (C) 1985-1996, Franz Inc., Berkeley, CA, USA. All Rights Reserved.
;; Optimization settings: safety 1, space 1, speed 1, debug 2.
;; For a complete description of all compiler switches given the
;; current optimization settings evaluate (EXPLAIN-COMPILER-SETTINGS).
USER(1): (compile-file "pnpoly.lisp")
;;; Compiling file pnpoly.lisp
; Compiling PNPOLY
; Compiling PNPOLYMAIN
;;; Writing fasl file pnpoly.fasl
Warning: No IN-PACKAGE form seen in pnpoly.lisp. (Allegro Presto will
be ineffective when loading a file having no IN-PACKAGE form.)
;;; Fasl write complete
#p"pnpoly.fasl"
T
NIL
USER(2): (load "pnpoly.fasl")
; Fast loading ./pnpoly.fasl
T
USER(3): (time (pnpolymain 200 400))

Area: 180.0784820989919d0
; cpu time (non-gc) 3,900 msec user, 10 msec system
; cpu time (gc) 140 msec user, 0 msec system
; cpu time (total) 4,040 msec user, 10 msec system
; real time 4,264 msec
; space allocation:
; 19 cons cells, 0 symbols, 7,804,928 other bytes
NIL
USER(4):

Mike Mcdonald

unread,
Mar 20, 1998, 3:00:00 AM3/20/98
to Pierre Mai

[Posted and mailed]

In article <m3iupckr...@torus.cs.tu-berlin.de>,


Pierre Mai <de...@cs.tu-berlin.de> writes:
>>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:
>
> RT> If you're interested, join the CMUCL mailing list for hints
> RT> and suggestions. See http://www.cons.org.
>

> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I don't
> think so), or is it being planned? Or which other Common Lisp

> implementation -- preferably free or low-cost -- which is
> "sufficiently fast"[1] is available for the Alpha?


bigdec=>cmucl
CMU Common Lisp 18a-RELEASE, running on bigdec


Send bug reports and questions to your local CMU CL maintainer, or to

cmuc...@cons.org.
Loaded subsystems:
Python 1.0, target Alpha


CLOS based on PCL version: September 16 92 PCL (f)

* (quit)
bigdec=>uname -a
OSF1 bigdec V4.0 564 alpha
bigdec=>

> I'm finding it somewhat difficult to contact *.cons.org (except
> ftp2.cons.org, which is Walnut Creek's FTP Server) recently, getting
> messages that these hosts don't exist (DNS problem?)... I'll try
> again later today...

Some machines died and they're slowing getting
fixed. Please be patient.

Mike McDonald
mik...@mikemac.com


Pierre Mai

unread,
Mar 20, 1998, 3:00:00 AM3/20/98
to

>>>>> "RT" == Raymond Toy <t...@rtp.ericsson.se> writes:

RT> Pierre Mai <de...@cs.tu-berlin.de> writes:
>> Does a port to Linux/Alpha and/or Digital Unix/Alpha exist (I
>> don't think so), or is it being planned? Or which other Common
>> Lisp

RT> It exists for Digital Unix (no Linux/Alpha) and you can find
RT> it www.cons.org.

Oooops, my fault for relying on my poor remembrance. Of course the
Alpha is supported, and having gotten the 18a-Release yesterday,
working with CMUCL on Alpha really is *fun* ;)

Thanks a lot for all those who corrected me! Isn't it funny what you
can miss out on, by relying on that thing between your ears... ;)

>> [1] FP performance is *not* my foremost concern here, but CLOS
>> performance and general funcalling and list manipulation is.
>>

RT> The Alpha version supposedly has quite good FP performance.

I think I should write more number-crunching programs ;)

Regs, Pierre.

Harvey J. Stein

unread,
Mar 25, 1998, 3:00:00 AM3/25/98
to

ma...@cs.uwa.edu.au (Matthew McDonald) writes:

> Erik Naggum <cle...@naggum.no> writes:
>
> > C is not fast. C is primitive, which means it has a potential of being
> > fast if the programmer is a genius and the compiler is on his side.

>
> With simple languages like C, the difference between naive compilation
> and full optimisation is usually less than a factor of two (in my
> experience). Look at the gcc vs oberon comparisions (the standard
> oberon compilers do practically no optimisation) or even gcc with and
> without optimisation.
>

> Even a stupid C compiler can typically produce code that's within
> spitting distance of what's produced by an optimising compiler,
> because the language is basically a portable assembler. The same can't
> be said of common lisp. When you're optimising floating point code in
> C you don't usually need to go to much trouble to ensure that the
> major operations are on unboxed data!

Very untrue. Although -O2 with gcc on intel typically reduces run
time by a factor of 1.2 to 1.5, I've seen egcs give a factor
significantly larger than 2 on a big project. I've also seen such
large differences more frequently on the alpha & sparc CPUs, most
notably with DEC's C compiler on alpha CPUs, where factors of 2 are
common, and larger factors aren't too unusual.

To substantiate some of this, here're the application runtimes when
compiled with egcs & with gcc on linux intel:

pII 300, gcc -O2 -ffloat-store:
24.85user 0.83system 0:45.53elapsed 56%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2092major+2575minor)pagefaults 217swaps
pII 300, egcs -O3 -ffloat-store -march=pentiumpro:
17.59user 0.33system 0:26.53elapsed 67%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (748major+2545minor)pagefaults 0swaps

(on this version of gcc, -O2 was giving us the best performance).

I don't have unoptimized run times handy, but the above shows the
large difference one can see using optimization from compilers even as
similar as gcc 2.7.2.1 & egcs 1.0.1. Also, if you assume that gcc -O2
gives a 30% reduction in run time over unoptimized code then egcs -O3
is giving slightly better than a factor of 2.

The differences between C & optimized C are better register allocation
and instruction reordering to take advantage of pipelining & multiple
execution units, not to mention common subexpression elimination,
hiking invariants out of loops, etc.

This is necessary for good performance on RISC chips. Instruction
reordering isn't as important on intel, but register allocation is
extremely important due to the lack of registers. And, with Merced,
optimization & instruction ordering is going to be even more
important. It will be almost impossible to write fast assembler by
hand for Merced, let alone do it with unoptimized C code.

All this actually makes C quite far from a "portable assembler".

As for optimizing floating point code in Lisp, one's largely just
adding in all the declarations that one had to start with in C, so I
don't see your point about it being so much more trouble. Especially
when most of the time is spent in a small section of the code, where
one just has to add the appropriate declarations in a small amount of
code, whereas in C the declarations are everywhere.

I think what generally makes optimization more difficult in Lisp is
that people do it so rarely. Mostly because it's not needed so
often. There's fast and there's fast enough, and Lisp is usually fast
enough.

--
Harvey J. Stein
Berger Financial Research
hjs...@bfr.co.il

Christopher J. Vogt

unread,
Mar 25, 1998, 3:00:00 AM3/25/98
to Harvey J. Stein

Harvey J. Stein wrote:
>
>[...]

>
>
> As for optimizing floating point code in Lisp, one's largely just
> adding in all the declarations that one had to start with in C, so I
> don't see your point about it being so much more trouble. Especially
> when most of the time is spent in a small section of the code, where
> one just has to add the appropriate declarations in a small amount of
> code, whereas in C the declarations are everywhere.

Your programming environment has to support this. The CL standard allows
implementations to ignore type declarations. There are some implementations
that ignore the floating point type declaration, and so no performance
impmrovement is available via. declarations.

>
> I think what generally makes optimization more difficult in Lisp is
> that people do it so rarely. Mostly because it's not needed so
> often. There's fast and there's fast enough, and Lisp is usually fast
> enough.

I would agree with you that CL is generally "fast enough". However, In my
experience, I have found that floating point and I/O tend not to be
"fast enough" for the applications I have worked on.

Matthew McDonald

unread,
Mar 26, 1998, 3:00:00 AM3/26/98
to

hjs...@bfr.co.il (Harvey J. Stein) writes:

>> Even a stupid C compiler can typically produce code that's within
>> spitting distance of what's produced by an optimising compiler,
>> because the language is basically a portable assembler. The same can't

[...]


>Very untrue. Although -O2 with gcc on intel typically reduces run
>time by a factor of 1.2 to 1.5, I've seen egcs give a factor
>significantly larger than 2 on a big project.

I'm sorry, but I think you're quibbling. The post you're replying to
(and the ones it referenced) said that for the purposes of
argument, "close" mean performance that was within a factor of
approximately two, and "not close" mean a factor of >= four.

>To substantiate some of this, here're the application runtimes when
>compiled with egcs & with gcc on linux intel:

[...]


>Also, if you assume that gcc -O2
>gives a 30% reduction in run time over unoptimized code then egcs -O3
>is giving slightly better than a factor of 2.

As I said in the post you're replying to, the difference between optimised
and unoptimised C is usually not much more than a factor of two (IME).

It may be significant that that three articles in this thread have
included vague claims that "significantly" larger differences are sometimes
observed, but all the actual timings reported shows factors of < 2.

[...]


>important. It will be almost impossible to write fast assembler by
>hand for Merced, let alone do it with unoptimized C code.

Obviously, C compilers have to work harder to get good performance, as
processors deviate further from the PDP-11. Lisp compilers will too.

Some languages *do* optimise better than C on modern or parallel
machines, but they seem to be mainly languages like FORTRAN and
SISAL, that are simpler than C in several obvious and important
respects. As far as I know, Common Lisp isn't one of these languages
and there's no recent machine on which a common lisp compiler reliably
produces faster code than the standard C compiler.

>As for optimizing floating point code in Lisp, one's largely just
>adding in all the declarations that one had to start with in C, so I
>don't see your point about it being so much more trouble. Especially
>when most of the time is spent in a small section of the code, where

I agree that with an appropriate compiler, you can usually get reasonable
performance with a reasonable amount of effort. However, (IME) you need to
have read the manual for your implementation fairly carefully, and the
declarations you add for one implementation are unlikely to be *equally*
*useful* for some randomly chosen other implementation.

-- Matthew McDonald <ma...@cs.uwa.edu.au>

Harvey J. Stein

unread,
Mar 26, 1998, 3:00:00 AM3/26/98
to

"Christopher J. Vogt" <vo...@computer.org> writes:

> Harvey J. Stein wrote:
> >
> > As for optimizing floating point code in Lisp, one's largely just
> > adding in all the declarations that one had to start with in C, so I
> > don't see your point about it being so much more trouble. Especially
> > when most of the time is spent in a small section of the code, where

> > one just has to add the appropriate declarations in a small amount of
> > code, whereas in C the declarations are everywhere.
>
> Your programming environment has to support this. The CL standard allows
> implementations to ignore type declarations. There are some implementations
> that ignore the floating point type declaration, and so no performance
> impmrovement is available via. declarations.

Then get a better Lisp compiler. The Lisp compiler can ignore
declarations and the C compiler can do lousy instruction scheduling &
register allocation. In fact, the C compiler can sometimes produce
*slower* code when certain optimizations are *enabled*. BFD.

There are also Lisp compilers which do type inference and sometimes
produce code which runs faster than comparable C code. Bigloo &
Stalin come to mind for scheme, and I think that CMUCL does this
sometimes for Common Lisp.

Andi Kleen

unread,
Mar 29, 1998, 3:00:00 AM3/29/98
to

hjs...@bfr.co.il (Harvey J. Stein) writes:

> There are also Lisp compilers which do type inference and sometimes
> produce code which runs faster than comparable C code. Bigloo &
> Stalin come to mind for scheme, and I think that CMUCL does this
> sometimes for Common Lisp.

How can bigloo or Stalin produce faster code than C when they both compile
into C?

-Andi

Erik Naggum

unread,
Mar 29, 1998, 3:00:00 AM3/29/98
to

* Andi Kleen

| How can bigloo or Stalin produce faster code than C when they both
| compile into C?

the same way some C compilers produce faster and better assembly code
than humans could do even though they produce assembly code: by going way
beyond the mental limitations of the human mind in the code produced,
such as the number of "open issues" that can be dealt with. for the most
part, this is a purely mechanical limitation, so if a programmer can make
another program (such as a compiler) juggle the open issues for him, he
can deal with more of them. since C is an amazingly stupid language, the
number of open issues for even an excellent C programmer are strictly
limited to a small constant number. (the human consciousness is reported
to be able to deal with 7 simultaneously open issues, which very from all
concrete, like something to watch out for, to very abstract, like a free
variable. most programming languages do not optimize for the use of
these "slots" in the human psyche, and so can vary greatly in how many
are required at any time. compilers, however, do not have to worry about
such slots, and can have any number of issues open at a time, and can
thus make what would appear to a human being to be highly intelligent
conclusions; recall the references to the "sufficiently smart compiler".)

Jeffrey Mark Siskind

unread,
Mar 29, 1998, 3:00:00 AM3/29/98
to

> | How can bigloo or Stalin produce faster code than C when they both
> | compile into C?
>
> the same way some C compilers produce faster and better assembly code
> than humans could do even though they produce assembly code: by going way
> beyond the mental limitations of the human mind in the code produced,

Let me give a very concrete example. Suppose that one wanted to write a
numerical integration procedure. The natural way to do this in Scheme (and
Common Lisp) is as a higher-order procedure:

(define (integrate-1d l u f)
(let ((d (/ (- u l) 8.0)))
(* (+ (* (f l) 0.5)
(f (+ l d))
(f (+ l (* 2.0 d)))
(f (+ l (* 3.0 d)))
(f (+ l (* 4.0 d)))
(f (- u (* 3.0 d)))
(f (- u (* 2.0 d)))
(f (- u d))
(* (f u) 0.5))
d)))

Now suppose that one wanted to compute the value of a double integral of some
function. The natural way to do this in Scheme (and Common Lisp) is by making
nested calls to the one-dimensional numerical-integration procedure defined
above:

(define (integrate-2d l1 u1 l2 u2 f)
(integrate-1d
l2 u2 (lambda (y) (integrate-1d l1 u1 (lambda (x) (f x y))))))

(define (zark u v) (integrate-2d 0.0 u 0.0 v (lambda (x y) (* x y))))

Likewise, the natural way to do this by hand in C is by using function
pointers:

double integrate_1d(double l, double u, double (*f)(double))
{ double d = (u-l)/8.0;
return ((*f)(l)*0.5+
(*f)(l+d)+
(*f)(l+2.0*d)+
(*f)(l+3.0*d)+
(*f)(l+4.0*d)+
(*f)(u-3.0*d)+
(*f)(u-2.0*d)+
(*f)(u-d)+
(*f)(u)*0.5)*d;}

double integrate_2d(double l1, double u1, double l2, double u2,
double (*f)(double, double))
{ double outer(double y)
{ double inner(double x) {return (*f)(x, y);}
return integrate_1d(l1, u1, inner);}
return integrate_1d(l2, u2, outer);}

double zark(double u, double v)
{ double zarkon(double x, double y) {return x*y;}
return integrate_2d(0.0, u, 0.0, v, zarkon);}

(Note that the above C code uses GNU C extensions, i.e. nested functions,
to simulate downward funargs, i.e. lambdas.)

Now, when one runs a benchmark written around the above Scheme code, vs. the
same benchmark written around the above C code, on my machine (a Goliath
motherboard with two 200MHz P6s each with 512K L2 running Red Hat 4.1, kernel
2.0.33, gcc 2.7.2.1), the Scheme code runs more than 21 times faster than the
C code (1.08 CPU sec for Scheme vs. 23.27 CPU sec for C). This is the case
even though the Scheme code was compiled into C with Stalin and both the
Stalin-generated C code and the hand-written C code were compiled with the
same C compiler with the same options (-O2).

The reason why is obvious. No C compiler that I am aware of will in-line
across an indirect function call. Yet Stalin will. So Stalin in-lines
INTEGRATE-1D, INTEGERATE-2D, and the lambda-expression analogues of the
nested C functions inner, outer, and zarkon into ZARK. Stalin generates one C
function for ZARK. Not only does this eliminate all of the function-call
overhead, it also allows the C compiler to do register allocation and
instruction scheduling across a much larger fragment of code. As this example
shows, these effects combined can make a huge difference in performance.

Thanks to Richard O'Keefe for giving me this example.
Jeff (http://www.neci.nj.nec.com/homepages/qobi)

Alexey Goldin

unread,
Apr 1, 1998, 3:00:00 AM4/1/98
to

Jeffrey Mark Siskind <qo...@research.nj.nec.com> writes:
>
> Now, when one runs a benchmark written around the above Scheme code, vs. the
> same benchmark written around the above C code, on my machine (a Goliath
> motherboard with two 200MHz P6s each with 512K L2 running Red Hat 4.1, kernel
> 2.0.33, gcc 2.7.2.1), the Scheme code runs more than 21 times faster than the
> C code (1.08 CPU sec for Scheme vs. 23.27 CPU sec for C). This is the case
> even though the Scheme code was compiled into C with Stalin and both the
> Stalin-generated C code and the hand-written C code were compiled with the
> same C compiler with the same options (-O2).
>

Could you please post your benchmarks? I tried to call this code million times
and got only 2 times difference (Scheme is faster).

Jeffrey Mark Siskind

unread,
Apr 1, 1998, 3:00:00 AM4/1/98
to Alexey Goldin

Enclosed is the handwritten C code and the Stalin-generated C code. This was
generated by the development version of Stalin. Since polyvariance is
currently broken in the development compiler, I simulated polyvariance
manually by giving it the enclosed Scheme program. Note that even though that
Scheme program is horrendous, the code generated is very good. Stalin flattens
it into a single C function. And even though there are many copies of the
parameters F1 ... F9, they are all eliminated because they are fictitious.

Now, I don't intend that people write Scheme code like this manually.
Eventually, I will fix polyvariance and Stalin will generate this
automatically from code like this:

(define (integrate-1D L U F)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F L) 0.5)
(F (+ L D))
(F (+ L (* 2.0 D)))
(F (+ L (* 3.0 D)))
(F (+ L (* 4.0 D)))
(F (- U (* 3.0 D)))
(F (- U (* 2.0 D)))
(F (- U D))
(* (F U) 0.5))
D)))

(define (integrate-2D L1 U1 L2 U2 F)
(integrate-1D L2 U2 (lambda (y) (integrate-1D L1 U1 (lambda (x) (F x y))) )))

(define (zark U V)
(integrate-2d 0.0 U 0.0 V (lambda (X Y) (* X Y)) ))

(define (r-total N)
(do ((I 1 (+ I 1))
(Sum 0.0 (+ Sum (zark (* I 1.0) (* I 2.0)))))
((> I N) Sum)))

(define (i-total N)
(do ((I 1 (+ I 1))
(Sum 0.0 (+ Sum (let ((I2 (* (* I I) 1.0))) (* I2 I2)))))
((> I N) Sum)))

(define (error-sum-of-squares N)
(do ((I 1 (+ I 1))
(Sum 0.0 (+ Sum (let ((E (- (r-total I) (i-total I)))) (* E E)))))
((> I N) Sum)))

(begin (display (error-sum-of-squares 1000)) (newline))

Jeff (http://www.neci.nj.nec.com/homepages/qobi)
-------------------------------------------------------------------------------
(define (integrate-1Da L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db1 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db2 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db3 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db4 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db5 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db6 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db7 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db8 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-1Db9 L U F1 F2 F3 F4 F5 F6 F7 F8 F9)
(let ((D (/ (- U L) 8.0)))
(* (+ (* (F1 L) 0.5)
(F2 (+ L D))
(F3 (+ L (* 2.0 D)))
(F4 (+ L (* 3.0 D)))
(F5 (+ L (* 4.0 D)))
(F6 (- U (* 3.0 D)))
(F7 (- U (* 2.0 D)))
(F8 (- U D))
(* (F9 U) 0.5))
D)))

(define (integrate-2D L1 U1 L2 U2
F11 F12 F13 F14 F15 F16 F17 F18 F19
F21 F22 F23 F24 F25 F26 F27 F28 F29
F31 F32 F33 F34 F35 F36 F37 F38 F39
F41 F42 F43 F44 F45 F46 F47 F48 F49
F51 F52 F53 F54 F55 F56 F57 F58 F59
F61 F62 F63 F64 F65 F66 F67 F68 F69
F71 F72 F73 F74 F75 F76 F77 F78 F79
F81 F82 F83 F84 F85 F86 F87 F88 F89
F91 F92 F93 F94 F95 F96 F97 F98 F99)
(integrate-1Da
L2 U2
(lambda (y)
(integrate-1Db1 L1 U1
(lambda (x) (F11 x y))
(lambda (x) (F12 x y))
(lambda (x) (F13 x y))
(lambda (x) (F14 x y))
(lambda (x) (F15 x y))
(lambda (x) (F16 x y))
(lambda (x) (F17 x y))
(lambda (x) (F18 x y))
(lambda (x) (F19 x y))))
(lambda (y)
(integrate-1Db2 L1 U1
(lambda (x) (F21 x y))
(lambda (x) (F22 x y))
(lambda (x) (F23 x y))
(lambda (x) (F24 x y))
(lambda (x) (F25 x y))
(lambda (x) (F26 x y))
(lambda (x) (F27 x y))
(lambda (x) (F28 x y))
(lambda (x) (F29 x y))))
(lambda (y)
(integrate-1Db3 L1 U1
(lambda (x) (F31 x y))
(lambda (x) (F32 x y))
(lambda (x) (F33 x y))
(lambda (x) (F34 x y))
(lambda (x) (F35 x y))
(lambda (x) (F36 x y))
(lambda (x) (F37 x y))
(lambda (x) (F38 x y))
(lambda (x) (F39 x y))))
(lambda (y)
(integrate-1Db4 L1 U1
(lambda (x) (F41 x y))
(lambda (x) (F42 x y))
(lambda (x) (F43 x y))
(lambda (x) (F44 x y))
(lambda (x) (F45 x y))
(lambda (x) (F46 x y))
(lambda (x) (F47 x y))
(lambda (x) (F48 x y))
(lambda (x) (F49 x y))))
(lambda (y)
(integrate-1Db5 L1 U1
(lambda (x) (F51 x y))
(lambda (x) (F52 x y))
(lambda (x) (F53 x y))
(lambda (x) (F54 x y))
(lambda (x) (F55 x y))
(lambda (x) (F56 x y))
(lambda (x) (F57 x y))
(lambda (x) (F58 x y))
(lambda (x) (F59 x y))))
(lambda (y)
(integrate-1Db6 L1 U1
(lambda (x) (F61 x y))
(lambda (x) (F62 x y))
(lambda (x) (F63 x y))
(lambda (x) (F64 x y))
(lambda (x) (F65 x y))
(lambda (x) (F66 x y))
(lambda (x) (F67 x y))
(lambda (x) (F68 x y))
(lambda (x) (F69 x y))))
(lambda (y)
(integrate-1Db7 L1 U1
(lambda (x) (F71 x y))
(lambda (x) (F72 x y))
(lambda (x) (F73 x y))
(lambda (x) (F74 x y))
(lambda (x) (F75 x y))
(lambda (x) (F76 x y))
(lambda (x) (F77 x y))
(lambda (x) (F78 x y))
(lambda (x) (F79 x y))))
(lambda (y)
(integrate-1Db8 L1 U1
(lambda (x) (F81 x y))
(lambda (x) (F82 x y))
(lambda (x) (F83 x y))
(lambda (x) (F84 x y))
(lambda (x) (F85 x y))
(lambda (x) (F86 x y))
(lambda (x) (F87 x y))
(lambda (x) (F88 x y))
(lambda (x) (F89 x y))))
(lambda (y)
(integrate-1Db9 L1 U1
(lambda (x) (F91 x y))
(lambda (x) (F92 x y))
(lambda (x) (F93 x y))
(lambda (x) (F94 x y))
(lambda (x) (F95 x y))
(lambda (x) (F96 x y))
(lambda (x) (F97 x y))
(lambda (x) (F98 x y))
(lambda (x) (F99 x y))))))

(define (zark U V)
(integrate-2d 0.0 U 0.0 V
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))
(lambda (X Y) (* X Y))))

(define (r-total N)
(do ((I 1 (+ I 1))
(Sum 0.0 (+ Sum (zark (* I 1.0) (* I 2.0)))))
((> I N) Sum)))

(define (i-total N)
(do ((I 1 (+ I 1))
(Sum 0.0 (+ Sum (let ((I2 (* (* I I) 1.0))) (* I2 I2)))))
((> I N) Sum)))

(define (error-sum-of-squares N)
(do ((I 1 (+ I 1))
(Sum 0.0 (+ Sum (let ((E (- (r-total I) (i-total I)))) (* E E)))))
((> I N) Sum)))

(begin (display (error-sum-of-squares 1000)) (newline))
-------------------------------------------------------------------------------
#include <stdio.h>

double integrate_1d(double, double, double (*)(double));
double integrate_2d(double, double, double, double,
double (*)(double, double));
double zark(double, double);
double r_total(int);
double sqr(double);
double i_total(int);
double error_sum_of_squares(int);
int main(void);

double integrate_1d(double l, double u, double (*f)(double))
{ double d = (u-l)/8.0;
return ((*f)(l)*0.5+
(*f)(l+d)+
(*f)(l+2.0*d)+
(*f)(l+3.0*d)+
(*f)(l+4.0*d)+
(*f)(u-3.0*d)+
(*f)(u-2.0*d)+
(*f)(u-d)+
(*f)(u)*0.5)*d;}

double integrate_2d(double l1, double u1, double l2, double u2,
double (*f)(double, double))
{ double outer(double y)
{ double inner(double x) {return (*f)(x, y);}
return integrate_1d(l1, u1, inner);}
return integrate_1d(l2, u2, outer);}

double zark(double u, double v)
{ double zarkon(double x, double y) {return x*y;}
return integrate_2d(0.0, u, 0.0, v, zarkon);}

double r_total(int n)
{ double sum = 0.0;
int i;
for (i = 1; i<=n; i++) sum += zark(i*1.0, i*2.0);
return sum;}

double sqr(double x) {return x*x;}

double i_total(int n)
{ double sum = 0.0;
int i;
for (i = 1; i<=n; i++) sum += sqr(i*i*1.0);
return sum;}

double error_sum_of_squares(int n)
{ double sum = 0.0;
int i;
for (i = 1; i<=n; i++) sum += sqr(r_total(i)-i_total(i));
return sum;}

int main(void)
{ printf("%f\n", error_sum_of_squares(1000));
return 0;}
-------------------------------------------------------------------------------
#include <stdio.h>

double closure_y, closure_l1, closure_u1, (*closure_f)(double, double);

double integrate_1d(double, double, double (*)(double));
double inner(double);
double outer(double);
double integrate_2d(double, double, double, double,
double (*)(double, double));
double zarkon(double, double);
double zark(double, double);
double r_total(int);
double sqr(double);
double i_total(int);
double error_sum_of_squares(int);
int main(void);

double integrate_1d(double l, double u, double (*f)(double))
{ double d = (u-l)/8.0;
return ((*f)(l)*0.5+
(*f)(l+d)+
(*f)(l+2.0*d)+
(*f)(l+3.0*d)+
(*f)(l+4.0*d)+
(*f)(u-3.0*d)+
(*f)(u-2.0*d)+
(*f)(u-d)+
(*f)(u)*0.5)*d;}

double inner(double x) {return (*closure_f)(x, closure_y);}

double outer(double y)
{ closure_y = y;
return integrate_1d(closure_l1, closure_u1, inner);}

double integrate_2d(double l1, double u1, double l2, double u2,
double (*f)(double, double))

{ closure_l1 = l1;
closure_u1 = u1;
closure_f = f;
return integrate_1d(l2, u2, outer);}

double zarkon(double x, double y) {return x*y;}

double zark(double u, double v) {return integrate_2d(0.0, u, 0.0, v, zarkon);}

double r_total(int n)
{ double sum = 0.0;
int i;
for (i = 1; i<=n; i++) sum += zark(i*1.0, i*2.0);
return sum;}

double sqr(double x) {return x*x;}

double i_total(int n)
{ double sum = 0.0;
int i;
for (i = 1; i<=n; i++) sum += sqr(i*i*1.0);
return sum;}

double error_sum_of_squares(int n)
{ double sum = 0.0;
int i;
for (i = 1; i<=n; i++) sum += sqr(r_total(i)-i_total(i));
return sum;}

int main(void)
{ printf("%f\n", error_sum_of_squares(1000));
return 0;}
-------------------------------------------------------------------------------
#include <math.h>
#include <string.h>
#include <stddef.h>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#define IMIN(x,y) (x<y?x:y)
#define IMAX(x,y) (x>y?x:y)
#define RMIN(x,y) (x<y?x:y)
#define RMAX(x,y) (x>y?x:y)
struct rectangular {double r; double i;};
FILE *current_output_port = stdout;
void print_flonum(double x);
void print_flonum(double x)
{char buffer[80];
if (x==HUGE_VAL) fputs("#*INFINITY*", current_output_port);
else if (x==-HUGE_VAL) fputs("-#*INFINITY*", current_output_port);
else if (x!=x) fputs("#*NOT-A-NUMBER*", current_output_port);
else
{sprintf(buffer, "%.16g", x);
if (!strpbrk(buffer, ".e")) strcat(buffer, ".0");
fputs(buffer, current_output_port);}}
#define VALUE_OFFSET 0
struct nonheaded_vector_type13
{unsigned int length;
char **element;};
void f0(void);
void f0(void)
{int a676;
int a679;
double a680;
double a681;
int a682;
int a685;
double a686;
double a687;
int a688;
int a691;
double a692;
double a693;
double a694;
double a695;
double a696;
double a697;
double a698;
double a699;
double a700;
double a701;
double a702;
double a703;
double a704;
double a705;
double a706;
double a707;
double a708;
double a709;
double a710;
double a711;
double a712;
double a713;
double a714;
double a715;
double a716;
double a717;
double a718;
double a719;
double a720;
double a721;
double a722;
double a723;
double a724;
double a725;
double a726;
double a727;
double a728;
double a729;
double a730;
double a731;
double a732;
double a733;
double a734;
double a735;
double a736;
double a737;
double a738;
double a739;
double a740;
double a741;
double a742;
double a743;
double a744;
double a745;
double a746;
double a747;
double a748;
double a749;
double a750;
double a751;
double a752;
double a753;
double a754;
double a755;
double a756;
double a757;
double a758;
double a759;
double a760;
double a761;
double a762;
double a763;
double a764;
double a765;
double a766;
double a767;
double a768;
double a769;
double a770;
double a771;
double a772;
double a773;
double a774;
double a775;
double a776;
double a777;
double a778;
double a779;
double a780;
double a781;
double a782;
double a783;
double a784;
double a785;
double a786;
double a787;
double a788;
double a789;
double a790;
double a791;
double a792;
double a793;
double a794;
double a795;
double a796;
double a797;
double a798;
double a799;
double a800;
double a801;
double a802;
double a803;
double a804;
double a805;
double a806;
double a807;
double a808;
double a809;
double a810;
double a811;
double a812;
double a813;
double a814;
double a815;
double a816;
double a817;
double a818;
double a819;
double a820;
double a821;
double a822;
double a823;
double a824;
double a825;
double a826;
double a827;
double a828;
double a829;
double a830;
double a831;
double a832;
double a833;
double a834;
double a835;
double a836;
double a837;
double a838;
double a839;
double a840;
double a841;
double a842;
double a843;
double a844;
double a845;
double a846;
double a847;
double a848;
double a849;
double a850;
double a851;
double a852;
double a853;
double a854;
double a855;
double a856;
double a857;
double a858;
double a859;
double a860;
double a942;
double a943;
double a944;
double a945;
double a946;
double a947;
double a948;
double a949;
double a950;
double a951;
double a952;
double a953;
double a954;
double a955;
double a956;
double a957;
double a958;
double a959;
double a960;
double a961;
double a962;
double a963;
double a964;
double a965;
double a966;
double a967;
double a968;
double a969;
double a970;
double a971;
double a972;
double a973;
double a974;
double a975;
double a976;
double a977;
double a978;
double a979;
double a980;
double a981;
double a982;
double a983;
double a984;
double a985;
double a986;
double a987;
double a988;
double a989;
double a990;
double a991;
double a992;
double a993;
double a994;
double a995;
double a996;
double a997;
double a998;
double a999;
double a1000;
double a1001;
double a1002;
double a1003;
double a1004;
double a1005;
double a1006;
double a1007;
double a1008;
double a1009;
double a1010;
double a1011;
double a1012;
double a1013;
double a1014;
double a1015;
double a1016;
double a1017;
double a1018;
double a1019;
double a1020;
double a1021;
double a1022;
double a1023;
double a1024;
double a1025;
double a1026;
double a1027;
double a1028;
double a1029;
double a1030;
double a1031;
double a1032;
double a1033;
double a1043;
double a1044;
double a1045;
double a1055;
double a1056;
double a1057;
double a1067;
double a1068;
double a1069;
double a1079;
double a1080;
double a1081;
double a1091;
double a1092;
double a1093;
double a1103;
double a1104;
double a1105;
double a1115;
double a1116;
double a1117;
double a1127;
double a1128;
double a1129;
double a1139;
double a1140;
double a1141;
double a1151;
double t0;
int t1;
int t2;
double t3;
int t4;
int t5;
int t6;
double t7;
int t8;
int t9;
double t10;
double t11;
double t12;
double t13;
double t14;
double t15;
double t16;
int t17;
int t18;
double t19;
int t20;
int t21;
int t22;
double t23;
int t24;
int t25;
double t26;
double t27;
double t28;
double t29;
double t30;
double t31;
double t32;
double t33;
double t34;
double t35;
double t36;
double t37;
double t38;
double t39;
double t40;
double t41;
double t42;
double t43;
double t44;
double t45;
double t46;
double t47;
double t48;
double t49;
double t50;
double t51;
double t52;
double t53;
double t54;
double t55;
double t56;
double t57;
double t58;
double t59;
double t60;
double t61;
double t62;
double t63;
double t64;
double t65;
double t66;
double t67;
double t68;
double t69;
double t70;
double t71;
double t72;
double t73;
double t74;
double t75;
double t76;
double t77;
double t78;
double t79;
double t80;
double t81;
double t82;
double t83;
double t84;
double t85;
double t86;
double t87;
double t88;
double t89;
double t90;
double t91;
double t92;
double t93;
double t94;
double t95;
double t96;
double t97;
double t98;
double t99;
double t100;
double t101;
double t102;
double t103;
double t104;
double t105;
double t106;
double t107;
double t108;
double t109;
double t110;
double t111;
double t112;
double t113;
double t114;
double t115;
double t116;
double t117;
double t118;
double t119;
double t120;
double t121;
double t122;
double t123;
double t124;
double t125;
double t126;
double t127;
double t128;
double t129;
double t130;
double t131;
double t132;
double t133;
double t134;
double t135;
double t136;
double t137;
double t138;
double t139;
double t140;
double t141;
double t142;
double t143;
double t144;
double t145;
double t146;
double t147;
double t148;
double t149;
double t150;
double t151;
double t152;
double t153;
double t154;
double t155;
double t156;
double t157;
double t158;
double t159;
double t160;
double t161;
double t162;
double t163;
double t164;
double t165;
double t166;
double t167;
double t168;
double t169;
double t170;
double t171;
double t172;
double t173;
double t174;
double t175;
double t176;
double t177;
double t178;
double t179;
double t180;
double t181;
double t182;
double t183;
double t184;
double t185;
double t186;
double t187;
double t188;
double t189;
double t190;
double t191;
double t192;
double t193;
double t194;
double t195;
double t196;
double t197;
double t198;
double t199;
double t200;
double t201;
double t202;
double t203;
double t204;
double t205;
double t206;
double t207;
double t208;
double t209;
double t210;
double t211;
double t212;
double t213;
double t214;
double t215;
double t216;
double t217;
double t218;
double t219;
double t220;
double t221;
double t222;
double t223;
double t224;
double t225;
double t226;
double t227;
double t228;
double t229;
double t230;
double t231;
double t232;
double t233;
double t234;
double t235;
double t236;
double t237;
double t238;
double t239;
double t240;
double t241;
double t242;
double t243;
double t244;
double t245;
double t246;
double t247;
double t248;
double t249;
double t250;
double t251;
double t252;
double t253;
double t254;
double t255;
double t256;
double t257;
double t258;
double t259;
double t260;
double t261;
double t262;
double t263;
double t264;
double t265;
double t266;
double t267;
double t268;
double t269;
double t270;
double t271;
double t272;
double t273;
double t274;
double t275;
double t276;
double t277;
double t278;
double t279;
double t280;
double t281;
double t282;
double t283;
double t284;
double t285;
double t286;
double t287;
double t288;
double t289;
double t290;
double t291;
double t292;
double t293;
double t294;
double t295;
double t296;
double t297;
double t298;
double t299;
double t300;
double t301;
double t302;
double t303;
double t304;
double t305;
double t306;
double t307;
double t308;
double t309;
double t310;
double t311;
double t312;
double t313;
double t314;
double t315;
double t316;
double t317;
double t318;
double t319;
double t320;
double t321;
double t322;
double t323;
double t324;
double t325;
double t326;
double t327;
double t328;
double t329;
double t330;
double t331;
double t332;
double t333;
double t334;
double t335;
double t336;
double t337;
double t338;
double t339;
double t340;
double t341;
double t342;
double t343;
double t344;
double t345;
double t346;
double t347;
double t348;
double t349;
double t350;
double t351;
double t352;
double t353;
double t354;
double t355;
double t356;
double t357;
double t358;
double t359;
double t360;
double t361;
double t362;
double t363;
double t364;
double t365;
double t366;
double t367;
double t368;
double t369;
double t370;
double t371;
double t372;
double t373;
double t374;
double t375;
double t376;
double t377;
double t378;
double t379;
double t380;
double t381;
double t382;
double t383;
double t384;
double t385;
double t386;
double t387;
double t388;
double t389;
double t390;
double t391;
double t392;
double t393;
double t394;
double t395;
double t396;
double t397;
double t398;
double t399;
double t400;
double t401;
double t402;
double t403;
double t404;
double t405;
double t406;
double t407;
double t408;
double t409;
double t410;
double t411;
double t412;
double t413;
double t414;
double t415;
double t416;
double t417;
double t418;
double t419;
double t420;
double t421;
double t422;
double t423;
double t424;
double t425;
double t426;
double t427;
double t428;
double t429;
double t430;
double t431;
double t432;
double t433;
double t434;
double t435;
double t436;
double t437;
double t438;
double t439;
double t440;
double t441;
double t442;
double t443;
double t444;
double t445;
double t446;
double t447;
double t448;
double t449;
double t450;
double t451;
double t452;
double t453;
double t454;
double t455;
double t456;
double t457;
double t458;
double t459;
double t460;
double t461;
double t462;
double t463;
double t464;
double t465;
double t466;
double t467;
double t468;
double t469;
double t470;
double t471;
double t472;
double t473;
double t474;
double t475;
double t476;
double t477;
double t478;
double t479;
double t480;
double t481;
double t482;
double t483;
double t484;
double t485;
double t486;
double t487;
double t488;
double t489;
double t490;
double t491;
double t492;
double t493;
double t494;
double t495;
double t496;
double t497;
double t498;
double t499;
double t500;
double t501;
double t502;
double t503;
double t504;
double t505;
double t506;
double t507;
double t508;
double t509;
double t510;
double t511;
double t512;
double t513;
double t514;
double t515;
double t516;
double t517;
double t518;
double t519;
double t520;
double t521;
double t522;
double t523;
double t524;
double t525;
double t526;
double t527;
double t528;
double t529;
double t530;
double t531;
double t532;
double t533;
double t534;
double t535;
double t536;
double t537;
double t538;
double t539;
double t540;
double t541;
double t542;
double t543;
double t544;
double t545;
double t546;
double t547;
double t548;
double t549;
double t550;
double t551;
double t552;
double t553;
double t554;
double t555;
double t556;
double t557;
double t558;
double t559;
double t560;
double t561;
double t562;
double t563;
double t564;
double t565;
double t566;
double t567;
double t568;
double t569;
double t570;
double t571;
double t572;
double t573;
double t574;
double t575;
double t576;
double t577;
double t578;
double t579;
double t580;
double t581;
double t582;
double t583;
double t584;
double t585;
double t586;
double t587;
double t588;
double t589;
double t590;
double t591;
double t592;
double t593;
double t594;
double t595;
double t596;
double t597;
double t598;
double t599;
double t600;
double t601;
double t602;
double t603;
double t604;
double t605;
double t606;
double t607;
double t608;
double t609;
double t610;
double t611;
double t612;
double t613;
double t614;
double t615;
double t616;
double t617;
double t618;
double t619;
double t620;
double t621;
double t622;
double t623;
double t624;
double t625;
double t626;
double t627;
double t628;
double t629;
double t630;
double t631;
double t632;
double t633;
double t634;
double t635;
double t636;
double t637;
double t638;
double t639;
double t640;
double t641;
double t642;
double t643;
double t644;
double t645;
double t646;
double t647;
double t648;
double t649;
double t650;
double t651;
double t652;
double t653;
double t654;
double t655;
double t656;
double t657;
double t658;
double t659;
double t660;
double t661;
double t662;
double t663;
double t664;
double t665;
double t666;
double t667;
double t668;
double t669;
double t670;
double t671;
double t672;
double t673;
double t674;
double t675;
double t676;
double t677;
double t678;
double t679;
double t680;
double t681;
double t682;
double t683;
double t684;
double t685;
double t686;
double t687;
double t688;
double t689;
double t690;
double t691;
double t692;
double t693;
double t694;
double t695;
double t696;
double t697;
double t698;
double t699;
double t700;
double t701;
double t702;
double t703;
double t704;
double t705;
double t706;
double t707;
double t708;
double t709;
double t710;
double t711;
double t712;
double t713;
double t714;
double t715;
double t716;
double t717;
double t718;
double t719;
double t720;
double t721;
double t722;
double t723;
double t724;
double t725;
double t726;
double t727;
double t728;
double t729;
double t730;
double t731;
double t732;
double t733;
double t734;
double t735;
double t736;
double t737;
double t738;
double t739;
double t740;
double t741;
double t742;
double t743;
double t744;
double t745;
double t746;
double t747;
double t748;
double t749;
double t750;
double t751;
double t752;
double t753;
double t754;
double t755;
double t756;
double t757;
double t758;
double t759;
double t760;
double t761;
double t762;
double t763;
double t764;
double t765;
double t766;
double t767;
double t768;
double t769;
double t770;
double t771;
double t772;
double t773;
double t774;
double t775;
double t776;
double t777;
double t778;
double t779;
double t780;
double t781;
double t782;
double t783;
double t784;
double t785;
double t786;
double t787;
double t788;
double t789;
double t790;
double t791;
double t792;
double t793;
double t794;
double t795;
double t796;
double t797;
double t798;
double t799;
double t800;
double t801;
double t802;
double t803;
double t804;
double t805;
double t806;
double t807;
double t808;
double t809;
double t810;
double t811;
double t812;
double t813;
double t814;
double t815;
double t816;
double t817;
double t818;
double t819;
double t820;
double t821;
double t822;
double t823;
double t824;
double t825;
double t826;
double t827;
double t828;
double t829;
double t830;
double t831;
double t832;
double t833;
double t834;
double t835;
double t836;
double t837;
double t838;
double t839;
double t840;
double t841;
double t842;
double t843;
double t844;
double t845;
double t846;
double t847;
double t848;
double t849;
double t850;
double t851;
double t852;
double t853;
double t854;
double t855;
double t856;
double t857;
double t858;
double t859;
double t860;
double t861;
double t862;
double t863;
double t864;
double t865;
double t866;
double t867;
double t868;
double t869;
double t870;
double t871;
double t872;
double t873;
double t874;
double t875;
double t876;
double t877;
double t878;
double t879;
double t880;
double t881;
double t882;
double t883;
double t884;
double t885;
double t886;
double t887;
double t888;
double t889;
double t890;
double t891;
double t892;
double t893;
double t894;
double t895;
double t896;
double t897;
double t898;
double t899;
double t900;
double t901;
double t902;
double t903;
double t904;
double t905;
double t906;
double t907;
int t908;
double t909;
int t910;
double t911;
int t912;
int t913;
double t914;
int t915;
int t916;
int t917;
double t918;
int t919;
int t920;
double t921;
double t922;
double t923;
double t924;
double t925;
int t926;
double t927;
int t928;
int t929;
t1 = 1000;
a676 = t1;
t2 = 1;
t3 = 0.;
a679 = t2;
a680 = t3;
h241:
t4 = a679;
t5 = a676;
if (!(t4>t5)) goto l1;
t0 = a680;
goto l2;
l1:
t8 = a679;
t9 = 1;
t6 = t8+t9;
t10 = a680;
t17 = a679;
a688 = t17;
t18 = 1;
t19 = 0.;
a691 = t18;
a692 = t19;
h257:
t20 = a691;
t21 = a688;
if (!(t20>t21)) goto l4;
t15 = a692;
goto l5;
l4:
t24 = a691;
t25 = 1;
t22 = t24+t25;
t26 = a692;
t908 = a691;
t909 = 1.;
t28 = t908*t909;
t910 = a691;
t911 = 2.;
t29 = t910*t911;
a693 = t28;
a694 = t29;
t30 = 0.;
t31 = a693;
t32 = 0.;
t33 = a694;
a857 = t30;
a858 = t31;
a859 = t32;
a860 = t33;
t34 = a859;
t35 = a860;
a1140 = t34;
a1141 = t35;
t906 = a1141;
t907 = a1140;
t904 = t906-t907;
t905 = 8.;
t36 = t904/t905;
a1151 = t36;
t50 = a1140;
a942 = t50;
t51 = a857;
t52 = a858;
a1128 = t51;
a1129 = t52;
t140 = a1129;
t141 = a1128;
t138 = t140-t141;
t139 = 8.;
t53 = t138/t139;
a1139 = t53;
t67 = a1128;
a943 = t67;
t68 = a943;
t69 = a942;
a695 = t68;
a696 = t69;
t70 = a695;
t71 = a696;
t65 = t70*t71;
t66 = 0.5;
t56 = t65*t66;
t77 = a1128;
t78 = a1139;
t72 = t77+t78;
a944 = t72;
t73 = a944;
t74 = a942;
a697 = t73;
a698 = t74;
t75 = a697;
t76 = a698;
t57 = t75*t76;
t84 = a1128;
t86 = 2.;
t87 = a1139;
t85 = t86*t87;
t79 = t84+t85;
a945 = t79;
t80 = a945;
t81 = a942;
a699 = t80;
a700 = t81;
t82 = a699;
t83 = a700;
t58 = t82*t83;
t93 = a1128;
t95 = 3.;
t96 = a1139;
t94 = t95*t96;
t88 = t93+t94;
a946 = t88;
t89 = a946;
t90 = a942;
a701 = t89;
a702 = t90;
t91 = a701;
t92 = a702;
t59 = t91*t92;
t102 = a1128;
t104 = 4.;
t105 = a1139;
t103 = t104*t105;
t97 = t102+t103;
a947 = t97;
t98 = a947;
t99 = a942;
a703 = t98;
a704 = t99;
t100 = a703;
t101 = a704;
t60 = t100*t101;
t111 = a1129;
t113 = 3.;
t114 = a1139;
t112 = t113*t114;
t106 = t111-t112;
a948 = t106;
t107 = a948;
t108 = a942;
a705 = t107;
a706 = t108;
t109 = a705;
t110 = a706;
t61 = t109*t110;
t120 = a1129;
t122 = 2.;
t123 = a1139;
t121 = t122*t123;
t115 = t120-t121;
a949 = t115;
t116 = a949;
t117 = a942;
a707 = t116;
a708 = t117;
t118 = a707;
t119 = a708;
t62 = t118*t119;
t129 = a1129;
t130 = a1139;
t124 = t129-t130;
a950 = t124;
t125 = a950;
t126 = a942;
a709 = t125;
a710 = t126;
t127 = a709;
t128 = a710;
t63 = t127*t128;
t133 = a1129;
a951 = t133;
t134 = a951;
t135 = a942;
a711 = t134;
a712 = t135;
t136 = a711;
t137 = a712;
t131 = t136*t137;
t132 = 0.5;
t64 = t131*t132;
t54 = (((((((t56+t57)+t58)+t59)+t60)+t61)+t62)+t63)+t64;
t55 = a1139;
t48 = t54*t55;
t49 = 0.5;
t39 = t48*t49;
t234 = a1140;
t235 = a1151;
t142 = t234+t235;
a952 = t142;
t143 = a857;
t144 = a858;
a1116 = t143;
a1117 = t144;
t232 = a1117;
t233 = a1116;
t230 = t232-t233;
t231 = 8.;
t145 = t230/t231;
a1127 = t145;
t159 = a1116;
a953 = t159;
t160 = a953;
t161 = a952;
a713 = t160;
a714 = t161;
t162 = a713;
t163 = a714;
t157 = t162*t163;
t158 = 0.5;
t148 = t157*t158;
t169 = a1116;
t170 = a1127;
t164 = t169+t170;
a954 = t164;
t165 = a954;
t166 = a952;
a715 = t165;
a716 = t166;
t167 = a715;
t168 = a716;
t149 = t167*t168;
t176 = a1116;
t178 = 2.;
t179 = a1127;
t177 = t178*t179;
t171 = t176+t177;
a955 = t171;
t172 = a955;
t173 = a952;
a717 = t172;
a718 = t173;
t174 = a717;
t175 = a718;
t150 = t174*t175;
t185 = a1116;
t187 = 3.;
t188 = a1127;
t186 = t187*t188;
t180 = t185+t186;
a956 = t180;
t181 = a956;
t182 = a952;
a719 = t181;
a720 = t182;
t183 = a719;
t184 = a720;
t151 = t183*t184;
t194 = a1116;
t196 = 4.;
t197 = a1127;
t195 = t196*t197;
t189 = t194+t195;
a957 = t189;
t190 = a957;
t191 = a952;
a721 = t190;
a722 = t191;
t192 = a721;
t193 = a722;
t152 = t192*t193;
t203 = a1117;
t205 = 3.;
t206 = a1127;
t204 = t205*t206;
t198 = t203-t204;
a958 = t198;
t199 = a958;
t200 = a952;
a723 = t199;
a724 = t200;
t201 = a723;
t202 = a724;
t153 = t201*t202;
t212 = a1117;
t214 = 2.;
t215 = a1127;
t213 = t214*t215;
t207 = t212-t213;
a959 = t207;
t208 = a959;
t209 = a952;
a725 = t208;
a726 = t209;
t210 = a725;
t211 = a726;
t154 = t210*t211;
t221 = a1117;
t222 = a1127;
t216 = t221-t222;
a960 = t216;
t217 = a960;
t218 = a952;
a727 = t217;
a728 = t218;
t219 = a727;
t220 = a728;
t155 = t219*t220;
t225 = a1117;
a961 = t225;
t226 = a961;
t227 = a952;
a729 = t226;
a730 = t227;
t228 = a729;
t229 = a730;
t223 = t228*t229;
t224 = 0.5;
t156 = t223*t224;
t146 = (((((((t148+t149)+t150)+t151)+t152)+t153)+t154)+t155)+t156;
t147 = a1127;
t40 = t146*t147;
t328 = a1140;
t330 = 2.;
t331 = a1151;
t329 = t330*t331;
t236 = t328+t329;
a962 = t236;
t237 = a857;
t238 = a858;
a1104 = t237;
a1105 = t238;
t326 = a1105;
t327 = a1104;
t324 = t326-t327;
t325 = 8.;
t239 = t324/t325;
a1115 = t239;
t253 = a1104;
a963 = t253;
t254 = a963;
t255 = a962;
a731 = t254;
a732 = t255;
t256 = a731;
t257 = a732;
t251 = t256*t257;
t252 = 0.5;
t242 = t251*t252;
t263 = a1104;
t264 = a1115;
t258 = t263+t264;
a964 = t258;
t259 = a964;
t260 = a962;
a733 = t259;
a734 = t260;
t261 = a733;
t262 = a734;
t243 = t261*t262;
t270 = a1104;
t272 = 2.;
t273 = a1115;
t271 = t272*t273;
t265 = t270+t271;
a965 = t265;
t266 = a965;
t267 = a962;
a735 = t266;
a736 = t267;
t268 = a735;
t269 = a736;
t244 = t268*t269;
t279 = a1104;
t281 = 3.;
t282 = a1115;
t280 = t281*t282;
t274 = t279+t280;
a966 = t274;
t275 = a966;
t276 = a962;
a737 = t275;
a738 = t276;
t277 = a737;
t278 = a738;
t245 = t277*t278;
t288 = a1104;
t290 = 4.;
t291 = a1115;
t289 = t290*t291;
t283 = t288+t289;
a967 = t283;
t284 = a967;
t285 = a962;
a739 = t284;
a740 = t285;
t286 = a739;
t287 = a740;
t246 = t286*t287;
t297 = a1105;
t299 = 3.;
t300 = a1115;
t298 = t299*t300;
t292 = t297-t298;
a968 = t292;
t293 = a968;
t294 = a962;
a741 = t293;
a742 = t294;
t295 = a741;
t296 = a742;
t247 = t295*t296;
t306 = a1105;
t308 = 2.;
t309 = a1115;
t307 = t308*t309;
t301 = t306-t307;
a969 = t301;
t302 = a969;
t303 = a962;
a743 = t302;
a744 = t303;
t304 = a743;
t305 = a744;
t248 = t304*t305;
t315 = a1105;
t316 = a1115;
t310 = t315-t316;
a970 = t310;
t311 = a970;
t312 = a962;
a745 = t311;
a746 = t312;
t313 = a745;
t314 = a746;
t249 = t313*t314;
t319 = a1105;
a971 = t319;
t320 = a971;
t321 = a962;
a747 = t320;
a748 = t321;
t322 = a747;
t323 = a748;
t317 = t322*t323;
t318 = 0.5;
t250 = t317*t318;
t240 = (((((((t242+t243)+t244)+t245)+t246)+t247)+t248)+t249)+t250;
t241 = a1115;
t41 = t240*t241;
t424 = a1140;
t426 = 3.;
t427 = a1151;
t425 = t426*t427;
t332 = t424+t425;
a972 = t332;
t333 = a857;
t334 = a858;
a1092 = t333;
a1093 = t334;
t422 = a1093;
t423 = a1092;
t420 = t422-t423;
t421 = 8.;
t335 = t420/t421;
a1103 = t335;
t349 = a1092;
a973 = t349;
t350 = a973;
t351 = a972;
a749 = t350;
a750 = t351;
t352 = a749;
t353 = a750;
t347 = t352*t353;
t348 = 0.5;
t338 = t347*t348;
t359 = a1092;
t360 = a1103;
t354 = t359+t360;
a974 = t354;
t355 = a974;
t356 = a972;
a751 = t355;
a752 = t356;
t357 = a751;
t358 = a752;
t339 = t357*t358;
t366 = a1092;
t368 = 2.;
t369 = a1103;
t367 = t368*t369;
t361 = t366+t367;
a975 = t361;
t362 = a975;
t363 = a972;
a753 = t362;
a754 = t363;
t364 = a753;
t365 = a754;
t340 = t364*t365;
t375 = a1092;
t377 = 3.;
t378 = a1103;
t376 = t377*t378;
t370 = t375+t376;
a976 = t370;
t371 = a976;
t372 = a972;
a755 = t371;
a756 = t372;
t373 = a755;
t374 = a756;
t341 = t373*t374;
t384 = a1092;
t386 = 4.;
t387 = a1103;
t385 = t386*t387;
t379 = t384+t385;
a977 = t379;
t380 = a977;
t381 = a972;
a757 = t380;
a758 = t381;
t382 = a757;
t383 = a758;
t342 = t382*t383;
t393 = a1093;
t395 = 3.;
t396 = a1103;
t394 = t395*t396;
t388 = t393-t394;
a978 = t388;
t389 = a978;
t390 = a972;
a759 = t389;
a760 = t390;
t391 = a759;
t392 = a760;
t343 = t391*t392;
t402 = a1093;
t404 = 2.;
t405 = a1103;
t403 = t404*t405;
t397 = t402-t403;
a979 = t397;
t398 = a979;
t399 = a972;
a761 = t398;
a762 = t399;
t400 = a761;
t401 = a762;
t344 = t400*t401;
t411 = a1093;
t412 = a1103;
t406 = t411-t412;
a980 = t406;
t407 = a980;
t408 = a972;
a763 = t407;
a764 = t408;
t409 = a763;
t410 = a764;
t345 = t409*t410;
t415 = a1093;
a981 = t415;
t416 = a981;
t417 = a972;
a765 = t416;
a766 = t417;
t418 = a765;
t419 = a766;
t413 = t418*t419;
t414 = 0.5;
t346 = t413*t414;
t336 = (((((((t338+t339)+t340)+t341)+t342)+t343)+t344)+t345)+t346;
t337 = a1103;
t42 = t336*t337;
t520 = a1140;
t522 = 4.;
t523 = a1151;
t521 = t522*t523;
t428 = t520+t521;
a982 = t428;
t429 = a857;
t430 = a858;
a1080 = t429;
a1081 = t430;
t518 = a1081;
t519 = a1080;
t516 = t518-t519;
t517 = 8.;
t431 = t516/t517;
a1091 = t431;
t445 = a1080;
a983 = t445;
t446 = a983;
t447 = a982;
a767 = t446;
a768 = t447;
t448 = a767;
t449 = a768;
t443 = t448*t449;
t444 = 0.5;
t434 = t443*t444;
t455 = a1080;
t456 = a1091;
t450 = t455+t456;
a984 = t450;
t451 = a984;
t452 = a982;
a769 = t451;
a770 = t452;
t453 = a769;
t454 = a770;
t435 = t453*t454;
t462 = a1080;
t464 = 2.;
t465 = a1091;
t463 = t464*t465;
t457 = t462+t463;
a985 = t457;
t458 = a985;
t459 = a982;
a771 = t458;
a772 = t459;
t460 = a771;
t461 = a772;
t436 = t460*t461;
t471 = a1080;
t473 = 3.;
t474 = a1091;
t472 = t473*t474;
t466 = t471+t472;
a986 = t466;
t467 = a986;
t468 = a982;
a773 = t467;
a774 = t468;
t469 = a773;
t470 = a774;
t437 = t469*t470;
t480 = a1080;
t482 = 4.;
t483 = a1091;
t481 = t482*t483;
t475 = t480+t481;
a987 = t475;
t476 = a987;
t477 = a982;
a775 = t476;
a776 = t477;
t478 = a775;
t479 = a776;
t438 = t478*t479;
t489 = a1081;
t491 = 3.;
t492 = a1091;
t490 = t491*t492;
t484 = t489-t490;
a988 = t484;
t485 = a988;
t486 = a982;
a777 = t485;
a778 = t486;
t487 = a777;
t488 = a778;
t439 = t487*t488;
t498 = a1081;
t500 = 2.;
t501 = a1091;
t499 = t500*t501;
t493 = t498-t499;
a989 = t493;
t494 = a989;
t495 = a982;
a779 = t494;
a780 = t495;
t496 = a779;
t497 = a780;
t440 = t496*t497;
t507 = a1081;
t508 = a1091;
t502 = t507-t508;
a990 = t502;
t503 = a990;
t504 = a982;
a781 = t503;
a782 = t504;
t505 = a781;
t506 = a782;
t441 = t505*t506;
t511 = a1081;
a991 = t511;
t512 = a991;
t513 = a982;
a783 = t512;
a784 = t513;
t514 = a783;
t515 = a784;
t509 = t514*t515;
t510 = 0.5;
t442 = t509*t510;
t432 = (((((((t434+t435)+t436)+t437)+t438)+t439)+t440)+t441)+t442;
t433 = a1091;
t43 = t432*t433;
t616 = a1141;
t618 = 3.;
t619 = a1151;
t617 = t618*t619;
t524 = t616-t617;
a992 = t524;
t525 = a857;
t526 = a858;
a1068 = t525;
a1069 = t526;
t614 = a1069;
t615 = a1068;
t612 = t614-t615;
t613 = 8.;
t527 = t612/t613;
a1079 = t527;
t541 = a1068;
a993 = t541;
t542 = a993;
t543 = a992;
a785 = t542;
a786 = t543;
t544 = a785;
t545 = a786;
t539 = t544*t545;
t540 = 0.5;
t530 = t539*t540;
t551 = a1068;
t552 = a1079;
t546 = t551+t552;
a994 = t546;
t547 = a994;
t548 = a992;
a787 = t547;
a788 = t548;
t549 = a787;
t550 = a788;
t531 = t549*t550;
t558 = a1068;
t560 = 2.;
t561 = a1079;
t559 = t560*t561;
t553 = t558+t559;
a995 = t553;
t554 = a995;
t555 = a992;
a789 = t554;
a790 = t555;
t556 = a789;
t557 = a790;
t532 = t556*t557;
t567 = a1068;
t569 = 3.;
t570 = a1079;
t568 = t569*t570;
t562 = t567+t568;
a996 = t562;
t563 = a996;
t564 = a992;
a791 = t563;
a792 = t564;
t565 = a791;
t566 = a792;
t533 = t565*t566;
t576 = a1068;
t578 = 4.;
t579 = a1079;
t577 = t578*t579;
t571 = t576+t577;
a997 = t571;
t572 = a997;
t573 = a992;
a793 = t572;
a794 = t573;
t574 = a793;
t575 = a794;
t534 = t574*t575;
t585 = a1069;
t587 = 3.;
t588 = a1079;
t586 = t587*t588;
t580 = t585-t586;
a998 = t580;
t581 = a998;
t582 = a992;
a795 = t581;
a796 = t582;
t583 = a795;
t584 = a796;
t535 = t583*t584;
t594 = a1069;
t596 = 2.;
t597 = a1079;
t595 = t596*t597;
t589 = t594-t595;
a999 = t589;
t590 = a999;
t591 = a992;
a797 = t590;
a798 = t591;
t592 = a797;
t593 = a798;
t536 = t592*t593;
t603 = a1069;
t604 = a1079;
t598 = t603-t604;
a1000 = t598;
t599 = a1000;
t600 = a992;
a799 = t599;
a800 = t600;
t601 = a799;
t602 = a800;
t537 = t601*t602;
t607 = a1069;
a1001 = t607;
t608 = a1001;
t609 = a992;
a801 = t608;
a802 = t609;
t610 = a801;
t611 = a802;
t605 = t610*t611;
t606 = 0.5;
t538 = t605*t606;
t528 = (((((((t530+t531)+t532)+t533)+t534)+t535)+t536)+t537)+t538;
t529 = a1079;
t44 = t528*t529;
t712 = a1141;
t714 = 2.;
t715 = a1151;
t713 = t714*t715;
t620 = t712-t713;
a1002 = t620;
t621 = a857;
t622 = a858;
a1056 = t621;
a1057 = t622;
t710 = a1057;
t711 = a1056;
t708 = t710-t711;
t709 = 8.;
t623 = t708/t709;
a1067 = t623;
t637 = a1056;
a1003 = t637;
t638 = a1003;
t639 = a1002;
a803 = t638;
a804 = t639;
t640 = a803;
t641 = a804;
t635 = t640*t641;
t636 = 0.5;
t626 = t635*t636;
t647 = a1056;
t648 = a1067;
t642 = t647+t648;
a1004 = t642;
t643 = a1004;
t644 = a1002;
a805 = t643;
a806 = t644;
t645 = a805;
t646 = a806;
t627 = t645*t646;
t654 = a1056;
t656 = 2.;
t657 = a1067;
t655 = t656*t657;
t649 = t654+t655;
a1005 = t649;
t650 = a1005;
t651 = a1002;
a807 = t650;
a808 = t651;
t652 = a807;
t653 = a808;
t628 = t652*t653;
t663 = a1056;
t665 = 3.;
t666 = a1067;
t664 = t665*t666;
t658 = t663+t664;
a1006 = t658;
t659 = a1006;
t660 = a1002;
a809 = t659;
a810 = t660;
t661 = a809;
t662 = a810;
t629 = t661*t662;
t672 = a1056;
t674 = 4.;
t675 = a1067;
t673 = t674*t675;
t667 = t672+t673;
a1007 = t667;
t668 = a1007;
t669 = a1002;
a811 = t668;
a812 = t669;
t670 = a811;
t671 = a812;
t630 = t670*t671;
t681 = a1057;
t683 = 3.;
t684 = a1067;
t682 = t683*t684;
t676 = t681-t682;
a1008 = t676;
t677 = a1008;
t678 = a1002;
a813 = t677;
a814 = t678;
t679 = a813;
t680 = a814;
t631 = t679*t680;
t690 = a1057;
t692 = 2.;
t693 = a1067;
t691 = t692*t693;
t685 = t690-t691;
a1009 = t685;
t686 = a1009;
t687 = a1002;
a815 = t686;
a816 = t687;
t688 = a815;
t689 = a816;
t632 = t688*t689;
t699 = a1057;
t700 = a1067;
t694 = t699-t700;
a1010 = t694;
t695 = a1010;
t696 = a1002;
a817 = t695;
a818 = t696;
t697 = a817;
t698 = a818;
t633 = t697*t698;
t703 = a1057;
a1011 = t703;
t704 = a1011;
t705 = a1002;
a819 = t704;
a820 = t705;
t706 = a819;
t707 = a820;
t701 = t706*t707;
t702 = 0.5;
t634 = t701*t702;
t624 = (((((((t626+t627)+t628)+t629)+t630)+t631)+t632)+t633)+t634;
t625 = a1067;
t45 = t624*t625;
t808 = a1141;
t809 = a1151;
t716 = t808-t809;
a1012 = t716;
t717 = a857;
t718 = a858;
a1044 = t717;
a1045 = t718;
t806 = a1045;
t807 = a1044;
t804 = t806-t807;
t805 = 8.;
t719 = t804/t805;
a1055 = t719;
t733 = a1044;
a1013 = t733;
t734 = a1013;
t735 = a1012;
a821 = t734;
a822 = t735;
t736 = a821;
t737 = a822;
t731 = t736*t737;
t732 = 0.5;
t722 = t731*t732;
t743 = a1044;
t744 = a1055;
t738 = t743+t744;
a1014 = t738;
t739 = a1014;
t740 = a1012;
a823 = t739;
a824 = t740;
t741 = a823;
t742 = a824;
t723 = t741*t742;
t750 = a1044;
t752 = 2.;
t753 = a1055;
t751 = t752*t753;
t745 = t750+t751;
a1015 = t745;
t746 = a1015;
t747 = a1012;
a825 = t746;
a826 = t747;
t748 = a825;
t749 = a826;
t724 = t748*t749;
t759 = a1044;
t761 = 3.;
t762 = a1055;
t760 = t761*t762;
t754 = t759+t760;
a1016 = t754;
t755 = a1016;
t756 = a1012;
a827 = t755;
a828 = t756;
t757 = a827;
t758 = a828;
t725 = t757*t758;
t768 = a1044;
t770 = 4.;
t771 = a1055;
t769 = t770*t771;
t763 = t768+t769;
a1017 = t763;
t764 = a1017;
t765 = a1012;
a829 = t764;
a830 = t765;
t766 = a829;
t767 = a830;
t726 = t766*t767;
t777 = a1045;
t779 = 3.;
t780 = a1055;
t778 = t779*t780;
t772 = t777-t778;
a1018 = t772;
t773 = a1018;
t774 = a1012;
a831 = t773;
a832 = t774;
t775 = a831;
t776 = a832;
t727 = t775*t776;
t786 = a1045;
t788 = 2.;
t789 = a1055;
t787 = t788*t789;
t781 = t786-t787;
a1019 = t781;
t782 = a1019;
t783 = a1012;
a833 = t782;
a834 = t783;
t784 = a833;
t785 = a834;
t728 = t784*t785;
t795 = a1045;
t796 = a1055;
t790 = t795-t796;
a1020 = t790;
t791 = a1020;
t792 = a1012;
a835 = t791;
a836 = t792;
t793 = a835;
t794 = a836;
t729 = t793*t794;
t799 = a1045;
a1021 = t799;
t800 = a1021;
t801 = a1012;
a837 = t800;
a838 = t801;
t802 = a837;
t803 = a838;
t797 = t802*t803;
t798 = 0.5;
t730 = t797*t798;
t720 = (((((((t722+t723)+t724)+t725)+t726)+t727)+t728)+t729)+t730;
t721 = a1055;
t46 = t720*t721;
t812 = a1141;
a1022 = t812;
t813 = a857;
t814 = a858;
a1032 = t813;
a1033 = t814;
t902 = a1033;
t903 = a1032;
t900 = t902-t903;
t901 = 8.;
t815 = t900/t901;
a1043 = t815;
t829 = a1032;
a1023 = t829;
t830 = a1023;
t831 = a1022;
a839 = t830;
a840 = t831;
t832 = a839;
t833 = a840;
t827 = t832*t833;
t828 = 0.5;
t818 = t827*t828;
t839 = a1032;
t840 = a1043;
t834 = t839+t840;
a1024 = t834;
t835 = a1024;
t836 = a1022;
a841 = t835;
a842 = t836;
t837 = a841;
t838 = a842;
t819 = t837*t838;
t846 = a1032;
t848 = 2.;
t849 = a1043;
t847 = t848*t849;
t841 = t846+t847;
a1025 = t841;
t842 = a1025;
t843 = a1022;
a843 = t842;
a844 = t843;
t844 = a843;
t845 = a844;
t820 = t844*t845;
t855 = a1032;
t857 = 3.;
t858 = a1043;
t856 = t857*t858;
t850 = t855+t856;
a1026 = t850;
t851 = a1026;
t852 = a1022;
a845 = t851;
a846 = t852;
t853 = a845;
t854 = a846;
t821 = t853*t854;
t864 = a1032;
t866 = 4.;
t867 = a1043;
t865 = t866*t867;
t859 = t864+t865;
a1027 = t859;
t860 = a1027;
t861 = a1022;
a847 = t860;
a848 = t861;
t862 = a847;
t863 = a848;
t822 = t862*t863;
t873 = a1033;
t875 = 3.;
t876 = a1043;
t874 = t875*t876;
t868 = t873-t874;
a1028 = t868;
t869 = a1028;
t870 = a1022;
a849 = t869;
a850 = t870;
t871 = a849;
t872 = a850;
t823 = t871*t872;
t882 = a1033;
t884 = 2.;
t885 = a1043;
t883 = t884*t885;
t877 = t882-t883;
a1029 = t877;
t878 = a1029;
t879 = a1022;
a851 = t878;
a852 = t879;
t880 = a851;
t881 = a852;
t824 = t880*t881;
t891 = a1033;
t892 = a1043;
t886 = t891-t892;
a1030 = t886;
t887 = a1030;
t888 = a1022;
a853 = t887;
a854 = t888;
t889 = a853;
t890 = a854;
t825 = t889*t890;
t895 = a1033;
a1031 = t895;
t896 = a1031;
t897 = a1022;
a855 = t896;
a856 = t897;
t898 = a855;
t899 = a856;
t893 = t898*t899;
t894 = 0.5;
t826 = t893*t894;
t816 = (((((((t818+t819)+t820)+t821)+t822)+t823)+t824)+t825)+t826;
t817 = a1043;
t810 = t816*t817;
t811 = 0.5;
t47 = t810*t811;
t37 = (((((((t39+t40)+t41)+t42)+t43)+t44)+t45)+t46)+t47;
t38 = a1151;
t27 = t37*t38;
t23 = t26+t27;
a691 = t22;
a692 = t23;
goto h257;
l5:
t912 = a679;
a682 = t912;
t913 = 1;
t914 = 0.;
a685 = t913;
a686 = t914;
h249:
t915 = a685;
t916 = a682;
if (!(t915>t916)) goto l7;
t16 = a686;
goto l8;
l7:
t919 = a685;
t920 = 1;
t917 = t919+t920;
t921 = a686;
t928 = a685;
t929 = a685;
t926 = t928*t929;
t927 = 1.;
t923 = t926*t927;
a687 = t923;
t924 = a687;
t925 = a687;
t922 = t924*t925;
t918 = t921+t922;
a685 = t917;
a686 = t918;
goto h249;
l8:
t12 = t15-t16;
a681 = t12;
t13 = a681;
t14 = a681;
t11 = t13*t14;
t7 = t10+t11;
a679 = t6;
a680 = t7;
goto h241;
l2:
print_flonum(t0);
putc('\n', current_output_port);
return;}
int main(void)
{assert(offsetof(struct{char dummy; char probe;}, probe)==1);
assert(sizeof(int)==4);
assert(sizeof(char*)==4);
assert(sizeof(unsigned int)==4);
assert(sizeof(unsigned int)==4);
assert(sizeof(int)==4);
f0();
return 0;}

Alexey Goldin

unread,
Apr 1, 1998, 3:00:00 AM4/1/98
to

Jeffrey Mark Siskind <qo...@research.nj.nec.com> writes:

> Enclosed is the handwritten C code and the Stalin-generated C code. This was
> generated by the development version of Stalin. Since polyvariance is
> currently broken in the development compiler, I simulated polyvariance
> manually by giving it the enclosed Scheme program. Note that even though that
> Scheme program is horrendous, the code generated is very good. Stalin flattens
> it into a single C function. And even though there are many copies of the
> parameters F1 ... F9, they are all eliminated because they are fictitious.

Thank you, I was using Stalin-0.7


Niels Möller

unread,
Apr 6, 1998, 3:00:00 AM4/6/98
to

Jeffrey Mark Siskind <qo...@research.nj.nec.com> writes:

> (Note that the above C code uses GNU C extensions, i.e. nested functions,
> to simulate downward funargs, i.e. lambdas.)

I don't think it's quite fair to use GNU-C nested functions in a
benchmark on C... They are convenient to use, but the trampoline code
that is generated probably slows down function calls through the
trampoline _significantly_. What figures do you get if you pass
pointers to local data explicitly instead of relying on GNU-C? Without
trampolines, the C compiler still won't eliminate the function calls,
but it will at least use use standard C calling conventions which are
reasonably fast.

Details, if anybody is interested... The GNU-C trampoline code is
implemented differently on different machines, I guess, but on the
single machine where I have looked at it, taking a pointer to a nested
function (if that nested function references variables in the
containing scope) is quite complicated:

First, a an executable trampoline function is generated on the
stack(!), and the address of this trampoline function is used. When
the trampoline function is called, the trampoline will copy the
referenced stack frame and call the real function. This is an
interesting hack, but the overhead for calling through the function
pointer is much higher than for ordinary C functions.

/Niels

Jeffrey Mark Siskind

unread,
Apr 6, 1998, 3:00:00 AM4/6/98
to Niels Möller

If you look at my previous posting you will see three C programs appended.
The first is a handwritten C program that uses GNU-C nested functions to do
double integration. The second is another handwritten C program to do the same
thing. But it doesn't use nested functions. In fact, it doesn't even simulate
closures, even though the original Scheme program has nested lambdas with free
variables. The reason is that it is a property of that program that those
closures are `single' i.e. only one live copy exists at any point in time. So
the variables that are accessed freely can be allocated as globals rather than
as closure slots. This eliminates the need to allocate and free closures
records. And it eliminates the need to pass around closure records. And it
makes accessing closed-over variables faster since there is no need to
indirect off of a closure pointer to access them. And despite all of this, the
performance only improves from 23.14 CPU sec for the version with GNU-C nested
functions to 14.88 CPU sec for the version without nested functions and
without closure records. The Stalin version takes 1.08 CPU sec which still is
more than 13 times faster than the C version which has no nested functions.

I've enclosed all three versions again below. The first two are the
handwritten C versions. The third is the version produced from Scheme by
Stalin.

Jeff (http://www.neci.nj.nec.com/homepages/qobi)

0 new messages