I have been challenged to "port" to CL the following (for now
quite inaccurate) approximation of pi/2. Depending on the
speed outcome, CL may get considered for the project that
will be havy on mumeric computations.
C++ code
#include <iostream>
#include <cmath>
#include <ctime>
static double piOver2() {
double sum = std::exp(-std::pow(-10.0, 2.0)) ;
sum += std::exp(-std::pow(10.0, 2)) ;
for (double x = -9.99 ; x < 10.0 ; x += 0.01) {
double v = std::exp(-std::pow(x, 2.0)) ;
sum += v + v ;
}
return sum * 0.005 ;
}
10,000 iterations take about 3s.
I then went through to java (replacing "std::" with "Math.")
and this takes about 9s.
I finally coded the Lisp version
(defun pi/2 ()
(do ((sum
(+ (exp (- (expt -10.0 2.0)))
(exp (- (expt 10.0 2.0)))))
(x -9.99 (+ x 0.01)))
((>= x 10.0) (* sum 0.005))
(incf sum (* 2 (exp (- (expt x 2.0)))))))
But I just can't tell how long 10,000 iterations would
take because I lost patience after 2 minutes.
So I started inserting a few (declare) and (the float)
here and there:
(defun pi/2-opt ()
(declare (optimize (safety 0) (debug 0) (speed 3)))
(declare (float sum))
(declare (float x))
(do ((sum
(+ (exp (- (expt -10.0 2.0)))
(exp (- (expt 10.0 2.0)))))
(x -9.99 (+ x 0.01)))
((>= x 10.0) (* sum 0.005))
(incf sum (the float (* 2 (exp (- (expt x 2.0))))))))
CL-USER 54 > (time (dotimes (i 10000) (pi/2-opt)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
user time = 97.631
system time = 2.084
Elapsed time = 0:02:05
Allocation = 4877356064 bytes
0 Page faults
Calls to %EVAL 120050
NIL
At nearly 30 times slower, I just don't stand a chance.
I can probably push the high-level, DSL and other Lisp features
at around 10x or 5x the speed penalty, but not at 30x.
What did I misunderstand about the use of (declare (optimize ...
and (the float ( ... or anything else that might affect the
speed so adversely?
Just in case this isn't clear, the point is not about getting
pi/2 but getting numeric speed for computations of complexity
similar to that used in this approximation of pi/2.
Many thanks
--
JFB
I get: Error: The value #C(1.1094963563118383E-43 2.7065338742349504E-57)
does not satisfy the type specifier FLOAT.
Perhaps that gives you a hint.
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
> On Sat, 12 Aug 2006 20:38:26 +0200, verec <ve...@mac.com> wrote:
>
> I get: Error: The value #C(1.1094963563118383E-43 2.7065338742349504E-57 )
> does not satisfy the type specifier FLOAT.
>
> Perhaps that gives you a hint.
Hmmm...
CL-USER 55 > (pi/2-opt)
#C(1.7724538 -7.747651E-8)
CL-USER 56 >
Are type-casts supposed to be implementation dependent?
I thought (erroneously?) that (the float XYZ) would
just drop the imaginary part, precisely because that's
the purpose of a cast? Or should I use (coerce ... in
addition?
FYI, I'm using LispWorks 4.4.6 PE on a G4.
Many thanks
--
JFB
1. Note that the call to TIME also tells you that your function was
interpreted, not compiled.
2. I pasted your code in SBCL's REPL. Here's what I got:
; caught WARNING:
; These variables are undefined:
; SUM X
The declares must be in the scope of the declared variables, like so:
(read the standard to know where to put your declarations)
(defun pi/2-opt ()
(declare (optimize (safety 0) (debug 0) (speed 3)))
(do ((sum
(+ (exp (- (expt -10.0 2.0)))
(exp (- (expt 10.0 2.0)))))
(x -9.99 (+ x 0.01)))
((>= x 10.0) (* sum 0.005))
(declare (type single-float x sum))
(incf sum (the float (* 2 (exp (- (expt x 2.0))))))))
10000 iterations take ~2.9 seconds on a P-M 1.6 GHz.
3. I don't know which compiler you intend to use, but python is a bit
moronic when it comes to side-effected variables. Here's a
side-effect-free version that gives the same answer (but as doubles,
not single floats). In this case, 10000 iterations also take ~2.9
seconds, but I prefer this version :)
(defun pi/2 ()
(declare (optimize (speed 3) (safety 0)))
(labels ((inner (x acc)
(declare (type double-float x acc))
(if (>= x 10d0)
acc
(inner (+ 0.01 x)
(+ acc (* 2 (exp (- (expt x 2)))))))))
(* 0.005d0 (inner -9.99d0
(+ (exp (- (expt -10.0d0 2)))
(exp (- (expt 10.0d0 2))))))))
4. It is sometimes useful to compile different versions and then
examine their assembly with DISASSEMBLE. For heavily numeric code, one
doesn't have to understand the innards of the CL implementation, only
normal x86 + x87.
Good Luck,
Paul Khuong
>> (defun pi/2-opt ()
>> (declare (optimize (safety 0) (debug 0) (speed 3)))
>> (declare (float sum))
>> (declare (float x))
>> (do ((sum
>> (+ (exp (- (expt -10.0 2.0)))
>> (exp (- (expt 10.0 2.0)))))
>> (x -9.99 (+ x 0.01)))
>> ((>= x 10.0) (* sum 0.005))
>> (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))
> 1. Note that the call to TIME also tells you that your function was
> interpreted, not compiled.
CL-USER 58 > (compile 'pi/2-opt)
;;;*** Warning in PI/2-OPT: The definition of PI/2-OPT is already compiled.
PI/2-OPT
((PI/2-OPT #<STYLE-WARNING 100B2BDF>))
NIL
It already was compiled :-(
> 2. I pasted your code in SBCL's REPL. Here's what I got:
> ; caught WARNING:
> ; These variables are undefined:
> ; SUM X
> The declares must be in the scope of the declared variables, like so:
>
> (read the standard to know where to put your declarations)
>
> (defun pi/2-opt ()
> (declare (optimize (safety 0) (debug 0) (speed 3)))
> (do ((sum
> (+ (exp (- (expt -10.0 2.0)))
> (exp (- (expt 10.0 2.0)))))
> (x -9.99 (+ x 0.01)))
> ((>= x 10.0) (* sum 0.005))
> (declare (type single-float x sum))
> (incf sum (the float (* 2 (exp (- (expt x 2.0))))))))
Excellent. Thanks. I tried.
CL-USER 59 > (pi/2-opt)
1.3412517E-31
Ooops. definitely incorrect. So I rewrote the declare part
as:
(declare (type float x sum))
CL-USER 60 > (pi/2-opt)
#C(1.7724538 -7.747651E-8)
OK. Back in the game :-)
CL-USER 61 > (time (dotimes (i 10000) (pi/2-opt)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
user time = 99.182
"-(
>
> 10000 iterations take ~2.9 seconds on a P-M 1.6 GHz.
>
> 3. I don't know which compiler you intend to use, but python is a bit
> moronic when it comes to side-effected variables. Here's a
> side-effect-free version that gives the same answer (but as doubles,
> not single floats). In this case, 10000 iterations also take ~2.9
> seconds, but I prefer this version :)
>
> (defun pi/2 ()
> (declare (optimize (speed 3) (safety 0)))
> (labels ((inner (x acc)
> (declare (type double-float x acc))
> (if (>= x 10d0)
> acc
> (inner (+ 0.01 x)
> (+ acc (* 2 (exp (- (expt x 2)))))))))
> (* 0.005d0 (inner -9.99d0
> (+ (exp (- (expt -10.0d0 2)))
> (exp (- (expt 10.0d0 2))))))))
CL-USER 63 > (pi/2-PK)
1.7724538905229454D0
Great. Correct too.
CL-USER 64 > (time (dotimes (i 10000) (pi/2-PK)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-PK))
user time = 43.500
Your version definitevly halves the time required!
That's execellent, but nowhere near the speed you
quote for your compiler/OS combo.
> 4. It is sometimes useful to compile different versions and then
> examine their assembly with DISASSEMBLE. For heavily numeric code, one
> doesn't have to understand the innards of the CL implementation, only
> normal x86 + x87.
I had tried to (disassemble 'pi/2-opt) but nearly fainted at the
sight of 104 lines of assembly :-(
> Good Luck,
Thank you very much! The lesson here for me is that I should
first try another compiler, as I just can't see why a copy and
paste of your version would run 15 times slower on my setup.
(LW 4.4.6 on a G4 @1.5GHz ... the fact that the disassembly shows
104 lines is probably a good hint that no optimization is applied
at all in the PE of LW?)
> Paul Khuong
--
JFB
(defun pi/2-opt ()
(declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
(loop with sum of-type double-float = #.(realpart (+ (exp (- (expt -10.0d0 2.0d0)))
(exp (- (expt 10.0d0 2.0d0)))))
for x of-type double-float from -9.99d0 by 0.01d0
while (< x 10.0d0) do
(incf sum (* 2d0 (exp (- (* (abs x) (abs x))))))
finally (return (* sum 0.005d0))))
Removed some expt stuff, I think it is still correct.
CL-USER 47 > (pi/2-opt)
1.7724538509055174
CL-USER 46 > (time (dotimes (i 10000) (pi/2-opt)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
user time = 1.832
system time = 0.000
Elapsed time = 0:00:02
Allocation = 160984 bytes standard / 112783 bytes conses
0 Page faults
Calls to %EVAL 10035
NIL
CL-USER 44 > (disassemble 'pi/2-opt)
2067FC1A:
0: 55 push ebp
1: 89E5 move ebp, esp
3: 83EC24 sub esp, 24
6: C7042445240000 move [esp], 2445
13: DD05E02A6920 fldl [20692AE0] ; 7.440151952041305E-44
19: DD5DF0 fstpl [ebp-10]
22: DD05C02B6920 fldl [20692BC0] ; -9.99
28: DD55F8 fstl [ebp-8]
31: DD05882C6920 fldl [20692C88] ; 10.0
37: DED9 fcompp
39: DFE0 fnstsw ax
41: 9E sahf
42: 0F8ABF000000 jp L6
48: 0F86B9000000 jbe L6
L1: 54: DD45F0 fldl [ebp-10]
57: DD5DE0 fstpl [ebp-20]
60: DD45F8 fldl [ebp-8]
63: DD55F8 fstl [ebp-8]
66: DD05C00B1920 fldl [20190BC0] ; 0.0
72: DED9 fcompp
74: DFE0 fnstsw ax
76: 9E sahf
77: 0F8AF6000000 jp L7
83: 0F86F0000000 jbe L7
89: DD45F8 fldl [ebp-8]
92: D9E0 fchs
94: DD5DE8 fstpl [ebp-18]
L2: 97: DD45F8 fldl [ebp-8]
100: DD55F8 fstl [ebp-8]
103: DD05C00B1920 fldl [20190BC0] ; 0.0
109: DED9 fcompp
111: DFE0 fnstsw ax
113: 9E sahf
114: 0F8ADC000000 jp L8
120: 0F86D6000000 jbe L8
126: DD45F8 fldl [ebp-8]
129: D9E0 fchs
131: DD5DF0 fstpl [ebp-10]
L3: 134: DD45E8 fldl [ebp-18]
137: DC4DF0 fmull [ebp-10]
140: D9E0 fchs
142: D9E5 fxam
144: DFE0 fnstsw ax
146: 3500010000 xor eax, 100
151: 9E sahf
152: 760D jbe L4
154: 7B23 jnp L5
156: F6C402 testb ah, 2
159: 741E je L5
161: DDD8 fstp st(0)
163: D9EE fldz
165: EB18 jmp L5
L4: 167: D9EA fld2e
169: DEC9 fmulp st(1), st
171: D9C0 fld st(0)
173: D9FC frndint
175: D9C0 fld st(0)
177: D9CA fxch st(2)
179: DEE1 fsubrp st(1), st
181: D9F0 f2xm1
183: D9E8 fld1
185: DEC1 faddp st(1), st
187: D9FD fscale
189: DDD9 fstp st(1)
L5: 191: DD05102D6920 fldl [20692D10] ; 2.0
197: DEC9 fmulp st(1), st
199: DC45E0 faddl [ebp-20]
202: DD5DF0 fstpl [ebp-10]
205: DD45F8 fldl [ebp-8]
208: DD05102C6920 fldl [20692C10] ; 0.01
214: DEC1 faddp st(1), st
216: DD55F8 fstl [ebp-8]
219: DD05882C6920 fldl [20692C88] ; 10.0
225: DED9 fcompp
227: DFE0 fnstsw ax
229: 9E sahf
230: 7A07 jp L6
232: 7605 jbe L6
234: E947FFFFFF jmp L1
L6: 239: DD45F0 fldl [ebp-10]
242: DD05602E6920 fldl [20692E60] ; 0.005
248: DEC9 fmulp st(1), st
250: FF75E4 push [ebp-1C]
253: FF75E0 push [ebp-20]
256: FF75DC push [ebp-24]
259: 8B75E8 move esi, [ebp-18]
262: 8975DC move [ebp-24], esi
265: 8B75EC move esi, [ebp-14]
268: 8975E0 move [ebp-20], esi
271: 8B75F0 move esi, [ebp-10]
274: 8975E4 move [ebp-1C], esi
277: 8B75F4 move esi, [ebp-C]
280: 8975E8 move [ebp-18], esi
283: 8B75F8 move esi, [ebp-8]
286: 8975EC move [ebp-14], esi
289: 8B75FC move esi, [ebp-4]
292: 8975F0 move [ebp-10], esi
295: 8B7500 move esi, [ebp]
298: 8975F4 move [ebp-C], esi
301: 83ED0C sub ebp, C
304: 8B7510 move esi, [ebp+10]
307: 897504 move [ebp+4], esi
310: DD5D0C fstpl [ebp+C]
313: C74508450C0000 move [ebp+8], C45
320: B501 moveb ch, 1
322: C9 leave
323: FF2530671120 jmp [20116730] ; SYSTEM::RAW-FAST-BOX-DOUBLE
L7: 329: DD45F8 fldl [ebp-8]
332: DD5DE8 fstpl [ebp-18]
335: E90DFFFFFF jmp L2
L8: 340: DD45F8 fldl [ebp-8]
343: DD5DF0 fstpl [ebp-10]
346: E927FFFFFF jmp L3
351: 90 nop
352: 90 nop
353: 90 nop
NIL
For the sake of my curiosity, why do you calculate twice the same
number? 10^2 == (-10)^2
> But I just can't tell how long 10,000 iterations would
> take because I lost patience after 2 minutes.
That's very strange indeed. I can confirm it is pretty fast without any
optimization. I'm also working with SBCL. It gets from 5 seconds without
optimization down to 1.1 with speed 3 and anything else to 0.
Maybe there's a flaw in your compiler. Try it on another implementation.
Curiously,
Nowhere man
--
nowhe...@levallois.eu.org
OpenPGP 0xD9D50D8A
(defun pi/2-opt ()
(declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
(loop with sum of-type double-float = #.(realpart (+ (exp (- (expt -10.0d0 2.0d0)))
(exp (- (expt 10.0d0 2.0d0)))))
for x of-type double-float from -9.99d0 below 10.0d0 by 0.01d0
do (incf sum (* 2d0 (exp (- (* x x)))))
finally (return (* sum 0.005d0))))
CL-USER 1 > (time (dotimes (i 10000) (pi/2-opt)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
; Loading fasl file C:\Program Files\Xanalys\LispWorks\lib\4-3-0-0\modules\util\callcoun.fsl
user time = 1.231
system time = 0.000
Elapsed time = 0:00:01
Allocation = 164304 bytes standard / 113113 bytes conses
0 Page faults
Calls to %EVAL 10035
NIL
CL-USER 4 > (disassemble 'pi/2-opt)
; Loading fasl file C:\Program Files\Xanalys\LispWorks\lib\4-3-0-0\modules\util\disass.fsl
; Loading fasl file C:\Program
Files\Xanalys\LispWorks\lib\4-3-0-0\patches\disassembler\0001\0001.fsl
; Loaded public patch DISASSEMBLER 1.1
; Loading fasl file C:\Program Files\Xanalys\LispWorks\lib\4-3-0-0\modules\memory\findptr.fsl
214EBE2A:
0: 55 push ebp
1: 89E5 move ebp, esp
3: 83EC24 sub esp, 24
6: C7042445240000 move [esp], 2445
13: DD05500E6920 fldl [20690E50] ; 7.440151952041305E-44
19: DD5DF0 fstpl [ebp-10]
22: DD05300F6920 fldl [20690F30] ; -9.99
28: DD55F8 fstl [ebp-8]
31: DD05780F6920 fldl [20690F78] ; 10.0
37: DED9 fcompp
39: DFE0 fnstsw ax
41: 9E sahf
42: 7A5C jp L2
44: 775A ja L2
L1: 46: DD45F0 fldl [ebp-10]
49: DD0570116920 fldl [20691170] ; 0.005
55: DEC9 fmulp st(1), st
57: FF75E4 push [ebp-1C]
60: FF75E0 push [ebp-20]
63: FF75DC push [ebp-24]
66: 8B75E8 move esi, [ebp-18]
69: 8975DC move [ebp-24], esi
72: 8B75EC move esi, [ebp-14]
75: 8975E0 move [ebp-20], esi
78: 8B75F0 move esi, [ebp-10]
81: 8975E4 move [ebp-1C], esi
84: 8B75F4 move esi, [ebp-C]
87: 8975E8 move [ebp-18], esi
90: 8B75F8 move esi, [ebp-8]
93: 8975EC move [ebp-14], esi
96: 8B75FC move esi, [ebp-4]
99: 8975F0 move [ebp-10], esi
102: 8B7500 move esi, [ebp]
105: 8975F4 move [ebp-C], esi
108: 83ED0C sub ebp, C
111: 8B7510 move esi, [ebp+10]
114: 897504 move [ebp+4], esi
117: DD5D0C fstpl [ebp+C]
120: C74508450C0000 move [ebp+8], C45
127: B501 moveb ch, 1
129: C9 leave
130: FF2530671120 jmp [20116730] ; SYSTEM::RAW-FAST-BOX-DOUBLE
L2: 136: DD45F8 fldl [ebp-8]
139: DC4DF8 fmull [ebp-8]
142: D9E0 fchs
144: D9E5 fxam
146: DFE0 fnstsw ax
148: 3500010000 xor eax, 100
153: 9E sahf
154: 760D jbe L3
156: 7B23 jnp L4
158: F6C402 testb ah, 2
161: 741E je L4
163: DDD8 fstp st(0)
165: D9EE fldz
167: EB18 jmp L4
L3: 169: D9EA fld2e
171: DEC9 fmulp st(1), st
173: D9C0 fld st(0)
175: D9FC frndint
177: D9C0 fld st(0)
179: D9CA fxch st(2)
181: DEE1 fsubrp st(1), st
183: D9F0 f2xm1
185: D9E8 fld1
187: DEC1 faddp st(1), st
189: D9FD fscale
191: DDD9 fstp st(1)
L4: 193: DD0550106920 fldl [20691050] ; 2.0
199: DEC9 fmulp st(1), st
201: DC45F0 faddl [ebp-10]
204: DD5DF0 fstpl [ebp-10]
207: DD45F8 fldl [ebp-8]
210: DD05C80F6920 fldl [20690FC8] ; 0.01
216: DEC1 faddp st(1), st
218: DD55F8 fstl [ebp-8]
221: DD05780F6920 fldl [20690F78] ; 10.0
227: DED9 fcompp
229: DFE0 fnstsw ax
231: 9E sahf
232: 7A9E jp L2
234: 779C ja L2
236: E93DFFFFFF jmp L1
241: 90 nop
NIL
> Le Sat, 12 Aug 2006 19:38:26 +0100, verec a écrit :
>> static double piOver2() {
>> double sum = std::exp(-std::pow(-10.0, 2.0)) ;
>> sum += std::exp(-std::pow(10.0, 2)) ;
>
> For the sake of my curiosity, why do you calculate twice the same
> number? 10^2 == (-10)^2
Well spotted! I guess this is only a mind-less translation of
mine of the math formula
pi/2 = integral[-10, 10, x](e^-(x^2))
:-(
>> But I just can't tell how long 10,000 iterations would
>> take because I lost patience after 2 minutes.
>
> That's very strange indeed. I can confirm it is pretty fast without any
> optimization. I'm also working with SBCL. It gets from 5 seconds without
> optimization down to 1.1 with speed 3 and anything else to 0.
>
> Maybe there's a flaw in your compiler. Try it on another implementation.
Still on the case :-)
I'll report in due course.
> Curiously,
> Nowhere man
--
JFB
> I thought (erroneously?) that (the float XYZ) would
> just drop the imaginary part, precisely because that's
> the purpose of a cast? Or should I use (coerce ... in
> addition?
THE is not a type-cast, it is a declaration that the value of the
expression is known a priori to be of the specified type. It allows the
compiler to perform type dispatching at compile time rather than at run
time, but it doesn't perform any conversion to ensure that the object is
of that type.
So you should write (the float (realpart xyz)).
--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
> This version is better than the last
>
> (defun pi/2-opt ()
> (declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
> (loop with sum of-type double-float = #.(realpart (+ (exp (- (expt
> -10.0d0 2.0d0)))
> (exp (- (expt
> 10.0d0 2.0d0)))))
> for x of-type double-float from -9.99d0 below 10.0d0 by 0.01d0
> do (incf sum (* 2d0 (exp (- (* x x)))))
> finally (return (* sum 0.005d0))))
>
> CL-USER 1 > (time (dotimes (i 10000) (pi/2-opt)))
> Timing the evaluation of (DOTIMES (I 10000) (PI/2-OPT))
> ; Loading fasl file C:\Program
> Files\Xanalys\LispWorks\lib\4-3-0-0\modules\util\callcoun.fsl
>
> user time = 1.231
> system time = 0.000
> Elapsed time = 0:00:01
> Allocation = 164304 bytes standard / 113113 bytes conses
> 0 Page faults
> Calls to %EVAL 10035
CL-USER 1 > (pi/2-WH)
1.7724538509055174D0
CL-USER 2 > (time (dotimes (i 10000) (pi/2-WH)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-WH))
...
Elapsed time = 0:00:12
...
CL-USER 4 > (pi/2-WH2)
1.7724538509055174D0
CL-USER 5 > (time (dotimes (i 10000) (pi/2-WH2)))
Timing the evaluation of (DOTIMES (I 10000) (PI/2-WH2))
...
Elapsed time = 0:00:12
...
Thanks Wade!
At 12s for the Lisp version (still with the same LW 4.4,6 PE compiler)
I have no doubt that I now can successfully make my case!
(Though, out of curiosity I'm going to try with SBCL and OpenMCL)
Many thanks
--
JFB
> Are type-casts supposed to be implementation dependent?
>
> I thought (erroneously?) that (the float XYZ) would
> just drop the imaginary part, precisely because that's
> the purpose of a cast?
Just to be clear, "the" is *not* a type cast - it is a special form
that acts as a type declaration.
(the float (foo ...)) is a special form telling the
compiler/interpreter that you, the programmer, are certain that the
expression beginning (foo ...) will definitely return a float so the
compiler can treat the return of (foo ...) as a float. If you use this
declaration with an expression that returns something other than a
float you are in undefined territory:
"The consequences are undefined if the values yielded by the form are
not of the type specified by value-type."
(from <http://www.lisp.org/HyperSpec/Body/speope_the.html#the>)
The function cl:float takes a real number and an optional prototype for
float format (e.g., 1.0d0) and returns a float of the same format as
the prototype or a single float if no prototype is supplied:
(float 1.0s0 1.0d0)
=> 1.0d0
(float 33/3 1.0d0)
=> 11.0d0
(float 2)
=> 2.0s0
If you want only the real part of a complex number use the function
cl:realpart:
(realpart #C(1.7724538 -7.747651E-8))
=> 1.7724538
> "The consequences are undefined if the values yielded by the form are
> not of the type specified by value-type."
> (from <http://www.lisp.org/HyperSpec/Body/speope_the.html#the>)
>
[...]
> If you want only the real part of a complex number use the function
> cl:realpart:
>
> (realpart #C(1.7724538 -7.747651E-8))
>
> => 1.7724538
Point taken. Many thanks.
--
JFB
> In article <44de2997$0$639$5a6a...@news.aaisp.net.uk>,
> verec <ve...@mac.com> wrote:
>
>> I thought (erroneously?) that (the float XYZ) would
>> just drop the imaginary part, precisely because that's
>> the purpose of a cast? Or should I use (coerce ... in
>> addition?
>
> THE is not a type-cast, it is a declaration that the value of the
> expression is known a priori to be of the specified type. It allows
> the compiler to perform type dispatching at compile time rather than at
> run time, but it doesn't perform any conversion to ensure that the
> object is of that type.
>
> So you should write (the float (realpart xyz)).
I knew that when with a hint of a doubt, I should head straight to the CLHS :-)
Many Thanks
--
JFB
"On using Common Lisp in scientific computing", by Nicolas Neuss. This
has a pretty good tuturial-level introduction to how to do numerical
optimization in Common Lisp.
http://www.iwr.uni-heidelberg.de/organization/sfb359/PP/Preprint2002-40.ps.gz
"Fast Floating-Point Processing in Common Lisp", by Richard Fateman.
This is a more in-depth treatment.
http://www.cs.berkeley.edu/~fateman/papers/lispfloat.ps
Ken Anderson's Common Lisp Performance page:
http://openmap.bbn.com/~kanderso/performance/
Additionally, here's a listing of math oriented libraries available for
Common Lisp:
http://www.cliki.net/mathematics
Glenn
> Removed some expt stuff, I think it is still correct.
Hmmm. That's a point I overlooked when copying and pasting
your example (as well as PK's.)
The thing is, both of your version are about 10 times
faster than what I intially came up with _precisely_
because you have replaced (expt x 2) with (* x x)
When I plug (expt x 2) back into the loop, I get again
the same dismall performance as with my initial version.
with:
do (incf sum (* 2d0 (exp (- (* x x)))))
12s
with:
do (incf sum (* 2d0 (exp (- (expt x 2.0d0)))))
115s!
but with:
do (incf sum (* 2d0 (exp (- (realpart (expt x 2.0d0))))))
88s
Hmmm. Still scratching my head :-(
--
JFB
Much appreciated. Thanks!
> Glenn
--
JFB
> On 2006-08-12 21:27:36 +0100, Wade Humeniuk
> <whumeniu+...@telus.net> said:
>
>> Removed some expt stuff, I think it is still correct.
>
> Hmmm. That's a point I overlooked when copying and pasting
> your example (as well as PK's.)
>
> The thing is, both of your version are about 10 times
> faster than what I intially came up with _precisely_
> because you have replaced (expt x 2) with (* x x)
>
> When I plug (expt x 2) back into the loop, I get again
> the same dismall performance as with my initial version.
You can write a compiler-macro. Unfortunately, not on CL:EXPT since
this is forbidden, but you can do it this way:
(shadow 'expt)
(declaim (inline expt))
(defun expt (base-number power-number) (cl:expt base-number power-number))
(define-compiler-macro expt (&whole whole base-number power-number)
(if (equal 2 power-number)
(let ((bn (gensym)))
`(let ((,bn ,base-number)) (* ,bn ,bn)))
whole))
Then compare:
(disassemble (compile (defun f (x) (expt x 2))))
(disassemble (compile (defun g (x y) (expt x y))))
So there is no need in modifying the source, keep (expt x 2).
(You can also write a more sophisticated compiler macro, handing other
constant exponents).
--
__Pascal Bourguignon__ http://www.informatimago.com/
"This statement is false." In Lisp: (defun Q () (eq nil (Q)))
> You can write a compiler-macro. Unfortunately, not on CL:EXPT since
> this is forbidden, but you can do it this way:
>
> (shadow 'expt)
> (declaim (inline expt))
> (defun expt (base-number power-number) (cl:expt base-number power-number))
> (define-compiler-macro expt (&whole whole base-number power-number)
> (if (equal 2 power-number)
> (let ((bn (gensym)))
> `(let ((,bn ,base-number)) (* ,bn ,bn)))
> whole))
>
> Then compare:
>
> (disassemble (compile (defun f (x) (expt x 2))))
> (disassemble (compile (defun g (x y) (expt x y))))
Cute :-)
Many thanks
--
JFB
I get consistently *faster* run times with SBCL 0.9.14.30 than with
GCC 4.0.4 using plain C, not C++. (There's no reason C++ would be
faster, is there?)
100,000 iterations on a 2GHz x86:
SBCL
17.729 seconds of real time
17.713108 seconds of user run time
0.004 seconds of system run time
0 page faults and
0 bytes consed.
GCC
real 0m19.948s
user 0m19.861s
sys 0m0.004s
How exactly did you instrument the C/C++ code to time it? As I didn't
want to take much time on it, I just compiled the given code with a main
doing 10,000 iterations with a for loop, and timed it with my shell, but
this includes spawning and shutting down the process...
Ha, wait. Don't mind. I could just change the number of iterations and
then easily calculate the impact it really has.
OK, I'm also significantly faster in Lisp... It took 1.7s even with -O3
for the C++ version (ten times with ten times iterations, so spawning
the process is quite lightweight, in fact), and 1.1s in Lisp for the
following:
(defun pi/2-opt ()
(declare (optimize (debug 0)(space 0)(safety 0)(speed 3)))
(do ((x -9.98 (+ x 0.01))
(v 0.0 (exp (- (expt x 2.0))))
(sum (* 2 (exp (- (expt 10.0 2.0)))) (+ sum (* 2 v))))
((>= x 10.0) (* 0.005 sum))
(declare (type float x v sum))))
I'm impressed...
Quickly,
> (defun pi/2-opt ()
> (declare (optimize (debug 0)(space 0)(safety 0)(speed 3)))
> (do ((x -9.98 (+ x 0.01))
> (v 0.0 (exp (- (expt x 2.0))))
> (sum (* 2 (exp (- (expt 10.0 2.0)))) (+ sum (* 2 v))))
> ((>= x 10.0) (* 0.005 sum))
> (declare (type float x v sum))))
>
> I'm impressed...
So am I :-)
I tested this version sith sbcl 0.9.15 and indeed:
* (time (dotimes (i 10000) (pi/2-optPT)))
Evaluation took:
4.413 seconds of real time
3.753417 seconds of user run time
0.036023 seconds of system run time
0 page faults and
81,920 bytes consed.
NIL
I'm amazed at the difference a compiler makes...
Many thanks to all, for all your time and efforts.
--
JFB
Here's the whole thing:
#include <math.h>
static double pi_over_2 (void)
{
double sum = exp (-pow (-10.0, 2.0)) + exp (-pow (10.0, 2.0));
double x;
for (x = -9.99; x < 10.0; x += 0.01)
sum += 2.0 * exp (-pow (x, 2.0)) ;
return sum * 0.005;
}
int main (void)
{
int i;
for (i = 0; i < 100000; i++)
pi_over_2 ();
return 0;
}
> OK, I'm also significantly faster in Lisp... It took 1.7s even with
> -O3 for the C++ version [...], and 1.1s in Lisp
I didn't see any significant differences between -O2, -O3, and -Os, or
between C and C++, so I guess that's not a factor to consider here.
> (defun pi/2-opt ()
> (declare (optimize (debug 0)(space 0)(safety 0)(speed 3)))
> (do ((x -9.98 (+ x 0.01))
> (v 0.0 (exp (- (expt x 2.0))))
> (sum (* 2 (exp (- (expt 10.0 2.0)))) (+ sum (* 2 v))))
> ((>= x 10.0) (* 0.005 sum))
> (declare (type float x v sum))))
You can also avoid boxing the return value, but I find that it doesn't
improve speed noticeably in this toy program.
(defun pi/2 (box)
(declare (type (simple-array double-float ()) box))
(do (...)
((>= x 10.0d0)
(setf (aref box) (* 0.005d0 sum))
nil)
...))
A one-element vector works just as well, but it's always fun to use a
zero-dimensional array. :) (Perhaps other implementations optimize one
of those better than another.)
> OK, I'm also significantly faster in Lisp... It took 1.7s even with -O3
> for the C++ version (ten times with ten times iterations, so spawning
> the process is quite lightweight, in fact), and 1.1s in Lisp
Compile the C version with -ffast-math and do something to avoid
optimizing out the whole computation (e.g sum the results and print
the sum). It's twice faster than without -ffast-math on my box.
--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Significantly faster, yes, but not by a factor of two on my machine.
SBCL
17.772 seconds of real time
17.745108 seconds of user run time
0.004 seconds of system run time
0 page faults and
0 bytes consed.
GCC
real 0m15.863s
user 0m15.785s
sys 0m0.012s
Old results, without summing the return values:
If you look at the recent thread in this newsgroup with subject "Squeezing
blood from a turnip" you will see a related discussion about getting better
performance in Lisp than C at counting bits in an integer.
What ended up being found is that the Common Lisp built in function logcount
was the fastest way do do it in SBCL, that the fastest C program to do it was
faster, that the implementation of logcount in SBCL was not using the fastest
algorithm that was known in C, and then coincidentally someone contributed a
faster logcount to the SBCL project and that made the Lisp version faster than
the C version.
An important point that you can get from my post to that thread is that you
can write your own compiler implementation of a function like logcount using
the SBCL macro define-vop, which allows you to write the assembler language
implementation of a function. So even if the bottleneck in your program is not
something that is part of the CL language definition such as logcount, you are
able to hand optimize it if it makes sense to do so. And you can evaluate a
define-vop form right at the REPL. You don't have to touch the compiler source
code to in effect make a patch to the the compiler.
If profiling shows that the bottleneck is a function that is part of Common
Lisp, you can give your improvements back to the project, if it is an open
source Lisp.
That means that if you have a very large program, once you have the algorithm
tweaked and you profile to find where the bottleneck is, you can go the extra
step of telling the compiler how to write fully optimized machine language for
a critical function in the inner loop, if there is such a function.
The advantage I have found in using Common Lisp instead of C for highly
optimized code is that it is much easier to make major changes to the
algorithm, which usually is the right way to get big improvements, before
dealing with low level tweaking.
--
Sidney Markowitz
http://www.sidney.com
> Depending on the
> speed outcome, CL may get considered for the project that
> will be havy on mumeric computations.
Why use Lisp for numerical computations ?
patro
Why not?
Curiously,
>> Depending on the
>> speed outcome, CL may get considered for the project that
>> will be havy on mumeric computations.
>
> Why use Lisp for numerical computations ?
I said that the project was going to be heavy on numerical
computations, not that it was only about them.
The basic blocks (financial modelling on time series) are
fairly well defined. That's where the "raw speed" is needed
because the data samples are potentially _huge_, and the
computations not trivial. The math guy doesn't really care
which language is chosen, provided it is "fast enough", and
I'm pushing as hard as I can to avoid having to use Java
for the rest, because Java forces me to build too much
infrastructure/protocols that have to be almost entirely discarded
when the higher level requirements change.
The higher levels (what strategies to apply according to the
results of the models) are a lot less well defined, and it is
my beleif that changing our mind three times along the way about
what we really want is going to be less expensive and faster
in CL (than the daily nightmare of forcing new business requirements
into a three year old Java code base that was not meant for them.)
But in order to be able to change our minds, we need first
to see some results, hence experiment, hence my plea for using
CL
--
JFB
On the MS Windows machine I'm currently using:
Cygwin GCC 3.4.4 takes 19.5secs (with -O3 and --fast-math)
SBCL "kitten of death" takes ~23secs
GCL 2.6.1 takes 59secs
I'm impressed by how close SBCL gets to GCC. I'm also impressed at the
acrobatics needed to stop GCC from removing the calculation altogether
(as you say sum results + print sum).
The 'time' function seems not to work on MS windows SBCL, I imagine
fixing that isn't a priority.
It works in the latest release, but you'll need to build that from
source. Binaries aren't made for all platforms on every release.
--
Juho Snellman
Well, any modern compiler will calculate both values at compile time.
And this "-" will help other people to understand your code.
--
Ivan Boldyrev
| recursion, n:
| See recursion
> On 2006-08-12 20:33:51 +0100, pkh...@gmail.com said:
> > 1. Note that the call to TIME also tells you that your function was
> > interpreted, not compiled.
>
> CL-USER 58 > (compile 'pi/2-opt)
> ;;;*** Warning in PI/2-OPT: The definition of PI/2-OPT is already compiled.
> PI/2-OPT
> ((PI/2-OPT #<STYLE-WARNING 100B2BDF>))
> NIL
>
> It already was compiled :-(
The function PI/2-OPT may have been compiled, but what you tested with
TIME was the following:
(time (dotimes (i 10000) (pi/2-opt)))
and the DOTIMES loop is interpreted, not compiled. What I usually end
up doing to get around that is to write some simple test function like
so:
(defun test (n) (dotimes (i n) (pi/2-opt)))
and then make sure TEST is compiled. With dynamic linking, it also
means that subsequent changes to the tested function can be timed
without needing to recompile the TEST function.
(time (test 10000))
--
Thomas A. Russ, USC/Information Sciences Institute
Interesting. In CMUCL, TIME always compiles the form to be timed:
cmu> (time (dotimes (i 100000000)))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:
; Evaluation took:
; 0.11f0 seconds of real time
; 0.11f0 seconds of user run time
; 0.0f0 seconds of system run time
; 200,529,433 CPU cycles
; 0 page faults and
; 0 bytes consed.
;
NIL
cmu>
-Rob
-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
> On 2006-08-13 23:15:29 +0100, "patro" <pa...@ortap.com> said:
>
>>> Depending on the
>>> speed outcome, CL may get considered for the project that
>>> will be havy on mumeric computations.
>
> I said that the project was going to be heavy on numerical
> computations, not that it was only about them.
>
> The basic blocks (financial modelling on time series) are
> fairly well defined. That's where the "raw speed" is needed
> because the data samples are potentially _huge_, and the
> computations not trivial. The math guy doesn't really care
> which language is chosen, provided it is "fast enough", and
> I'm pushing as hard as I can to avoid having to use Java
> for the rest, because Java forces me to build too much
> infrastructure/protocols that have to be almost entirely discarded
> when the higher level requirements change.
Speaking from experience, it's really not at all about how fast a
language can do one sort or another of numerical floating point
arithmetic. With huge data samples, it's about how the data are
organized and what types of methods are used to access them. We handle
very large time series using Lisp, and are eternally thankful that we
chose it for a language. We do use some routines for numerical
computation that are not Lisp, but they're sure not Java either.
They're carefully coded and optimized and then wrapped in Lisp wrappers.
We have years of experience in using Lisp with (very large) time series
for financial applications. If you'd like more pointers, let me know.
>
> The higher levels (what strategies to apply according to the
> results of the models) are a lot less well defined, and it is
> my beleif that changing our mind three times along the way about
> what we really want is going to be less expensive and faster
> in CL (than the daily nightmare of forcing new business requirements
> into a three year old Java code base that was not meant for them.)
>
> But in order to be able to change our minds, we need first
> to see some results, hence experiment, hence my plea for using
> CL
----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
I'd agree with that. The example program given isn't a particularly
good one.
Most programs that use floating point iterate over large data
structures to read the data they need and output the results. The
optimization of these data structures is often the most important part.
In the SPECfp2000 floating point benchmark set there are now few
benchmarks that make great use of the floating point capabilities of
the processor. Most test the capabilities of the memory subsystem to
access regularly structured data. This is because the benchmarks are
built to be typical of fp applications and this is what fp applications
today do.
CL-USER 11 > (defun pi/2-opt4 () ;;; formula for (sqrt pi) not for (/ pi 2) !!
(declare (optimize (speed 3) (safety 0) (debug 0) (float 0)))
(loop with sum of-type double-float = #.(realpart (+ (exp (- (expt
-10.0d0 2.0d0)))
(exp (- (expt
10.0d0 2.0d0)))))
for x of-type double-float from -9.99d0 below 10.0d0 by 0.01d0
do (incf sum (* 2d0 (exp (- (realpart (expt x 2.0d0))))))
finally (return (* sum 0.005d0))))
CL-USER 12 > (pi/2-opt4)
1.7724538509055174D0
CL-USER 13 > (sqrt pi)
1.7724538509055159D0
CL-USER 14 > (/ pi 2)
1.5707963267948966D0
Many thanks again.
--
JFB