I am looking for the numerically most stable, i.e. most accurate
formulation for evaluating the roots
of a quadratic equation (2nd order polynom)
ax^2 + bx + c = 0 .
From school I know that the roots are given by
x_{1,2} = (-b +/- \sqrt{b^2 -4ac}) / 2a (1)
or, alternatively
x_{1,2} = 2c / (-b +/- \sqrt{b^2 -4ac}) . (2)
Since the problem comes from physics, I know that a>0, b>0, c<0 and I am
only interested in the root with the minus sign.
I am having trouble in the frame of the solution of a non-linear system
of equation using a Newton-like method with numerical approximation of
the Jacobian by finite differences. I could trace down convergence
problems to very in-accurate partial derivatives resulting from
in-accurate evaluation of the roots of the quadratic equation.
I have already figured out that formula (1) is inferior to (2) if b>>ac.
I hope sombody can give some general guidelines how to evaluate the
roots with most accuracy in finite precision, depending on the values of
the coefficients.
Thanks in advance for your help,
Michael Buck
IKE, Universitaet Stuttgart
D-70550 Stuttgart, Germany
E-mail: bu...@ike.uni-stuttgart.de
> I am looking for the numerically most stable, i.e. most accurate
> formulation for evaluating the roots
> of a quadratic equation (2nd order polynom)
>
> ax^2 + bx + c = 0 .
Let's reduce it to the solution of
x^2 + 2px + q = 0
for the sake of simplicity.
The problem with the standard formula
x1;2 = -p +- sqrt(pp-q);
are the cases where |q| is very small relative to |pp|. Then for
negative/positive p the expression -p -/+ sqrt(pp-q) will subtract to
values of almost the same magnitude and the relative precision of the
result will vanish.
Using VIETA's identity x1*x2 = q
a superior approach is
(p=q=0 excluded)
(pp-q >= 0)
if (p<0)
x1 = -p + sqrt(pp-q);
else
x1 = -p - sqrt(pp-q);
x2 = q/x1;
Regards
Horst
>I am looking for the numerically most stable, i.e. most accurate
>formulation for evaluating the roots
>of a quadratic equation (2nd order polynom)
Use the code listed below.
Greetings,
Jos
Subroutine quadrc(a,b,c,w1,w2)
! -- Solves a x^2 + b x + c = 0; w1 becomes root with smaller abs. val.
implicit complex*16(a-z)
b2 = -b/2
d = sqrt(b2*b2-a*c)
s = b2+d
s2= b2-d
if(abs(s2) .gt. abs(s)) s=s2
w2 = s/a
w1 = 0
if(c.ne.0) w1=c/s
end
For a comprehensive look at solving the quadratic equation find a copy
of Pat H. Sterbenz's book Floating Point Computation. Section 9.3 (page
246-252 in my copy).
Jeff Lindemuth
li...@coil.com
LakeShore Cryotronics
li...@lakeshore.com
I wasn't going to answer this thread because I figured there would be
many essentially correct solutions posted, but I believe it is always
desirable to give the reasons for the changes you suggest to someone.
First, given that the original statement of the problem was that A and
B were greater than zero and C was less than zero, there is no need
to introduce the COMPLEX data type into his problem. The expression
within the SQRT call will always be positive.
Second, the introduction of B2 saves a little time (and slightly decreases
the chance that B**2 will overflow) but does nothing to increase the
accuracy of the solution. I'm not saying that it's a bad idea: the saved
time is no doubt worth something. But it should be explained that
the modification if for speed.
Third, you use B2*B2 instead of B2**2. This should be
what the compiler does anyway for small integer exponents. But
if it isn't, the single multiply is almost certainly more accurate than
the exp(2*log(B2)) that might usually be the alternative. It's
always a good idea to multiply rather than exponentiate if you
can.
Fourth, the introduction of S and S2 is again mostly a time saver.
The normal solution divides by A to get the first root, and then
inverts that root and multiplies it by C/A to get the second root.
This second division by A essentially produces the S value you
had before. There is possibly a little loss of precision from this,
but not significant in this case (see below).
Fifth, the IF statement to choose that largest value solution as
the principal root and computing the other from it resolves the
issue of "front cancellation" when subtracting nearly equal
values. But again, this should be a moot point in this case
(next paragraph). If he _did_want the second root, this would
be a very important step. Subtracting two (nearly equal) values
which are both the result of calculation is a good way to _promote_
error.
Given that A and B are positive, C is negative, and he only wanted
the (most) negative root, the naive solution is:
R1 = (-B - SQRT(B*B - 4*A*C)) / (2*A)
This is also the largest magnitude root of the two, given the prior
knowledge of the values. In spite of your changing variables,
this is the result your program would produce as well (assuming that
multiplications by 4 and 2 are exact as well as your division by 2).
Since the other root is never needed, the whole discussion above
about S and S2 is not really relevant in this case. Nor is the
issue of selecting the largest root as the principal one. (If he decides
he needs the other root, the statement R2 = C/(A*R1) can be used.
This still avoids the "front cancellation" from the subtract.)
Horst Kraemer also gave a solution. In it, he also did a lot
of substitution of variables. He defined P as B/(2*A) and
Q as C/A. The solution was then given for:
X^2 + 2PX + Q = 0
The solution also involved some time saving and a test for the
largest magnitute root. But, in the end it should again produce the
same solution as R1 above for the case in question.
In the case, as described, if the solution given by the naive
solution is not adequate, you might try using extended precision
(say double the precision that the variables A, B, and C are
declared) for all the intermediate calculations in the solution.
This would probably be about the best you could do (assuming
the extended precision SQRT function was "good to the last ULP")
If the system you're on doesn't have extended precision (or if the
cost of an extended precision SQRT is prohibitive), you might
try:
R1_PRIME = R1 - (A*R1*R1 + B*R1 + C) / (2*A*R1 + B)
This is a Newton's method iteration applied to the quadratic formula.
Assuming that the problem really is a poor approximation of R1 from
the naive formula, this should correct the approximation quite precisely
(assuming that R1 is close at all). It may even be cheaper to do the
naive calculation and the Newton's iteration both in normal precision
than to do a single expression of the naive solution in extended precision.
Applying the various time saving variable substitutions may also help.
Maybe I've confused the issue now, with too much information. But
I really do think that answers should give explanations and not just
code to blindly follow. Anyway, good luck.
--
J. Giles
The reason for introducing S and S2 is not time saving, but to be able
to inspect them, keep the larger (reliable) one and discard the other.
The problem with the "normal" solution is which root to call the "first"
root. The essence can be seen, rewriting the standard roots as:
R1 = {-b/2 + sqrt((b/2)^2-a*c)} / a
R2 = {-b/2 - sqrt((b/2)^2-a*c)} / a
^^^
Which one of these has a cancellation between the terms in braces?
You don't know, since the terms can be anywhere in the complex plane!
It's simple to prove that in (at least) one of the roots, the result is
larger than each of the terms. This is the one that is reliable, the
other one must be computed indirectly. By introducing s and s2, we can
make this choice at the point where the cancellation has occurred. This
will even give a more or less correct solution pair [-c/b, Inf] for a=0.
To see the cancellation, take for instance: a=1e-40, b=1, c=-1.
The small root is then approximately: R1 = 1.0 (see how close you come.)
Change b to -1 and then R2, the other root, is the difficult one.
>> Subroutine quadrc(a,b,c,w1,w2)
>>! -- Solves a x^2 + b x + c = 0; w1 becomes root with smaller abs. val.
>> implicit complex*16(a-z)
>>
>> b2 = -b/2
>> d = sqrt(b2*b2-a*c)
>> s = b2+d
>> s2= b2-d
>> if(abs(s2) .gt. abs(s)) s=s2
>> w2 = s/a
>> w1 = 0
>> if(c.ne.0) w1=c/s
>> end
-- J.B.
Yes, but as I pointed out, you can make the determination of which
root has the largest magnitude without introducing S and S2. They may
(in the complex case) save some time in making this determination. They
certainly (in all cases) save a couple of operations when you're interested
in getting both roots.
For the case in question, the largest magnitude root is determined by
the sign of B, which is also a given - so there's no need to even make
a test. And the user only wants one root, so there's no savings from
the omitted operations.
And yes, as we've both now said, it *is* important to compute the
largest root differently than the smaller one.
--
J. Giles
> Third, you use B2*B2 instead of B2**2. This should be
> what the compiler does anyway for small integer exponents. But
> if it isn't, the single multiply is almost certainly more accurate than
> the exp(2*log(B2)) that might usually be the alternative. It's
> always a good idea to multiply rather than exponentiate if you
> can.
I agree that compilers should do this anyway and all those that I have
just tried appear to do so (except possibly, unoptimised g77).
I agree with the statement about accuracy.
I disagree with the unqualified advice to multiply rather than
exponentiate as I have found this typically encourages students
to write expressions such as
R2 = (A*X-X0)*(A*X-X0) + (B*Y-Y0)*(B*Y-Y0) + (C*Z-Z0)*(C*Z-Z0)
and similar distortions of formula translation.
However, I do preach
y = a0 + x*(a1 + x*(a2 + x*(a3 ... )))
so there is plenty of common ground !
Bob (we...@atm.ox.ac.uk)
Jeffrey Lindemuth wrote in message <36506E...@bronze.coil.com>...
>However, I do preach
> y = a0 + x*(a1 + x*(a2 + x*(a3 ... )))
>so there is plenty of common ground !
The algorithm:
poly = a(n)
do i = n-1, 0, -1
poly = x*poly+a(i)
end do
Has twice the latency of the algorithm:
poly = a(0)
power = x
do i = 1, n-1
poly = poly+a(i)*power
power = power*x
end do
poly = poly+a(n)*power
And by further rearranging the arithmetic one can reduce
the latency even more. Of course if the compiler is
able to unroll the outer loop where the polynomials have
to be calculated the first algorithm has twice the
throughput, so neither method is faster in every case.
It might be preferable, then, to write the polynomial
evaluation in the most readable manner, e.g.:
poly = a(0) + sum(a(1:n)*x**(/ (i, i = 1, n) /))
and let the compiler try to figure out what's best, perhaps
with the aid of compiler options.
> ..., so neither method is faster in every case.
> It might be preferable, then, to write the polynomial
> evaluation in the most readable manner, e.g.:
>
> poly = a(0) + sum(a(1:n)*x**(/ (i, i = 1, n) /))
If we should write it readable, then why the separate a(0)? Clearer
would be:
poly = sum((/( a(i)*x**i, i=0,n )/))
Does this have to do with the x**0 problem, or with other subtleties
of our beloved standard?
- Jos -
----
Jozef R. Bergervoet Electromagnetism and EMC
Philips Research Laboratories, Eindhoven, The Netherlands
E-mail: berg...@natlab.research.philips.com Phone: +31-40-2742403
> The algorithm:
>
> poly = a(n)
> do i = n-1, 0, -1
> poly = x*poly+a(i)
> end do
>
> Has twice the latency of the algorithm:
>
> poly = a(0)
> power = x
> do i = 1, n-1
> poly = poly+a(i)*power
> power = power*x
> end do
> poly = poly+a(n)*power
>
> And by further rearranging the arithmetic one can reduce
> the latency even more. Of course if the compiler is
> able to unroll the outer loop where the polynomials have
> to be calculated the first algorithm has twice the
> throughput, so neither method is faster in every case.
Rewrite the 2nd code as,
poly = a(0)
power = 1
do i = 1, n
power = power*x
poly = poly + a(i)*power
end do
By looking at both "renditions" I can't follow your reasoning that
neither is better in every case. It seems to me that, regardless of
compiler acrobatics, version one is not only more elegant it is also
always better.
--
Dr.B.Voh
-----------------------------------------------
Modeling * Simulation * Analysis
http://www.netcom.com/~essoft
-----------------------------------------------
>By looking at both "renditions" I can't follow your reasoning that
>neither is better in every case. It seems to me that, regardless of
>compiler acrobatics, version one is not only more elegant it is also
>always better.
We are making a typical newsgroup mistake here of discussing algorithms
without benchmarking them. In order to remedy this I offer the
following code:
program polytest
implicit none
integer, parameter :: rk8 = selected_real_kind(15, 300)
integer i, j
integer, parameter :: n = 100
real(rk8) :: a(0:n) = (/ (1+(-1)**i/(i+1.0_rk8), i = 0, n) /)
real(rk8) x, poly, power
real(rk8) t1, t2
call cpu_time(t1)
x = 1
do j = 1, 2000000
poly = a(n)
do i = n-1, 0, -1
poly = x*poly+a(i)
end do
x = x-sign(x, poly)
end do
call cpu_time(t2)
write(*, *) ' x =', x, ' time =', t2-t1
call cpu_time(t1)
x = 1
do j = 1, 2000000
poly = a(0)
power = x
do i = 1, n-1
poly = poly+a(i)*power
power = power*x
end do
poly = poly+a(n)*power
x = x-sign(x, poly)
end do
call cpu_time(t2)
write(*, *) ' x =', x, ' time =', t2-t1
call cpu_time(t1)
x = 1
do j = 1, 2000000
poly = a(0) + sum(a(1:n)*x**(/ (i, i = 1, n) /))
x = x-sign(x, poly)
end do
call cpu_time(t2)
write(*, *) ' x =', x, ' time =', t2-t1
end program polytest
And here are the timing results:
F:\program files\devstudio\James\polytest\AlphaRel>polytest
x = 0.000000000000000E+000 time = 3.24218750000000
x = 0.000000000000000E+000 time = 2.21875000000000
x = 0.000000000000000E+000 time = 36.9921875000000
F:\program files\devstudio\James\polytest\AlphaRel>
Note that the apparently more naive second algorithm actually
did better than the first in this case. If you examine the code,
you will note that I forced the polynomial evaluations to be
sequential so the times are determined by the latency of the
methods (unrolling the outer loop is impossible). In the first
algorithm, the latency is that of a floating multiply and a
floating add, which is a total of 8 clocks on my PC. In the
second algorithm, both the poly and x variables have a latency
of 4 clocks per loop cycle, but now they can run in parallel
so the minimum latency is just 4 clocks per loop cycle. Of
course there is an extra multiplication per loop iteration here
but that is of no consequence here as the processor could issue
4 multiplications in 4 clocks in any case; we are only gobbling
up half its floating multiplication throughput anyway! The
third algorithm's performance I blame on the compiler; it should
be smarter than that.
Michael
>
> We are making a typical newsgroup mistake here of discussing algorithms
> without benchmarking them. In order to remedy this I offer the
> following code:
>
> program polytest
> ...
> ... In the
> second algorithm, both the poly and x variables have a latency
> of 4 clocks per loop cycle, but now they can run in parallel ...
But still the second algorithm seems to take nearly 6 clocks in the loop:
2.2187*533e6 / (2000000*100) = 5.9128 (assuming 533 MHz, right?)
Does this mean there is still something sub-optimal?
I didn't succeed in producing much better code. I tried Michael's
suggestion of splitting odd/even. On a Pentium 200 with Linux PSR f90
(-O3 -funroll-all-loops) it gave :
x = 0. time = 6.48 James' first algorithm
x = 0. time = 7.42 James' second alg.
x = 0. time = 5.59 odd/even separate
So on the Pentium, the last version took about 5.59 clocks to loop. The
odd/even code was simply:
x2 = x**2
poly2 = a(n)
poly1 = a(n-1)
do i = n-2, 2, -2 ! assuming n=even !!!
poly2 = x2*poly2+a(i)
poly1 = x2*poly1+a(i-1)
end do
poly = a(0)+x*(poly1+x*poly2)
I suspect it can be done faster...
-- Jos
>But still the second algorithm seems to take nearly 6 clocks in the loop:
>
> 2.2187*533e6 / (2000000*100) = 5.9128 (assuming 533 MHz, right?)
>
>Does this mean there is still something sub-optimal?
Yes, a couple of things are sub-optimal. I just quickly pasted some code
around the algorithms and didn't take the time to play with any compiler
settings. Also the x = x-sign(x, poly) isn't very intelligent. I was
just trying to ensure that the outer loops couldn't be executed in
parallel (in which case the original Horner algorithm would always win).
Since this is cross-posted to the numerical analysis ng, perhaps
someone there could come up with a fixed-point iteration where the
polynomial is of degree about 100 and the target value is in (0.5, 1.0)
and fairly straightforward to generate so that I could switch to x = poly.
The times for the release version and no extra switches:
Horner: x = 0.000000000000000E+000 time = 3.42968750000000
Naive: x = 0.000000000000000E+000 time = 2.21875000000000
F90 Array: x = 0.000000000000000E+000 time = 37.8046875000000
Horner mod 2: x = 0.000000000000000E+000 time = 1.75000000000000
Horner mod 8: x = 0.000000000000000E+000 time = 6.39062500000000
As above with /fast /arch:ev56 /tune:ev56 /pipeline
Horner: x = 0.000000000000000E+000 time = 3.43750000000000
Naive: x = 0.000000000000000E+000 time = 1.89062500000000
F90 Array: x = 0.000000000000000E+000 time = 39.0000000000000
Horner mod 2: x = 0.000000000000000E+000 time = 1.73437500000000
Horner mod 8: x = 0.000000000000000E+000 time = 5.92187500000000
As above with /transform_loops
Horner: x = 0.000000000000000E+000 time = 3.15625000000000
Naive: x = 0.000000000000000E+000 time = 1.88281250000000
F90 Array: x = 0.000000000000000E+000 time = 40.3906250000000
Horner mod 2: x = 0.000000000000000E+000 time = 1.78125000000000
Horner mod 8: x = 0.000000000000000E+000 time = 3.15625000000000
As above with /unroll:101
Horner: x = 0.000000000000000E+000 time = 3.18750000000000
Naive: x = 0.000000000000000E+000 time = 1.95312500000000
F90 Array: x = 0.000000000000000E+000 time = 33.1562500000000
Horner mod 2: x = 0.000000000000000E+000 time = 1.65625000000000
Horner mod 8: x = 0.000000000000000E+000 time = 1.85156250000000
The Horner mod 8 algorithm is an attempt to get perfect FP pipelining
in the inner loop:
xvec(2) = x**2
xvec(0) = 1
xvec(1) = x
xvec(4) = xvec(2)**2
xvec(3) = x*xvec(2)
x8 = xvec(4)**2
xvec(5) = x*xvec(4)
xvec(6) = xvec(2)*xvec(4)
xvec(7) = xvec(3)*xvec(4)
polyvec = a(n-7:n)
do i = n-8, 12, -8 ! assuming n=4 mod 8 !!!
polyvec = x8*polyvec+a(i-7:i)
end do
poly = sum(a(0:4)*xvec(0:4))+xvec(5)*sum(xvec*polyvec)
But it failed to do an adequate job here. Perhaps if I used F77 syntax
instead of F90 array syntax like the Horner mod 2 algorithm does...
Note that on a Pentium the latency problems of Horner's method aren't
as big a deal as on a 21164 or 21264 because the floating point
multiply latency/throughput ratio (3/2) isn't as big as it is on the
Alphas (4/1).
> poly = a(n)
> do i = n-1, 0, -1
> poly = x*poly+a(i)
> end do
>
> poly = a(0)
> power = x
> do i = 1, n-1
> poly = poly+a(i)*power
> power = power*x
> end do
> poly = poly+a(n)*power
>
> And here are the timing results:
>
> time = 3.24218750000000
> time = 2.21875000000000
>
> Note that the apparently more naive second algorithm actually
> did better than the first in this case. . .
I tried two different F77 compilers -- with & without optimization -- and in
both cases the first algorithm does better (anywhere from 6 to 58 percent).
Then I realized we're not comparing like with like. Shortening the loop by one
cycle in the second case is an "improvement" that can be neutralized by doing
the same for the first case.
Loop overhead may be small, however with your generous 2 million cycle it adds
up and leaves the comparative reasoning for the test numbers questionable.
Perhaps dual processors change the equation and unsettle a common intuition!??
>I tried two different F77 compilers -- with & without optimization -- and
in
>both cases the first algorithm does better (anywhere from 6 to 58 percent).
>Then I realized we're not comparing like with like. Shortening the loop by
one
>cycle in the second case is an "improvement" that can be neutralized by
doing
>the same for the first case.
>Loop overhead may be small, however with your generous 2 million cycle it
adds
>up and leaves the comparative reasoning for the test numbers questionable.
>Perhaps dual processors change the equation and unsettle a common
intuition!??
I'm not exactly sure about what you are saying here -- one has to know
the processor type and speed to be able to make a guess about what the
compiler is doing to the code. On a 486, for example, the straight
Horner algorithm would always be faster because floating point instructions
can't overlap, but on more recent processors floating point operations
can overlap so the second method should be faster if the compiler
schedules instructions for maximum throughput. The third method was
intended to tell the compiler that a polynomial evaluation was
requested and to do the best job it could of optimizing this.
Perhaps a comparison of throughput and latency on a 21164 will help you
to understand the point I am trying to make. The max throughput of this
processor is one floating point add and one floating point multiply per
clock cycle, along with up to two integer operations (note that FP
loads count as integer operations in this context). The latency for
FP adds and multiplies, however, is 4 clock cycles so the Horner
algorithm requires 8 clock cycles * the degree of the polynomial for
completion. That adds up to 800 cycles for evaluation of a 100°
polynomial. In the 8 clock cycles for each add and multiply we could
have issued 8 adds and 8 multiplies if we had ideally pipelined code,
so we are only using 1/8 of the processor's potential FP throughput
with this method!
As an illustration of how wasteful this is, I offer some pseudo-
assembly (F77) code and some actual assembly (ASAXP) code that gets
much closer to optimum throughput. First the driver program (same
as before, more or less, so I only left the declarations and the
..ASM driver loop intact. Well, I changed the polynomial, too, so
that the result wasn't so trivial; I left that in as well).
program polytest
implicit none
integer, parameter :: rk8 = selected_real_kind(15, 300)
integer i, j
integer, parameter :: n = 100
real(rk8) a(0:n)
real(rk8) x, poly, power
real(rk8) t1, t2
real(rk8) x2, poly1, poly2
real(rk8) xvec(0:7), x8, polyvec(0:7)
real(rk8) x0, y0, y1
real(rk8) x3, x4, x5, x6, x7
real(rk8) f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11
external asm100, asm200
real(rk8) asm100, asm200
a = (/ (sqrt(2.0_rk8)**i*(1+(-1)**i/(i+1.0_rk8)), i = 0, n) /)
x0 = (1/sqrt(3.0_rk8)+atan(1.0_rk8))/2
y0 = sum((/(a(i)*x0**i,i=0,n)/))
y1 = sum((/(i*a(i)*x0**(i-1),i=0,n)/))
a(0) = a(0)+x0-0.9_rk8*x0-y0+y1*x0
a(1) = a(1)+0.9_rk8-y1
call cpu_time(t1)
x = 139.0_rk8/204
do j = 1, 2000000
poly = asm100(%val(x), a) ! Use the assembly code
x = poly
end do
call cpu_time(t2)
write(*, *) ' True ASM: x =', x, ' time =', t2-t1
end program polytest
And here is the assembly code (the pseudo-assembly is in the comments)
Note that in the inner loop the instructions are grouped in blocks of
4 with exactly one FP multiply and one FP add in each group; this is
one (of many) requirements the 21164 has for max FP throughput.
..globl ASM100
..set noreorder
..align 4
ASM100:
mult $f16, $f16, $f12 # x2 = x**2
ldt $f10, +2*8($17) # f10 = a(2)
ldt $f11, +4*8($17) # f11 = a(4)
nop
mult $f10, $f16, $f10 # f10 = f10*x
mult $f11, $f16, $f11 # f11 = f11*x
ldt $f21, +1*8($17) # f8 = a(1)
ldt $f19, +3*8($17) # f9 = a(3)
mult $f12, $f12, $f14 # x4 = x2**2
mult $f16, $f12, $f13 # x3 = x*x2
lda $0, +92*8($17) # set address of a(i)
addt $f10, $f21, $f10 # f10 = f10+f8
addt $f11, $f19, $f11 # f11 = f11+f9
or $31, 10, $1 # set loop iteration count
mult $f14, $f14, $f18 # x8 = x4**2
ldt $f0, +1*8($0) # f0 = a(93)
mult $f16, $f14, $f15 # x5 = x*x4
ldt $f1, +2*8($0) # f1 = a(94)
mult $f12, $f14, $f20 # x6 = x2*x4
ldt $f22, +3*8($0) # f2 = a(95)
ldt $f23, +4*8($0) # f3 = a(96)
mult $f13, $f14, $f17 # x7 = x3*x4
ldt $f24, +5*8($0) # f4 = a(97)
unop
ldt $f25, +6*8($0) # f5 = a(98)
mult $f0, $f18, $f0 # f0 = f0*x8
ldt $f26, +7*8($0) # f6 = a(99)
unop
ldt $f27, +8*8($0) # f7 = a(100)
mult $f1, $f18, $f1 # f1 = f1*x8
ldt $f21, -7*8($0) # f8 = a(85)
unop
mult $f22, $f18, $f22 # f2 = f2*x8
ldt $f19, -6*8($0) # f9 = a(86)
mult $f23, $f18, $f23 # f3 = f3*x8
unop
..align 4
loop_top: # do i = 92,20,-8
mult $f24, $f18, $f24 # f4 = f4*x8
addt $f0, $f21, $f0 # f0 = f0+f8
ldt $f21, -5*8($0) # f8 = a(i-5)
unop
mult $f25, $f18, $f25 # f5 = f5*x8
addt $f1, $f19, $f1 # f1 = f1+f9
ldt $f19, -4*8($0) # f9 = a(i-4)
unop
mult $f26, $f18, $f26 # f6 = f6*x8
addt $f22, $f21, $f22 # f2 = f2+f8
ldt $f21, -3*8($0) # f8 = a(i-3)
unop
mult $f27, $f18, $f27 # f7 = f7*x8
addt $f23, $f19, $f23 # f3 = f3+f9
ldt $f19, -2*8($0) # f9 = a(i-2)
unop
mult $f0, $f18, $f0 # f0 = f0*x8
addt $f24, $f21, $f24 # f4 = f4+f8
ldt $f21, -1*8($0) # f8 = a(i-1)
unop
mult $f1, $f18, $f1 # f1 = f1*x8
addt $f25, $f19, $f25 # f5 = f5+f9
ldt $f19, -0*8($0) # f9 = a(i)
subq $0, 8*8 # point to a(i-8)
mult $f22, $f18, $f22 # f2 = f2*x8
addt $f26, $f21, $f26 # f6 = f6+f8
ldt $f21, -7*8($0) # f8 = a(i-15)
subq $1, 1, $1 # update loop counter
mult $f23, $f18, $f23 # f3 = f3*x8
addt $f27, $f19, $f27 # f7 = f7+f9
ldt $f19, -6*8($0) # f9 = a(i-14)
bne $1, loop_top # end do
mult $f24, $f18, $f24 # f4 = f4*x8
addt $f0, $f21, $f0 # f0 = f0+f8
ldt $f21, -5*8($0) # f8 = a(i-5)
unop
mult $f25, $f18, $f25 # f5 = f5*x8
addt $f1, $f19, $f1 # f1 = f1+f9
ldt $f19, -4*8($0) # f9 = a(i-4)
unop
mult $f26, $f18, $f26 # f6 = f6*x8
addt $f22, $f21, $f22 # f2 = f2+f8
ldt $f21, -3*8($0) # f8 = a(i-3)
unop
mult $f27, $f18, $f27 # f7 = f7*x8
addt $f23, $f19, $f23 # f3 = f3+f9
ldt $f19, -2*8($0) # f9 = a(i-2)
unop
mult $f10, $f16, $f10 # f10 = f10*x
addt $f24, $f21, $f24 # f4 = f4+f8
ldt $f21, -1*8($0) # f8 = a(i-1)
unop
mult $f11, $f13, $f11 # f11 = f11*x3
addt $f25, $f19, $f25 # f5 = f5+f9
ldt $f19, -0*8($0) # f9 = a(i)
unop
addt $f26, $f21, $f26 # f6 = f6+f8
ldt $f21, +0*8($17) # f8 = a(0)
addt $f27, $f19, $f27 # f7 = f7+f9
mult $f1, $f16, $f1 # f1 = f1*x
mult $f22, $f12, $f22 # f2 = f2*x2
addt $f10, $f21, $f21 # f8 = f10+f8
mult $f23, $f13, $f23 # f3 = f3*x3
mult $f24, $f14, $f24 # f4 = f4*x4
mult $f25, $f15, $f25 # f5 = f5*x5
addt $f0, $f1, $f0 # f0 = f0+f1
mult $f26, $f20, $f26 # f6 = f6*x6
addt $f21, $f11, $f21 # f8 = f8+f11
mult $f27, $f17, $f27 # f7 = f7*x7
addt $f22, $f23, $f22 # f2 = f2+f3
addt $f24, $f25, $f24 # f4 = f4+f5
addt $f26, $f27, $f26 # f6 = f6+f7
addt $f0, $f22, $f0 # f0 = f0+f2
addt $f24, $f26, $f24 # f4 = f4+f6
addt $f0, $f24, $f0 # f0 = f0+f4
mult $f15, $f0, $f0 # f0 = x5*f0
addt $f0, $f21, $f0 # poly = f0+f8
ret ($26)
As you can see, this code is really horrible and one would prefer
the compiler did it for us, but we weren't so lucky this time.
Horner: x = 0.681374216292852 time = 3.04687500000000
Naive: x = 0.681374216293617 time = 1.82812500000000
F90 Array: x = 0.681374216291147 time = 36.1875000000000
Horner mod 2: x = 0.681374216292852 time = 1.75000000000000
Horner mod 8: x = 0.681374216293065 time = 5.88281250000000
Pseudo-ASM: x = 0.681374216293879 time = 0.859375000000000
True ASM: x = 0.681374216293879 time = 0.578125000000000
You can see that the assembly code took about
0.578125*533.e6/100/2.e6 = 1.54 clock cycles per iteration while the
Horner method used 8.12 clock cycles per iteration. So even though
the compiler does almost a perfect job with Horner's method it's
falling far short of the best we can do (which is about 10% faster
than the assembly code above: unrolling the inner loop completely
eliminates the loop overhead, but that would have been way too
much code to post). The point I'm trying to make here is that even
on a single processor you have to achieve lots of parallelism (here
8X) to get the optimum throughput and that older concepts about
simply minimizing the total instruction count without worrying about
latency are not valid (and have not been for quite a while now).