Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

What you always wanted to know about math copros 2/5

75 views

Skip to first unread message

|S| Norbert Juffa

unread,

Sep 28, 1992, 10:09:13 AM9/28/92

Cyrix 83D87 was introduced in 1989, only shortly after the
coprocessors from IIT. It has been the fastest
387 compatible coprocessor in several benchmark
comparisons [1,7,68,69]. It also came out as the
fastest coprocessor in my own tests (see benchmark
results below). Although the Cyrix 83D87 provides
up to 50% more performance than the Intel 387DX
in benchmarks comparisons, the speed advantage
over other 387 compatible coprocessors in real
applications is usually much smaller. For example,
in a test using the program 3D-Studio, the Cyrix
83D87 was 6% faster than the Intel 387DX [1].
Besides being the fastest 387 coprocessor, the
83D87 also offers the most accurate transcendental
functions results of all coprocessors tested (see
test results below). The new version of the 83D87,
which is sold as 387+ in Europe, even surpasses
the level of accuracy of the original 83D87 design.
Unlike Intel's coprocessors, which use the CORDIC
[18,19] algorithm to compute the transcendental
functions, Cyrix uses polynomial and rational
approximations to the functions. In the past the
CORDIC method has been popular since it requires
only shifts and adds which makes it easy to implement.
It is also reasonably fast. Recently, the cost for
the implementation for fast floating-point multipliers
has dropped significantly due to the availability of
VLSI, making the use of polynomial and rational
approximations superior to CORDIC for the generation
of transcendental functions [61]. The Cyrix 83D87
uses a fast array multiplier, making its transcendental
functions faster than those of any other 387 compatible
coprocessor. It also uses 75 bit for the mantissa
in intermediate calculations (as opposed to 68 bits
on other coprocessors), making its transcendental
functions more accurate than those of any other
coprocessor or FPU (see results below). The 83D87
and its successor, the 387+ are the 387 'clones'
with the highest degree of compatibility. There
are only very few SW and HW incompatibilities with
the Intel 387DX. These have been documented by
Cyrix [12]. The software differences are caused
by some bugs present in the 387DX that Cyrix fixed
for the 83D87. Unlike the Intel 387DX, the 83D87
(and all other 387 'clones' as well) does not support
asynchronous operation of CPU and coprocessor. There
have also been problems in the past with the CPU -
coprocessor communication, causing the 83D87 to
hang on some machines. The reason was that Cyrix
shaved off a wait state in the communication protocol,
which caused a communications breakdown between the
CPU and the 83D87 for some systems running at 25 MHz
or faster. One notable example of this behavior was
the Intel 302 board. The problem is only rarely
encountered with the current generation of 386
motherboards. It is possible that the problem has
been entirely eliminated in the 387+, the successor
to the 83D87. To reduce power consumption the 83D87
features advanced power saving features. Those
portions of the coprocessor that are not needed
are automatically shut down. If no coprocessor
instructions are being executed, all parts except
the bus interface unit are shut down [12]. Maximal
power consumption of the Cyrix 83D87 at 33 MHz is
1900 mW, typical power consumption at this clock
frequency is 500 mW [15].
Cyrix EMC87 is basically a special version of the Cyrix 83D87.
In addition to the normal 387 operating mode, in
which coprocessor-CPU communication is handled through
reserved IO-ports, it also offers a memory-mapped
mode of operation similar to the operation principle
of the Weitek Abacus. To implement the memory mapped
interface, the usual 80x87 architecture has been
slightly expanded with three additional registers
and eleven additional instructions that can only be
used if the memory mapped mode is enabled. Please
note that the EMC87 is *not* compatible with Weitek's
Abacus coprocessor. They both use the same interface
technique (memory mapping) but while the EMC87 uses
the standard 387 instruction set, the Weitek Abacus
coprocessors use a different instruction set of their
own. Like the Weitek Abacus, the EMC87 occupies a
block of memory starting at physical address C0000000h
(the Abacus occupies a memory block of 64 kB, while
the EMC87 uses only 4 kB [77]). It can therefore
only be accessed in the protected or virtual modes
of the 386 CPU. DOS programs can access the EMC87
with the help of DOS extenders or memory managers
like EMM386 which run in protected/virtual mode
themselves. Since the EMC87 provides also the standard
CPU interface via IO-ports, it can be used just like
any other 387 compatible coprocessor and delivers the
same performance as the Cyrix 83D87 in this mode. The
EMC87 even allows mixed use of memory-mapped and
traditional instructions in the same code. Using the
memory mapped mode of the EMC87 provides a significant
speed advantage. The traditional 387 CPU-coprocessor
interface via IO-ports has an overhead of about
14-20 clock cycles. Since the Cyrix 83D87 executes
some operations like addition and multiplication in
much less time, its performance is limited by the
CPU-coprocessor interface. The memory-mapped mode
has much less overhead and allows all coprocessor
instructions to be executed at full speed and with
no penalty. For this reason, Cyrix introduced the
EMC87 in 1990. In a test, the EMC87 at 33 MHz ran
the single precision Whetstone benchmark at 7608
kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had
a speed of only 5049 kWhetstones/sec, an increase
of 50.6% [63]. In another test, the EMC87 ran a
fractal computation at two times the speed of the
Cyrix 83D87 and 2.6 times as fast as an Intel 387DX
[64]. A third test found the EMC87's overall
performance to be 20% higher than the performance of
the Cyrix 83D87 [65]. The Cyrix FasMath EMC87 has
also been sold as Cyrix AutoMATH by Cyrix. The two
chips are 100% identical. Unlike the Cyrix 83D87,
which fits into the 68-pin 387 coprocessor socket,
the EMC87 comes in a 121-pin PGA and requires the
121-pin EMC (Extended Math Coprocessor) socket. Note
that not all boards have such a socket, a notable
exception being IBM's PS/2s, for example. Originally,
Cyrix claimed support for the fast memory mapped mode
of the EMC87 from a lot of software vendors (including
Borland and Microsoft). However, there are only
very few applications that make use of it, among
them Evolution Computing's FastCAD 3D, MicroWay
Inc.'s NDP FORTRAN-386 compiler, Metaware's High-C
compiler version 1.6 and newer, and Intusofts's
Spice [63,73]. Part of the problem in supporting the
memory mapped mode is that one has to reserve one of
the general purpose registers of the CPU to use
memory mapped mode instructions that access memory.
Cyrix has implemented some additional instructions
in the EMC87 that are also avaliable in the 387
compatible mode: FRICHOP, FRINT2, and FRINEAR. These
instructions enable rounding to integer without setting
the rounding mode by manipulating the coprocessor
control word and are intended to make life easier
for compiler writers. The EMC87 is available 25 and 33
MHz versions. Max. power consumption at 33 MHz is
2000 mW. Cyrix is currently phasing out the EMC87.
Cyrix 387+ is the successor to the Cyrix 83D87. The name 387+
is only used for European distribution. In other
parts of the world, the second generation 387 'clone'
from Cyrix still goes by the name 83D87. In my tests,
I found the Cyrix 387+ to be about five to 10 percent
*slower* than the Cyrix 83D87. However, some
instructions like the square root (FSQRT) now
run at only half the speed at which they ran in
the 83D87 and the transcendental functions show
a 40% drop in performance compared with the 83D87
on the average (see performance results below). I
also found the transcendental functions on the 387+
to be a bit *more* accurate than those implemented
in the 83D87. According to a source with Cyrix [73],
the 387+ was designed to make a smaller and thus
cheaper coprocessor chip, that also can go at
higher frequencies than the 83D87. The new design
uses a slower hardware multiplier that needs 6
clock cycles to multiply the floating point mantissa
of an internal precision number, while the multiplier
in the 83D87 takes only 4 clocks to accomplish the
same task. Since the transcendental functions are
generated by polynomial and rational approximations
in Cyrix math coprocessors, this slows them down
quite a bit. The divide/square root logic has also
been changed from the 83D87 design. The original design
used an algorithm that could generate both, the
quotient and square root, so the execution times for
these instructions were nearly identical. The algorithm
chosen for the division in the 387+ doesn't allow
the square root to be taken so easily, so it takes
nearly twice as long. The 387+ is available in
versions of up to 40 MHz. In the 387+, the available
argument range for the FYL2XP1 instruction has been
extended from the usual range -1+sqrt(2)/2..sqrt(2)/2,
that is found on all 80x87 coprocessors, to all
floating-point numbers. Also, four additional
instructions have been implemented: FRICHOP (opcode
DD FC), FRINT2 (opcode DB FC), FRINEAR (opcode DF FC),
and FTSTP (opcode D9 E6).
Cyrix 83S87 is the SX version of the Cyrix 83D87. Just like the
Cyrix 83D87 is the fastest 387 compatible coprocessor,
the Cyrix 83S87 is the fastest of the 387SX compatible
coprocessor [1]. Besides being the fastest 387SX
'clone', the Cyrix 83S87 also features the most
accurate transcendental functions. 83S87 chips sold
after about August 1992 use the internals of the
Cyrix 387+, the successor to the original 83D87 [73].
The 83S87 is packaged in a 68-pin PLCC and is available
in 16, 20 and 25 MHz versions. Due to the advanced
power saving features of the Cyrix coprocessor, the
typical power consumption of the 20 MHz version is
about 350 mW [67].
ULSI 83C87 is a 387 'clone' that came out in early 1991, well
after the IIT 3C87 and Cyrix 83D87. Like all clones,
it is somewhat faster than the Intel 387DX. Especially
the basic arithmetic functions are fast, while the
transcendental functions show only a slight speed
improvement over the Intel 387DX (see benchmark
results below). In my tests, the ULSI had the most
inaccurate transcendental functions. However, the
maximum relative error is still within the limits
set by Intel, so this is probably not an important
issue in all but very few applications. The ULSI
83C87 shows some minor flaws in the tests for IEEE
754 compatibility, but this, too, is unimportant
under typical operating conditions. It is interesting
to note that an ULSI 83S87 manufactured in 92/17
showed less errors in the IEEETEST test run [74] than
the ULSI 83C87, manufactured in 91/48, I used in
my original test. This indicates that ULSI might
have applied some quick fixes to newer revisions
of their math coprocessors. ULSI claims that the
program IEEETEST, which was used to test for IEEE
compatibility, contains many personal interpretations
of the IEEE standard by the program's author and
states that there is no ANSI-certified IEEE-754
compliance test. While this is may be true, it
is also a fact that the IEEE test vectors used in
IEEETEST are sort of an industry standard and that
Intel's 387, 486, and RapidCAD chips pass it
without a single failure, as do the coprocessors from
Cyrix. Since the ULSI Math*Co 83C87 fails some of
the tests, it is certainly less than 100% compatible
with Intel's chips, although this will hardly make
any difference in typical operating conditions.
The ULSI 83C87 fails to be compatible with the
IEEE-754 in that is does not implement the precision
control feature. While all the internal operations of
80x87 coprocessors are usually done with the maximum
precision available (double extended precision with
64 mantissa bits), the 80x87 also offer the possibility
to force lower precision to be used for the basic
arithmetic functions add, subtract, multiply, divide,
and square root. This feature is required by IEEE-754
for all coprocessor that can not store results
*directly* to a single or double precision location.
Since the 80x87 coprocessors lack this capability,
the have to implement this capability to provide
correctly rounded single and double precision results
according to the floating-point standard. All 80x87
coprocessors except the ones from ULSI support this
feature. For programs that make use of precision
control, e.g. Interactive UNIX, correct implementation
of the feature may be essential for correct arithmetic
results. Like the other 387 'clones', the 83C87 does
not support asynchronous operation of the CPU and the
coprocessor. This means that the 83C87 always runs at
the full speed of the CPU. The ULSI 83C87 is available
in 20, 25, 33, and 40 MHz versions. The ULSI is
produced in high performance, low power CMOS. Power
consumption at 20 MHz is max. 800 mW (400 mW typical),
at 25 MHz it is max. 1000 mW (500 mW typical), at 33
MHz it is max. 1250 mW (625 mW), and at 40 MHz the
ULSI Math*Co 83C87 consumes max. 1500 mW (750 mW
typical) [58]. The 83C87 is packaged in a 68-pin
ceramic PGA. ULSI coprocessors come with a lifetime
warranty. ULSI Systems, Inc. will replace the
coprocessor up to three times free of charge should
it ever fail.
ULSI 83S87 is the SX version of the ULSI 83C87 for operation
with an Intel 387SX or an AMD Am387SX. It is
functionally equivalent to the 83C87. To aid low
power laptop designs, the ULSI 83S87 features an
advanced power saving design with a sleep mode and
a standby mode with only minimal power requirements.
Power consumption under normal operating conditions
(dynamic mode) is max. 400 mW at 16 MHz (300 mW
typical), max. 450 mW at 20 MHz (350 mW typical),
and max. 500 mW at 25 MHz (400 mW typical) [58].
The ULSI 83S87 is packaged in a 68-pin PLCC.
C&T 38700DX is the latest entry into the 387 'clone' market.
Originally announced in October, 1991, it has
apparently not been available to end users before
third quarter of 1992, at least here in Germany.
The C&T 38700DX is compatible with the Intel 387DX.
My tests show that compatibility is indeed very good,
even for the more arcane features of the 387DX and
comparable to the coprocessors from Cyrix. Like
the coprocessors from Cyrix and Intel, it passes
the IEEETEST program without a single failure. It
passes, of course, all tests in Chips & Technologies
own compatibility test program SMDIAG. However, some
of the tests (transcendental functions) in this program
are selected in such a way that the C&T 38700 passes
while the Cyrix 83D87 or Intel RapidCAD fail, so they
are not very useful. There is also a 'bug' in the test
for FSCALE that hides a true bug in the C&T 38700. In
my own speed tests [see below] and those reported in
[1], the C&T 38700DX showed performance at about 90-
100% the level of the Cyrix 83D87, which is the 387
'clone' with the highest performance. For floating
point intensive benchmarks the C&T 38700DX provides up
to 50% more computational power than the Intel 387DX.
However, as with all other 387 compatible coprocessors,
the speed advantage over the Intel 387DX is hardly
measurable in real application. The accuracy of the
transcendental functions on the C&T 38700DX varies.
Overall accuracy of the transcendental function is
slightly better than on the Intel 387DX. The SuperMATH
38700DX is implemented in 1.2 micron CMOS with on-chip
power management, which makes for low power consumption.
The 38700DX is packaged in a 68-pin ceramic PGA (pin
grid array and available in speeds of 16, 20, 25, 33,
and 40 MHz.
C&T 38700SX is the SX version of the 38700DX and compatible to
the Intel 387SX. It provides performance similar to
the Cyrix 83S87 [1], the 387SX 'clone' with the
highest performance. Compatibility with the Intel
387SX is very good and comparable with high degree
of the compatibility found in the Cyrix 83S87. It
has low power consumption. The SuperMATH 38700SX is
packaged in a 68-pin PLCC (plastic leaded chip carrier)
and available in speeds of 16, 20, and 25 MHz.
Intel RapidCAD is not a coprocessor, strictly seen, although it
is marketed as one. Rather, it is a CPU replacement.
It is basically an Intel 486DX without the cache and
with a 386 pinout. RapidCAD is delivered as a set of
two chips. RapidCAD-1 goes into the 386 socket and
contains the CPU and FPU, RapidCAD-2 goes into the
coprocessor socket and contains a PAL that generates
the Ferr signal that is normally generated by a
coprocessor and used by the motherboard circuitry to
provide 287 compatible coprocessor exception handling
in 386/387 systems. The RapidCAD instruction set is
compatible with the 386, so it doesn't know the 486
specific instructions like BSWAP. Since the RapidCAD
CPU core is very similar to 486 CPU core, most of the
register to register instructions execute in the same
number of clock cycles as on the 486. The use of the
386 bus interface causes instructions that access memory
to execute at about the same speed as on the 386. The
integer performance on the RapidCAD is definitely
limited by the low memory bandwidth provided by the
386 bus interface (2 clock cycles per bus cycle)
and the lack of an internal cache. CPU instructions
often execute faster than they can be fetched from
memory, even with a big and fast external cache.
Therefore, the integer performance of the RapidCAD
exceeds that of a 386 by *at most* 35%. This value
was derived by running some programs that use
mostly register-to-register operations and few
memory accesses. This finding is supported by the
SPEC ratings that Intel reports for the 386-33
and the RapidCAD-33. While the 386-33 has a
SPECint of 6.4, the RapidCAD has a SPECint of 7.3
[28], a 14% increase. Note that these tests used
the old (1989) SPEC benchmarks suite. While CPU
instructions often execute in one clock cycle on
the RapidCAD, FPU instructions always take more
than seven clock cycles. They are therefore rarely
slowed down by the low memory bandwidth provided
by the 386 bus interface. My tests show a 70%-100%
performance increase for floating-point intensive
benchmarks (see below) over a 386 based system
using the Intel 387DX math coprocessor. This is
consistent with the SPECfp rating reported by Intel.
The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
the RapidCAD is rated at 6.1 SPECfp at the same
frequency, a 85% increase. This means that a system
that uses the RapidCAD is faster than any 386/387
combination, regardless of the type of 387 used
(Intel 387DX or faster clone). The diagnostic disk
for the RapidCAD also gives some application
performance data for the RapidCAD compared to the
Intel 387DX:

Application Time w/ 387DX Time w/ RapidCAD Speedup

AutoCAD 11 52 sec 32 sec 63%
AutoShade/Renderman 180 sec 108 sec 67%
Mathematica(Windows) 139 sec 103 sec 35%
SPSS/PC+ 4.01 17 sec 14 sec 21%

RapidCAD is available in 25 MHz and 33 MHz versions.
It is distributed through other channels than the
other Intel math coprocessors. Therefore, I have been
unable to obtain a data sheet for it. [78] gives the
typical power consumption of the 33 MHz RapidCAD as
3500 mW, which is the same as for the 33 MHz 486DX.
The RapidCAD-1 chip gets quite hot when operating.
Therefore, I recommend extra cooling for this chip
(see the paragraph below on the 486 for details). The
RapidCAD-1 is packaged in a 132-pin PGA, just like the
80386, and the RapidCAD-2 is packaged in a 68-pin PGA
like a 80387 coprocessor.
Intel 486DX is not a coprocessor. This chip, brought out in
1989 functionally combines the CPU (a heavily pipelined
implementation of the 386 architecture) with an
enhanced 387 (the floating-point unit, FPU) and
8 kB of unified code/data cache on one chip. Of
course, this description is simplified, for a
detailed hardware description, see [52]. The
486DX offers about two to three times the integer
performance of a 386 at the same frequency.
Floating point performance is about three to four
times as high as on the Intel 387DX at the same
clock rate [29]. Since the FPU is on the same
chip as the CPU, the considerable communication
overhead between CPU and coprocessor in a 386/387
system is omitted, letting FPU instructions run
at the full speed permitted by the implementation.
The FPU also takes advantage of the on-chip cache
and the highly pipelined execution unit. Besides
the higher speed, the 486 FPU features more accurate
transcendental functions than the Intel 387DX
coprocessor according to tests run by me (see below).
To achieve better interrupt latency, FPU instructions
with a long execution time have been made abortable
in the case an interrupt occurs during their
execution. The concurrent execution of CPU and
coprocessor instructions typical for 80x86/80x87
systems is still in existence on the 486, but
some FPU instructions like FSIN have nearly no
concurrency with CPU instructions, indicating
that they make heavy use of both, CPU and FPU
resources [53, 1]. The 486DX comes in a 168 pin
ceramic PGA (pin grid array). It is available in
25 MHz and 33 MHz versions. Since the end of 1991,
there is also a 50 MHz version available done in
a CHMOS V process (the 25 MHz and 33 MHz are
produced using the CHMOS IV process). Maximum
power consumption is 3500 mW for the 25 MHz 486
(2600 mW typical), 4500 mW for the 33 MHz version
(3500 mW typical), and 5000 mW (4000 mW typical)
for the 50 MHz chip. Due to the considerable amount
of heat produced by these chips, and taking into
consideration the slow air flow provided by the
fan in garden variety PC tower cases, I recommend
an extra fan directly above the CPU for safer
operation. If you measure the surface temperature
of an i486 in a normal tower case without extra
cooling after some time of operation, you may well
come up with something like 80 - 90 degrees Celsius
(that is 176 - 194 degrees Fahrenheit for those not
familiar with metric units) [54,55]. You don't need
the well known and expensive IceCap(tm) to effectively
cool your CPU. A simple fan mounted directly above
the CPU can bring the temperature down to about 50
to 60 degrees Celsius (122 - 140 degrees Fahrenheit)
depending on the room temperature and the temperature
within the PC case (which depends on the total power
dissipation of all the components and the cooling
provided by the fan in the power unit). According
to a simple rule known as Arrhenius' Law, lowering
the temperature by 10 degrees Celsius slows down
chemical reactions by a factor of two, thus lowering
the temperature of your CPU by 30 degrees should
prolong the live of the device by a factor of eight
due to the slower ageing process. If you are reluctant
to add a fan to your system because of the additional
noise, settle for a low-noise fan like those
available from the German manufacturer Pabst (this
is not meant to be an advertisement. I am just the
happy owner of such a fan. Besides that, I have no
connections to the firm).
Intel 486DX2 is the name for Intel latest generation of 486 CPUs.
Using the DX2 suffix instead of simply DX is meant
to be an indicator that these are clock-doubled
versions. A normal 486DX operates at the frequency
provided by the incoming clock signal. A 486DX2
generates a new clock signal from the incoming clock
by means of a PLL (phase locked loop). In the DX2,
this clock signal has twice the frequency of the
incoming clock, hence the name clock-doubler. All
internal parts of the 486DX2 (cache, CPU core, FPU)
run at this higher frequency. Only the bus interface
runs at the normal speed. That way, a 486DX-50 can
run on a motherboard designed for 25 MHz operation.
Since motherboards for 50 MHz operations are much
harder to design than those for 25 MHz, this makes
a 486DX2-50 system easier to built and cheaper than
a 486DX-50 system. For all operations that don't
access off-chip resources (e.g. register operations)
a 486DX2-50 provides exactly the same performance as
a 486DX-50 and twice the performance of a 486DX-25.
However, since the main memory in a 486DX2-50 systems
still operates at 25 MHz, all instructions involving
memory accesses are potentially slower than in a
486DX-50 system, whose memory also runs at 50 MHz.
The internal cache of the 486 helps this problem a
bit, but overall performance of a 486DX2-50 is still
lower than that of a 486DX-50, although Intel's
documentation [32] shows this drop to be quite small.
It depends a lot on the code one runs, though. The
nice thing about the 486DX2 is that it allows easy
upgrading of 25 and 33 MHz 486 systems, since the
486DX2 is completely pin-compatible with the 486DX.
Just take out the 486DX and plug in the new 486DX2.
Note that power consumption of the 486DX2-50 equals
that of the 486DX-50 (4000 mW typical), and that the
486DX2-66 exceeds this by about 30%. These chips get
really hot in a standard PC case with no extra cooling.
See the above paragraph for more detailed information
on this problem.
Intel 487SX is the coprocessor intended for use in 486SX systems.
The 486SX is basically a 486DX without the floating-
point unit (FPU) [48, 50]. Originally Intel sold
486DXs with a defective FPU as 486SXs but it has
now completely removed the FPU part from the 486SX
mask for mass production. The introduction of the
486SX in 1991 has been viewed mainly as a marketing
'trick' by Intel to take market share from the 386
based systems once AMD became successful with their
Am386 (AMD has taken as much as 40% of the 386 market
due to some superior features such as higher clock
frequency, lower power consumption, and a fully
static design). A 486SX at 20 MHz delivers a bit
less integer performance than a 40 MHz Am386. To add
floating-point capabilities to a 486SX based system,
it would be easiest to swap the 486SX with a 486DX
which includes the FPU. However, Intel has prevented
this easy solution by giving the 486SX a slightly
different pin out [48, 51]. Since only three pins
are assigned differently, clever board manufacturers
have come out with boards that accept anything from
a 486SX-20 to a 486DX2-50 in their CPU socket and
provide a clean upgrade path this way. A set of
three jumpers ensures correct signal assignment to
the pins for either configuration. To upgrade systems
without this feature, one has to buy the 487SX and
put it into the "Performance Upgrade Socket" present
in most 486SX systems. Once the 487SX was available,
it was quickly found out that it is just a normal
486DX with a slightly different pin out [49]. Inserting
the 487SX effectively shuts down the 486SX in the
486SX/487SX system, so the 486SX could be removed
once the 487SX is installed. Since the shut down is
logical, not electrical, the 486SX still uses power
if used with the 487SX, although it is unoperational.
Technically speaking, the solution Intel chose was
the only practical way to provide a 486SX system with
the high level of floating-point performance the
486DX offers. The CPU and FPU have to be on the same
chip, otherwise the FPU can not make use of the cache
on the CPU chip and there would be considerable
overhead in CPU-FPU communication (similar to a
386/387 system), nullifying most of the arithmetic
speedups over the 387. That the 486SX, 487SX, and
486DX are not pin-compatible seems to be purely for
marketing reasons. To upgrade a 486SX based system,
Intel also offers the OverDrive chip, which is just
the same as a 487SX with internal clock doubling. It
goes also goes into the "Performance Upgrade Socket"
found in 486SX systems. The OverDrive roughly doubles
the performance of a 486SX/487SX based system. For a
explanation of clock doubling, see the description
of the 486DX2 above. As the 486SX, the 487SX is
available in 20 MHz and 25 MHz versions. At 20 MHz,
the 487SX has a power consumption of max. 4000 mW.
It is available in a 169 pin ceramic PGA (pin grid
array).
Weitek 1167 was the predecessor to the Weitek Abacus 3167 math
coprocessor. It was actually a small printed circuit
board with three chips mounted on it. As opposed to
the Weitek 3167, the 1167 did not have a square root
instruction, instead the square root function was
computed by means of a subroutine in the Weitek
transcendental function library. However, the 1167
did have a mode in which it supported denormals.
The Weitek 3167 and 4167 only implement the 'fast'
mode, in which denormals are not supported. Overall
performance of the 1167 is slightly less than that
of the Weitek 3167.
Weitek 3167 was introduced in 1989 to provide the fastest
floating point performance possible on a 386 based
system at that time. The Weitek Abacus 3167 is not
a real coprocessor, strictly speaking, but rather
a memory mapped peripheral device. The Weitek 3167
was optimized for speed wherever possible. Besides
using the faster memory mapped interface to the CPU
(the 80x87 uses IO-ports), it does not support many
of the features of the 80x87 coprocessors, allowing
all of the chip's resources to be concentrated on
the fast execution of the basic arithmetic operations.
For a more detailed description of the Weitek 3167 see
the first chapter of this document. In benchmark
comparisons, the Weitek 3167 provided up to 2.5 times
the performance of an Intel 387DX coprocessor. For
example, on a 33 MHz 3167 the Whetstone benchmark
performed at 7574 kWhetstones/sec compared with the
3743 kWhetstones/s for the Intel 387DX. Note
however that these are single precision results and
that the Weitek 3167's performance would drop to
about half the stated rate for double precision,
while the value for the Intel 387DX would not change
much. Anyhow, before the advent of the Intel RapidCAD,
the Weitek 3167 usually beat all 387 compatible
coprocessors even for double precision operations
[63,65,69]. For typical applications the advantage
of the Weitek 3167 over the 387 clones is much smaller.
In a benchmark test using AutoDesk's 3D-Studio the
Weitek 3167 performed at 123% of the Intel 487DX's
performance compared with 106% for the Cyrix FasMath
83D87 and 118% for the Intel RapidCAD. The Weitek
Abacus 3167 is packaged in a 121-pin PGA that fits
into an EMC socket provided by most 386 based systems.
It does *not* fit into the normal coprocessor socket
designed to hold a 387 compatible coprocessor in a
68-pin PGA. To get the best of both worlds, one might
want to use a Weitek 3167 and a 387 compatible
coprocessor in the same system. These coprocessors
can coexist in the same system just fine. Only problem
is that most 386 based systems contain only one
coprocessor socket, usually of the EMC (extended math
coprocessor) type. Thus, you can install either a
387 coprocessor or a Weitek 3167, but not both. There
are little daughter boards available though that fit
into the EMC socket and provide two sockets, an EMC
and a standard coprocessor socket. At 25 MHz, the
Weitek 3167 has a power consumption of max. 1750 mW.
At 33 MHz, the max. power consumption is 2250 mW.
Weitek 4167 is a memory mapped coprocessor that has the same
architecture as the 3167 and is designed to provide
486 based systems with the highest floating point
performance available. It executes coprocessor
instructions at three to four times the speed of
the Weitek 3167. Although it is up to 80% faster
than the Intel 468 in some benchmarks [1,69], the
performance advantage for real application is more
like 10%. The introduction of the 486DX2 processors
has more or less obliterated the need for a Weitek
4167, since the DX2 CPUs provide the same performance
and all the additional features the 80x87 has over
the Weitek Abacus. The Weitek 4167 is packaged in
a 142-pin PGA package that is only slightly smaller
than the 486's package. At 25 MHz, it has a max.
power consumption of 2500 mW [32].

If you are interested in techniques how to detect the different
coprocessors described above, I would like to refer you to my
COMPTEST program. COMPTEST reliably detects the type and clock
frequency of the CPU and coprocessor installed in your machine.
COMPTEST is in the public domain and comes with complete source
code. It is available via anonymous ftp from garbo.uwasa.fi and
additional ftp sites that mirror garbo. The current version is
CTEST257.ZIP, with future versions going to be called CTEST258,
CTEST259 and so on. COMPTEST can correctly identify all of the
coprocessors described above, with the exception of the Weitek
chips, for which the detection mechanism is not that reliable.

Pricing

Due to a recent price slashing by Cyrix and subsequently by Intel
for 387 coprocessors, prices have dropped significantly for all
287 and 387 compatible coprocessors with hardly any price difference
between manufacturers. 387DX compatible coprocessors typically sell
for ~US$ 100 for all speeds except for 40 MHz versions which are
typically ~US$ 130. 387SX compatible coprocessors sell for ~US$ 90
regardless of speed with the exception of the 33 MHz versions, which
are ~US$ 100. The Intel 287XL sells for ~US$ 100, while the IIT 2C87
and Cyrix 82S87 sell for about US$ 70. 8087s may be more expensive,
the price of an 8087-10 being US$ 150. I bought the Intel RapidCAD
for US$ 320 and haven't seen it offered for a better price. I see
the Weitek Abacus 3167-33 being offered for US$ 780 and the 4167-33
being offered for US$ 1100. This price information reflects the
price situation as of 09-17-92. Prices can be expected to drop
slightly in the near future.

If you have a demand for high floating-point performance, you
should consider to buy a 486 based system rather than buying
a 386 based system with an additional coprocessor. A 386 mother
board for 33 MHz operation sells for ~ US$ 300, together with the
coprocessor, cost totals ~ US$ 400. A 486-33 ISA-board sells for
US$ 650. While the 486-33 system is 60% more expensive than the
386/387 system, it also provides 100% more integer and floating-
point performance (twice the performance). If you want to push
your 386 based system to maximum floating-point performance and
can't switch to a 486 based system for some reason, I recommend
the Intel RapidCAD. It is both faster [1] and cheaper than installing
a Weitek Abacus 3167 with your 386, which used to be the highest
performing combination before the RapidCAD came out. Similarly,
the introduction of the 486DX2 clock-doubler chips have obliterated
the need for a Weitek 4167 to get maximum floating-point performance
out of a 486 based system. A 486DX2-66 performs at or above the
performance level of a 33 MHz Weitek 4167, even if the latter
uses single precision rather than double precision. The 486DX-66
is rated by Intel at 24700 double precision kWhetstones/sec and
3.1 double precision Linpack MFLOPS. Of course, these benchmarks
used the highest performance compilers available. But even with
a Turbo Pascal 6.0 program, I managed to squeeze 1.6 double precision
MFLOPS out of the 486DX2-66 for the LLL benchmark (for a description
of the benchmarks mentioned, see the paragraph on benchmarks below).
Although I haven't yet seen 486DX2-66 processors being offered
to end users for upgrade purposes, I recommend the 486DX2-66
to those that need highest floating-point performance and are
planning to buy a new PC. The price difference between a 33 MHz
486DX motherboard and a 486DX2-66 motherboard is around US$ 600,
well below the price for the Weitek Abacus 4167.

Operation

In a 80x86/80x87 system CPU instructions and coprocessor
instructions are executed concurrently. This means that
the CPU can execute CPU instructions while the coprocessor
executes a coprocessor instruction at the same time. The
concurrency is restricted somewhat by the fact that the
CPU has to aid the coprocessor in certain operations. As
the CPU and the coprocessor are fed from the same instruction
stream and both instruction streams may operate on the same
data, there has to be a synchronizing mechanism between the
CPU and the coprocessor.

8086/8087 or 8088/8087 system, both of the chips look at the
opcodes coming in from the bus. To do this, both chips have
the same BIU (bus interface unit) and the 8086 BIU sends the
status signals of its prefetch queue to the 8087 BIU. This
assures that both processors always decode the same instructions
in parallel. Since all coprocessor instruction start with the
bit pattern 11011, it is easy for the 8087 to ignore all other
instructions. Likewise the CPU ignores all coprocessor instructions
except if they access memory. In this case, the CPU computes
the address of the LSB (least significant byte) of the memory
operand and does a dummy read. The 8087 then takes the data
from the data bus. If more than one memory access is needed to
load an memory operand, the 8087 requests the bus from the CPU,
generates the consecutive addresses of the operand's bytes
and fetches them from the data bus. After completing the operation,
the 8087 hands bus control back to the CPU. Since 8087 and CPU
are hooked up to the same synchronous bus, they have to run at
the same speed. This means that with the 8087, only synchronous
operation of CPU and coprocessor is possible. Another 8087
coprocessor instruction can only be started if the previous one
has been completed in the NEU (numerical execution unit) of the
8087. To prevent the 8086 from decoding a new coprocessor
instruction while the 8087 is still executing the previous
coprocessor instruction, the following mechanism is used: The
compilers and assemblers automatically generate a WAIT instruction
before each coprocessor instruction. The WAIT instruction tests
the /TEST pin until its input becomes "LOW". In 8086/8087 systems,
the 8086 /TEST pin is connected to the 8087 BUSY pin. As long
as the NEU executes a coprocessor instruction, it forces its
BUSY pin "HIGH". Thus the WAIT instruction in front of every
coprocessor instruction stops the CPU until a still executing
previous coprocessor instruction has finished. The same
synchronization is used before the CPU accesses data that
was written by the coprocessor. A WAIT instruction after the
coprocessor instruction that writes to memory causes the CPU to
stop until the coprocessor has transferred the data to memory,
after which the CPU can safely access the data.

With the help of an additional chip, the 8087 can also be inter-
faced to the 80186 [36]. The 80186 was the CPU in some PCs (e.g.
from Philips, Siemens) in the 1982/1983 time frame, but with
the introduction of the IBM AT which used the 80286, it lost all
significance for the PC market. The 80C186 (CMOS version of the
80186) nowadays sells as an embedded controller and can be combined
with a 80C187 coprocessor which is based on the internals of the
Intel 387 [37].

The 80287 CPU-interface is totally different from the solution
used in the 8087. Since the 80286 implements memory protection
via an MMU based on segmentation, it would have been much to
expensive to duplicate the whole protection logic on the coprocessor
for an interface solution similar to the 8087. In a 80286/80287
system, the CPU fetches and stores all opcodes and operands for
the coprocessor. Information is passed through ports F8h - FFh.
As these ports are accessible under program control, care must
be taken to not accidentally perform write operation to them, as
this could corrupt the information in the math coprocessor.
The execution unit of the 80287 is practically identical to that
of the 8087, that is, nearly all coprocessor instructions execute
in the same number of clock cycles on both coprocessors. Due to
the additional overhead of the CPU/coprocessor interface (at
least ~40 clock cycles), a 8 MHz 80286/80287 combination can be
slower than a 8086/8087 system running at the same speed for
floating point intensive programs. Additionally, most of the
older 286 boards were configured to run the coprocessor at 2/3
the speed of the CPU, making use of the ability of the 80287
to run asynchronous with the CPU. The 80287 has a CKM pin that
causes the incoming system clock to be divided by three for
the coprocessor if it is tied to ground. The 80286 always
divides the system clock by two internally. Thus the ratio 2/3.
However, when the CKM (ClocK Mode) pin is tied high on the 80287,
it does not divide the CLK input. This feature has been exploited
by the maker of coprocessor speed sockets. These sockets tie
CKM high and supply their own CLK signal with a built-in oscillator,
thereby allowing the 80287 or compatible to run at a much higher
speed than the CPU. With an IIT or Cyrix 287 one can have a
20 MHz coprocessor running with a 8 MHz 80286. Note however that
the floating-point performance in such a configuration does not
scale linearly with the coprocessor clock, since all the data
has to be passed through the much slower CPU. If the coprocessor
executes mostly simple instructions such as addition and multiplication
doubling the coprocessor clock in a 10 MHz system to 20 MHz does
not show any performance increase at all [24]. The 80C287 by AMD
is a 100% clone of the original Intel 80287, but is produced in
CMOS not in NMOS as the original Intel chip. This makes for lower
power consumption.

The 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals
of a 387 coprocessor, but are pin-compatible to the original 287.
However, these chips divide the system clock by two internally,
as opposed to three in the original Intel 80287. Since the 80286
also divides the system clock by two, they usually run synchronously
with the CPU. They can also run asynchronously, though.

The 8087/8087 combination can be characterized as a cooperation of
partners with equal rights, while the 80286/287 is more a master-
slave relationship. This makes synchronization much more easy, since
the complete instruction and data flow of the coprocessor goes through
the CPU. Before executing most coprocessor instructions, the 80286
tests its /BUSY pin which is hooked up to the 287 coprocessor and
signals if the 80287 is still executing a previous coprocessor
instruction or has encountered an exception. The 80286 then waits
until the 80287 is not busy before loading the coprocessor instruction
into the coprocessor. Therefore, a WAIT instruction before every
coprocessor instruction is not required. These WAITs are permissible,
but not necessary in 80287 programs. The second form of WAIT
synchronization after the coprocessor has written a memory operand is
still necessary on 286/287 systems.

The coprocessor interface in 80386/80387 systems is very similar to
the one found in 286/287 systems. However, to prevent corruption
of the coprocessor's contents by programming errors, the IO-ports
800000F8 - 800000FF are used which are not user accessible. The
interface has been optimized and uses 32-bit transfers. The overhead
of the interface has been reduced to about 16-20 clock cycles. For
some operations on the 387 'clones', that take less than 16 clock
cycles to complete this effectively limits the execution rate of
coprocessor instructions. The only sensible solution to provide
even higher floating point performance was to integrate the CPU
and coprocessor functionality onto the same chip. This is what
Intel did with the 80486. The FPU in the 486 also benefits from
the instruction pipelining and from the integrated cache.

0 new messages