Application Time w/ 387DX Time w/ RapidCAD Speedup
AutoCAD 11 52 sec 32 sec 63%
AutoShade/Renderman 180 sec 108 sec 67%
Mathematica(Windows) 139 sec 103 sec 35%
SPSS/PC+ 4.01 17 sec 14 sec 21%
RapidCAD is available in 25 MHz and 33 MHz versions.
It is distributed through other channels than the
other Intel math coprocessors. Therefore, I have been
unable to obtain a data sheet for it. [78] gives the
typical power consumption of the 33 MHz RapidCAD as
3500 mW, which is the same as for the 33 MHz 486DX.
The RapidCAD-1 chip gets quite hot when operating.
Therefore, I recommend extra cooling for this chip
(see the paragraph below on the 486 for details). The
RapidCAD-1 is packaged in a 132-pin PGA, just like the
80386, and the RapidCAD-2 is packaged in a 68-pin PGA
like a 80387 coprocessor.
Intel 486DX is not a coprocessor. This chip, brought out in
1989 functionally combines the CPU (a heavily pipelined
implementation of the 386 architecture) with an
enhanced 387 (the floating-point unit, FPU) and
8 kB of unified code/data cache on one chip. Of
course, this description is simplified, for a
detailed hardware description, see [52]. The
486DX offers about two to three times the integer
performance of a 386 at the same frequency.
Floating point performance is about three to four
times as high as on the Intel 387DX at the same
clock rate [29]. Since the FPU is on the same
chip as the CPU, the considerable communication
overhead between CPU and coprocessor in a 386/387
system is omitted, letting FPU instructions run
at the full speed permitted by the implementation.
The FPU also takes advantage of the on-chip cache
and the highly pipelined execution unit. Besides
the higher speed, the 486 FPU features more accurate
transcendental functions than the Intel 387DX
coprocessor according to tests run by me (see below).
To achieve better interrupt latency, FPU instructions
with a long execution time have been made abortable
in the case an interrupt occurs during their
execution. The concurrent execution of CPU and
coprocessor instructions typical for 80x86/80x87
systems is still in existence on the 486, but
some FPU instructions like FSIN have nearly no
concurrency with CPU instructions, indicating
that they make heavy use of both, CPU and FPU
resources [53, 1]. The 486DX comes in a 168 pin
ceramic PGA (pin grid array). It is available in
25 MHz and 33 MHz versions. Since the end of 1991,
there is also a 50 MHz version available done in
a CHMOS V process (the 25 MHz and 33 MHz are
produced using the CHMOS IV process). Maximum
power consumption is 3500 mW for the 25 MHz 486
(2600 mW typical), 4500 mW for the 33 MHz version
(3500 mW typical), and 5000 mW (4000 mW typical)
for the 50 MHz chip. Due to the considerable amount
of heat produced by these chips, and taking into
consideration the slow air flow provided by the
fan in garden variety PC tower cases, I recommend
an extra fan directly above the CPU for safer
operation. If you measure the surface temperature
of an i486 in a normal tower case without extra
cooling after some time of operation, you may well
come up with something like 80 - 90 degrees Celsius
(that is 176 - 194 degrees Fahrenheit for those not
familiar with metric units) [54,55]. You don't need
the well known and expensive IceCap(tm) to effectively
cool your CPU. A simple fan mounted directly above
the CPU can bring the temperature down to about 50
to 60 degrees Celsius (122 - 140 degrees Fahrenheit)
depending on the room temperature and the temperature
within the PC case (which depends on the total power
dissipation of all the components and the cooling
provided by the fan in the power unit). According
to a simple rule known as Arrhenius' Law, lowering
the temperature by 10 degrees Celsius slows down
chemical reactions by a factor of two, thus lowering
the temperature of your CPU by 30 degrees should
prolong the live of the device by a factor of eight
due to the slower ageing process. If you are reluctant
to add a fan to your system because of the additional
noise, settle for a low-noise fan like those
available from the German manufacturer Pabst (this
is not meant to be an advertisement. I am just the
happy owner of such a fan. Besides that, I have no
connections to the firm).
Intel 486DX2 is the name for Intel latest generation of 486 CPUs.
Using the DX2 suffix instead of simply DX is meant
to be an indicator that these are clock-doubled
versions. A normal 486DX operates at the frequency
provided by the incoming clock signal. A 486DX2
generates a new clock signal from the incoming clock
by means of a PLL (phase locked loop). In the DX2,
this clock signal has twice the frequency of the
incoming clock, hence the name clock-doubler. All
internal parts of the 486DX2 (cache, CPU core, FPU)
run at this higher frequency. Only the bus interface
runs at the normal speed. That way, a 486DX-50 can
run on a motherboard designed for 25 MHz operation.
Since motherboards for 50 MHz operations are much
harder to design than those for 25 MHz, this makes
a 486DX2-50 system easier to built and cheaper than
a 486DX-50 system. For all operations that don't
access off-chip resources (e.g. register operations)
a 486DX2-50 provides exactly the same performance as
a 486DX-50 and twice the performance of a 486DX-25.
However, since the main memory in a 486DX2-50 systems
still operates at 25 MHz, all instructions involving
memory accesses are potentially slower than in a
486DX-50 system, whose memory also runs at 50 MHz.
The internal cache of the 486 helps this problem a
bit, but overall performance of a 486DX2-50 is still
lower than that of a 486DX-50, although Intel's
documentation [32] shows this drop to be quite small.
It depends a lot on the code one runs, though. The
nice thing about the 486DX2 is that it allows easy
upgrading of 25 and 33 MHz 486 systems, since the
486DX2 is completely pin-compatible with the 486DX.
Just take out the 486DX and plug in the new 486DX2.
Note that power consumption of the 486DX2-50 equals
that of the 486DX-50 (4000 mW typical), and that the
486DX2-66 exceeds this by about 30%. These chips get
really hot in a standard PC case with no extra cooling.
See the above paragraph for more detailed information
on this problem.
Intel 487SX is the coprocessor intended for use in 486SX systems.
The 486SX is basically a 486DX without the floating-
point unit (FPU) [48, 50]. Originally Intel sold
486DXs with a defective FPU as 486SXs but it has
now completely removed the FPU part from the 486SX
mask for mass production. The introduction of the
486SX in 1991 has been viewed mainly as a marketing
'trick' by Intel to take market share from the 386
based systems once AMD became successful with their
Am386 (AMD has taken as much as 40% of the 386 market
due to some superior features such as higher clock
frequency, lower power consumption, and a fully
static design). A 486SX at 20 MHz delivers a bit
less integer performance than a 40 MHz Am386. To add
floating-point capabilities to a 486SX based system,
it would be easiest to swap the 486SX with a 486DX
which includes the FPU. However, Intel has prevented
this easy solution by giving the 486SX a slightly
different pin out [48, 51]. Since only three pins
are assigned differently, clever board manufacturers
have come out with boards that accept anything from
a 486SX-20 to a 486DX2-50 in their CPU socket and
provide a clean upgrade path this way. A set of
three jumpers ensures correct signal assignment to
the pins for either configuration. To upgrade systems
without this feature, one has to buy the 487SX and
put it into the "Performance Upgrade Socket" present
in most 486SX systems. Once the 487SX was available,
it was quickly found out that it is just a normal
486DX with a slightly different pin out [49]. Inserting
the 487SX effectively shuts down the 486SX in the
486SX/487SX system, so the 486SX could be removed
once the 487SX is installed. Since the shut down is
logical, not electrical, the 486SX still uses power
if used with the 487SX, although it is unoperational.
Technically speaking, the solution Intel chose was
the only practical way to provide a 486SX system with
the high level of floating-point performance the
486DX offers. The CPU and FPU have to be on the same
chip, otherwise the FPU can not make use of the cache
on the CPU chip and there would be considerable
overhead in CPU-FPU communication (similar to a
386/387 system), nullifying most of the arithmetic
speedups over the 387. That the 486SX, 487SX, and
486DX are not pin-compatible seems to be purely for
marketing reasons. To upgrade a 486SX based system,
Intel also offers the OverDrive chip, which is just
the same as a 487SX with internal clock doubling. It
goes also goes into the "Performance Upgrade Socket"
found in 486SX systems. The OverDrive roughly doubles
the performance of a 486SX/487SX based system. For a
explanation of clock doubling, see the description
of the 486DX2 above. As the 486SX, the 487SX is
available in 20 MHz and 25 MHz versions. At 20 MHz,
the 487SX has a power consumption of max. 4000 mW.
It is available in a 169 pin ceramic PGA (pin grid
array).
Weitek 1167 was the predecessor to the Weitek Abacus 3167 math
coprocessor. It was actually a small printed circuit
board with three chips mounted on it. As opposed to
the Weitek 3167, the 1167 did not have a square root
instruction, instead the square root function was
computed by means of a subroutine in the Weitek
transcendental function library. However, the 1167
did have a mode in which it supported denormals.
The Weitek 3167 and 4167 only implement the 'fast'
mode, in which denormals are not supported. Overall
performance of the 1167 is slightly less than that
of the Weitek 3167.
Weitek 3167 was introduced in 1989 to provide the fastest
floating point performance possible on a 386 based
system at that time. The Weitek Abacus 3167 is not
a real coprocessor, strictly speaking, but rather
a memory mapped peripheral device. The Weitek 3167
was optimized for speed wherever possible. Besides
using the faster memory mapped interface to the CPU
(the 80x87 uses IO-ports), it does not support many
of the features of the 80x87 coprocessors, allowing
all of the chip's resources to be concentrated on
the fast execution of the basic arithmetic operations.
For a more detailed description of the Weitek 3167 see
the first chapter of this document. In benchmark
comparisons, the Weitek 3167 provided up to 2.5 times
the performance of an Intel 387DX coprocessor. For
example, on a 33 MHz 3167 the Whetstone benchmark
performed at 7574 kWhetstones/sec compared with the
3743 kWhetstones/s for the Intel 387DX. Note
however that these are single precision results and
that the Weitek 3167's performance would drop to
about half the stated rate for double precision,
while the value for the Intel 387DX would not change
much. Anyhow, before the advent of the Intel RapidCAD,
the Weitek 3167 usually beat all 387 compatible
coprocessors even for double precision operations
[63,65,69]. For typical applications the advantage
of the Weitek 3167 over the 387 clones is much smaller.
In a benchmark test using AutoDesk's 3D-Studio the
Weitek 3167 performed at 123% of the Intel 487DX's
performance compared with 106% for the Cyrix FasMath
83D87 and 118% for the Intel RapidCAD. The Weitek
Abacus 3167 is packaged in a 121-pin PGA that fits
into an EMC socket provided by most 386 based systems.
It does *not* fit into the normal coprocessor socket
designed to hold a 387 compatible coprocessor in a
68-pin PGA. To get the best of both worlds, one might
want to use a Weitek 3167 and a 387 compatible
coprocessor in the same system. These coprocessors
can coexist in the same system just fine. Only problem
is that most 386 based systems contain only one
coprocessor socket, usually of the EMC (extended math
coprocessor) type. Thus, you can install either a
387 coprocessor or a Weitek 3167, but not both. There
are little daughter boards available though that fit
into the EMC socket and provide two sockets, an EMC
and a standard coprocessor socket. At 25 MHz, the
Weitek 3167 has a power consumption of max. 1750 mW.
At 33 MHz, the max. power consumption is 2250 mW.
Weitek 4167 is a memory mapped coprocessor that has the same
architecture as the 3167 and is designed to provide
486 based systems with the highest floating point
performance available. It executes coprocessor
instructions at three to four times the speed of
the Weitek 3167. Although it is up to 80% faster
than the Intel 468 in some benchmarks [1,69], the
performance advantage for real application is more
like 10%. The introduction of the 486DX2 processors
has more or less obliterated the need for a Weitek
4167, since the DX2 CPUs provide the same performance
and all the additional features the 80x87 has over
the Weitek Abacus. The Weitek 4167 is packaged in
a 142-pin PGA package that is only slightly smaller
than the 486's package. At 25 MHz, it has a max.
power consumption of 2500 mW [32].
If you are interested in techniques how to detect the different
coprocessors described above, I would like to refer you to my
COMPTEST program. COMPTEST reliably detects the type and clock
frequency of the CPU and coprocessor installed in your machine.
COMPTEST is in the public domain and comes with complete source
code. It is available via anonymous ftp from garbo.uwasa.fi and
additional ftp sites that mirror garbo. The current version is
CTEST257.ZIP, with future versions going to be called CTEST258,
CTEST259 and so on. COMPTEST can correctly identify all of the
coprocessors described above, with the exception of the Weitek
chips, for which the detection mechanism is not that reliable.
Pricing
Due to a recent price slashing by Cyrix and subsequently by Intel
for 387 coprocessors, prices have dropped significantly for all
287 and 387 compatible coprocessors with hardly any price difference
between manufacturers. 387DX compatible coprocessors typically sell
for ~US$ 100 for all speeds except for 40 MHz versions which are
typically ~US$ 130. 387SX compatible coprocessors sell for ~US$ 90
regardless of speed with the exception of the 33 MHz versions, which
are ~US$ 100. The Intel 287XL sells for ~US$ 100, while the IIT 2C87
and Cyrix 82S87 sell for about US$ 70. 8087s may be more expensive,
the price of an 8087-10 being US$ 150. I bought the Intel RapidCAD
for US$ 320 and haven't seen it offered for a better price. I see
the Weitek Abacus 3167-33 being offered for US$ 780 and the 4167-33
being offered for US$ 1100. This price information reflects the
price situation as of 09-17-92. Prices can be expected to drop
slightly in the near future.
If you have a demand for high floating-point performance, you
should consider to buy a 486 based system rather than buying
a 386 based system with an additional coprocessor. A 386 mother
board for 33 MHz operation sells for ~ US$ 300, together with the
coprocessor, cost totals ~ US$ 400. A 486-33 ISA-board sells for
US$ 650. While the 486-33 system is 60% more expensive than the
386/387 system, it also provides 100% more integer and floating-
point performance (twice the performance). If you want to push
your 386 based system to maximum floating-point performance and
can't switch to a 486 based system for some reason, I recommend
the Intel RapidCAD. It is both faster [1] and cheaper than installing
a Weitek Abacus 3167 with your 386, which used to be the highest
performing combination before the RapidCAD came out. Similarly,
the introduction of the 486DX2 clock-doubler chips have obliterated
the need for a Weitek 4167 to get maximum floating-point performance
out of a 486 based system. A 486DX2-66 performs at or above the
performance level of a 33 MHz Weitek 4167, even if the latter
uses single precision rather than double precision. The 486DX-66
is rated by Intel at 24700 double precision kWhetstones/sec and
3.1 double precision Linpack MFLOPS. Of course, these benchmarks
used the highest performance compilers available. But even with
a Turbo Pascal 6.0 program, I managed to squeeze 1.6 double precision
MFLOPS out of the 486DX2-66 for the LLL benchmark (for a description
of the benchmarks mentioned, see the paragraph on benchmarks below).
Although I haven't yet seen 486DX2-66 processors being offered
to end users for upgrade purposes, I recommend the 486DX2-66
to those that need highest floating-point performance and are
planning to buy a new PC. The price difference between a 33 MHz
486DX motherboard and a 486DX2-66 motherboard is around US$ 600,
well below the price for the Weitek Abacus 4167.
Operation
In a 80x86/80x87 system CPU instructions and coprocessor
instructions are executed concurrently. This means that
the CPU can execute CPU instructions while the coprocessor
executes a coprocessor instruction at the same time. The
concurrency is restricted somewhat by the fact that the
CPU has to aid the coprocessor in certain operations. As
the CPU and the coprocessor are fed from the same instruction
stream and both instruction streams may operate on the same
data, there has to be a synchronizing mechanism between the
CPU and the coprocessor.
8086/8087 or 8088/8087 system, both of the chips look at the
opcodes coming in from the bus. To do this, both chips have
the same BIU (bus interface unit) and the 8086 BIU sends the
status signals of its prefetch queue to the 8087 BIU. This
assures that both processors always decode the same instructions
in parallel. Since all coprocessor instruction start with the
bit pattern 11011, it is easy for the 8087 to ignore all other
instructions. Likewise the CPU ignores all coprocessor instructions
except if they access memory. In this case, the CPU computes
the address of the LSB (least significant byte) of the memory
operand and does a dummy read. The 8087 then takes the data
from the data bus. If more than one memory access is needed to
load an memory operand, the 8087 requests the bus from the CPU,
generates the consecutive addresses of the operand's bytes
and fetches them from the data bus. After completing the operation,
the 8087 hands bus control back to the CPU. Since 8087 and CPU
are hooked up to the same synchronous bus, they have to run at
the same speed. This means that with the 8087, only synchronous
operation of CPU and coprocessor is possible. Another 8087
coprocessor instruction can only be started if the previous one
has been completed in the NEU (numerical execution unit) of the
8087. To prevent the 8086 from decoding a new coprocessor
instruction while the 8087 is still executing the previous
coprocessor instruction, the following mechanism is used: The
compilers and assemblers automatically generate a WAIT instruction
before each coprocessor instruction. The WAIT instruction tests
the /TEST pin until its input becomes "LOW". In 8086/8087 systems,
the 8086 /TEST pin is connected to the 8087 BUSY pin. As long
as the NEU executes a coprocessor instruction, it forces its
BUSY pin "HIGH". Thus the WAIT instruction in front of every
coprocessor instruction stops the CPU until a still executing
previous coprocessor instruction has finished. The same
synchronization is used before the CPU accesses data that
was written by the coprocessor. A WAIT instruction after the
coprocessor instruction that writes to memory causes the CPU to
stop until the coprocessor has transferred the data to memory,
after which the CPU can safely access the data.
With the help of an additional chip, the 8087 can also be inter-
faced to the 80186 [36]. The 80186 was the CPU in some PCs (e.g.
from Philips, Siemens) in the 1982/1983 time frame, but with
the introduction of the IBM AT which used the 80286, it lost all
significance for the PC market. The 80C186 (CMOS version of the
80186) nowadays sells as an embedded controller and can be combined
with a 80C187 coprocessor which is based on the internals of the
Intel 387 [37].
The 80287 CPU-interface is totally different from the solution
used in the 8087. Since the 80286 implements memory protection
via an MMU based on segmentation, it would have been much to
expensive to duplicate the whole protection logic on the coprocessor
for an interface solution similar to the 8087. In a 80286/80287
system, the CPU fetches and stores all opcodes and operands for
the coprocessor. Information is passed through ports F8h - FFh.
As these ports are accessible under program control, care must
be taken to not accidentally perform write operation to them, as
this could corrupt the information in the math coprocessor.
The execution unit of the 80287 is practically identical to that
of the 8087, that is, nearly all coprocessor instructions execute
in the same number of clock cycles on both coprocessors. Due to
the additional overhead of the CPU/coprocessor interface (at
least ~40 clock cycles), a 8 MHz 80286/80287 combination can be
slower than a 8086/8087 system running at the same speed for
floating point intensive programs. Additionally, most of the
older 286 boards were configured to run the coprocessor at 2/3
the speed of the CPU, making use of the ability of the 80287
to run asynchronous with the CPU. The 80287 has a CKM pin that
causes the incoming system clock to be divided by three for
the coprocessor if it is tied to ground. The 80286 always
divides the system clock by two internally. Thus the ratio 2/3.
However, when the CKM (ClocK Mode) pin is tied high on the 80287,
it does not divide the CLK input. This feature has been exploited
by the maker of coprocessor speed sockets. These sockets tie
CKM high and supply their own CLK signal with a built-in oscillator,
thereby allowing the 80287 or compatible to run at a much higher
speed than the CPU. With an IIT or Cyrix 287 one can have a
20 MHz coprocessor running with a 8 MHz 80286. Note however that
the floating-point performance in such a configuration does not
scale linearly with the coprocessor clock, since all the data
has to be passed through the much slower CPU. If the coprocessor
executes mostly simple instructions such as addition and multiplication
doubling the coprocessor clock in a 10 MHz system to 20 MHz does
not show any performance increase at all [24]. The 80C287 by AMD
is a 100% clone of the original Intel 80287, but is produced in
CMOS not in NMOS as the original Intel chip. This makes for lower
power consumption.
The 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals
of a 387 coprocessor, but are pin-compatible to the original 287.
However, these chips divide the system clock by two internally,
as opposed to three in the original Intel 80287. Since the 80286
also divides the system clock by two, they usually run synchronously
with the CPU. They can also run asynchronously, though.
The 8087/8087 combination can be characterized as a cooperation of
partners with equal rights, while the 80286/287 is more a master-
slave relationship. This makes synchronization much more easy, since
the complete instruction and data flow of the coprocessor goes through
the CPU. Before executing most coprocessor instructions, the 80286
tests its /BUSY pin which is hooked up to the 287 coprocessor and
signals if the 80287 is still executing a previous coprocessor
instruction or has encountered an exception. The 80286 then waits
until the 80287 is not busy before loading the coprocessor instruction
into the coprocessor. Therefore, a WAIT instruction before every
coprocessor instruction is not required. These WAITs are permissible,
but not necessary in 80287 programs. The second form of WAIT
synchronization after the coprocessor has written a memory operand is
still necessary on 286/287 systems.
The coprocessor interface in 80386/80387 systems is very similar to
the one found in 286/287 systems. However, to prevent corruption
of the coprocessor's contents by programming errors, the IO-ports
800000F8 - 800000FF are used which are not user accessible. The
interface has been optimized and uses 32-bit transfers. The overhead
of the interface has been reduced to about 16-20 clock cycles. For
some operations on the 387 'clones', that take less than 16 clock
cycles to complete this effectively limits the execution rate of
coprocessor instructions. The only sensible solution to provide
even higher floating point performance was to integrate the CPU
and coprocessor functionality onto the same chip. This is what
Intel did with the 80486. The FPU in the 486 also benefits from
the instruction pipelining and from the integrated cache.