x87 FPU stack overflow

Rick C. Hodgin

unread,

Nov 17, 2016, 12:29:46 PM11/17/16

to

I've had an interesting discussion in the comp.lang.c group regarding
a Madelbrot function which generates code utilizing the x87 FPU for
32-bit code, and SIMD code for 64-bit code. The question was, "How
could the FPU code be more optimized" as it seemed to be not very
optimized in the generated results.

Here is the original function:

-----[ Begin ]-----
// test.cpp
#include <stdio.h>
#include <time.h>

int mandel(double x, double y, int max_iters)
{
double a, b, a2;
int iter;

a = 0.0;
b = 0.0;
for (iter = 0; iter < max_iters; iter++)
{
a2 = (a*a - b*b) + x;
b = 2.0*a*b + y;
a = a2;
if (a*a + b*b >= 4.0)
return iter;
}

return max_iters;
}

int main(int argc, char* argv[])
{
int max;
double x, y;
clock_t start;
clock_t end;

// Iterate for a while
start = clock();
for (x = 2.0f, max = 0, max_rick = 0; x > -3.0; x -= 0.0005)
{
for (y = -2.0; y < 2.0; y += 0.0005)
max += mandel(x, y, 40);

}
end = clock();
printf("Max: %d, time: %lf\n", max, (double)(end - start) / CLOCKS_PER_SEC);
return 0;
}

// Visual Studio 2010 on my laptop machine:
// Debug: Max: 366085784, time: 3.630000
// Release: Max: 366085784, time: 2.020000
-----[ End ]-----

The inner mandel() function was the code of interest.

In my custom code example, I was able to meet the needs of putting
everything in registers by using all FPU registers st0..st7. It
had me thinking, however, in seeing the VC++ 2010 compiler's code
generation that it might be an interesting idea to extend the stx
registers beyond the traditional 8-register barrier into a
dedicated memory stack of specified size.

A new register would be added, foverflow, which is an address to
overflow stack data to. foverflow points to the top of that
stack size, and it is increased as needed. It would be known as
fosp (for FPU Overflow Stack Pointer).

When data is pushed onto the FPU stack, and more data is pushed
than will fit into the 8 registers, it then begins overflowing
onto the stack, writing st7 to the fosp location, and then the
stack rotates as normal. When something is popped back off, the
stack rotates back like normal, and if there is still data in
the fosp block, it is popped back into st7's slot.

And, by designing the FPU in this way, the extra slots above st7
could be brought into a close cache making them nearly as fast as
the actual registers themselves, reducing the number of memory
reads/writes required, by being able to inject large numbers of
constants and variables, and then referencing them in source code
by their actual slot location.

Taking this idea a step further, it might be desirable to then
be able to mask off a block of registers into non-movable space,
so that when things are pushed or popped they simply remain in
their non-movable space, so they an always be referenced by a
hard fixed location, rather than a floating one.

-----
I was wondering if anybody has any experience with an FPU like
this? It would preclude the necessity of REX-type bytes, and
extend the FPU (and also SIMD and later registers) into an almost
unlimited space with probably up to 32 registers being actively
maintained in the close caches, resulting in much faster access
than even standard L1 cache reads/writes.

Any thoughts?

Best regards,
Rick C. Hodgin

Robert Wessel

unread,

Nov 17, 2016, 8:49:58 PM11/17/16

to

That was the original goal of the x87 (although with the stack spills
and fills being handled in software). Unfortunately design defects in
the x87 made that impossible (you basically don't have enough
information to be able to restart all types of operations).

SPARC and IPF do that, for example, for integer registers.

>Taking this idea a step further, it might be desirable to then
>be able to mask off a block of registers into non-movable space,
>so that when things are pushed or popped they simply remain in
>their non-movable space, so they an always be referenced by a
>hard fixed location, rather than a floating one.

Again SPARC and IPF integer registers have some portion fixed. IPF
can additional have some registers (both GP and FP) marked as
"rotating", allowing for effective loop unrolling without all that
code duplication.

>-----
>I was wondering if anybody has any experience with an FPU like
>this? It would preclude the necessity of REX-type bytes, and
>extend the FPU (and also SIMD and later registers) into an almost
>unlimited space with probably up to 32 registers being actively
>maintained in the close caches, resulting in much faster access
>than even standard L1 cache reads/writes.
>
>Any thoughts?

Stack based architectures have died out for a reason, they tend to
make good code optimization very difficult, since the top-of-stack is
always a hotspot. The x87 stack was an attempt to fit a reasonable FP
ISA into the very limited opcode space provided by the ESC
instructions. They didn't do a stack because it's a good idea, they
did a stack because you can do a stack ISA with fewer instruction
bits.

Anton Ertl

unread,

Nov 18, 2016, 5:10:36 AM11/18/16

to

Robert Wessel <robert...@yahoo.com> writes:
>Stack based architectures have died out for a reason, they tend to
>make good code optimization very difficult, since the top-of-stack is
>always a hotspot.

Well, looking at register machine implementations, they actually turn
their register code internally into the same kind of hot-spot code:
They recognize of one instruction writes to a register and the next
reads from it, and optimize this case with forwarding. You can also
have an accumulator or stack machine and make this aspect explicit in
the instruction encoding. Of course, for the cases when you want to
access other data, you would use, e.g., registers or a belt.

There was a very interesting paper [kim&smith02] that plays with the
idea of organizing the code microarchitecturally into strands of
instructions connected by an accumulator. A later paper
[salverda&zilles07] found disadvantages of that model and provides
insights into what works and what doesn't. I wonder if making the
strands architectural would reduce the problems and provide benefits
over classical register architectures.

@InProceedings{kim&smith02,
author = {Ho-Seop Kim and James E. Smith},
title = {An Instruction Set and Microarchitecture for
Instruction Level Distributed Processing},
crossref = {isca02},
pages = {71--81},
url = {http://www.ece.wisc.edu/~hskim/papers/kimh_ildp.pdf},
annote = {This paper addresses the problems of wide
superscalars with communication across the chip and
the number of write ports in the register file. The
authors propose an architecture (ILDP) with
general-purpose registers and with accumulators
(with instructions only accessing one accumulator
(read and/or write) and one register (read or
write); for the accumulators their death is
specified explicitly in the instructions. The
microarchitecture builds \emph{strands} from
instructions working on an accumulator; a strand
starts with an instruction writing to an accumulator
without reading from it, continues with instructions
reading from (and possibly writing to) the
accumulator and ends with an instruction that kills
the accumulator. Strands are allocated to one out of
eight processing elements (PEs) dynamically (i.e.,
accumulators are renamed). A PE consists of
mainly one ALU data path (but also a copy of the
GPRs and an L1 cache). They evaluated this
architecture by translating Alpha binaries into it,
and comparing their architecture to a 4-wide or
8-wide Alpha implementation; their architecture has
a lower L1 cache latency, though. The performance of
ILDP in clock cycles is competetive, and one can
expect faster clocks for ILDP. The paper also
presents data for other stuff, e.g. general-purpose
register writes, which have to be promoted between
strands and which are relatively few.}
}
@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
year = "2002",
key = "ISCA 29",
}

@InProceedings{salverda&zilles07,
author = {Pierre Salverda and Craig Zilles},
title = {Dependence-Based Scheduling Revisited: A Tale of Two
Baselines},
booktitle = {Sixth Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD 2007)},
year = {2007},
url = {http://www.ece.wisc.edu/~wddd/2007/papers/wddd_01.pdf},
url2 = {http://www-sal.cs.uiuc.edu/~zilles/papers/lanes.wddd-2007.pdf},
annote = {When the authors simulated the dependence-based
scheduling work by Palacharla, Kim, and Smith, they
found 30\% lower IPC than a conventional OoO
machine, whereas the original simulations only found
a 5\% lower IPC. The paper analyses the reasons for
this, and provides a number of insights into how
hardware schedulers, execution engines, and various
features in them interact, and why and how
dependence-based scheduling works. The authors'
simulation had a number of significant differences
from the simulation in the original work: it used a
memory disambiguator, 2-cycle load latency (instead
of 1-cycle), and a better branch predictor. These
changes increase the number of strands available at
the same time, and the 8-lane dependence-based
machine becomes lane-limited (and instruction fetch
stalls waiting for a free lane), so it cannot profit
from the improvements or work around the higher
latency, whereas a conventional OoO machine can. 24
lanes would be required to bring the IPC
disadvantage of the dependence-based machine down to
5\% on the authors' simulator. OTOH, by changing
these parts of their simulation to be like the
original work, the dependence-based scheduling only
had an 11\% IPC disadvantage on an 8-lane machine
(much closer to the original 5\%)}
}

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Rick C. Hodgin

unread,

Nov 18, 2016, 6:26:08 AM11/18/16

to

On Friday, November 18, 2016 at 5:10:36 AM UTC-5, Anton Ertl wrote:
> Robert Wessel <robert...@yahoo.com> writes:
> >Stack based architectures have died out for a reason, they tend to
> >make good code optimization very difficult, since the top-of-stack is
> >always a hotspot.
>
> Well, looking at register machine implementations, they actually turn
> their register code internally into the same kind of hot-spot code:
> They recognize of one instruction writes to a register and the next
> reads from it, and optimize this case with forwarding. You can also
> have an accumulator or stack machine and make this aspect explicit in
> the instruction encoding. Of course, for the cases when you want to
> access other data, you would use, e.g., registers or a belt.

The same general type of floating operation is conducted when removing
the stack frame/base pointer and freeing up that extra stack register
for general purpose use in a limited register set ISA like 80386 and
later.

The relative position of data on the stack is maintained and utilized
to access parameters and local variables relative to the other stack-
based operations taking place.

Ivan Godard

unread,

Nov 18, 2016, 2:58:45 PM11/18/16

to

All links are broken.

Anton Ertl

unread,

Nov 18, 2016, 6:02:02 PM11/18/16

to

Ivan Godard <iv...@millcomputing.com> writes:
>All links are broken.

That's deplorable. Googling for the title usually works.

Rick C. Hodgin

unread,

Nov 19, 2016, 7:14:42 AM11/19/16

to

On Thursday, November 17, 2016 at 8:49:58 PM UTC-5, robert...@yahoo.com wrote:
> Stack based architectures have died out for a reason, they tend to
> make good code optimization very difficult, since the top-of-stack is
> always a hotspot. The x87 stack was an attempt to fit a reasonable FP
> ISA into the very limited opcode space provided by the ESC
> instructions. They didn't do a stack because it's a good idea, they
> did a stack because you can do a stack ISA with fewer instruction
> bits.

Over the past 14+ months I have seriously considered introducing the
stack into my SIMD engine. I spent a great deal of time in the 1990s
learning the x87 FPU. I came to greatly appreciate its stack-based
nature. It allows natural propagation through an expression with a
minimal number of operations, and it's truly elegant when done properly.

I have not yet figured out a good way to do it in the SIMD engine, and
this new consideration of extra registers in memory and a fixed portion
at the top introduces new variables which need to be resolved first.

I have considered introducing single-byte stack prefix opcodes which
indicate what to do with the resulting value once it is computed, such
as to change the top of stack once, or twice. It's workable, but I
still need to sort out the details.

-----
I do think the rotating stack design of the FPU is appropriate. It's
very complex to get a handle on and master. But I think once you do,
it's really quite a thing of beauty, and I can see why the x87 designers
created it that way. I also realize that once the compiler algorithms
which handle it properly are written, most people won't ever need to
deal with it again, and that those algorithms, while being very
demanding, are not impossible.

The ability to push stx data to memory-based registers beyond the normal
st0..st7 regs provided for in hardware seems to me to be a fundamental
ability that should be required in any proper stack-fpu implementation.
It is only natural as it's more difficult to generate code in a heavily
constrained environment.

It also seems the ability to block off a fixed portion at the top, so
those values are not rotated with the rest of the stack, is fundamental.

The extra registers fully survive function calls, interrupts, and can be
saved/restored in a modified task state segment (or its equivalent),
thereby surviving even task switches. I don't see a downside apart from
established mental biases or prejudices against a stack-based fpu.