x86: Why LOOPxx instruction is so slow?

629 views
Skip to first unread message

Tacit

unread,
Mar 9, 2013, 5:01:56 PM3/9/13
to
Well, direct answer is: LOOPcc decodes into many µops and takes many
cycles to execute, unlike DEC + JNZ pair and JrCXZ. So every
optimisation manual says you should not use it. But why? LOOPxx is:
0) if cc is correct — skip everything;
1) DEC rCX, but keep the flags;
2) JNZ, using DEC's unwritten Z-flag.

How hard can it be for microcode to implement? If Agner Fog's tables
are correct: K7-12 — 7 µops and 3-4 cycles, Bulldozer — 1 and 1-2,
Bobcat — 8 & 4, Pentium II/III/M — 11 & 6, Pentium 4 — 4 & 2-4, Core
2 — 11 & 5, Nehalem — 6/11 & 4/7 (for LOOP/LOOP(N)E), Sandy Bridge —
7/11 & 5, Atom — 8 & 8, Nano — ? & 1-3 (as any Jcc).

So, just 2 microarchitectures are able to implement LOOPcc short and
fast. But most fail. Why?

John Levine

unread,
Mar 9, 2013, 5:41:49 PM3/9/13
to
>So, just 2 microarchitectures are able to implement LOOPcc short and
>fast. But most fail. Why?

Probably because they don't care. To use the LOOP instructions, the
count has to be in CX and the target address has to be within 128
bytes of the instruction. It's hardly worth adding an optimizer check
for that, since the equivalent DEC and JNE are plenty fast.





--
Regards,
John Levine, jo...@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. http://jl.ly

Tacit

unread,
Mar 9, 2013, 6:40:41 PM3/9/13
to
On 10 мар, 00:41, John Levine <jo...@iecc.com> wrote:
> Probably because they don't care.  To use the LOOP instructions, the
> count has to be in CX and the target address has to be within 128
> bytes of the instruction.  It's hardly worth adding an optimizer check
> for that, since the equivalent DEC and JNE are plenty fast.

Yes, the instruction is mostly useless, but that doesn't explain, why
it takes so many µops.

Robert Wessel

unread,
Mar 10, 2013, 1:31:36 AM3/10/13
to
On Sat, 9 Mar 2013 14:01:56 -0800 (PST), Tacit <tacit...@gmail.com>
wrote:
LOOP became slow on some of the earliest machines (circa 486) when
significant pipelining started to happen, and running any but the
simplest instruction down the pipeline efficiently was technologically
impractical. So LOOP was slow for a number of generations. So nobody
used it. So when it became possible to speed it up, there was no real
incentive to do so, since nobody was actually using it.

Tacit

unread,
Mar 10, 2013, 8:49:33 AM3/10/13
to
On 10 мар, 08:31, Robert Wessel <robertwess...@yahoo.com> wrote:
> LOOP became slow on some of the earliest machines (circa 486) when
> significant pipelining started to happen, and running any but the
> simplest instruction down the pipeline efficiently was technologically
> impractical.

Not just slow — up to 11 µops! Obviously it was done deliberately. If
DEC and JNZ have 1 µop each, there is no reason for engineers to
implement a combination of them in much more complicated form instead
of same 2 µops (3 with condition check).

> So when it became possible to speed it up

I don't see the reason to slow it down so much in the first place.

Anton Ertl

unread,
Mar 10, 2013, 9:02:47 AM3/10/13
to
Tacit <tacit...@gmail.com> writes:
>On 10 =D0=BC=D0=B0=D1=80, 08:31, Robert Wessel <robertwess...@yahoo.com> wr=
>ote:
>> LOOP became slow on some of the earliest machines (circa 486) when
>> significant pipelining started to happen, and running any but the
>> simplest instruction down the pipeline efficiently was technologically
>> impractical.
>
>Not just slow =E2=80=94 up to 11 =C2=B5ops! Obviously it was done deliberat=
>ely.

IIRC LOOP was used in some software for timing loops; there was
(important) software that did not work on CPUs where LOOP was too fast
(this was in the early 90s or so). So CPU makers learned to make LOOP
slow. Apparently this still persists; I don't know if there is still
software around where that makes a difference.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Tacit

unread,
Mar 10, 2013, 10:29:16 AM3/10/13
to
On 10 мар, 15:02, an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> IIRC LOOP was used in some software for timing loops; there was
> (important) software that did not work on CPUs where LOOP was too fast
> (this was in the early 90s or so). So CPU makers learned to make LOOP
> slow. Apparently this still persists; I don't know if there is still
> software around where that makes a difference.

If this is correct and the only reason for keeping it slow -- it's
incredibly stupid.

Robert Wessel

unread,
Mar 10, 2013, 1:44:15 PM3/10/13
to
On Sun, 10 Mar 2013 05:49:33 -0700 (PDT), Tacit
<tacit...@gmail.com> wrote:
The simple instructions were dispatched directly into the pipeline by
the decoder, LOOP was complex, and trapped to microcode.

Robert Wessel

unread,
Mar 10, 2013, 1:45:57 PM3/10/13
to
On Sun, 10 Mar 2013 13:02:47 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Tacit <tacit...@gmail.com> writes:
>>On 10 =D0=BC=D0=B0=D1=80, 08:31, Robert Wessel <robertwess...@yahoo.com> wr=
>>ote:
>>> LOOP became slow on some of the earliest machines (circa 486) when
>>> significant pipelining started to happen, and running any but the
>>> simplest instruction down the pipeline efficiently was technologically
>>> impractical.
>>
>>Not just slow =E2=80=94 up to 11 =C2=B5ops! Obviously it was done deliberat=
>>ely.
>
>IIRC LOOP was used in some software for timing loops; there was
>(important) software that did not work on CPUs where LOOP was too fast
>(this was in the early 90s or so). So CPU makers learned to make LOOP
>slow. Apparently this still persists; I don't know if there is still
>software around where that makes a difference.


That happened, but only on one or two CPUs. I think AMD did that for
Windows compatibility at one point. Whatever slowdown they did would
have been overwhelmed by clock speed increases long ago.

Tacit

unread,
Mar 10, 2013, 2:15:24 PM3/10/13
to
On 10 мар, 19:44, Robert Wessel <robertwess...@yahoo.com> wrote:
> The simple instructions were dispatched directly into the pipeline by
> the decoder, LOOP was complex, and trapped to microcode.

LOOP should not be that complex! It should be just 2 or 3 µops, which
can be handled by complex translator on way 0 of any modern CPU
pipeline, not by microsequencer with µROM. However, it really has up
to 11 µops and is microcoded, and I'm wondering, why there are so many
µops.

John Levine

unread,
Mar 10, 2013, 4:42:53 PM3/10/13
to
>> The simple instructions were dispatched directly into the pipeline by
>> the decoder, LOOP was complex, and trapped to microcode.
>
>LOOP should not be that complex! It should be just 2 or 3 µops, which
>can be handled by complex translator on way 0 of any modern CPU
>pipeline, not by microsequencer with µROM. However, it really has up
>to 11 µops and is microcoded, and I'm wondering, why there are so many
>µops.

Because nobody cares, and it's not worth dedicating whatever small
incremental chunk of silicon would be needed to make it faster.

You can ask this question as many times as you want, but you're not
going to get any new answers.

Paul A. Clayton

unread,
Mar 10, 2013, 5:01:46 PM3/10/13
to
I could imagine a possibly sensible 6-µop version:
virtual_cc = cc;
temp = test (cc);
rCX = rCX - temp; // also setting cc
cc = temp & cc; // assumes branch handling is not
// substantially changed for the sake of LOOP
branch
cc = virtual_cc

I could also imagine allowing an operation after a
branch (kind of like a delayed branch) might have
some issues. The above also assumes the presence
of two scratch registers.

If one is implementing LOOP in a manner that
*minimally* interferes with the design--i.e.,
almost ignoring the existence of the instruction
while designing the core and then implementing it
with microcode--, it might not be *that* incredible
that the microcode is bloated. Without special
support, one might not be able to skip the rest of
the operations if the cc test succeeds (there might
not be any microcode-internal branch instruction--
though x86 microcode might require such?).
(Including a microcode-internal branch might also
be less attractive than converting the control
dependence to a data dependence.)

I agree that 11 µops sounds excessive and I
would be interested to know both how such has
been implemented and why faster alternatives were
not used. (I think the implementation I presented
above is very poor--both in requiring some special
support and in using many µops--, but it is hard
to guess what the cost would be of having the
virtual_cc set by a subtract and AND and used by
the branch [that would avoid to µops].)

I suspect I am just revealing my ignorance and
even my incompetence.

Tacit

unread,
Mar 10, 2013, 6:51:27 PM3/10/13
to
On 10 мар, 23:01, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> I could imagine a possibly sensible 6-µop version:
> virtual_cc = cc;
> temp = test (cc);
> rCX = rCX - temp; // also setting cc
> cc = temp & cc; // assumes branch handling is not
> // substantially changed for the sake of LOOP
> branch
> cc = virtual_cc

Could be much easier:
1) abort if CC is correct (i.e. J(N)Z to end)
2) DEC rCX
3) JNZ to target

In case there are no options for saving Z-flag in temp-flag and use it
for JNZ — add 2 µops for save and restore main flags.

> I could also imagine allowing an operation after a
> branch (kind of like a delayed branch) might have
> some issues.

No problem, «JNZ to target» in microcode is actually ADD rIP,
relative_address. New rIP will be used after microsequence ends.

> Without special
> support, one might not be able to skip the rest of
> the operations if the cc test succeeds (there might
> not be any microcode-internal branch instruction--
> though x86 microcode might require such?).

There are. A lot of exception and denormal handling require early
exit.

> (Including a microcode-internal branch might also
> be less attractive than converting the control
> dependence to a data dependence.)

OK, replace any next µop ofter CC check to conditional. There may be
problems, if only moves and jumps are allowed to be conditional.

> If one is implementing LOOP in a manner that
> *minimally* interferes with the design--i.e.,
> almost ignoring the existence of the instruction
> while designing the core and then implementing it
> with microcode--, it might not be *that* incredible
> that the microcode is bloated.

After over 10 generations of cores there could be some solution for
it. Bulldozer and Nano architects did it.

Robert Wessel

unread,
Mar 11, 2013, 1:23:33 AM3/11/13
to
On Sun, 10 Mar 2013 11:15:24 -0700 (PDT), Tacit
<tacit...@gmail.com> wrote:

>On 10 ???, 19:44, Robert Wessel <robertwess...@yahoo.com> wrote:
>> The simple instructions were dispatched directly into the pipeline by
>> the decoder, LOOP was complex, and trapped to microcode.
>
>LOOP should not be that complex! It should be just 2 or 3 ľops, which
>can be handled by complex translator on way 0 of any modern CPU
>pipeline, not by microsequencer with ľROM. However, it really has up
>to 11 ľops and is microcoded, and I'm wondering, why there are so many
>ľops.


I was responding to your questions as to why it was made so slow in
the first place.

As to why it was not made faster later, mostly nobody caring, not that
it wasn't possible. I expect the x87 FPU instructions to start
undergoing a dramatic slowdown in the next decade too. You have to
support them for compatibility, but there's no meaningful use in 64
bit mode, and as 32 bit applications start to get sparse, there will
be little execution of the x87 instructions at all, and the transistor
and power budget will be better spent on other things.

Tacit

unread,
Mar 11, 2013, 10:18:37 AM3/11/13
to
On 11 мар, 07:23, Robert Wessel <robertwess...@yahoo.com> wrote:
> As to why it was not made faster later, mostly nobody caring, not that
> it wasn't possible.  I expect the x87 FPU instructions to start
> undergoing a dramatic slowdown in the next decade too.

Probably, AMD should have got rid of x87 and LOOPcc completely, when
they've made x86-64 (as they did with BCD-arithmetic, other long
unused instructions, segments and prefixes).

Paul A. Clayton

unread,
Mar 11, 2013, 11:52:29 AM3/11/13
to
On Mar 10, 6:51 pm, Tacit <tacit.mu...@gmail.com> wrote:
> On 10 мар, 23:01, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
[snip]
> Could be much easier:
> 1) abort if CC is correct (i.e. J(N)Z to end)
> 2) DEC rCX
> 3) JNZ to target

I agree that a shorter sequence should be possible,
but I was trying to think of a bloated sequence
that might make sense if *minimal*
microarchitectural adjustments were permitted.

If LOOP never changes the cc, then JNZ would not
work without adding a branch on a virtual cc.
While adding such support may seem trivial, if
the LOOP implementation is not allowed to make
any changes to the internal microarchitecture,
then such (trivial) optimizations may not be
allowed.

I really have no idea what constraints a
microarchitecture might place on a microcode
implementation of LOOP.

[snip]
> No problem, «JNZ to target» in microcode is actually ADD rIP,
> relative_address. New rIP will be used after microsequence ends.

Perhaps such a solution would have added complexity to
the internal microarchitecture since one would not want
that µop to update branch predictors. (Again, one
needs to assume that any added cost is too much.)

Even if I were a processor designer it would be
difficult to guess what constraints the microcode
developers were under.

If a useless, compatibility-only instruction is handed
to the microcode developers, they might reasonably not
be able or willing to suggest minor changes to the
internal microarchitecture to improve such an
instruction. Not only would they rather use their
"change suggestion capital" more productively but the
suggestion of a change for a useless case would reduce
the credibility of other suggestions.

[snip]
> There are. A lot of exception and denormal handling require early
> exit.

Trap to microcode might be somewhat different than
branch to microcode. Again, this is assuming no or
minimal changes to the internal microarchitecture.

[snip]
> After over 10 generations of cores there could be some solution for
> it. Bulldozer and Nano architects did it.

It is clearly not a case of such not being possible.
However, if LOOP is considered compatibility-only,
then it might reasonably be handed off to the
microcode developers. If the microarchitecture
happens to fit a simple/fast microcode implementation
of LOOP, then such might be implemented; but there
might be no reason to adjust the microarchitecture to
improve such an instruction.

The architects behind Nano may have found avoiding
the special casing of LOOP simplified their design
in terms of area or power. Or they may have had
incentives from embedded users to provide a fast
implementation (for code density benefits). Those
are just *WILD* guesses.

If optimization of LOOP fell out of other
optimizations (like fusion of compare and branch),
it might be easier to tweak LOOP into a fast path
instruction than to handle it in microcode even if
the performance of LOOP was unimportant.

I suspect that such decisions are based on
specific details of the implementation. Information
about such details does not seem to be generally
available and interpreting such information would be
beyond the skill level of most people. (I am not a
hardware designer--and have never played one on
television or stayed at a Holiday Inn Express. :-)

Robert Wessel

unread,
Mar 12, 2013, 2:15:30 AM3/12/13
to
On Mon, 11 Mar 2013 07:18:37 -0700 (PDT), Tacit
<tacit...@gmail.com> wrote:
Perhaps. The decimal instructions were truly rarely used, and while
LOOP is slow, it's not in quite that same realm of disuse. x87 was
pretty commonly used.

One of AMDs prime goals was making it easy to move to the new ISA. I
think they could have cleaned up the encoding mess a lot in the
process, but they didn't. It's not like they really saved that much
work for people porting their compiler back ends. Certainly they
could have reclaimed a nice chunk of prime opcode space by pushing the
x87 stuff under a single (initial) opcode.

But I do think AMD missed most of the possible cleanup, and actual 64
bit code suffers considerably for it.

Jolly Good

unread,
Mar 14, 2013, 12:22:11 AM3/14/13
to
On Mar 10, 6:51 pm, Tacit <tacit.mu...@gmail.com> wrote:
> On 10 мар, 23:01, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
>
> > I could imagine a possibly sensible 6-µop version:
> > virtual_cc = cc;
> > temp = test (cc);
> > rCX = rCX - temp; // also setting cc
> > cc = temp & cc; // assumes branch handling is not
> >        // substantially changed for the sake of LOOP
> > branch
> > cc = virtual_cc
>
> Could be much easier:
> 1) abort if CC is correct (i.e. J(N)Z to end)
> 2) DEC rCX
> 3) JNZ to target
>
> In case there are no options for saving Z-flag in temp-flag and use it
> for JNZ — add 2 µops for save and restore main flags.

This also does not cover errors cases. A 32-bit LOOPcc that jumps
outside of CS boundaries or a 64-bit LOOPcc with a destination in non-
canonical form will cause a #GP(0) exception.

#GP(0) is a fault-class exception, meaning that machine state is
restored to as it was before the instruction executed. In this case,
that means that ECX or RCX must not be decremented if the jump would
cause an exception.

Change the DEC RCX into a SUB to a temporary register, add in an
exception check before the final JNZ, MOV temporary register into real
RCX at this point, then do the final JNZ. That is at least 2 more
µops.

Add in the 2 µops for saving and restoring flags and you are already
at 7 µops. We have made very few assumptions about the underlying
architecture and are already of similar size to the real
implementations you originally questioned.

Paul A. Clayton

unread,
Mar 14, 2013, 8:56:32 AM3/14/13
to
On Mar 14, 12:22 am, Jolly Good <jolly.good.po...@gmail.com> wrote:
[snip]
> This also does not cover errors cases. A 32-bit LOOPcc that jumps
> outside of CS boundaries or a 64-bit LOOPcc with a destination in non-
> canonical form will cause a #GP(0) exception.

Exception recovery would be handled by the ordinary
mechanism provided for out-of-order execution.
(Obviously, that would not apply to the current
Atom implementation, but even Atom is moving to
out-of-order execution.)

Stephen Sprunk

unread,
Mar 14, 2013, 12:38:01 PM3/14/13
to
On 12-Mar-13 01:15, Robert Wessel wrote:
> On Mon, 11 Mar 2013 07:18:37 -0700 (PDT), Tacit
> <tacit...@gmail.com> wrote:
>> On 11 ???, 07:23, Robert Wessel <robertwess...@yahoo.com> wrote:
>>> As to why it was not made faster later, mostly nobody caring, not
>>> that it wasn't possible. I expect the x87 FPU instructions to
>>> start undergoing a dramatic slowdown in the next decade too.
>>
>> Probably, AMD should have got rid of x87 and LOOPcc completely,
>> when they've made x86-64 (as they did with BCD-arithmetic, other
>> long unused instructions, segments and prefixes).
>
> Perhaps. The decimal instructions were truly rarely used, and while
> LOOP is slow, it's not in quite that same realm of disuse. x87 was
> pretty commonly used.

OTOH, the ABIs that AMD developed with MS and the Linux community made
SSE pretty much mandatory for FP, so x87 code in 64-bit mode should fall
into the "truly rare" category as well. Why not just remove them as
they did with other "truly rare" instructions (and prefixes)?

> One of AMDs prime goals was making it easy to move to the new ISA.
> I think they could have cleaned up the encoding mess a lot in the
> process, but they didn't. It's not like they really saved that much
> work for people porting their compiler back ends.

I'm not sure if it was compiler complexity. Let's face it, only two x86
compilers really matter (MSVC and GCC), and both would implement
whatever encoding and instruction changes AMD came up with.

If I were in AMD's shoes at the time, I would have worried about being
seen as changing "too much", given the spectacular failure that was
unfolding at Intel.

Also, I don't know how much of the 32-bit mode decoder logic AMD was
able to reuse. Needing entirely separate decoders would surely have
increased the die space required, for something that was at that point
purely speculative. They had no idea if the industry would adopt their
ISA or if it would just waste space that could be applied to something
that would certainly help them in the market, eg. larger caches.

> Certainly they could have reclaimed a nice chunk of prime opcode
> space by pushing the x87 stuff under a single (initial) opcode.

Indeed. The REX prefix consumes an enormous chunk of that opcode space,
but that was done by appropriating the single-byte INC and DEC
instructions, which already had equivalent two-byte forms so nothing was
truly lost.

> But I do think AMD missed most of the possible cleanup, and actual
> 64 bit code suffers considerably for it.

I've thought about how I would encode x86(-64) if starting from a clean
slate, but it's really a moot point now; AMD blew the one opportunity
we've had in the last 30+ years, and it'll probably be at least that
long before we get another one.

I'm not sure how much it really matters, though. Modern chips just
decode the x86 instruction stream into internal RISCy instructions.
That likely wouldn't go away no matter how good the encoding was, unless
most/all CISCy instructions were eliminated too.

The preponderance of prefixes, the ModRM and SIB bytes, etc. really only
affect code size--and probably not by enough to matter in most cases. I
haven't seen any recent comparisons, but code for 32-bit RISC ISAs was
substantially larger than equivalent x86 code, so it's possible that a
new encoding scheme would have made the code size problem even worse
than it already is.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Ivan Godard

unread,
Mar 14, 2013, 1:58:17 PM3/14/13
to
On 3/14/2013 9:38 AM, Stephen Sprunk wrote:

<snip>

> The preponderance of prefixes, the ModRM and SIB bytes, etc. really only
> affect code size--and probably not by enough to matter in most cases. I
> haven't seen any recent comparisons, but code for 32-bit RISC ISAs was
> substantially larger than equivalent x86 code, so it's possible that a
> new encoding scheme would have made the code size problem even worse
> than it already is.

It's very hard to get meaningful comparison numbers for code size across
ISAs - too much apples to oranges. For example, the Mill is a load/store
architecture like RISCs. That makes the x86 average-instruction size
look bad because the x86 number includes the big fat memory-to-memory
instructions. However, the x86 actually is more compact than the Mill
for these kinds of ops (despite the apparent numbers) because the Mill
needs two or three operations to do what the x86 does in one.

Conversely, the Mill has some very high level CISCy operations that do
in one operation what can take tens of operations on an x86. These two
effects balance out each other - somewhat. But where is the balance
point? Hard to tell.

Then there are family effects - x86-32 has different size numbers than
x86-64, or, for us, the Tin configuration has different numbers than the
Gold configuration.

Wide issue machines, like VLIWs, Itanium, and the Mill, have another
source of variability: bundle overhead, including compulsory no-ops if
the ISA requires them. In codes where there is high natural ILP (or with
software pipelining in any codes) the bundle overhead gets amortized
against more operations, giving better average size.

You might think that the way to avoid these effects is to measure
whole-program sizes. But then you start to get compiler effects.
Different compilers, even for the same target architecture and the same
(apparent) optimization settings, can produce widely different code
sizes - try GCC, ICC, clang, and MS against an x86 to see. A great deal
depends on how enthusiastic the compiler is about loop unrolling, which
can be very visible in the whole-program sizes while completely
meaningless in the instruction sizes.

Ivan

Stephen Sprunk

unread,
Mar 14, 2013, 4:25:00 PM3/14/13
to
On 14-Mar-13 12:58, Ivan Godard wrote:
> On 3/14/2013 9:38 AM, Stephen Sprunk wrote:
>> The preponderance of prefixes, the ModRM and SIB bytes, etc. really
>> only affect code size--and probably not by enough to matter in most
>> cases. I haven't seen any recent comparisons, but code for 32-bit
>> RISC ISAs was substantially larger than equivalent x86 code, so
>> it's possible that a new encoding scheme would have made the code
>> size problem even worse than it already is.
>
> It's very hard to get meaningful comparison numbers for code size
> across ISAs - too much apples to oranges. For example, the Mill is a
> load/store architecture like RISCs. That makes the x86
> average-instruction size look bad because the x86 number includes the
> big fat memory-to-memory instructions. However, the x86 actually is
> more compact than the Mill for these kinds of ops (despite the
> apparent numbers) because the Mill needs two or three operations to
> do what the x86 does in one.

That's one of the big things I missed when I first fell in love with
load-store architectures and wanted to eliminate mem-op, op-mem and
mem-op-mem instructions: the instructions might get a little smaller,
but you'll need more of them, so it's (almost?) always a net loss on
code density.

The average instruction size for x86(-32) code is 2-3 bytes, while
contemporary RISC ISAs had a fixed instruction size of 4 bytes. A RISC
logically requires at least as many instructions as a CISC, so x86 would
obviously win on code density--measurement not required.

(This assumes you've eliminated factors like register pressure and stack
calling conventions, which afflicted x86-32 more than x86-64.)

> Then there are family effects - x86-32 has different size numbers
> than x86-64, or, for us, the Tin configuration has different numbers
> than the Gold configuration.

I would expect average instruction size for x86-64 to be 3-4 bytes, due
to REX, but that should still be a small win vs. traditional RISCs.

OTOH, a smarter encoding retaining the variable-length, two-operand
model of x86-32 wouldn't have increased average instruction size. And
I'm not sure anyone cares about instruction size now anyway.

Ivan Godard

unread,
Mar 14, 2013, 4:53:50 PM3/14/13
to
The best I've found in the lit for x86-64 (sorry, no cite) was 27 bits
over a range of apps, compiler set to optimize space. We do better as
raw numbers (14-17 bits/op), but it's still very much apples to oranges
and the best whole-program level numbers we have suggest perhaps a 25%
gain over x86 rather than the apparent 75%. *Very* "well, depending..."
though. At least the "--optimize space" option on our tool chain is a
no-op - we don't need to do unrolling :-)

>
> OTOH, a smarter encoding retaining the variable-length, two-operand
> model of x86-32 wouldn't have increased average instruction size. And
> I'm not sure anyone cares about instruction size now anyway.

I sure do.

A large fraction of apps important to the market will thrash in the
iCache. That includes the OS. People don't notice because everybody uses
the same cache sizes (power and speed-of-light constraints). Hence users
have nothing to compare against, and they complain about dog programs
rather than dog architectures. One of our most promising patents, just
filed, gives you double the iCache at no increase in power or latency,
or same size at better power/latency. That's on top of the gain from
compact instructions. It matters.

Ivan

Paul A. Clayton

unread,
Mar 14, 2013, 10:36:01 PM3/14/13
to
On Mar 14, 4:53 pm, Ivan Godard <i...@ootbcomp.com> wrote:
> On 3/14/2013 1:25 PM, Stephen Sprunk wrote:
[snip]
>> OTOH, a smarter encoding retaining the variable-length, two-operand
>> model of x86-32 wouldn't have increased average instruction size.  And
>> I'm not sure anyone cares about instruction size now anyway.
>
> I sure do.
>
> A large fraction of apps important to the market will thrash in the
> iCache. That includes the OS. People don't notice because everybody uses
> the same cache sizes (power and speed-of-light constraints). Hence users
> have nothing to compare against, and they complain about dog programs
> rather than dog architectures. One of our most promising patents, just
> filed, gives you double the iCache at no increase in power or latency,
> or same size at better power/latency. That's on top of the gain from
> compact instructions. It matters.

It is not clear to me that a two-fold increase in cache
size would make that much of a performance difference,
and if it did the loss from misprediction recovery (from
greater latency) and extra power would be relatively
small. (Dynamically adjusting Icache size would help
with the power issue when a larger cache is not useful;
changing the latency in cycles would seem to be more
difficult.)

One way to decrease the typical latency of an Icache
might be to allocate blocks of instructions into two
separate sections of equal size with the close
section having lower latency and requests being
directed to both sections at the same time so that
requests directed to the higher latency section
return the result a cycle after the lower latency
section returns its result.

Obviously this would work best with a trace cache
with two-cycle long traces, but such a design
could be used with blocks of instructions. This
is somewhat similar to a branch target instruction
cache (except that it has an exclusion property
with the slower cache, is larger in size, and a
BTIC might be indexed by the branch address rather
than the target address).

It is an application of NUCA with prefetching. I.e.,
some portions of the instruction stream are highly
prefetchable and so can be placed in a higher
latency portion of the cache while other portions
have more critical timing.

Such a partitioned cache might increase power use,
though with a square-root latency rule it might be
possible to specialize the implementation of the
higher latency section to reduce power (if it
would naturally be c. 42% higher latency, increasing
it to 50% [2 and 3 cycle latencies] or 100% [1 and 2
cycle latencies] might allow a reduction in power).
Such a design might also be more friendly to cutting
cache size in half to reduce power under low Icache
demand.

(It might be practical to use variable length "basic
blocks" with right-to-left and left-to-right
fitting [like heap and stack grow together in some
address space arrangements]--not true basic blocks
since such would not end in a control flow operation
or just before an alternate entry point. Such might
facilitate greater flexibility in placing
instructions based on criticality.)

(For some ISAs, it might even be possible to store
less critical information in a more distant portion
of cache. E.g., register names might be most
critical while operation type is less critical [at
least if not needed to distinguish register names]
and execution-used immediates might be least
critical. Such a fracturing of instructions might
be a bit complex. [Something similar was proposed
as a separate cache to allow register renaming to
run ahead on an Icache miss {"Reducing the
Performance Impact of Instruction Cache Misses by
Writing Instructions into the Reservation Stations
Out-of-Order", Stark et al., 1997}. This might also
be loosely related to my suggestion to extract around
16 bits from instructions biased toward taken branches
so that part of the capacity of the BTB would be used
for storage of instruction fragments; Andy Glew
mentioned that someone working on Alpha had the same
basic idea--earlier, of course.])

I think in the 1980s there was research on Icache
bypass (for conflict and capacity issues), and it
seems unfortunate that fetch behavior is not
exploited for allocation in this manner.

It sounds like the patent-pending technique might be
rather interesting.

Ivan Godard

unread,
Mar 15, 2013, 12:14:52 AM3/15/13
to
On 3/14/2013 7:36 PM, Paul A. Clayton wrote:
> On Mar 14, 4:53 pm, Ivan Godard <i...@ootbcomp.com> wrote:
>> On 3/14/2013 1:25 PM, Stephen Sprunk wrote:
> [snip]
>>> OTOH, a smarter encoding retaining the variable-length, two-operand
>>> model of x86-32 wouldn't have increased average instruction size. And
>>> I'm not sure anyone cares about instruction size now anyway.
>>
>> I sure do.
>>
>> A large fraction of apps important to the market will thrash in the
>> iCache. That includes the OS. People don't notice because everybody uses
>> the same cache sizes (power and speed-of-light constraints). Hence users
>> have nothing to compare against, and they complain about dog programs
>> rather than dog architectures. One of our most promising patents, just
>> filed, gives you double the iCache at no increase in power or latency,
>> or same size at better power/latency. That's on top of the gain from
>> compact instructions. It matters.
>
Wow, you cite a ton of interesting ideas. Shame you weren't on our
effort several years ago :-)

I'm going to ignore all the things related to pre-parsing and trace
cache where it cache is not simply a repository of bytes. Those are
interesting but require either being able to to recognize instruction
boundaries in an arbitrary byte sequence, or require post-decoration
techniques where the trace is built as untraced raw instructions are
parsed and executed. I've no argument that the methods work, but they
seem mostly of value when you must deal with legacy codes, and I have
little experience with that.

> It is not clear to me that a two-fold increase in cache
> size would make that much of a performance difference,
> and if it did the loss from misprediction recovery (from
> greater latency) and extra power would be relatively
> small. (Dynamically adjusting Icache size would help
> with the power issue when a larger cache is not useful;
> changing the latency in cycles would seem to be more
> difficult.)
>
It's clear that the extra size is irrelevant in the absence of
thrashing. Many programs spend all their time in tight loops that fit.
The extra cache is an attack on Amdahl's law: if only 10% of execution
is in code that thrashes, and thrashing causes a 10X loss of throughput,
then getting out of thrashing mode doubles the overall performance. Of
course, some programs will thrash in the doubled cache too, and some
can't use it, so the doubling is only valuable in codes within a window
of working set sizes. In my experience that window is populated enough
that I care, YMMV.

As for why existing chips don't make the i$1 bigger, I appeal to Andy or
Mitch to suggest an answer. My understanding is that by and large the
cache is made at least 32K for thrash prevention, and as much bigger as
possible as will not cause an extra pipe stage under the clock and stage
height constraints dictated by other parts of the microarchitecture. I'm
told that power is more determined by wayness and fetch width (i.e.
decode width and/or bank width) than raw size, but size does have a cost.

> One way to decrease the typical latency of an Icache
> might be to allocate blocks of instructions into two
> separate sections of equal size with the close
> section having lower latency and requests being
> directed to both sections at the same time so that
> requests directed to the higher latency section
> return the result a cycle after the lower latency
> section returns its result.

Isn't this just an i$1 pulling from an i$2?

> Obviously this would work best with a trace cache
> with two-cycle long traces, but such a design
> could be used with blocks of instructions. This
> is somewhat similar to a branch target instruction
> cache (except that it has an exclusion property
> with the slower cache, is larger in size, and a
> BTIC might be indexed by the branch address rather
> than the target address).

That's an approach we never considered, in part because we don't keep
track of branches and cannot recognize a branch in pre-decode because of
the short, exposed, pipeline. By the time we know it's a branch we are
already executing it :-) We also can't use a BTB (or rather, don't need
one) for the same reason. It's not clear to me what advantage a BTIC has
over a BTB. Needs thought.

> It is an application of NUCA with prefetching. I.e.,
> some portions of the instruction stream are highly
> prefetchable and so can be placed in a higher
> latency portion of the cache while other portions
> have more critical timing.

But how do you determine which is which? Predicteds go in fast, and slow
is used for post-miss fetch?

> Such a partitioned cache might increase power use,
> though with a square-root latency rule it might be
> possible to specialize the implementation of the
> higher latency section to reduce power (if it
> would naturally be c. 42% higher latency, increasing
> it to 50% [2 and 3 cycle latencies] or 100% [1 and 2
> cycle latencies] might allow a reduction in power).
Don't understand - aren't your numbers backwards (i.e. bigger should be
slower)?
> Such a design might also be more friendly to cutting
> cache size in half to reduce power under low Icache
> demand.

Cutting the number of rows doesn't buy much unless you actually shut it
off for leakage, and I don't want to think about turning it back on.
Cutting wayness (assuming parallel ways) would save some power. But
you'd save nothing in the actual transmission of the bytes, unless you
chopped the physically furthest half and saved a repeater or two. ; I
supposed the physical layout could assign ways by distance so your chop
got both wayness and distance. Or so I understand - remember, I ,ake no
claim of being a gates and volts guy.

> (It might be practical to use variable length "basic
> blocks" with right-to-left and left-to-right
> fitting [like heap and stack grow together in some
> address space arrangements]--not true basic blocks
> since such would not end in a control flow operation
> or just before an alternate entry point. Such might
> facilitate greater flexibility in placing
> instructions based on criticality.)

That's an interesting notion. There was a bunch of academic work
involving compiler placement of code for optimal cache behavior for
various size/wayness/etc. They got decent numbers, but as far as I know
commercial compilers don't do it, because the necessary trace info is
rarely available and the targets vary too much.

> (For some ISAs, it might even be possible to store
> less critical information in a more distant portion
> of cache. E.g., register names might be most
> critical while operation type is less critical [at
> least if not needed to distinguish register names]
> and execution-used immediates might be least
> critical. Such a fracturing of instructions might
> be a bit complex.

Complex hell - downright weird, as weird as a Mill :-) The drawback of
splitting (unless you are using a pre-decoded trace cache) is that
either your fetch granularity must be smaller, or your execution average
granularity (basic block length) must be bigger, or internal
fragmentation will kill you.

[Something similar was proposed
> as a separate cache to allow register renaming to
> run ahead on an Icache miss {"Reducing the
> Performance Impact of Instruction Cache Misses by
> Writing Instructions into the Reservation Stations
> Out-of-Order", Stark et al., 1997}.
I can see where this might be useful if you have register renaming; we
don't, nor have registers to rename.
This might also
> be loosely related to my suggestion to extract around
> 16 bits from instructions biased toward taken branches
> so that part of the capacity of the BTB would be used
> for storage of instruction fragments; Andy Glew
> mentioned that someone working on Alpha had the same
> basic idea--earlier, of course.])

That might be useful if you have a BTB. What did Andy think?

> I think in the 1980s there was research on Icache
> bypass (for conflict and capacity issues), and it
> seems unfortunate that fetch behavior is not
> exploited for allocation in this manner.

Simpler to just double the cache :-) Can you find a cite or two on this?

> It sounds like the patent-pending technique might be
> rather interesting.

The decode paper for CDES goes in Monday, and there's an informal
presentation next month and another at a different venue in May.
Somewhere along then I'll splash it here and you guys can comment on our
baby. Better say good things - it is self evidently the most scrumptious
baby that has ever been :-)

Ivan

Robert Wessel

unread,
Mar 15, 2013, 12:47:24 AM3/15/13
to
On Thu, 14 Mar 2013 11:38:01 -0500, Stephen Sprunk
<ste...@sprunk.org> wrote:

>On 12-Mar-13 01:15, Robert Wessel wrote:
>> On Mon, 11 Mar 2013 07:18:37 -0700 (PDT), Tacit
>> <tacit...@gmail.com> wrote:
>>> On 11 ???, 07:23, Robert Wessel <robertwess...@yahoo.com> wrote:
>>>> As to why it was not made faster later, mostly nobody caring, not
>>>> that it wasn't possible. I expect the x87 FPU instructions to
>>>> start undergoing a dramatic slowdown in the next decade too.
>>>
>>> Probably, AMD should have got rid of x87 and LOOPcc completely,
>>> when they've made x86-64 (as they did with BCD-arithmetic, other
>>> long unused instructions, segments and prefixes).
>>
>> Perhaps. The decimal instructions were truly rarely used, and while
>> LOOP is slow, it's not in quite that same realm of disuse. x87 was
>> pretty commonly used.
>
>OTOH, the ABIs that AMD developed with MS and the Linux community made
>SSE pretty much mandatory for FP, so x87 code in 64-bit mode should fall
>into the "truly rare" category as well. Why not just remove them as
>they did with other "truly rare" instructions (and prefixes)?


I think it would be reasonable to consider existing 32-bit x87 code,
both in assembler, and compilers generating that, when considering
what to support in 64-bit land. As it turn out, there was very little
porting of x87 code to 64-bit mode, but it would have required
considerable bravery to predict that at the time.

>
>> One of AMDs prime goals was making it easy to move to the new ISA.
>> I think they could have cleaned up the encoding mess a lot in the
>> process, but they didn't. It's not like they really saved that much
>> work for people porting their compiler back ends.
>
>I'm not sure if it was compiler complexity. Let's face it, only two x86
>compilers really matter (MSVC and GCC), and both would implement
>whatever encoding and instruction changes AMD came up with.
>
>If I were in AMD's shoes at the time, I would have worried about being
>seen as changing "too much", given the spectacular failure that was
>unfolding at Intel.
>
>Also, I don't know how much of the 32-bit mode decoder logic AMD was
>able to reuse. Needing entirely separate decoders would surely have
>increased the die space required, for something that was at that point
>purely speculative. They had no idea if the industry would adopt their
>ISA or if it would just waste space that could be applied to something
>that would certainly help them in the market, eg. larger caches.


Clearly the design they came up with allows for one decoder to do both
ISAs with only a limited amount of logic. A seriously different ISA
encoding might well have required a separate decoder.

Robert Wessel

unread,
Mar 15, 2013, 12:53:01 AM3/15/13
to
On Thu, 14 Mar 2013 10:58:17 -0700, Ivan Godard <iv...@ootbcomp.com>
wrote:
Measuring the dynamic number of instruction bits (and instructions)
executed would not be so hard, and ought to get you to a fairly good
comparison. At least a lot better than looking at static sizes.
Although things like different amounts of unrolling would still impact
the dynamic size (although to a much lesser degree than the static
size), as would the quality of the optimization.

Ivan Godard

unread,
Mar 15, 2013, 1:09:07 AM3/15/13
to
On 3/14/2013 9:53 PM, Robert Wessel wrote:

> Measuring the dynamic number of instruction bits (and instructions)
> executed would not be so hard, and ought to get you to a fairly good
> comparison. At least a lot better than looking at static sizes.
> Although things like different amounts of unrolling would still impact
> the dynamic size (although to a much lesser degree than the static
> size), as would the quality of the optimization.

Depends on what you want the information for. Thus the dynamic sizes
should be most relevant when sizing the datapaths from cache to decoder,
while the static sizes would be more useful for sizing the cache itself.


Jolly Good

unread,
Mar 15, 2013, 1:34:55 AM3/15/13
to
Exception recovery can be handled easily for single-µop instructions
-- wait until the µop is at the head of the ROB and flush any
speculative state changes if it causes an exception.

I am a little less clear about how this is done for multi-µop
instructions or µcode sequencer output. If the ROB holds µops (K5, P6,
etc.), then you can no longer rely entirely on "has the head of the
ROB caused an exception?" or "have the top retire_width µops in the
ROB caused an exception?" Some notion of the macro instructions must
be kept in the ROB (as would already be implied by having separate
performance events for µop retirement and instruction retirement).
However, I am having difficulty finding any concrete statements about
how modern microarchitectures do this.

"Has any µop within the same instruction as the µop at the head of the
ROB caused an exception?" seems like it would be harder to answer. Not
impossible, nor even completely serial, but there is likely some limit
to the number of µops that can be checked within the retire stage's
cycle time. Nonetheless, it is not free. The retirement width might be
a good estimate of the numbers of µops checked for being instruction
boundaries.

If that is the case, what happens when you have an instruction that
require more µops than the retirement window? A µop in one retirement
chunk may change architectural state and a µop in the next retirement
chunk may then cause an exception. If this is µcode from the
sequencer, I assume it falls on the µcode author to explicitly handle
architected state around µops that could cause faults.

One flaw with the assumptions I've made is that e.g. P6 could complex-
decode an instruction into 4 µops, but could only retire 3 µops per
cycle. 4 µops likely does not give you enough room to deal with
manually working around architected state changes. Maybe you could
only output 4 µops if you were lucky with which could cause
exceptions. Maybe the retirement stage could check more in parallel.
Maybe I am overlooking an obvious answer!

Bruce Evans

unread,
Mar 15, 2013, 9:33:04 AM3/15/13
to
In article <v195k89c9855h8dv6...@4ax.com>,
Robert Wessel <robert...@yahoo.com> wrote:
>On Thu, 14 Mar 2013 11:38:01 -0500, Stephen Sprunk
><ste...@sprunk.org> wrote:
>
>>On 12-Mar-13 01:15, Robert Wessel wrote:
>>> On Mon, 11 Mar 2013 07:18:37 -0700 (PDT), Tacit
>>> <tacit...@gmail.com> wrote:
>>>> On 11 ???, 07:23, Robert Wessel <robertwess...@yahoo.com> wrote:
>>>>> As to why it was not made faster later, mostly nobody caring, not
>>>>> that it wasn't possible. I expect the x87 FPU instructions to
>>>>> start undergoing a dramatic slowdown in the next decade too.
>>>>
>>>> Probably, AMD should have got rid of x87 and LOOPcc completely,
>>>> when they've made x86-64 (as they did with BCD-arithmetic, other
>>>> long unused instructions, segments and prefixes).
>>>
>>> Perhaps. The decimal instructions were truly rarely used, and while
>>> LOOP is slow, it's not in quite that same realm of disuse. x87 was
>>> pretty commonly used.
>>
>>OTOH, the ABIs that AMD developed with MS and the Linux community made
>>SSE pretty much mandatory for FP, so x87 code in 64-bit mode should fall
>>into the "truly rare" category as well. Why not just remove them as
>>they did with other "truly rare" instructions (and prefixes)?

People who want to break extended precision can't do it so easily :-).
x87 code is mandatory in the 64-bit ABI, and its use is not very rare.
It is used for long doubles, and the ABI specifies the details for
long doubles.

An 80-bit extended type is needed much more with a 64-bit ABI than it
was in the 8087 in ~1980 when there were no standard ABIs and the
nonstandard ABIs were barely 16 bits. 80 bits was excessively future-proof
in 1980.

The ABI also has an optional __float128 type, at least in old versions.
When SSE supports this, the x87 can be removed without losing features
or efficiency. But the ABI prevents simply removing it.

>I think it would be reasonable to consider existing 32-bit x87 code,
>both in assembler, and compilers generating that, when considering
>what to support in 64-bit land. As it turn out, there was very little
>porting of x87 code to 64-bit mode, but it would have required
>considerable bravery to predict that at the time.

In the FreeBSD math library, we (I) try to avoid using special x87
code in 64-bit mode, and if possible and not too slow, just use the
compiler's support for long doubles. I also try to avoid using
algorithms that depend on the x87's extra precision to gain efficiency
for double precision routines. The x87 ends up with special x87 code
in 64-bit mode in only the following functions:

remainderl()
sqrtl()
rintl() and related functions
logbl()
remquo()
remquof()
remquol()
scalbn()
scalbnf()
scalbnl()

A few of the medium-weight long double functions are here because the
i387 is quite good at them and they are hard to make efficient in C.
Especially sqrtl(). No heavyweight long double functions like cosl()
are here because the i387 is quite bad at them and they can be made
more efficient and accurate in C, although not easily. logbl() and
scalbn*() shouldn't be here, since the C versions can be made competitive
with them although the currently aren't for scalbn*(). Even the double
precision remquo() is here since SSE2 apparently doesn't support this.
I'm not sure why this doesn't apply to remainder(). remainderl() and
remquol() are here since they are apparently hard to make efficient in C
(I haven't tried for these)

In 32-bit mode, the x87 is used for 48 functions, including for 8
transcendental functions where its use is a pessimization on modern
CPUs and/or loses accuracy on all CPUs. I'm slowly replacing these
uses by the same C code used in 64-bit mode. The x87 runs this C
code at about the same speed as SSE2. I could make it run faster
algorithms specialized for the x87 (depend on its builtin extra
precision so as not to have do extra precision in software) , but
don't have time to maintain the different versions.

>>> One of AMDs prime goals was making it easy to move to the new ISA.
>>> I think they could have cleaned up the encoding mess a lot in the
>>> process, but they didn't. It's not like they really saved that much
>>> work for people porting their compiler back ends.
>>
>>I'm not sure if it was compiler complexity. Let's face it, only two x86
>>compilers really matter (MSVC and GCC), and both would implement
>>whatever encoding and instruction changes AMD came up with.

FreeBSD now uses clang.

Bruce

Paul A. Clayton

unread,
Mar 15, 2013, 11:56:16 AM3/15/13
to
On Mar 15, 12:14 am, Ivan Godard <i...@ootbcomp.com> wrote:
> On 3/14/2013 7:36 PM, Paul A. Clayton wrote:
[snip]
> Wow, you cite a ton of interesting ideas. Shame you weren't on our
> effort several years ago :-)

It was once proposed that I would be useful for
brainstorming.

> I'm going to ignore all the things related to pre-parsing and trace
> cache where it cache is not simply a repository of bytes. Those are
> interesting but require either being able to to recognize instruction
> boundaries in an arbitrary byte sequence, or require post-decoration
> techniques where the trace is built as untraced raw instructions are
> parsed and executed. I've no argument that the methods work, but they
> seem mostly of value when you must deal with legacy codes, and I have
> little experience with that.

Fill-time predecode is a fairly common technique, even
for x86. With fixed-size (aligned) instructions
finding instruction boundaries is trivial and some VLEs
facilitate low overhead boundary determination (e.g.,
S/360's two-bit length indicator in a fixed position of
the first parcel or using one or two bits in each
parcel to indicate end of instruction). Even if the
line must be significantly decoded before cache fill,
this need not greatly hinder performance (some x86
implementations do this to put a stop/continue bit for
each byte to ease later decoding).

[snip]
> It's clear that the extra size is irrelevant in the absence of
> thrashing. Many programs spend all their time in tight loops that fit.
> The extra cache is an attack on Amdahl's law: if only 10% of execution
> is in code that thrashes, and thrashing causes a 10X loss of throughput,
> then getting out of thrashing mode doubles the overall performance. Of
> course, some programs will thrash in the doubled cache too, and some
> can't use it, so the doubling is only valuable in codes within a window
> of working set sizes. In my experience that window is populated enough
> that I care, YMMV.

I was under the impression that the Icache busting codes
tended to be much larger rather than just a little larger.
(I think the current Itanium's have a 512KiB L2 Icache--
even with its low code density that would probably be
greater than 128KiB for x86--and PA-RISC [for which
Itanium is a replacement] used huge L1 Icaches.)

> As for why existing chips don't make the i$1 bigger, I appeal to Andy or
> Mitch to suggest an answer. My understanding is that by and large the
> cache is made at least 32K for thrash prevention, and as much bigger as
> possible as will not cause an extra pipe stage under the clock and stage
> height constraints dictated by other parts of the microarchitecture. I'm
> told that power is more determined by wayness and fetch width (i.e.
> decode width and/or bank width) than raw size, but size does have a cost.

With good branch predictors and instruction buffering,
going from a two cycle Icache to a three cycle Icache
might involve a small penalty (<5%). Unfortunately, one
would pay that penalty even for the codes that are
friendly to a smaller Icache and only benefit for the
codes benefiting from the larger Icache.

[snip]
> Isn't this just an i$1 pulling from an i$2?

Separation of critical and less critical portions
in this way is not just an L2 cache. The index
and way would be the same (or possibly only slightly
different) and only one tag would be checked, at
least for a simple trace cache. This is effectively
a width pipelined Icache; fetching chunk A from the
fast section and A+1 from the slow section but
initiating the fetch at the same time.

(Obviously, in a non-trace cache the next block
might not be sequential, so it might act more like
an L2 that bypasses L1 (requiring tag checks and
using a different index) and a more complex
placement choice might be needed [or just accept
that some taken branches sacrifice performance]. As
noted, this is just NUCA with a recognition of
prefetchability--any access which can be prefetched
can tolerate higher access latency.)

[snip]
> That's an approach we never considered, in part because we don't keep
> track of branches and cannot recognize a branch in pre-decode because of
> the short, exposed, pipeline. By the time we know it's a branch we are
> already executing it :-) We also can't use a BTB (or rather, don't need
> one) for the same reason. It's not clear to me what advantage a BTIC has
> over a BTB. Needs thought.

The main advantage of BTIC over BTB is the next few
instructions are available earlier, avoiding a bubble
due to Icache access latency (in some designs), such
could also reduce power use slightly (a fraction of
instructions being fetched from a small cache) or
increase bandwidth (whether by avoiding a bubble or
fetching from the Icache in parallel). Instruction
buffering (before and/or after decode) and branch
prediction run ahead would reduce the usefulness of
such.

(I had proposed a function entry point cache behind
the L1 Icache with a similar latency-hiding motive,
under the assumption that function calls are major
discontinuities for fetch while sequential prefetch
works fairly well within a function.)

[snip NUCA with criticality placement informed by
prefetchability]
> But how do you determine which is which? Predicteds go in fast, and slow
> is used for post-miss fetch?

My proposal was to have the more predictable fetch
blocks go in the slow portion. Fetch redirects
would be biased toward the fast portion (thus kind
of like a BTIC). Even placement by address might
provide a modest improvement (half [or some fraction
of] the time a redirect might have no bubble). The
branch predictor would predict two blocks for fetching.

With criticality/prefetchability information, placement
might be optimized.

[snip]
> Don't understand - aren't your numbers backwards (i.e. bigger should be
> slower)?

The second portion would be the same size as the first
portion but placed more remotely (thus the square root
rule assumption--which I am guessing would not apply
for smaller wide caches and even with larger caches
layout is cannot have extremely fine granularity).

If the slow portion was twice as large (which might be
useful since many instruction fetches might not be
timing critical--i.e., the address is available early),
one might imagine a layout like:

SSSS S -> "slow" chunk
SFFS F -> "fast" chunk
SFFS

or perhaps (with a three times larger slow portion):

SSSS
SSFFSS
SSFFSS

Note: I am not a hardware person; I am just guessing
based on a vague distance estimate.

[snip]
> Cutting the number of rows doesn't buy much unless you actually shut it
> off for leakage, and I don't want to think about turning it back on.

Turning off sections of cache can be quite reasonable.
This is effectively a specialization of the core to
fit the workload (and energy and performance goals).
The reconfiguration cost would be less than for a
migration to a more appropriate core (and the benefit
would likewise be less) and the area use would be less
than with more numerous more specialized (and less
configurable) cores (and the interconnect should be
simpler).

> Cutting wayness (assuming parallel ways) would save some power. But

For instruction fetch, I think way prediction is
sufficiently accurate that parallel read might not be
necessary. However, since parallel read still seems
to be used (though it is not easy to tell based on
public information), it seems that way prediction is
not good enough.

[snip]
> That's an interesting notion. There was a bunch of academic work
> involving compiler placement of code for optimal cache behavior for
> various size/wayness/etc. They got decent numbers, but as far as I know
> commercial compilers don't do it, because the necessary trace info is
> rarely available and the targets vary too much.

It is kind of frustrating that software is less
aggressively optimized. For cache optimizations, it
probably does not help that there is diversity of
targets even in the same time period, though some
optimizations are "cache oblivious" (I dislike that
term since it implies forgetting that caches exist
rather than forgetting the specifics of cache
implementation, though I can understand the desire
to distinguish more general techniques from cache
aware techniques that exploit details of a
particular cache design.).

[snip]
> Complex hell - downright weird, as weird as a Mill :-) The drawback of
> splitting (unless you are using a pre-decoded trace cache) is that
> either your fetch granularity must be smaller, or your execution average
> granularity (basic block length) must be bigger, or internal
> fragmentation will kill you.

I agree--complex, weird, and constraining. (I was
thinking smaller fetch granularity.)

[snip OoO rename/issue]
> I can see where this might be useful if you have register renaming; we
> don't, nor have registers to rename.

Hmm. Is the Mill a transport-triggered architecture?

(Even a memory-memory architecture like AT&T's CRISP
can have "registers". CRISP used a stack cache for
"registers". One paper on the development of CRISP
suggested the possibility of a global/TLS cache,
which would have made it a bit like Itanium [or
SPARC or AMD 29k {I think}].)

[snip partially merged Icache/BTB]
> That might be useful if you have a BTB. What did Andy think?

He thought it could make sense with fixed length
instructions--obviously if an Alpha architect had
considered such. (I think the idea could be made
to work with a less classic RISC ISA, but the
complexity would be greater [and so it would be
less worthwhile, which might mean a negative value].)

[snip Icache bypass]
> Simpler to just double the cache :-) Can you find a cite or two on this?

It seems most of the work is on bypassing the data
cache (e.g., one PA-RISC implementation used a
per-page classification of non-temporal to avoid
allocation in the huge off-chip L1, providing a
modest L0 [assist?] cache). So far I have not
found any references to bypassing an Icache
(except for prefetching), but I am almost certain
that I read the idea somewhere (though I could be
misremembering and simply applying an optimization
for data accesses to instruction accesses).

(At least one StrongARM implementation had a two
way associative "non-temporal" data cache which
likewise used page annotations to determine
allocation.)

Retaining reuse information for a sufficient
period (with just hardware) might be difficult.
If (single) reuse distance for a cache block is
too great for the capacity of the Icache, it is
perhaps likely to be too great for the L2 cache
capacity (though some extra tracking might be
useful for better replacement choices).

Page-granular marking of non-reuse might be
problematic, though allocating to a special
low-associativity small cache might be practical.
However, such might be almost just a way or
way group preference indicator and not particularly
useful.

[snip]
> The decode paper for CDES goes in Monday, and there's an informal
> presentation next month and another at a different venue in May.
> Somewhere along then I'll splash it here and you guys can comment on our
> baby. Better say good things - it is self evidently the most scrumptious
> baby that has ever been :-)

Rotten eggs and tomatoes do not work very well
over the Internet. :-)

Even if criticism is harsh, I suspect that the
readers here will at least find such interesting
(based on what you have stated already).

Stephen Sprunk

unread,
Mar 15, 2013, 3:08:56 PM3/15/13
to
On 14-Mar-13 23:47, Robert Wessel wrote:
> On Thu, 14 Mar 2013 11:38:01 -0500, Stephen Sprunk
> <ste...@sprunk.org> wrote:
>> On 12-Mar-13 01:15, Robert Wessel wrote:
>>> Perhaps. The decimal instructions were truly rarely used, and
>>> while LOOP is slow, it's not in quite that same realm of disuse.
>>> x87 was pretty commonly used.
>>
>> OTOH, the ABIs that AMD developed with MS and the Linux community
>> made SSE pretty much mandatory for FP, so x87 code in 64-bit mode
>> should fall into the "truly rare" category as well. Why not just
>> remove them as they did with other "truly rare" instructions (and
>> prefixes)?
>
> I think it would be reasonable to consider existing 32-bit x87 code,
> both in assembler, and compilers generating that, when considering
> what to support in 64-bit land. As it turn out, there was very
> little porting of x87 code to 64-bit mode, but it would have
> required considerable bravery to predict that at the time.

Really? Within seconds of understanding they hadn't dropped x87, I
thought they were a bunch of morons. Granted, they had probably made
that decision years before it became public knowledge, so perhaps more
bravery would have been required, but it doesn't take a genius to see
that both Intel and AMD were already moving in the direction of SSE for
FP math.

As for porting of hand-written x87 code, if it was important enough
people would rewrite for SSE. Most of that code was probably first
written back before CPUs and compilers did a decent job, and the minor
performance benefit from porting it (even intact as x87) was not worth
the effort. Also, in most cases where performance still mattered, I
suspect folks had _already_ ported to SSE for the SIMD benefits.

Ivan Godard

unread,
Mar 15, 2013, 3:11:32 PM3/15/13
to
Yes, I know. It doesn't work for us (or at least I never found a way to
make it work) because of the throughput volume we wanted. A fixed 4 byte
encoding makes easy decode, but at a target throughput of 30 ops/cycle
we would be looking at 128-byte fetches, which has real impact on the
rest of the hierarchy and power budget. So we went with a
variable-length ultra-compact encoding.

Sim suggested that the pre-decode works for variable length when you can
amortize the cost because you are executing out of cache (assuming that
you are not forced into predecode for legacy reasons). But we are
already a house afire when in loops; our concern was to the other end of
Amdahl's law, where you are not in a loop, often not in cache at all,
and the misses are killing your average performance. There the predecode
isn't helping because you are in one-time code.

So we wanted a decode with very short latency (for the misses) that
could handle the throughput, and came up with one. That left us with the
question of whether to also have a fastpath using pre-decode where we
were in fact reusing code, loops or whatever.

The answer to that comes down to a design choice that has nothing to do
with decode per se: OOO or IO. I'm as guilty as the next guy of seeing
everything as needing a hammer because I know hammers, and I come out of
the DSP world. My basic biases are to have everything ready exactly when
you need it and stay out of the way. If by accident of career I had
worked on the NetBurst no doubt my biases would be different, but we are
what we are :-)

One good thing about OOO is that it is very good at taking advantage of
early-out. It's generally worth while to have both a fast way and a slow
way to do something, because nobody is waiting. Or rather, everybody is
always waiting. In an IO, DSP-like world everything is statically
clocked and you can't use something that arrives early. More, you have
to buffer the arrival and that complicates things for no gain. I'm not a
gates guy, but the trade off is much like that between asynchronous and
clocked circuits.

So the question is whether we could usefully take advantage of the
early-out that pre-decode would provide. Assume that we fully
pre-decode, so that we have the equivalent of ready-to-execute micros in
the cache and all that's needed is to fetch them: how much faster would
that be?

A full decode from raw bundles for us is three cycles. However, portions
of that bundle are used in the second cycle; you can think of it as if
the bundle has some ops that decode in one cycle for immediate issue.
The issue cycle of remainder data-depends on the issue cycle of those
first ops, independent of decode timing. With some heroics we might be
able to pull the rest tighter by a single cycle, but we haven't bothered
to do so because we would be getting ahead of decode, but in pre-decode
that wouldn't be an issue and we could save a cycle there. Plus
pre-decode would eliminate the first decode stage, so there is a
potential saving of two stages in the pipe in a completely optimal
pre-decode. Two cycles is not to be sneezed at. However, it's only a win
on a miss - if the predictors are doing their job then decode can take
arbitrarily long with no performance impact. I'm sure that Netburst said
the same thing :-)

So now look at the cost. Those pre-decoded micros need to convey the
same information that the undecoded ops do, except that everything has
to be exploded and normalized so that the FUs can use it directly. We
are not the x86 or 360 with short offsets and narrow immediates you have
to program around; our addresses use 32-bit offsets. Besides the
offsets, there are transient data references (i.e. register numbers on a
legacy architecture), and quite a few signal flags depending on how
horizontal the microcode is. Ball-park guess: 64 bits/micro, but only
one micro/macro because our macro is close to the machine for direct
execution. Now remember that 30 ops/cycle target throughput; that's 240
bytes/cycle from the i$ to the FUs. Ouch.

More, if for example we assume a 32k byte i$ then we have a
collision-free capacity of 4k ops, or ~128 cycles worth total. Lots of
loops fit in that, but if we are in a loop then the predictors are
humming and the fastpath decode isn't buying anything. If we are not in
a loop, 128 cycles is a third of a memory access and we will have run
out of the i$ and missed almost immediately.

When we went through the use-cases, the only one that stood out as being
a win for pre-decode was: long running loop with an internal
unpredictable branch that could be neither predicated (if-conversion)
nor speculated. It's a loop, so both sides of the branch are in cache.
It's unpredictable so we frequently miss on the branch. And we can't
hide the miss with our (extensive) predication/speculation capability
(I'll tell you about those when the filings are in. Sigh). Then we will
see those two cycles that the pre-decode would have saved us.

Conclusion: not worth it, neither the power for 240 bytes/cycle nor the
4X hit on cache capacity. YMMV.


> [snip]
>> It's clear that the extra size is irrelevant in the absence of
>> thrashing. Many programs spend all their time in tight loops that fit.
>> The extra cache is an attack on Amdahl's law: if only 10% of execution
>> is in code that thrashes, and thrashing causes a 10X loss of throughput,
>> then getting out of thrashing mode doubles the overall performance. Of
>> course, some programs will thrash in the doubled cache too, and some
>> can't use it, so the doubling is only valuable in codes within a window
>> of working set sizes. In my experience that window is populated enough
>> that I care, YMMV.
>
> I was under the impression that the Icache busting codes
> tended to be much larger rather than just a little larger.
> (I think the current Itanium's have a 512KiB L2 Icache--
> even with its low code density that would probably be
> greater than 128KiB for x86--and PA-RISC [for which
> Itanium is a replacement] used huge L1 Icaches.)

Yes, that's true. Commercial codes in general are i$ busters; both
machines simply have to pay the cost for the bigger cache. I really wish
somebody would do a thesis showing working-set size vs. % of program
curves for different loads and instruction sets. In some markets we may
need to expand the i$ too, to accommodate the much-larger busters.
However, that doesn't change the effective saving from a free doubling:
the free doubling in the encoding turns a required 4X increase to a 2X,
and 2X is cheaper than 4X.

>> As for why existing chips don't make the i$1 bigger, I appeal to Andy or
>> Mitch to suggest an answer. My understanding is that by and large the
>> cache is made at least 32K for thrash prevention, and as much bigger as
>> possible as will not cause an extra pipe stage under the clock and stage
>> height constraints dictated by other parts of the microarchitecture. I'm
>> told that power is more determined by wayness and fetch width (i.e.
>> decode width and/or bank width) than raw size, but size does have a cost.
>
> With good branch predictors and instruction buffering,
> going from a two cycle Icache to a three cycle Icache
> might involve a small penalty (<5%). Unfortunately, one
> would pay that penalty even for the codes that are
> friendly to a smaller Icache and only benefit for the
> codes benefiting from the larger Icache.

Which is why they don't do it. And why doing it for free wins :-)

> [snip]
>> Isn't this just an i$1 pulling from an i$2?
>
> Separation of critical and less critical portions
> in this way is not just an L2 cache. The index
> and way would be the same (or possibly only slightly
> different) and only one tag would be checked, at
> least for a simple trace cache. This is effectively
> a width pipelined Icache; fetching chunk A from the
> fast section and A+1 from the slow section but
> initiating the fetch at the same time.

Hmm. Needs thought. You know anybody who does it that way?


> [snip]
>> That's an approach we never considered, in part because we don't keep
>> track of branches and cannot recognize a branch in pre-decode because of
>> the short, exposed, pipeline. By the time we know it's a branch we are
>> already executing it :-) We also can't use a BTB (or rather, don't need
>> one) for the same reason. It's not clear to me what advantage a BTIC has
>> over a BTB. Needs thought.
>
> The main advantage of BTIC over BTB is the next few
> instructions are available earlier, avoiding a bubble
> due to Icache access latency (in some designs), such
> could also reduce power use slightly (a fraction of
> instructions being fetched from a small cache) or
> increase bandwidth (whether by avoiding a bubble or
> fetching from the Icache in parallel). Instruction
> buffering (before and/or after decode) and branch
> prediction run ahead would reduce the usefulness of
> such.

Yeah. We have run-ahead prediction, so only of value on misses, and
that's when both the BTB and BTIC would have the wrong thing anyway.

> (I had proposed a function entry point cache behind
> the L1 Icache with a similar latency-hiding motive,
> under the assumption that function calls are major
> discontinuities for fetch while sequential prefetch
> works fairly well within a function.)

Calls don't seem to present any more of a problem than regular branches
for us. I suspect that it matters more for predictors that need static
targets because they are pre-executing branches. We do run-ahead and
don't have the branches to look at anyway, even if we could parse them out.

The biggest problem with calls is function pointers; our ISA supports
label pointers too, but they are very rare. I don't understand how
having a separate function-entry cache would be better at handling
function pointers than simply letting the entry line sit in regular i$1
and get treated like a branch. Please explain.

> [snip NUCA with criticality placement informed by
> prefetchability]
>> But how do you determine which is which? Predicteds go in fast, and slow
>> is used for post-miss fetch?
>
> My proposal was to have the more predictable fetch
> blocks go in the slow portion. Fetch redirects
> would be biased toward the fast portion (thus kind
> of like a BTIC). Even placement by address might
> provide a modest improvement (half [or some fraction
> of] the time a redirect might have no bubble). The
> branch predictor would predict two blocks for fetching.
>
> With criticality/prefetchability information, placement
> might be optimized.

OK, if I understand you (not sure), the predictor doesn't actually
predict, it just gives you both addresses with weights. You could keep
the second address in the fast cache with the line of the first
prediction, like a next-like predictor, to let you have bigger tables
for the primary prediction. Of course, that would require that the
second choice is the same regardless of the path that led to the first
choice, which might be false enough to obviate the saving. Nobody uses
next-line predictors any more either.

Whether it paid would depend on how frequently the second (as opposed to
the first or the Nth) was the right path. I'd be concerned about clutter
in the traffic. You'd want quality bits on the second choice, not
bothering with the second fetch if it had proven unreliable.

<snip>

> [snip OoO rename/issue]
>> I can see where this might be useful if you have register renaming; we
>> don't, nor have registers to rename.
>
> Hmm. Is the Mill a transport-triggered architecture?

No, it has regular add/sub/etc opcodes. I found TTA to be very
interesting but not working well with a rich operation set, The Mill has
around 1000 operations at the assembly-language level, although of
course the great majority of those are merely variations on a basic
primitive: shift(l/r)(s/u)(<overflow behavior>), etc.

> (Even a memory-memory architecture like AT&T's CRISP
> can have "registers". CRISP used a stack cache for
> "registers". One paper on the development of CRISP
> suggested the possibility of a global/TLS cache,
> which would have made it a bit like Itanium [or
> SPARC or AMD 29k {I think}].)

Well, we do have specRegs (SPRs) - 40 or so, although only maybe a dozen
are reachable from code as registers; the rest are protected behind
MMIO. But as general registers to hold intermediate results or
fast-access variables, no, no more than a stack machine does.
We of course have cache-bypass load and store available; I think every
machine does. But its an attribute of the operation, not the address.

Ivan

Stephen Sprunk

unread,
Mar 15, 2013, 3:59:00 PM3/15/13
to
On 15-Mar-13 08:33, Bruce Evans wrote:
> In article <v195k89c9855h8dv6...@4ax.com>, Robert
> Wessel <robert...@yahoo.com> wrote:
>> On Thu, 14 Mar 2013 11:38:01 -0500, Stephen Sprunk
>> <ste...@sprunk.org> wrote:
>>> OTOH, the ABIs that AMD developed with MS and the Linux community
>>> made SSE pretty much mandatory for FP, so x87 code in 64-bit mode
>>> should fall into the "truly rare" category as well. Why not just
>>> remove them as they did with other "truly rare" instructions (and
>>> prefixes)?
>
> People who want to break extended precision can't do it so easily
> :-). x87 code is mandatory in the 64-bit ABI, and its use is not very
> rare. It is used for long doubles, and the ABI specifies the details
> for long doubles.

... but we are discussing what AMD could have done differently in said
ABI, so that argument is moot.

> The ABI also has an optional __float128 type, at least in old
> versions. When SSE supports this, the x87 can be removed without
> losing features or efficiency.

IMHO, if extended precision is really that big a deal, they should have
just added 128-bit long doubles to SSE (for both x86-32 and -64) and
dropped x87 entirely from the x86-64 ABI.

I don't think AMD's hardware could do that at the time, but it would
have been worth adding it (or at least faking it via the existing 80-bit
hardware) to remove that as justification for x87.

OTOH, other platforms get by just fine with only double precision.

> But the ABI prevents simply removing it.

Again, that argument is moot.

>>> I'm not sure if it was compiler complexity. Let's face it, only
>>> two x86 compilers really matter (MSVC and GCC), and both would
>>> implement whatever encoding and instruction changes AMD came up
>>> with.
>
> FreeBSD now uses clang.

x86-64 predates even LLVM (Clang's back end) by several years. Clang
itself didn't come until a couple years after LLVM, when the LLVM team
got fed up with GCC's front end.

Terje Mathisen

unread,
Mar 15, 2013, 5:06:43 PM3/15/13
to
Bruce Evans wrote:
> In the FreeBSD math library, we (I) try to avoid using special x87
> code in 64-bit mode, and if possible and not too slow, just use the
> compiler's support for long doubles. I also try to avoid using
> algorithms that depend on the x87's extra precision to gain efficiency
> for double precision routines. The x87 ends up with special x87 code
> in 64-bit mode in only the following functions:
>
> remainderl()
> sqrtl()
> rintl() and related functions
> logbl()
> remquo()
> remquof()
> remquol()
> scalbn()
> scalbnf()
> scalbnl()

OK, so we have the following groups:

rem*, scalbn*, sqrt, log and rint

rem* requires exact division, but scalbn is just exponent manipulation,
isn't it?

Why can't you do that easily in SSE using integer operations on the
exponent field?

sqrt makes sense, since x87 hw can do this at similar effort to FDIV.

log is very easy to get approximately right, but I suspect much harder
to get to sub-1.0 ulp?

Finally, what about rintl?

SSE has directed rounding operations, but only for 32-bit values, right?

Is it really worthwhile to transfer data from SSE to x87 just to convert
a 64-bit value?


> A few of the medium-weight long double functions are here because the
> i387 is quite good at them and they are hard to make efficient in C.
> Especially sqrtl(). No heavyweight long double functions like cosl()
> are here because the i387 is quite bad at them and they can be made
> more efficient and accurate in C, although not easily. logbl() and
> scalbn*() shouldn't be here, since the C versions can be made competitive
> with them although the currently aren't for scalbn*(). Even the double
> precision remquo() is here since SSE2 apparently doesn't support this.
> I'm not sure why this doesn't apply to remainder(). remainderl() and
> remquol() are here since they are apparently hard to make efficient in C
> (I haven't tried for these)

OK, so what about the rounding function?

Inspect the exponent, branch to alternate paths:

a) exp > 63: Overflow
b) exp >= 53: Exact conversion
c) exp >= 0: Check the bits that gets shifted out, handle rounding +
inexact trap

Is the problem that you need to generate a hw trap of the right kind?

>
> In 32-bit mode, the x87 is used for 48 functions, including for 8
> transcendental functions where its use is a pessimization on modern
> CPUs and/or loses accuracy on all CPUs. I'm slowly replacing these
> uses by the same C code used in 64-bit mode. The x87 runs this C
> code at about the same speed as SSE2. I could make it run faster
> algorithms specialized for the x87 (depend on its builtin extra
> precision so as not to have do extra precision in software) , but
> don't have time to maintain the different versions.

That makes a lot of sense.

>>> I'm not sure if it was compiler complexity. Let's face it, only two x86
>>> compilers really matter (MSVC and GCC), and both would implement
>>> whatever encoding and instruction changes AMD came up with.
>
> FreeBSD now uses clang.

The FreeBSD team is doing a great job, I have been using it for all my
ntp stratum 1 servers for many years now. :-)

I also use a laptop with FreeBSD as my home gateway machine, besides
being an ntp pool server (ntp.tmsw.no) it is also my leafnode news
server (i.e. carrying this post) and the ssh gateway to my home network. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Paul A. Clayton

unread,
Mar 15, 2013, 7:19:43 PM3/15/13
to
On Mar 15, 3:11 pm, Ivan Godard <i...@ootbcomp.com> wrote:

[snip]
> Yes, I know. It doesn't work for us (or at least I never found a way to
> make it work) because of the throughput volume we wanted. A fixed 4 byte
> encoding makes easy decode, but at a target throughput of 30 ops/cycle
> we would be looking at 128-byte fetches, which has real impact on the
> rest of the hierarchy and power budget. So we went with a
> variable-length ultra-compact encoding.

With your throughput targets, clean sheet encoding, etc.,
predecoding might be more obviously unattractive.

(30 ops/cycle still seems incredible, which makes
waiting that much more difficult!)

[snip]

(BTW, thanks for the detailed explanation.)

> On 3/15/2013 8:56 AM, Paul A. Clayton wrote:
>> Separation of critical and less critical portions
>> in this way is not just an L2 cache.  The index
>> and way would be the same (or possibly only slightly
>> different) and only one tag would be checked, at
>> least for a simple trace cache.  This is effectively
>> a width pipelined Icache; fetching chunk A from the
>> fast section and A+1 from the slow section but
>> initiating the fetch at the same time.
>
> Hmm. Needs thought. You know anybody who does it that way?

As far as I know, it is only one of my odd ideas. It
is *possible* that the Pentium 4 trace cache used such
a mechanism since (if I recall correctly) it fetched
two half-traces one after the other (3 µops per cycle
with traces containing up to 6 µops--in the initial
implementation); but I have not read anything indicating
that such was the case and it seems a sufficiently odd
and questionable (in terms of utility vs. complexity)
idea that I would be inclined to doubt that it was used.

[snip]

> The biggest problem with calls is function pointers; our ISA supports
> label pointers too, but they are very rare. I don't understand how
> having a separate function-entry cache would be better at handling
> function pointers than simply letting the entry line sit in regular i$1
> and get treated like a branch. Please explain.

Think of it as something like a victim cache with a
frequency of use and reduced (sequential)
prefetchability bias. Function entry points are
frequently used and are not friendly to sequential
prefetch.

(Incidentally, such *might* be used to facilitate BTB
entry compression, possibly with speculation rather
than ensuring continued validity of the pointers to
entries in the function entry point instruction cache.)

When posted here some time ago, the idea was generally
regarded as unprofitable, and I mentioned it as a
point of interest. (I suspect its utility is at best
marginal, but I think it is interesting particularly
in exploiting assumed information with respect to
reuse and prefetchability.)

[snip]
>> My proposal was to have the more predictable fetch
>> blocks go in the slow portion.  Fetch redirects
>> would be biased toward the fast portion (thus kind
>> of like a BTIC).  Even placement by address might
>> provide a modest improvement (half [or some fraction
>> of] the time a redirect might have no bubble).  The
>> branch predictor would predict two blocks for fetching.
>
>> With criticality/prefetchability information, placement
>> might be optimized.
>
> OK, if I understand you (not sure), the predictor doesn't actually
> predict, it just gives you both addresses with weights. You could keep
> the second address in the fast cache with the line of the first
> prediction, like a next-like predictor, to let you have bigger tables
> for the primary prediction. Of course, that would require that the
> second choice is the same regardless of the path that led to the first
> choice, which might be false enough to obviate the saving. Nobody uses
> next-line predictors any more either.

What I was suggesting was something like a next-two-lines
predictor. I admit that I have not really thought through
what would be required (the trace cache use is obvious, for
a regular Icache the management of placement and directing
fetch is less obvious), having only considered the
possibility that "prefetchability" could be exploited.

[snip]

> We of course have cache-bypass load and store available; I think every
> machine does. But its an attribute of the operation, not the address.

For most ISAs, it seems cache behavior is generally
controlled by page attributes. (x86 does have write
combining store instructions, but Itanium seems to be
a bit odd in providing cache behavior hints in load
and store instructions [initially non-temporal in
level N indicators, but, IIRC, the recent Itanium 9500
{Poulson} can somewhat adjust the meaning of the hint
bits].)

For data accesses, page-granular control might be
more-or-less adequate for skipping a level of the
memory hierarchy or even just hinting that the
temporal locality should be more closely monitored
or assumed biased.

For any such cache hint or directive, a significant
problem comes in the characteristics of a cache
not being architected (and sharing or even power
saving potentially altering effective capacity and
associativity).

Side thought: Perhaps a last local use hint might
be useful with the semantic of sending the cache
block to its home or a default location (such as
the last reader [for producer consumer]). This
would not quite be message passing, but would be
*somewhat* similar.

christian.bau

unread,
Mar 15, 2013, 7:59:01 PM3/15/13
to
On Mar 15, 7:08 pm, Stephen Sprunk <step...@sprunk.org> wrote:

> Really?  Within seconds of understanding they hadn't dropped x87, I
> thought they were a bunch of morons.  Granted, they had probably made
> that decision years before it became public knowledge, so perhaps more
> bravery would have been required, but it doesn't take a genius to see
> that both Intel and AMD were already moving in the direction of SSE for
> FP math.

All the x86 64-bit processors are also x86-32 bit processors, and you
surely want to continue running 32 bit code for a while, including 32
bit code that uses x87.

On the other hand, I think on 64 bit I would have removed x87 and MMX
registers and instructions completely.

Now how do you implement 80 bit fp? Choices are:

1. Still in hardware; while 32 bit code fetches operands from x87
registers, 64 bit fetches operands from XMM registers.
2. Emulate FP using integer instructions on 64 bit, trap and emulate
in the interrupt handler on 32 bit code. x87 emulation can be done
quite fast, especially if you have 64 bit integers including 64x64 bit
multiply.

Ivan Godard

unread,
Mar 15, 2013, 9:13:06 PM3/15/13
to
On 3/15/2013 4:19 PM, Paul A. Clayton wrote:
> On Mar 15, 3:11 pm, Ivan Godard <i...@ootbcomp.com> wrote:
>
> [snip]
>> Yes, I know. It doesn't work for us (or at least I never found a way to
>> make it work) because of the throughput volume we wanted. A fixed 4 byte
>> encoding makes easy decode, but at a target throughput of 30 ops/cycle
>> we would be looking at 128-byte fetches, which has real impact on the
>> rest of the hierarchy and power budget. So we went with a
>> variable-length ultra-compact encoding.
>
> With your throughput targets, clean sheet encoding, etc.,
> predecoding might be more obviously unattractive.
>
> (30 ops/cycle still seems incredible, which makes
> waiting that much more difficult!)

That's the target for a high-end family member; bottom-end does 5/cycle
max. We don't know whether we can sustain that rate, and won't for a
while - a lot depends on the fab tech of the day when we actually try to
build one. Heat dissipation seems the most likely issue at this point.
But we know that the limit will not be the decode rate.

> [snip]
>
> (BTW, thanks for the detailed explanation.)
>

> [snip]
>
>> The biggest problem with calls is function pointers; our ISA supports
>> label pointers too, but they are very rare. I don't understand how
>> having a separate function-entry cache would be better at handling
>> function pointers than simply letting the entry line sit in regular i$1
>> and get treated like a branch. Please explain.
>
> Think of it as something like a victim cache with a
> frequency of use and reduced (sequential)
> prefetchability bias. Function entry points are
> frequently used and are not friendly to sequential
> prefetch.

Perhaps that's why it doesn't seem to make much difference for us - we
don't do sequential prefetch.


>> We of course have cache-bypass load and store available; I think every
>> machine does. But its an attribute of the operation, not the address.
>
> For most ISAs, it seems cache behavior is generally
> controlled by page attributes. (x86 does have write
> combining store instructions, but Itanium seems to be
> a bit odd in providing cache behavior hints in load
> and store instructions [initially non-temporal in
> level N indicators, but, IIRC, the recent Itanium 9500
> {Poulson} can somewhat adjust the meaning of the hint
> bits].)

This runs afoul of us having virtual caches - there's no TLB to mark
pages in before you get to the cache. There is a PLB, and I suppose one
could add entries for address ranges (no pages at this point) with
access characteristics, but that hardware orientation is really at
cross-purposes with the process orientation of protection.

All the no-cache code I know of is *very* well aware of what it is
doing, so kicking the issue to the software I don't expect to be an
issue. But the eventual user community may tell me different :-)

> For data accesses, page-granular control might be
> more-or-less adequate for skipping a level of the
> memory hierarchy or even just hinting that the
> temporal locality should be more closely monitored
> or assumed biased.
>
> For any such cache hint or directive, a significant
> problem comes in the characteristics of a cache
> not being architected (and sharing or even power
> saving potentially altering effective capacity and
> associativity).
>
> Side thought: Perhaps a last local use hint might
> be useful with the semantic of sending the cache
> block to its home or a default location (such as
> the last reader [for producer consumer]). This
> would not quite be message passing, but would be
> *somewhat* similar.
>

We've looked at a "release" semantics operation to let the coherence
logic get a head start, but haven't put it in so far. But our multi-core
level stuff is far from complete - we know that what we have will work
with standard MOESI etc. protocols, and we have eliminated false
sharing, but I'm pretty sure there's more gold in that area when we
actually have cores to multi :-)

Bruce Evans

unread,
Mar 16, 2013, 6:31:58 AM3/16/13
to
In article <khvuet$7t4$1...@dont-email.me>,
Stephen Sprunk <ste...@sprunk.org> wrote:
>On 15-Mar-13 08:33, Bruce Evans wrote:
>> In article <v195k89c9855h8dv6...@4ax.com>, Robert
>> Wessel <robert...@yahoo.com> wrote:
>>> On Thu, 14 Mar 2013 11:38:01 -0500, Stephen Sprunk
>>> <ste...@sprunk.org> wrote:
>>>> OTOH, the ABIs that AMD developed with MS and the Linux community
>>>> made SSE pretty much mandatory for FP, so x87 code in 64-bit mode
>>>> should fall into the "truly rare" category as well. Why not just
>>>> remove them as they did with other "truly rare" instructions (and
>>>> prefixes)?
>>
>> People who want to break extended precision can't do it so easily
>> :-). x87 code is mandatory in the 64-bit ABI, and its use is not very
>> rare. It is used for long doubles, and the ABI specifies the details
>> for long doubles.
>
>... but we are discussing what AMD could have done differently in said
>ABI, so that argument is moot.

Not really. They couldn't have dropped x87 support without providing
something that works better (SSE with at least 80-bit precision and
faster operations). Otherwise all algorithms that depend on the
extra precision would break.

>> The ABI also has an optional __float128 type, at least in old
>> versions. When SSE supports this, the x87 can be removed without
>> losing features or efficiency.
>
>IMHO, if extended precision is really that big a deal, they should have
>just added 128-bit long doubles to SSE (for both x86-32 and -64) and
>dropped x87 entirely from the x86-64 ABI.

>I don't think AMD's hardware could do that at the time,

Likely.

>but it would
>have been worth adding it (or at least faking it via the existing 80-bit
>hardware) to remove that as justification for x87.

Apparently it is still not so easy for the hardware, since otherwise it
would have been implemented already.

sparc supports 128-bit long doubles, but it died before it supported
them in hardware. I think it had opcodes for them from the beginning
for 64-bit mode (when was that?). I think their only use is a checkbox
for marketing, since emulating them in software makes 128-bit long
doubles about 3000 times slower than x87 long doubles (for scalar code,
measured in cycle counts, in not very good emulations). Sparc floats
and doubles are only 2 or 3 times slower than in x87 or SSE.

>OTOH, other platforms get by just fine with only double precision.

So does FreeBSD, mostly. In around 1993, just before gcc started
supporting long doubles as 80-bit ones, I made the default rounding
precision in FreeBSD and Linux only 53 bits. This reduces precision
bugs for 64-bit doubles although not for 32-bit floats. It almost
completely breaks 80-bit long doubles (they have extended exponent
range but mostly only 53-bit precision). Linux changed this as soon
as gcc started supporting long doubles, but FreeBSD never did. FreeBSD
is still missing library support for a few long double functions.

>> But the ABI prevents simply removing it.
>
>Again, that argument is moot.

I'm taling about the current ABI. 32-bit mode didn't even really have
one, so changing it would have been relatively easy. Now the 64-bit
mode ABI has been established for 10 years, so it will be hard to change.

Bruce

Bruce Evans

unread,
Mar 16, 2013, 7:50:17 AM3/16/13
to
In article <34fc1a-...@ntp-sure.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Bruce Evans wrote:
>> In the FreeBSD math library, we (I) try to avoid using special x87
>> code in 64-bit mode, and if possible and not too slow, just use the
>> compiler's support for long doubles. I also try to avoid using
>> algorithms that depend on the x87's extra precision to gain efficiency
>> for double precision routines. The x87 ends up with special x87 code
>> in 64-bit mode in only the following functions:
>>
>> remainderl()
>> sqrtl()
>> rintl() and related functions
>> logbl()
>> remquo()
>> remquof()
>> remquol()
>> scalbn()
>> scalbnf()
>> scalbnl()
>
>OK, so we have the following groups:
>
>rem*, scalbn*, sqrt, log and rint
>
>rem* requires exact division, but scalbn is just exponent manipulation,
>isn't it?
>
>Why can't you do that easily in SSE using integer operations on the
>exponent field?

Normalization and handing error cases is painful. The C version uses
this method, and it takes 26 lines of excessively concise code to
handle all the cases.

It's better (for x87 and SSE) to only use bits for constructing the
exponent: use the ordinary multiplication x * 2**n. 2**n could be by
table lookup, but the table would be too large, so construct 2**n by
modifying the exponent of 1.0[FL]. Arrange pipelines so that this is
done early so it is almost free. The multiplication then runs faster
than the special x87 instruction, and gives the same error handling.
Unfortunately, 2**n is not representable for all the needed cases (much
the same cases as when you subtract n from x's exponent), so some
complications and branches are needed. So it still seems hard to beat
the x87 in even an inline C function. I use this method mainly for
exp*(), where it is easier to caclulate 2**n well in advance.

>sqrt makes sense, since x87 hw can do this at similar effort to FDIV.
>
>log is very easy to get approximately right, but I suspect much harder
>to get to sub-1.0 ulp?

log is actually easier to get right in software than sqrt (not very
easy), since perfect rounding in all rounding modes is only required
for sqrt. For log*(), I get 6 extra bits of precision (sub 0.52 ulps
after rounding these bits) about twice as fast on average as the x87.
I haven't fully tested the accuracy of x87 log, but it seems to be
within 1 ulp as documented. That is with 64-bit ulps, so it has more
than 6 extra relative to double precision and is a little more accurate
than my version after rounding.

>Finally, what about rintl?
>
>SSE has directed rounding operations, but only for 32-bit values, right?

Yes, and these are used in 64-bit mode for floats and doubles. The
"related functions" are other long double ones like lrintl().

>Is it really worthwhile to transfer data from SSE to x87 just to convert
>a 64-bit value?

Depends how fast the SSE and x87 conversions are, but in general you
should never convert between SSE and x87 in either direction since
the transfer is inherently slow and causes extra dependencies which
tend to stall pipelines. However, for non-inline functions, passing
the parameters causes similar dependencies anyway. Compilers now
have builtin inlines for some utility functions and might have them
for rint*(). Probably not for scalbn*().

>> A few of the medium-weight long double functions are here because the
>> i387 is quite good at them and they are hard to make efficient in C.
>> ...
>
>OK, so what about the rounding function?
>
>Inspect the exponent, branch to alternate paths:
>
>a) exp > 63: Overflow
>b) exp >= 53: Exact conversion
>c) exp >= 0: Check the bits that gets shifted out, handle rounding +
>inexact trap
>
>Is the problem that you need to generate a hw trap of the right kind?

FreeBSD uses a shifting method for rintl() in case (c).
(rintl() doesn't convert to an integer,
so there is no overflow and only a trap for signaling NaNs.
lrintl() would have to generate an overflow exception and
perhaps it is best implemented as (long)rintl()).
The shifting method is basically rintl(x) = (x + 0x1p63) - 0x1p63
for 64-bit precision. x < 0 and especially negative zero need special
handling. Otherwise this formula does the right thing in all rounding
modes.

In other precisions (and even in long double precision if it is exotic),
There are technical problems with extra precision and double rounding.
These only affect 32-bit mode on x86. FreeBSD still uses the generic
fdlibm/Cygnus routines for rint() and rintf(). These do almost
everything in bits. These have large slow messes to handle the double
rounding and smaller messes to handle the extra precision. Handling
all rounding modes doesn't get any easier when done in bits.

I tried to use the shifting method in all precisions, but forgot about
the double rounding problem and only have buggy versions with the extra
precision part optimized.

The round*() functions are probably similar. These are new in C99 and
subtly different from the rint*() functions. FreeBSD only has unoptimized
C versions of them.

Bruce

Terje Mathisen

unread,
Mar 16, 2013, 1:17:46 PM3/16/13
to
Bruce Evans wrote:
> The shifting method is basically rintl(x) = (x + 0x1p63) - 0x1p63
> for 64-bit precision. x < 0 and especially negative zero need special
> handling. Otherwise this formula does the right thing in all rounding
> modes.

This is actually the method I prefer myself, but with 1.5 as the
constant: This limits maximum mantissa sizes by an additional bit, but
handles both negative and positive numbers.

I.e. not really usable in a generic library routine, but the fastest
approach when you control maximum input sizes.

Mask away the exponent part and you're left with a
>
> In other precisions (and even in long double precision if it is exotic),
> There are technical problems with extra precision and double rounding.
> These only affect 32-bit mode on x86. FreeBSD still uses the generic
> fdlibm/Cygnus routines for rint() and rintf(). These do almost
> everything in bits. These have large slow messes to handle the double
> rounding and smaller messes to handle the extra precision. Handling
> all rounding modes doesn't get any easier when done in bits.
>
> I tried to use the shifting method in all precisions, but forgot about
> the double rounding problem and only have buggy versions with the extra
> precision part optimized.
>
> The round*() functions are probably similar. These are new in C99 and
> subtly different from the rint*() functions. FreeBSD only has unoptimized
> C versions of them.
>
> Bruce
>


Stephen Sprunk

unread,
Mar 16, 2013, 2:26:51 PM3/16/13
to
On 16-Mar-13 05:31, Bruce Evans wrote:
> In article <khvuet$7t4$1...@dont-email.me>, Stephen Sprunk
> <ste...@sprunk.org> wrote:
>> On 15-Mar-13 08:33, Bruce Evans wrote:
>>> People who want to break extended precision can't do it so
>>> easily :-). x87 code is mandatory in the 64-bit ABI, and its use
>>> is not very rare. It is used for long doubles, and the ABI
>>> specifies the details for long doubles.
>>
>> ... but we are discussing what AMD could have done differently in
>> said ABI, so that argument is moot.
>
> Not really. They couldn't have dropped x87 support without
> providing something that works better (SSE with at least 80-bit
> precision and faster operations). Otherwise all algorithms that
> depend on the extra precision would break.

Those algorithms already "broke" on any system other than x86, so having
them "break" on x86-64 wouldn't be that novel.

>>> The ABI also has an optional __float128 type, at least in old
>>> versions. When SSE supports this, the x87 can be removed without
>>> losing features or efficiency.
>>
>> IMHO, if extended precision is really that big a deal, they should
>> have just added 128-bit long doubles to SSE (for both x86-32 and
>> -64) and dropped x87 entirely from the x86-64 ABI.
>
>> I don't think AMD's hardware could do that at the time,
>
> Likely.
>
>> but it would have been worth adding it (or at least faking it via
>> the existing 80-bit hardware) to remove that as justification for
>> x87.
>
> Apparently it is still not so easy for the hardware, since otherwise
> it would have been implemented already.

I'm not so sure about that, at least the faking part. The silicon to
load 80 significant bits (already supported by the x87 hardware) from a
128-bit memory object should be trivial.

If anyone cared about real 128-bit precision, that would have been
relatively easy to add later without changing the ABI.

>>> But the ABI prevents simply removing it.
>>
>> Again, that argument is moot.
>
> I'm taling about the current ABI. 32-bit mode didn't even really
> have one, so changing it would have been relatively easy.

Of course it did:
http://www.sco.com/developers/devspecs/abi386-4.pdf

The copyright stretches from 1990 to 1996, which implies the range of
years in which it was under development.

Granted, that's SysV, but Linux et al used it as well. Microsoft, of
course, had their own ABI, but them doing something nonstandard is not
exactly news.

> Now the 64-bit mode ABI has been established for 10 years, so it
> will be hard to change.

Again, we were discussing what AMD could/should have done differently in
the x86-64 ABI at the time it was first developed, so what was actually
done is moot.

Stephen Sprunk

unread,
Mar 16, 2013, 5:16:13 PM3/16/13
to
On 15-Mar-13 18:59, christian.bau wrote:
> On Mar 15, 7:08 pm, Stephen Sprunk <step...@sprunk.org> wrote:
>> Really? Within seconds of understanding they hadn't dropped x87,
>> I thought they were a bunch of morons. Granted, they had probably
>> made that decision years before it became public knowledge, so
>> perhaps more bravery would have been required, but it doesn't take
>> a genius to see that both Intel and AMD were already moving in the
>> direction of SSE for FP math.
>
> All the x86 64-bit processors are also x86-32 bit processors, and
> you surely want to continue running 32 bit code for a while,
> including 32 bit code that uses x87.
>
> On the other hand, I think on 64 bit I would have removed x87 and
> MMX registers and instructions completely.

That's what I meant.

> Now how do you implement 80 bit fp? Choices are:
>
> 1. Still in hardware; while 32 bit code fetches operands from x87
> registers, 64 bit fetches operands from XMM registers. 2. Emulate FP
> using integer instructions on 64 bit, trap and emulate in the
> interrupt handler on 32 bit code. x87 emulation can be done quite
> fast, especially if you have 64 bit integers including 64x64 bit
> multiply.

Single and double precision FP ops could easily be emulated via SSE, as
could extended precision FP ops if you enhanced SSE to include quad
precision FP. If not, the latter would require integer ops.

There's already a standard for doing x87 emulation in software when no
coprocessor is present, so you wouldn't even need silicon to do it.

That emulation wouldn't be necessary in x86-64 if the ABI explicitly
dropped support for x87/MMX; all new apps would be forced to use SSE
natively for FP.

Paul A. Clayton

unread,
Mar 16, 2013, 7:45:41 PM3/16/13
to
On Mar 15, 12:14 am, Ivan Godard <i...@ootbcomp.com> wrote:
> On 3/14/2013 7:36 PM, Paul A. Clayton wrote:
[snip]
>> I think in the 1980s there was research on Icache
>> bypass (for conflict and capacity issues), and it
>> seems unfortunate that fetch behavior is not
>> exploited for allocation in this manner.
>
> Simpler to just double the cache :-) Can you find a cite or two on this?

I found a 1990 paper ("Power and Performance Tradeoffs
using Various Caching Strategies", R. Iris Bahar et al.)
which explored uses of a small fully associative cache
(victim, non-temporal, speculative, penalty) in
combination with a direct mapped cache. The last method
was only explored for Icache and used a criticality
estimate (instruction and issue buffer fullness) to
choose placement; performance improved 0% to almost 25%
with an median improvement of about 6%.

The speculative method estimated confidence of branches
and data and/or instruction accesses of low confidence
were placed in the side buffer. Performance
improvement for the instruction-only form ranged from
-0.197% to 22.883% (median just under 6%). Interestingly,
the writers mention that random placement into the side
buffer also improved performance (and energy efficiency).

The non-temporal method did not work well in their
configuration because it required more L2 accesses
to store the temporality information. Performance
for integer programs decreased by 3% to 20% and
even fp programs were not universally better. (This
used per cache block tracking not specific to
instruction or data accesses.)

Obviously this study is less applicable to modern
systems--direct mapped 8KiB L1s. (It also did not
use placement optimized code, which _is_ realistic
for current systems.) An 8KiB 2-way Icache (assuming
same 1 cycle latency as direct mapped) had
substantially better performance than speculative
and penalty buffer on all but one benchmark; so it
appears the benefit of bypass is questionable.
There is also the caveat that the benchmarks were
based on SPEC95, so these were probably not very
Icache intensive (though a 16KiB direct mapped
single cycle L1 improved performance by 19% or more
on most of the programs).

(Bypass or more clever replacement might still have
a place even in an L1 Icache. It might also be
interesting to see the effect of L2/L3 bypass,
particularly in a 3 level cache with some degree of
non-inclusion.)

Bruce Evans

unread,
Mar 17, 2013, 1:59:25 AM3/17/13
to
In article <q2me1a-...@ntp-sure.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Bruce Evans wrote:
>> The shifting method is basically rintl(x) = (x + 0x1p63) - 0x1p63
>> for 64-bit precision. x < 0 and especially negative zero need special
>> handling. Otherwise this formula does the right thing in all rounding
>> modes.
>
>This is actually the method I prefer myself, but with 1.5 as the
>constant: This limits maximum mantissa sizes by an additional bit, but
>handles both negative and positive numbers.

You mean 1.5 times the relevant power of 2.

>I.e. not really usable in a generic library routine, but the fastest
>approach when you control maximum input sizes.

FreeBSD uses this method in non-general library routines (e.g., as part
of range reduction for exp2()). When I tried using in rint*(), I just
got broken versions since I had forgotten its limitations, and after
fixing the buggy cases it didn't seem to be much use there.

Bruce

Terje Mathisen

unread,
Mar 17, 2013, 8:14:27 AM3/17/13
to
Bruce Evans wrote:
> In article <q2me1a-...@ntp-sure.tmsw.no>,
> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>> This is actually the method I prefer myself, but with 1.5 as the
>> constant: This limits maximum mantissa sizes by an additional bit, but
>> handles both negative and positive numbers.
>
> You mean 1.5 times the relevant power of 2.

Naturally.
>
>> I.e. not really usable in a generic library routine, but the fastest
>> approach when you control maximum input sizes.
>
> FreeBSD uses this method in non-general library routines (e.g., as part
> of range reduction for exp2()). When I tried using in rint*(), I just
> got broken versions since I had forgotten its limitations, and after
> fixing the buggy cases it didn't seem to be much use there.

:-(

Terje

Michael S

unread,
Mar 18, 2013, 6:31:15 PM3/18/13
to
On Mar 16, 8:26 pm, Stephen Sprunk <step...@sprunk.org> wrote:
> On 16-Mar-13 05:31, Bruce Evans wrote:
>
> > In article <khvuet$7t...@dont-email.me>, Stephen Sprunk
Microsoft's ABI was established in 1993. So, if SysV released theirs
in 1996, I am not sure who exactly was doing something nonstandard.

BTW, Microsoft's IA32 ABI does not support 80-bit precision at all.
Probably because Dave Cutler was Alpha bigot.

Stephen Sprunk

unread,
Mar 18, 2013, 7:39:29 PM3/18/13
to
Look at the copyright dates for the SysV ABI again: 1990-1996 means it
was first published in 1990 and last modified in 1996.

> BTW, Microsoft's IA32 ABI does not support 80-bit precision at all.
> Probably because Dave Cutler was Alpha bigot.

... meaning that x87's support for 80-bit precision is inaccessible to
what, 99% of users? There should be no problem removing it, then.

Michael S

unread,
Mar 19, 2013, 8:07:56 AM3/19/13
to
On Mar 19, 1:39 am, Stephen Sprunk <step...@sprunk.org> wrote:
> On 18-Mar-13 17:31, Michael S wrote:
>
>
>
>
>
>
>
>
>
> > On Mar 16, 8:26 pm, Stephen Sprunk <step...@sprunk.org> wrote:
> >> On 16-Mar-13 05:31, Bruce Evans wrote:
> >>> I'm taling about the current ABI.  32-bit mode didn't even really
> >>> have one, so changing it would have been relatively easy.
>
> >> Of course it did:
> >>http://www.sco.com/developers/devspecs/abi386-4.pdf
>
> >> The copyright stretches from 1990 to 1996, which implies the range
> >> of years in which it was under development.
>
> >> Granted, that's SysV, but Linux et al used it as well.  Microsoft,
> >> of course, had their own ABI, but them doing something nonstandard
> >> is not exactly news.
>
> > Microsoft's ABI was established in 1993. So, if  SysV released
> > theirs in 1996, I am not sure who exactly was doing something
> > nonstandard.
>
> Look at the copyright dates for the SysV ABI again: 1990-1996 means it
> was first published in 1990 and last modified in 1996.
>

Still, Microsoft likely had register allocation-related parts of their
IA32 ABI frozen in 1989 if not in 1988.

> > BTW, Microsoft's IA32 ABI does not support 80-bit precision at all.
> > Probably because Dave Cutler was Alpha bigot.
>
> ... meaning that x87's support for 80-bit precision is inaccessible to
> what, 99% of users?  There should be no problem removing it, then.
>

99% percent is way too much.

First, in big IA32 world Win32 probably never had match more than 90%
market share.

Second, even in Win32, 80-bit precision is not supported by Microsoft
compilers ABI, but fully supported by OS (i.e. full x87 state saved/
restored during task switches). Most system Win32 APIs would also
accept x87 in any precision-related state on entry and will not
disturb it upon exit. For those APIs that do disturb x87 state, like
many older parts of DirectX, it is possible to wrap them in manual
save/restore.
So, both in theory and in practice, nothing prevents 3rd party tool
vendors from providing 80-bit precision on Win32 platform. I don't
know how many actually did it, but gcc certainly did, both in cygwin
and in mingw. And non-Microsoft dev tools traditionally had decent
market share on Win32 (less so today).

Third, even with Microsoft compiler, careful developer can utilize
extended precision in C-callable asm routines.

MitchAlsup

unread,
Mar 19, 2013, 1:38:49 PM3/19/13
to
On Tuesday, March 19, 2013 7:07:56 AM UTC-5, Michael S wrote:
>

A new rendition of "If you build it, Someone will prevent them
from comming".

Mitch

Michael S

unread,
Mar 19, 2013, 1:52:28 PM3/19/13
to
Sorry, I didn't understand. Can you elaborate on that?

timca...@aol.com

unread,
Mar 19, 2013, 1:58:19 PM3/19/13
to
I think he was referring to x87 80 bit math.

Intel built it, Win32 ABI prevented people from using it.

- tim

MitchAlsup

unread,
Mar 19, 2013, 11:55:51 PM3/19/13
to
It is the contrapositive of "If you build it they will come.

Intel built it, MS prevented people from using it.

Mitch

Michael S

unread,
Mar 20, 2013, 4:49:50 AM3/20/13
to
ok. I didn't understand because of position of your post in the
thread.

Your reaction is very understandable as a reply to my first post or to
Stephen's followup.
Less understandable as a reaction to my post that tries to argue that,
although Microsoft did not utilize x87 extended precision in their
dev. tools, they actually did not prevent other tool vendors and asm
programmers from using it on Win32. At least, no more than they
prevented them from using other obscure, but potentially usable x86/
x87 features. Like, say, parity flag. Or 18-digit packed BCD integers
that are also rudimentary supported by x87 hardware.

If you are looking for IA32 features that Microsoft prevented people
from using on Win32 platform then x87 extended precision is not a good
example. I'd rather mention dynamically sized/placed segments or to
TSS and its surrounding. Or, for something less brain-dead, call
gates.

Ivan Godard

unread,
Mar 20, 2013, 5:44:19 AM3/20/13