Uses for an exchange instruction?

Paul A. Clayton

unread,

Oct 17, 2012, 7:03:26 PM10/17/12

to

What uses are there for an exchange instruction?

If it is atomic with respect to cache coherence (which for
such a simple operation should not be problematic, I am
guessing), it can be used as a form of test-and-set where
the set and clear values are defined by software. This is
more limited than compare-and-swap but better fits a RISC
design philosophy.

Exchange might also be useful when spilling and filling
values to/from the stack, accomplishing the action in a
single instruction (with a reduction in fetch, decode, and
address generation, address translation, and cache tag
comparison overhead) and allowing the stack frame to be
reduced in size by one word (though debugging would be
made more complex). Even if debugging complexity is not a
problem, it is not clear that there would be much use for
such.

There might be some cases where the update of a variable
is not dependent on the current value and the current
value needs to be read. (Perhaps grabbing the pointer
for a work item and replacing it with a 'null', perhaps
as part of cutting out a slice of a linked list?)

I had initially thought that such might help in reducing
the usefulness of scratch registers (or shadow registers),
but there would still be a need for a thread or processor
thread context local memory pointer for at least the first
exchange target.

So what am I missing? Or was exchange primarily intended
as simply a more flexible test-and-set with any other
uses being an added bonus?

Ivan Godard

unread,

Oct 17, 2012, 7:46:58 PM10/17/12

to

Two questions here, hardware and linguistic, with different answers.

In hardware, XCHG was the Burroughs B-6500 atomic primitive. The
designers thought that exchange would simplify or obviate a lot of locks
compared to other primitives; lockless programming was an understood
goal even 50 years ago.

The B-6500 hardware had core memory (for you whippersnappers, those were
little ferrite rings threaded with sense wires). The tech had the
property that a write to a bit caused the XOR of the value written and
the previous value to appear on the sense lines, which means that an
exchange operation only needs to access the ferrite once and the
operation has the same timing as a plain load.

There were no end of problems with using exchange as an atomic, leading
the the realization that as an atomic the only thing you could do with
it was to implement test-and-set and proceed from there. The basic issue
was ABA aliasing, though I remember Billard and Baurle muttering about
other issues too - I wasn't in that group and didn't know the details.
This of course meant that many many data structures throughout the OS
had to have lock variables added, with much rending of garments.

On the other hand, exchange in a *non-atomic* context is quite useful if
it is reachable from your host language. A lot of performance critical
algorithms run much faster with exchange so long as the data structure
is monothreaded. I vaguely recall the memory part of the system sort
function speeding up by >30% when exchange replaced load/store. Note
that if you don't need to be atomic, a swap-word operation can be used
to swap-arbitrary simply by walking wordwise along the structure.

In software, Mary is the only language I know of that exposed exchange
as a language primitive (as opposed to an intrinsic function). Mary had
no operator precedence, using strict left-to-right execution order
including assignment. Thus "A + B * C =: D;" means load A to a register,
add B, multiply the result times C, and store the product to D.

The Mary exchange operation was ":=:" and was applicable to any types
for which "=:" (store) was applicable. Thus "new :=: old.next =:
new.next" is a linked list insert; very convenient and natural notation
IMO. The compiler used the hardware exchange operation if one existed,
or loads and stores if not, depending on target.

Note that the Mary :=: was *not* an atomic. It was just a notation for a
common idiom, swap, without requiring a scope, a new variable in the
namespace, and the nuisance of maintaining the scratch declaration in
the face of type changes of the value being swapped. Only with "auto"
and C++11 can conventional languages avoid that nuisance in
metaprogrammang, and Mary did a *lot* of metaprogramming.

MitchAlsup

unread,

Oct 17, 2012, 9:01:43 PM10/17/12

to

On Wednesday, October 17, 2012 6:03:26 PM UTC-5, Paul A. Clayton wrote:
> What uses are there for an exchange instruction?

Do yourself a favor and do not attempt to join the orthogonal properties
of swapping data with the property of atomicity.

Excahnge instructions should have the singular property that after the
instruction has been performed that the registers and memory locations
have easily computable bit patterns in them. Atomic stuff has many more
possible bit patterns and many more potential sources of "it no workie".

Given one slot left in the instruction encoding and the ability to put
an exchange instruction in that slot--don't do it--save it for something
more valuable later on.

Mitch

Paul A. Clayton

unread,

Oct 18, 2012, 11:20:00 AM10/18/12

to

On Wednesday, October 17, 2012 7:47:16 PM UTC-4, Ivan Godard wrote:
> On 10/17/2012 4:03 PM, Paul A. Clayton wrote:
>> What uses are there for an exchange instruction?

[snip]

> Two questions here, hardware and linguistic, with different answers.
>
> In hardware, XCHG was the Burroughs B-6500 atomic primitive. The
> designers thought that exchange would simplify or obviate a lot of locks
> compared to other primitives; lockless programming was an understood
> goal even 50 years ago.
>
> The B-6500 hardware had core memory (for you whippersnappers, those were
> little ferrite rings threaded with sense wires). The tech had the
> property that a write to a bit caused the XOR of the value written and
> the previous value to appear on the sense lines, which means that an
> exchange operation only needs to access the ferrite once and the
> operation has the same timing as a plain load.

Neat. (I am almost surprised that they did not [or did they?]
find a use for the XOR "result"; such might reduce the logic used
for test-and-set.)

> There were no end of problems with using exchange as an atomic, leading
> the the realization that as an atomic the only thing you could do with
> it was to implement test-and-set and proceed from there. The basic issue
> was ABA aliasing, though I remember Billard and Baurle muttering about
> other issues too - I wasn't in that group and didn't know the details.

Unlike compare-and-swap, one cannot safely use exchange to implement
something like atomic increment. While a modification by another
thread between the preparatory read and the exchange could be
detected, there is a chance that a thread might use the incorrect
value inserted by the exchange (which was based on an increment of
a stale value). For a join counter this might not be a huge
problem since the thread that inserted the wrong value would know
how many increments had been undone and could keep adding the
number of undone increments until no conflicts were encountered
(I think); since no action is taken until the join total is
reached, temporarily storing a smaller or equal count should not
have an adverse effect.

> This of course meant that many many data structures throughout the OS
> had to have lock variables added, with much rending of garments.
>
> On the other hand, exchange in a *non-atomic* context is quite useful if
> it is reachable from your host language. A lot of performance critical
> algorithms run much faster with exchange so long as the data structure
> is monothreaded. I vaguely recall the memory part of the system sort
> function speeding up by >30% when exchange replaced load/store. Note
> that if you don't need to be atomic, a swap-word operation can be used
> to swap-arbitrary simply by walking wordwise along the structure.

That is somewhat surprising. This would seem to imply that exchanges
were a large fraction of executed instructions. (If MSI coherence
state was used, updating Shared to Modified could be expensive if
memory operation time was not overlapped. In MESI, a single-user
cache line would never need a Shared to Modified update [though
false sharing could force such if each _word_ only had a single
user but a cache line had multiple users].)

In most uses, exchange would be coherence atomic; it is just that
software is guaranteeing atomicity not hardware (e.g., single
threaded program or use of locks).

[snip interesting comments on the Mary programming language]

Now I am saddened not only by how much I do not know but also
by how much good design is not practiced (or how great the
delay is between discovery and broad adoption).

I is often fun to learn, so I do thank you for sharing this bit
of mostly historical information.

Paul A. Clayton

unread,

Oct 18, 2012, 11:57:54 AM10/18/12

to

On Wednesday, October 17, 2012 9:01:43 PM UTC-4, MitchAlsup wrote:
> On Wednesday, October 17, 2012 6:03:26 PM UTC-5, Paul A. Clayton wrote:
>> What uses are there for an exchange instruction?
>
> Do yourself a favor and do not attempt to join the orthogonal properties
> of swapping data with the property of atomicity.
>
> Excahnge instructions should have the singular property that after the
> instruction has been performed that the registers and memory locations
> have easily computable bit patterns in them. Atomic stuff has many more
> possible bit patterns and many more potential sources of "it no workie".

I thought that exchange would be relatively simple to make
coherence atomic. (Admittedly, simple and correct might mean
uselessly slow. Waiting for the store part to be ready to
commit before performing the load might be somewhat simple
but could significantly hurt performance. In a processor using
OoO mechanisms to provide a very strong consistency model,
speculatively performing the load might not add significant
complexity.)

(x86 does distinguish between coherence atomic and merely
interrupt atomic update operations with the LOCK prefix.
Using something like the acquire/release indicator in
Itanium and ARM AArch64 would probably not be appropriate
because at least some uses of coherence atomic exchange
do not need the memory barrier semantics; again a case of
unnecessarily binding orthogonal features.)

IBM did choose to recently make certain update
instructions coherence atomic (if the memory address is
properly aligned) in zSeries (S/360 descendants). This
hints that such artificial binding might not be extremely
expensive (at least if the architecture had a strong
memory consistency model), though such might also be of
greater benefit for the target market of IBM's mainframes
than it would be for most general purpose processors (the
atomic guarantee might allow some software to be unchanged
while still supporting multiprocessor operation).

> Given one slot left in the instruction encoding and the ability to put
> an exchange instruction in that slot--don't do it--save it for something
> more valuable later on.

I was inclined to think that exchange was a less useful
instruction, which was why I asked if there were uses that
I had not thought of which could justify its inclusion in
the M88k. (Ivan Godard already mentioned part of the
attraction for a processor using core memory.)

I admit that I like that exchange facilitates certain
optimizations and avoids certain overheads. In coherence
atomic form, such could replace a three instruction
sequence (ll, sc, branch) and would not have even
theoretical lock-up issues (idiom recognition with short
ll/sc sequences could be useful, but such would be more
expensive [but also more flexible] than specialized
instructions).

Implementing exchange might not be excessively complex
at any scale of performance, but if its use is relatively
limited forcing such complexity on all implementations
could be suboptimal.

I am not entirely convinced on the preciousness of
instruction encodings, especially if one has variable
length encodings. Even in a RISC with 32-bit instructions,
using one of hundreds of minor opcodes for a single major
opcode might not be excessively expensive. Since it seems
to be less useful _and_ can be effectively synthesized
with modest extra overhead, there does not seem to be a
good case for exchange. (Of course, I have no experience
with ISA design much less maintenance, and I have a very
underdeveloped sense of the importance of binary compatibility;
so my perception is not exactly trustworthy in this.)

EricP

unread,

Oct 18, 2012, 11:57:34 AM10/18/12

to

Paul A. Clayton wrote:
> What uses are there for an exchange instruction?

Outside of atomic usages, if you have a non-orthagonal ISA,
either because it uses accumulator(s), or where certain opcodes
only work on certain registers, then you need exchange to get
operands to the right register without a store(s)+load(s).

Eric

Michael S

unread,

Oct 18, 2012, 12:21:09 PM10/18/12

to

On Oct 18, 5:57 pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
>
> (x86 does distinguish between coherence atomic and merely
> interrupt atomic update operations with the LOCK prefix.

The statement above is correct for all RMW x86 instructions *except*
xchg.

From Intel manual:
If a memory operand is referenced, the processor’s locking protocol is
automatically implemented for the duration of the exchange operation,
regardless of the presence or absence of the LOCK prefix or of the
value of the IOPL.

Michael S

unread,

Oct 18, 2012, 12:23:58 PM10/18/12

to

On Oct 18, 1:47 am, Ivan Godard <igod...@pacbell.net> wrote:
>
> On the other hand, exchange in a *non-atomic* context is quite useful if
> it is reachable from your host language. A lot of performance critical
> algorithms run much faster with exchange so long as the data structure
> is monothreaded. I vaguely recall the memory part of the system sort
> function speeding up by >30% when exchange replaced load/store.

You mean, on original B6500? Or on later derivatives?

Robert Wessel

unread,

Oct 18, 2012, 12:45:44 PM10/18/12

to

Excepting pre-286 versions of x86, of course, where an unadorned xchg
did not imply a lock.

I've always been baffled by that particular change in the ISA. Xchg to
memory is now always atomic, and very often slow because of that. So
it's less useful in cases where you don't care that it's atomic. And
it's not like there wasn't a mechanism for making it atomic when you
wanted it.

Michael S

unread,

Oct 18, 2012, 1:58:05 PM10/18/12

to

On Oct 18, 6:45 pm, Robert Wessel <robertwess...@yahoo.com> wrote:
> On Thu, 18 Oct 2012 09:21:09 -0700 (PDT), Michael S
>

I'd guess, the reference to IOPL in above statements gives us a hint
on why.
Seems like 80286 designers had a bright idea that all locked
primitives except XCHG should be reserved for the IOpriveledged code
segments. But then they realized that hardware that generates #GP on
all LOCK prefixed instructions except XCHG is much more complex than
the hardware that generates #GP on each appearance of LOCK prefix
itself. The rest follows.

Unlike few people here, I think that x86 ISA in its current IA-32/
AMD64 form mostly makes sense and that even its original 8086 variant
sorta made sense, although obviously less so. But this sentiment does
not extend to the absolute majority of 286-introduced features.

EricP

unread,

Oct 18, 2012, 2:43:55 PM10/18/12

to

Yeah, sometimes atomic is not wanted, just non-interruptable.
Xchg is useful for pushing an item on a list, in a nested
interrupt safe manner, without disabling interrupts.
I use it to build a FIFO to-do list in interrupt handlers.
Because each list is private to a cpu no memory barriers are
required, just a non-interruptable exchange.

Eric

Kalle Olavi Niemitalo

unread,

Oct 18, 2012, 6:02:09 PM10/18/12

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

> What uses are there for an exchange instruction?

I've seen exchange instructions used for obfuscation, making it
harder for humans to understand whether the code wants to load or
store... or both for unrelated values. It was not particularly
effective though.

MitchAlsup

unread,

Oct 18, 2012, 6:22:06 PM10/18/12

to

On Thursday, October 18, 2012 5:01:07 PM UTC-5, Kalle Olavi Niemitalo wrote:
> I've seen exchange instructions used for obfuscation, making it
> harder for humans to understand whether the code wants to load or
> store... or both for unrelated values. It was not particularly
> effective though.

When writing an 8085 driver way back in 1979 I used 4 XTHLs in a row
in order to create a delay between when I wrote to a device register
and when I was allowed to read from the device (subsequent to having
written to the device.) Why XTHL--well it was only 1 byte opcode and
took a few cycles to read and write the RAMs; and I was being pressed
to keep the code as small as possible.

The code did absolutely nothing to machine state other than generate
16 (or was it 32) cycles of delay.

Talk about obscuration.....

Mitch

Tim McCaffrey

unread,

Oct 18, 2012, 7:46:47 PM10/18/12

to

In article <bGVfs.5279$sB6....@newsfe08.iad>,
ThatWould...@thevillage.com says...

Indeed, when I did 8088 asm programs I used XCHG AX,<reg> a fair
amount. It was small (1 byte) and fast (2 cycles).

Of course, I also used XLAT, LOOP(cond), LODSB, STOSB, and other
instructions that Intel/AMD don't want around anymore....

- Tim

Robert Wessel

unread,

Oct 19, 2012, 12:02:55 AM10/19/12

to

Gack. I had forgotten that you needed CPL <= IOPL to issue a lock on
the 286. What moron thought that was a good idea? And to what end?

>Unlike few people here, I think that x86 ISA in its current IA-32/
>AMD64 form mostly makes sense and that even its original 8086 variant
>sorta made sense, although obviously less so. But this sentiment does
>not extend to the absolute majority of 286-introduced features.

I don't really disagree, but it does give us something to gripe about.

Sure it could be cleaner, and use some more registers (largely fixed
in x86-64 mode), and there's a fair bit of legacy crap (x87 for
starters), but I have to agree. x86-32 and x86-64 are not too bad in
general, although they have their little quirks and bits of
non-orthogonality to annoy us. And given the constraints the 8086 was
designed under (the darn thing had only 29,000 transistors), it was a
big (and pretty nice) step up from the 8080. x87 and segmentation are
now largely bad memories, although both could have been much better
with only minor changes.

And at the end of the day, whatever its warts, x86 gives us using
these machines outstanding performance (in any terms), for a pittance.
That excuses almost everything.

Terje Mathisen

unread,

Oct 19, 2012, 2:00:15 AM10/19/12

to

Tim McCaffrey wrote:
> In article <bGVfs.5279$sB6....@newsfe08.iad>,
> ThatWould...@thevillage.com says...
>>
>> Paul A. Clayton wrote:
>>> What uses are there for an exchange instruction?
>>
>> Outside of atomic usages, if you have a non-orthagonal ISA,
>> either because it uses accumulator(s), or where certain opcodes
>> only work on certain registers, then you need exchange to get
>> operands to the right register without a store(s)+load(s).
>>
>> Eric
>
> Indeed, when I did 8088 asm programs I used XCHG AX,<reg> a fair
> amount. It was small (1 byte) and fast (2 cycles).

Small? Yes.

Fast? Yes, but it was 4 cycles, not 2:

On the 8088 every instruction used 4 clocks per bus cycle, including the
opcode transfer.

>
> Of course, I also used XLAT, LOOP(cond), LODSB, STOSB, and other
> instructions that Intel/AMD don't want around anymore....

You used them for the same reason I did: Small code was fast code!
:-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Tim McCaffrey

unread,

Oct 19, 2012, 11:02:29 AM10/19/12

to

In article <gs67l9-...@ntp-sure.tmsw.no>, "terje.mathisenattmsw.no"
says...

>
>Tim McCaffrey wrote:
>> In article <bGVfs.5279$sB6....@newsfe08.iad>,
>> ThatWould...@thevillage.com says...
>>>
>>> Paul A. Clayton wrote:
>>>> What uses are there for an exchange instruction?
>>>
>>> Outside of atomic usages, if you have a non-orthagonal ISA,
>>> either because it uses accumulator(s), or where certain opcodes
>>> only work on certain registers, then you need exchange to get
>>> operands to the right register without a store(s)+load(s).
>>>
>>> Eric
>>
>> Indeed, when I did 8088 asm programs I used XCHG AX,<reg> a fair
>> amount. It was small (1 byte) and fast (2 cycles).
>
>Small? Yes.
>
>Fast? Yes, but it was 4 cycles, not 2:
>
>

Hmm, my July '81 iAPX88 book says XCHG AX,<reg> is 3 cycles, XCHG <reg>,<reg>
is 4. (and XCHG <reg>,<mem16> is 25+EA cycles! Ouch!).

Like you, I understood the execution speed was limited to how fast you could
fetch instruction bytes, but if I was going to use a slow instruction that
allowed the instruction prefetcher to actually work, I tried to take
advantage of it...

- Tim

Terje Mathisen

unread,

Oct 19, 2012, 4:07:49 PM10/19/12

to

Tim McCaffrey wrote:
> In article <gs67l9-...@ntp-sure.tmsw.no>, "terje.mathisenattmsw.no"

>>> Indeed, when I did 8088 asm programs I used XCHG AX,<reg> a fair
>>> amount. It was small (1 byte) and fast (2 cycles).
>>
>> Small? Yes.
>>
>> Fast? Yes, but it was 4 cycles, not 2:
>>
> Hmm, my July '81 iAPX88 book says XCHG AX,<reg> is 3 cycles, XCHG <reg>,<reg>
> is 4. (and XCHG <reg>,<mem16> is 25+EA cycles! Ouch!).

Let's see...that last instruction is probably 4 bytes, right?

Adding in the two bytes read and the two bytes written and I get 8x4=32
cycles as the minimum fetch time for that operation, so even with a 25+
cycle instruction the prefetch buffer will be empty afterwards. :-(

>
> Like you, I understood the execution speed was limited to how fast you could
> fetch instruction bytes, but if I was going to use a slow instruction that
> allowed the instruction prefetcher to actually work, I tried to take
> advantage of it...

IMHO the only common operations that allowed the prefetch buffer to fill
were MUL and DIV.

I used DIV in my prefetch measuring code, i.e. before doing a REP STOSB
which overwrote a bunch of INC REG instructions.

Andy (Super) Glew

unread,

Oct 20, 2012, 1:56:57 PM10/20/12

to Ivan Godard

Thanks for this nugget of history! I was not aware of this property of core.

> There were no end of problems with using exchange as an atomic, leading
> the the realization that as an atomic the only thing you could do with
> it was to implement test-and-set and proceed from there.

Actually you can do some slightly more powerful thing with
atomic-exchange-with-memory, aka atomic fetch-and-store:

You can implement several scalable queue locks,

e.g. Graunke and Thakkar's array based queue lock,
http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#graunke_thakkar

e.g. the CLH

and a variant of the famous MCS lock can be implemented with exchange,
although MCS prefers to have compare_and_store (aka compare_and_swap).

> In software, Mary is the only language I know of that exposed exchange
> as a language primitive (as opposed to an intrinsic function). Mary had
> no operator precedence, using strict left-to-right execution order
> including assignment. Thus "A + B * C =: D;" means load A to a register,
> add B, multiply the result times C, and store the product to D.
>
> The Mary exchange operation was ":=:" and was applicable to any types
> for which "=:" (store) was applicable. Thus "new :=: old.next =:
> new.next" is a linked list insert; very convenient and natural notation
> IMO. The compiler used the hardware exchange operation if one existed,
> or loads and stores if not, depending on target.
>
> Note that the Mary :=: was *not* an atomic. It was just a notation for a
> common idiom, swap, without requiring a scope, a new variable in the
> namespace, and the nuisance of maintaining the scratch declaration in
> the face of type changes of the value being swapped. Only with "auto"
> and C++11 can conventional languages avoid that nuisance in
> metaprogrammang, and Mary did a *lot* of metaprogramming.

I learned to program from Dijkstra-era books that had various flavors of
concurrent assignment and other concurrent statement. Some of them had :=:

For example, the following were equivalent

a :=: b
a,b := b,a
a:=b, b:=a

but the comma oriented syntaxes allow more general parallel operations,
such as

# to from
ptr->next := n,
ptr->next->prev := n,
n->prev := ptr,
n->next := ptr->next

for insertion in a doubly linked list.

Not atomic, but allowed the compiler to figure out what temporaries were
necessary.

Ivan Godard

unread,

Oct 20, 2012, 2:20:15 PM10/20/12

to

On 10/20/2012 10:56 AM, Andy (Super) Glew wrote:
> On 10/17/2012 4:46 PM, Ivan Godard wrote:

<snip>

>>
>> The B-6500 hardware had core memory (for you whippersnappers, those were
>> little ferrite rings threaded with sense wires). The tech had the
>> property that a write to a bit caused the XOR of the value written and
>> the previous value to appear on the sense lines, which means that an
>> exchange operation only needs to access the ferrite once and the
>> operation has the same timing as a plain load.
>
> Thanks for this nugget of history! I was not aware of this property of
> core.

A read from core was actually a write of all zero bits. The XOR on the
sense lines was sent to the CPU and written back to the ferrite, so a
read took twice as long as a write or exchange in the module itself.
However, the write back overlapped with the flight time to the CPU, so
from the view of the CPU they were all one-cycle operations.

Ivan Godard

unread,

Oct 20, 2012, 2:35:37 PM10/20/12

to

On 10/20/2012 10:56 AM, Andy (Super) Glew wrote:

<snip>

> I learned to program from Dijkstra-era books that had various flavors of
> concurrent assignment and other concurrent statement. Some of them had :=:
>
> For example, the following were equivalent
>
> a :=: b
> a,b := b,a
> a:=b, b:=a
>
> but the comma oriented syntaxes allow more general parallel operations,
> such as
>
> # to from
> ptr->next := n,
> ptr->next->prev := n,
> n->prev := ptr,
> n->next := ptr->next
>
> for insertion in a doubly linked list.
>
> Not atomic, but allowed the compiler to figure out what temporaries were
> necessary.

Par clauses (your commas) are the best available, more general than
tuples, but don't map well to the underlying machine - they are more
hints to the optimizer. I always wanted to have a concise notation for
{A := X/Y, B := X%Y} that mapped to a hardware divide-with-remainder
operation.

This is especially a notational problem for our Mill, in which hardware
call operations can return more than one result. It works fairly
naturally for Ada and other languages with OUT parameters, but there's
no good way to reach it from C.

Michael S

unread,

Oct 20, 2012, 4:01:50 PM10/20/12

to

On Oct 19, 6:02 am, Robert Wessel <robertwess...@yahoo.com> wrote:
> On Thu, 18 Oct 2012 10:58:05 -0700 (PDT), Michael S
>
>
>
>
>
>
>
>
>

Couple of day ago I encountered that:
http://www.computerhistory.org/collections/accession/102702019
Pretty interesting reading, even if I don't believe that they remember
a a lot of details 25+ years later.
As to 286, they seem to recollect that any 286 design decisions were
driven by scare of "Zilog MMU", supposed reference to z8000
segmentation/memory protection facility. Or, may be, they had sources
telling them about z80000?
So they were scare, which is not the best mood for making all right
decision.

Trying to rationalize, may be, originally they planned to make LOCK
into real instruction, rather than prefix, with UNLOCK being explicit,
rather than implied on the next instruction boundary? Later on they
probably didn't find a useful way for co-existents of LOCK sceme like
that with interrupts so the idea was dropped. But we still can see the
vestige in form of IOPL sensitivity of LOCK prefix.

Terje Mathisen

unread,

Oct 21, 2012, 2:26:40 AM10/21/12

to

Ivan Godard wrote:
> Par clauses (your commas) are the best available, more general than
> tuples, but don't map well to the underlying machine - they are more
> hints to the optimizer. I always wanted to have a concise notation for
> {A := X/Y, B := X%Y} that mapped to a hardware divide-with-remainder
> operation.

Ditto.

>
> This is especially a notational problem for our Mill, in which hardware
> call operations can return more than one result. It works fairly
> naturally for Ada and other languages with OUT parameters, but there's
> no good way to reach it from C.
>

I believe I have seen a C compiler which had an intrinsic that did
expose this, but you needed an address parameter to get back the
remainder, something like

uint64_t dividend;
uint32_t divisor, result, remainder;
...
result = divmod6432(dividend, divisor, &remainder);

where the compiler would presumably figure out that remainder was a
local variable which could live in the natural 64%32 remainder register.

Even without an intrinsic, this is a useful asm library function!

Stephen Sprunk

unread,

Oct 21, 2012, 4:09:38 AM10/21/12

to

On 21-Oct-12 01:26, Terje Mathisen wrote:

> Ivan Godard wrote:
>> This is especially a notational problem for our Mill, in which hardware
>> call operations can return more than one result. It works fairly
>> naturally for Ada and other languages with OUT parameters, but there's
>> no good way to reach it from C.
>
> I believe I have seen a C compiler which had an intrinsic that did
> expose this, but you needed an address parameter to get back the
> remainder, something like
>
> uint64_t dividend;
> uint32_t divisor, result, remainder;
> ...
> result = divmod6432(dividend, divisor, &remainder);
>
> where the compiler would presumably figure out that remainder was a
> local variable which could live in the natural 64%32 remainder register.

It's too bad that this isn't legal:

uint64_t dividend;
uint32_t divisor, quotient, remainder;
struct {uint32_t, uint32_t} divmod6432(uint64_t, uint32_t);
...
{quotient, remainder} = divmod6432(dividend, divisor);

I can think of innumerable cases where it would be useful to return
multiple values without having to mess about with temporary structures
or pass-by-pointer. It would be a nice touch if it were also legal to
call the above function like this:

{quotient,} = divmod6432(dividend, divisor); //discard remainder

or like this:

{,remainder} = divmod6432(dividend, divisor); //discard quotient

Adding this to the language seems a rather obvious request, so I have to
assume that there's some serious obstacle that isn't so obvious?

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Walter Banks

unread,

Oct 21, 2012, 5:13:00 AM10/21/12

to

This has merit for many normal functions to return multiple values
as a result rather than through argument list pointers. The syntax is
clear and the implementation would not be difficult.

w..

nm...@cam.ac.uk

unread,

Oct 21, 2012, 7:48:12 AM10/21/12

to

In article <5083BC9C...@bytecraft.com>,

>This has merit for many normal functions to return multiple values
>as a result rather than through argument list pointers. The syntax is
>clear and the implementation would not be difficult.

I am sorry, but that is seriously mistaken. All that is true in a
cleaner language, but is completely the opposite of the truth in
the utter mess that C and C++ have become. It was discussed in
WG14 when I was on it, and rejected on those grounds. We didn't
consider using braces rather than parentheses, but I am sure that
I could think of some syntactic nasties with those, especially in
C++. For example, remember that assignment is an operator and
delivers an lvalue, and is less binding than the conditional and
comma operators.

Regards,
Nick Maclaren.

Walter Banks

unread,

Oct 21, 2012, 11:40:00 AM10/21/12

to

nm...@cam.ac.uk wrote:

>
> >> Adding this to the language seems a rather obvious request, so I have to
> >> assume that there's some serious obstacle that isn't so obvious?
> >
> >This has merit for many normal functions to return multiple values
> >as a result rather than through argument list pointers. The syntax is
> >clear and the implementation would not be difficult.
>
> I am sorry, but that is seriously mistaken. All that is true in a
> cleaner language, but is completely the opposite of the truth in
> the utter mess that C and C++ have become. It was discussed in
> WG14 when I was on it, and rejected on those grounds. We didn't
> consider using braces rather than parentheses, but I am sure that
> I could think of some syntactic nasties with those, especially in
> C++. For example, remember that assignment is an operator and
> delivers an lvalue, and is less binding than the conditional and
> comma operators.

Nick,

I was on WG-14 at the same time. I am not advocating a C change
but some form of language linguistics to allow more than one return
value from a function.

There is some precedence for this in the pascal procedures where
a variable may declared in the argument list as a var and values
may be returned that way. The C problem is multiple values may
be returned but only through a pointer.

Walter..

Michael S

unread,

Oct 21, 2012, 12:00:04 PM10/21/12

to

On Oct 21, 5:37 pm, Walter Banks <wal...@bytecraft.com> wrote:

Why returning structure is not good enough?

nm...@cam.ac.uk

unread,

Oct 21, 2012, 12:11:45 PM10/21/12

to

In article <50841750...@bytecraft.com>,

Walter Banks <wal...@bytecraft.com> wrote:
>
>> >> Adding this to the language seems a rather obvious request, so I have to
>> >> assume that there's some serious obstacle that isn't so obvious?
>> >
>> >This has merit for many normal functions to return multiple values
>> >as a result rather than through argument list pointers. The syntax is
>> >clear and the implementation would not be difficult.
>>
>> I am sorry, but that is seriously mistaken. All that is true in a
>> cleaner language, but is completely the opposite of the truth in
>> the utter mess that C and C++ have become. It was discussed in
>> WG14 when I was on it, and rejected on those grounds. We didn't
>> consider using braces rather than parentheses, but I am sure that
>> I could think of some syntactic nasties with those, especially in
>> C++. For example, remember that assignment is an operator and
>> delivers an lvalue, and is less binding than the conditional and
>> comma operators.
>

>I was on WG-14 at the same time. I am not advocating a C change
>but some form of language linguistics to allow more than one return
>value from a function.
>
>There is some precedence for this in the pascal procedures where
>a variable may declared in the argument list as a var and values
>may be returned that way. The C problem is multiple values may
>be returned but only through a pointer.

Sorry - may I please increasing age and degenerating memory?

But my point stands. Pascal does not have the same syntactic
horrors - for example, assignment is a statement (like in
Fortran) - but the real problem is in the semantic horrors,
which are made worse by allowing this. For example

int * a, * b;
{a[5],b[10]} = ({a[10],b[5]} += {3,2});

And, of course, that's a simple case. It was bad enough in C90,
but what C99 and (even worse) C11 have done to the semantics of
this area beggars description. The only simple resolution would
be to create the concept of an assignment statement, but I am
pretty sure that wouldn't solve the issues in C11 or be acceptable
to C++.

In principle, I agree that it's trivial. Lots of languages
permit it - Matlab and Python, to name but two.

Regards,
Nick Maclaren.

Robert Wessel

unread,

Oct 21, 2012, 12:34:08 PM10/21/12

to

Because the two values returned frequently don't always have a
relationship that makes it sensible to put them into a structure.
Consider the most obvious use for such a thing: the ability to return
a value and a result/status code from a function - you'd almost never
really want those two in a structure together.

Ivan Godard

unread,

Oct 21, 2012, 12:50:25 PM10/21/12

to

I agree that tuple syntax (all those { , , , } approaches) is a
syntactic problem in C. More important, it makes code *very* hard to
read and understand - too many balls in the air for mere mortals.

However, what's wrong with OUT parameters?
float divide(float divisor, float dividend,
OUT float remainder);
float X, Y, Z, W;
X = divide(Y, Z, W);
by making the OUT parameters optional (compiler passes an ignored temp)
then you get:
X = divide(Y, Z);
and
; divide(Y, Z, W);
for normal / and %. INOUT is an obvious extension that is widespread in
other languages, as is explicit IN for the documentation benefit.

I don't see any syntactic problems adding these to C. As a notation it's
clumsy because you need a handy L-value, but single-carried-value is a
psychological assumption that is a holdover from the linear scan of
reading text; we don't "read" graphs, only arcs. Hence there doesn't
seem to be anything better than to stash all but one result in a name
where you can pick then up later when you are done with the "prime" result.

Terje Mathisen

unread,

Oct 21, 2012, 1:05:52 PM10/21/12

to

Walter Banks wrote:

>
>
> nm...@cam.ac.uk wrote:
>> I am sorry, but that is seriously mistaken. All that is true in a
>> cleaner language, but is completely the opposite of the truth in
>> the utter mess that C and C++ have become. It was discussed in
>> WG14 when I was on it, and rejected on those grounds. We didn't
>> consider using braces rather than parentheses, but I am sure that
>> I could think of some syntactic nasties with those, especially in
>> C++. For example, remember that assignment is an operator and
>> delivers an lvalue, and is less binding than the conditional and
>> comma operators.
>
> Nick,
>
> I was on WG-14 at the same time. I am not advocating a C change
> but some form of language linguistics to allow more than one return
> value from a function.
>
> There is some precedence for this in the pascal procedures where
> a variable may declared in the argument list as a var and values
> may be returned that way. The C problem is multiple values may
> be returned but only through a pointer.

Wouldn't a struct return solve this syntax problem?

Terje Mathisen

unread,

Oct 21, 2012, 1:08:27 PM10/21/12

to

Robert Wessel wrote:
> On Sun, 21 Oct 2012 09:00:04 -0700 (PDT), Michael S

>> Why returning structure is not good enough?
>
>
> Because the two values returned frequently don't always have a
> relationship that makes it sensible to put them into a structure.
> Consider the most obvious use for such a thing: the ability to return
> a value and a result/status code from a function - you'd almost never
> really want those two in a structure together.

No, but if that struct is declared locally and the two (or more?) return
value components are immediately copied (i.e. moved) to the final
location, then most compilers should be able to detect those moves as
NOPs and get rid of them.

nm...@cam.ac.uk

unread,

Oct 21, 2012, 2:44:03 PM10/21/12

to

In article <k6194d$31m$1...@speranza.aioe.org>,

Ivan Godard <igo...@pacbell.net> wrote:
>
>I agree that tuple syntax (all those { , , , } approaches) is a
>syntactic problem in C. More important, it makes code *very* hard to
>read and understand - too many balls in the air for mere mortals.

I don't have a problem with it in a language with a simpler and
cleaner syntax, but I agree about it in C.

>However, what's wrong with OUT parameters?
> float divide(float divisor, float dividend,
> OUT float remainder);
> float X, Y, Z, W;
> X = divide(Y, Z, W);
>by making the OUT parameters optional (compiler passes an ignored temp)
>then you get:
> X = divide(Y, Z);
>and
> ; divide(Y, Z, W);
>for normal / and %. INOUT is an obvious extension that is widespread in
>other languages, as is explicit IN for the documentation benefit.

As in C++ reference arguments, yes. It would resolve some issues
because all side-effects and sequencing in calculating the lvalues
would be done before calling the function. What it would NOT do
is to provide proper multi-valued function semantics, without
inventing some semantics and semantic mechanism to store the
multiple results only on return. And that is the aspect that I
think is intractable.

None of this is linguistically fundamental, but I am referring to
the specifics of C, where this area is already an ungodly mess.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Oct 21, 2012, 2:45:27 PM10/21/12

to

In article <bpmdl9-...@ntp-sure.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Robert Wessel wrote:
>> On Sun, 21 Oct 2012 09:00:04 -0700 (PDT), Michael S
>>
>>> Why returning structure is not good enough?
>>
>> Because the two values returned frequently don't always have a
>> relationship that makes it sensible to put them into a structure.
>> Consider the most obvious use for such a thing: the ability to return
>> a value and a result/status code from a function - you'd almost never
>> really want those two in a structure together.
>
>No, but if that struct is declared locally and the two (or more?) return
>value components are immediately copied (i.e. moved) to the final
>location, then most compilers should be able to detect those moves as
>NOPs and get rid of them.

The difference between theory and practice is less in theory than
it is in practice :-(

Otherwise, yes.

Regards,
Nick Maclaren.

Stephen Sprunk

unread,

Oct 21, 2012, 3:25:44 PM10/21/12

to

On 21-Oct-12 06:48, nm...@cam.ac.uk wrote:

>> Stephen Sprunk wrote:
>>> It's too bad that this isn't legal:
>>>
>>> uint64_t dividend;
>>> uint32_t divisor, quotient, remainder;
>>> struct {uint32_t, uint32_t} divmod6432(uint64_t, uint32_t);
>>> ...
>>> {quotient, remainder} = divmod6432(dividend, divisor);
>

> ... It was discussed in WG14 when I was on it, and rejected on those

> grounds. We didn't consider using braces rather than parentheses,
> but I am sure that I could think of some syntactic nasties with
> those, especially in C++. For example, remember that assignment is
> an operator and delivers an lvalue, and is less binding than the
> conditional and comma operators.

That's why I disregarded the (more obvious, to me) parenthesis syntax;
the following already has a defined meaning:

(quotient, remainder) = divmod6432(dividend, divisor);

which, AFAICT, would parse as this:

remainder = divmod6432(dividend, divisor);

except that I'm not sure the LHS would be an lvalue, and even if it were
there would be an obvious type mismatch and the quotient would get lost.

OTOH, using braces results in something that is not a compound statement
because there is no internal ";", so the only logical way to parse it is
as an anonymous struct.

Likewise, I would expect the following naïve implementation of
divmod6432() to work as well:

struct {uint32_t quotient, uint32_t remainder} divmod6432(uint64_t
dividend, uint32_t divisor) {
return {(dividend / divisor), (dividend % divisor)};
}

Again, without the internal ";", there is no valid way to parse this
except as an anonymous struct.

Aside from the parsing issues (and I assume there are other
complications I haven't considered there, as I know it's one of my weak
points), the behavior seems relatively easy to specify since you can
easily translate the above examples to the following equivalent code:

uint64_t dividend;
uint32_t divisor, quotient, remainder;
struct {uint32_t, uint32_t} divmod6432(uint64_t, uint32_t);
...

struct {uint32_t tmp1, uint32_t tmp2} tmp;
tmp = divmod6432(dividend, divisor);
quotient = tmp.tmp1;
remainder = tmp.tmp2;

and:

struct {uint32_t quotient, uint32_t remainder} divmod6432(uint64_t
dividend, uint32_t divisor) {
struct {uint32_t tmp1, uint32_t tmp2} tmp;
tmp.tmp1 = (dividend / divisor);
tmp.tmp2 = (dividend % divisor);
return tmp;
}

It is true one could argue that no truly new functionality is being
added to the language, just syntactic sugar, but this particular sugar
would be incredibly useful--and there are plenty of other examples of
syntactic sugar in C ([], ->, ++, etc).

Stephen Sprunk

unread,

Oct 21, 2012, 3:35:02 PM10/21/12

to

So what if they have no "relationship"? They are immediately extracted
from the temporary structure as soon as the function returns, thanks to
the anonymous struct syntax I proposed.

And, arguably, they do have a "relationship" of sorts, being multiple
return values from the same function invocation.

Stephen Sprunk

unread,

Oct 21, 2012, 3:35:05 PM10/21/12

to

On 21-Oct-12 12:08, Terje Mathisen wrote:
> Robert Wessel wrote:
>> On Sun, 21 Oct 2012 09:00:04 -0700 (PDT), Michael S
>>> Why returning structure is not good enough?
>>
>> Because the two values returned frequently don't always have a
>> relationship that makes it sensible to put them into a structure.
>> Consider the most obvious use for such a thing: the ability to return
>> a value and a result/status code from a function - you'd almost never
>> really want those two in a structure together.
>
> No, but if that struct is declared locally and the two (or more?) return
> value components are immediately copied (i.e. moved) to the final
> location, then most compilers should be able to detect those moves as
> NOPs and get rid of them.

Unfortunately, that would only be the case if the function were inlined.

Many (most?) ABIs specify that if a function returns a struct, the
caller passes a hidden argument with a pointer to a struct on the stack
that receives the return values. You would need to redefine all those
ABIs to allow multiple return values via registers (as with function
arguments), not just one, for it to be efficient in non-inlined cases.

Stephen Sprunk

unread,

Oct 21, 2012, 3:40:31 PM10/21/12

to

On 21-Oct-12 11:11, nm...@cam.ac.uk wrote:
> But my point stands. Pascal does not have the same syntactic
> horrors - for example, assignment is a statement (like in
> Fortran) - but the real problem is in the semantic horrors,
> which are made worse by allowing this. For example
>
> int * a, * b;
> {a[5],b[10]} = ({a[10],b[5]} += {3,2});
>
> And, of course, that's a simple case.

Well, I would never think of writing such a monstrosity, but its meaning
is obviously (to me) equivalent to this:

a[5] = a[10] += 3,
b[10] = b[5] += 2;

except that there is no sequence point between the two, so it would be
undefined if a==b or a and b overlapped in certain ways.

> The only simple resolution would be to create the concept of an
> assignment statement, but I am pretty sure that wouldn't solve the
> issues in C11 or be acceptable to C++.

Why would you need an assignment statement to make anonymous structs work?

Ivan Godard

unread,

Oct 21, 2012, 3:49:18 PM10/21/12

to

If you look at the machine code for a struct return you will discover
that they are returned via a pointer regardless of what it looks like in
the source. The caller passes a pointer to a temporary region, and the
callee fills it in. This means stores to assign the results, and later
loads to use them; painful increases in memory bandwidth and load/store
unit utilization, not to mention the necessary load latency to use the
result after the return. Given:
struct S {int i; ...};
S F1() { ... }
void F2(int) { ... }
F2(F1().i);
there will be at least a load delay between the calls, even on an OOO
machine.

In contrast, a multi-result call in asm just leaves the several results
in registers, where they are immediately usable by the caller. The above
code would have at most a register move delay between the calls. This
practice violates all call ABIs I have ever heard of. It could be used
for intra-module calls where the ABI rules are explicitly waived, but I
have also never seen a compiler that can recognize a multi-result idiom
and use an in-register protocol.

In contrast, Ada and other languages with OUT parameters routinely
return them in registers using an ABI that allows for that usage.
Inter-language calls are restricted to the lowest common denominator of
course, i.e. C protocol.

Terje Mathisen

unread,

Oct 21, 2012, 4:21:11 PM10/21/12

to

Ivan Godard wrote:
> On 10/21/2012 10:05 AM, Terje Mathisen wrote:
>> Wouldn't a struct return solve this syntax problem?
>>
>> Terje
>
> If you look at the machine code for a struct return you will discover
> that they are returned via a pointer regardless of what it looks like in
> the source.

As another poster noted: What about inlining?

IMHO such a tiny piece of code really doesn't make sense unless you
inline it, since the CALL/RETURN overhead (just in code space) is far
larger than the actual DIV opcode which generates both return values.

I.e. I'm really arguing for what's effectively a compiler intrinsic, but
still portable code.

Terje Mathisen

unread,

Oct 21, 2012, 4:23:56 PM10/21/12

to

nm...@cam.ac.uk wrote:
> As in C++ reference arguments, yes. It would resolve some issues
> because all side-effects and sequencing in calculating the lvalues
> would be done before calling the function. What it would NOT do
> is to provide proper multi-valued function semantics, without
> inventing some semantics and semantic mechanism to store the
> multiple results only on return. And that is the aspect that I
> think is intractable.

What I have done previously is to define a pair of 32-bit return values
as a single 64-bit value:

This makes perfect sense in this particular case on x86, since EDX:EAX
is the 64-bit return pair, and the same registers contain the two
results of a 64/32->(32,32) DIV opcode. :-)

Ivan Godard

unread,

Oct 21, 2012, 5:20:12 PM10/21/12

to

On 10/21/2012 1:23 PM, Terje Mathisen wrote:
> nm...@cam.ac.uk wrote:
>> As in C++ reference arguments, yes. It would resolve some issues
>> because all side-effects and sequencing in calculating the lvalues
>> would be done before calling the function. What it would NOT do
>> is to provide proper multi-valued function semantics, without
>> inventing some semantics and semantic mechanism to store the
>> multiple results only on return. And that is the aspect that I
>> think is intractable.
>
> What I have done previously is to define a pair of 32-bit return values
> as a single 64-bit value:
>
> This makes perfect sense in this particular case on x86, since EDX:EAX
> is the 64-bit return pair, and the same registers contain the two
> results of a 64/32->(32,32) DIV opcode. :-)
>
> Terje
>

Until you want to return a double and a status.

div/mod here is merely an easily understood example of multiple return
values. While any given example can klugified into an intrinsic or other
ad hocery, the general case remains.

Ivan Godard

unread,

Oct 21, 2012, 5:34:56 PM10/21/12

to

On 10/21/2012 1:21 PM, Terje Mathisen wrote:
> Ivan Godard wrote:
>> On 10/21/2012 10:05 AM, Terje Mathisen wrote:
>>> Wouldn't a struct return solve this syntax problem?
>>>
>>> Terje
>>
>> If you look at the machine code for a struct return you will discover
>> that they are returned via a pointer regardless of what it looks like in
>> the source.
>
> As another poster noted: What about inlining?
>
> IMHO such a tiny piece of code really doesn't make sense unless you
> inline it, since the CALL/RETURN overhead (just in code space) is far
> larger than the actual DIV opcode which generates both return values.
>
> I.e. I'm really arguing for what's effectively a compiler intrinsic, but
> still portable code.
>
> Terje
>

Then use a longer example: a function that returns both sin and cos is
much cheaper than two calls one for each, and it's not uncommon to want
both. Sure, you can inline - but a correct library function, that deals
with rounding modes, NaNs and infs, etc is a bit bigger than you'd want to.

Or look at the mess that are Unix syscalls and error returns. Some
return zero on failure, some zero on success, some a negative, some a
status, some a null pointer, some don't report at all - and most with
the hidden errno out argument. And all uniformly ignored - tell me the
last time you checked the return value of printf (yes, it can fail).

Contrast a uniform convention in which the result of every call is the
error enum, and any values to be returned are OUT parameters.

We use a library that intercepts every syscall with a macro of the same
nam. The macro checks for and converts error returns to C++ throw, while
still returning the (checked) data value of the original signature. It's
saved my butt more times than I like to admit.

Michael S

unread,

Oct 21, 2012, 7:25:32 PM10/21/12

to

On Oct 21, 9:49 pm, Ivan Godard <igod...@pacbell.net> wrote:
> On 10/21/2012 10:05 AM, Terje Mathisen wrote:
>
>
> If you look at the machine code for a struct return you will discover
> that they are returned via a pointer regardless of what it looks like in
> the source. The caller passes a pointer to a temporary region, and the
> callee fills it in. This means stores to assign the results, and later
> loads to use them;

True for Microsoft calling conventions.
According to my understanding of "System V Application Binary
Interface on AMD64 Architecture Processor", only partially true
(a.k.a. false) on x64 Gnu/Linux, Solaris, FreBSD and, may be, some
other OSes.
System V ABI has rather complex rules that I don't understand
completely. But the bottom line in our case is that
struct ab { uint64_t a,b; } is passed back in RAX and RDX.

> painful increases in memory bandwidth and load/store
> unit utilization, not to mention the necessary load latency to use the
> result after the return.

Even when it is passed on stack, it's not that painful relatively to
the time of 128b/64b division itself.
Memory is practically never involved. Of all today's x86 processors,
even L2 is involved only on Bulldozer.
Load/store unit is, indeed, utilized, but it's no big deal.
Load latency also shouldn't be involved, at least on more advanced
processors, because the case is extremely easy for load-to-store
forwarding hardware.

Ivan Godard

unread,

Oct 21, 2012, 8:41:33 PM10/21/12

to

On 10/21/2012 4:25 PM, Michael S wrote:
> On Oct 21, 9:49 pm, Ivan Godard <igod...@pacbell.net> wrote:
>> On 10/21/2012 10:05 AM, Terje Mathisen wrote:
>>
>>
>> If you look at the machine code for a struct return you will discover
>> that they are returned via a pointer regardless of what it looks like in
>> the source. The caller passes a pointer to a temporary region, and the
>> callee fills it in. This means stores to assign the results, and later
>> loads to use them;

> Even when it is passed on stack, it's not that painful relatively to
> the time of 128b/64b division itself.
> Memory is practically never involved. Of all today's x86 processors,
> even L2 is involved only on Bulldozer.
> Load/store unit is, indeed, utilized, but it's no big deal.
> Load latency also shouldn't be involved, at least on more advanced
> processors, because the case is extremely easy for load-to-store
> forwarding hardware.
>

Good point; the store buffers act in effect like a L0 cache.

Does anyone know the actual timing on a load from decode to availability
for something that has a hard data dependence when the load is satisfied
in the store buffers?

Robert Wessel

unread,

Oct 22, 2012, 1:42:57 AM10/22/12

to

On Sun, 21 Oct 2012 14:35:05 -0500, Stephen Sprunk
<ste...@sprunk.org> wrote:

>On 21-Oct-12 12:08, Terje Mathisen wrote:
>> Robert Wessel wrote:
>>> On Sun, 21 Oct 2012 09:00:04 -0700 (PDT), Michael S
>>>> Why returning structure is not good enough?
>>>
>>> Because the two values returned frequently don't always have a
>>> relationship that makes it sensible to put them into a structure.
>>> Consider the most obvious use for such a thing: the ability to return
>>> a value and a result/status code from a function - you'd almost never
>>> really want those two in a structure together.
>>
>> No, but if that struct is declared locally and the two (or more?) return
>> value components are immediately copied (i.e. moved) to the final
>> location, then most compilers should be able to detect those moves as
>> NOPs and get rid of them.
>
>Unfortunately, that would only be the case if the function were inlined.
>
>Many (most?) ABIs specify that if a function returns a struct, the
>caller passes a hidden argument with a pointer to a struct on the stack
>that receives the return values. You would need to redefine all those
>ABIs to allow multiple return values via registers (as with function
>arguments), not just one, for it to be efficient in non-inlined cases.

While I don't want to quibble of the definition of "many", it's not
uncommon for (very) short structures to be returned in registers. For
example, both MS and GCC using various "fast" calling conventions will
return at least some structures of 8 bytes in a register pair in
x86-32.

Robert Wessel

unread,

Oct 22, 2012, 1:46:44 AM10/22/12

to

On Sun, 21 Oct 2012 19:08:27 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"> wrote:

>Robert Wessel wrote:
>> On Sun, 21 Oct 2012 09:00:04 -0700 (PDT), Michael S
>>> Why returning structure is not good enough?
>>
>>
>> Because the two values returned frequently don't always have a
>> relationship that makes it sensible to put them into a structure.
>> Consider the most obvious use for such a thing: the ability to return
>> a value and a result/status code from a function - you'd almost never
>> really want those two in a structure together.
>
>No, but if that struct is declared locally and the two (or more?) return
>value components are immediately copied (i.e. moved) to the final
>location, then most compilers should be able to detect those moves as
>NOPs and get rid of them.

True enough, at least for functions that are inlined or have very
short returned structures meeting certain requirements, but it's
certainly not a pretty solution. For many libraries you'd need to
define a whole passel of "return" structures. And the introduction of
otherwise meaningless temporary variables is unlikely to help the
clarity of the code.

Terje Mathisen

unread,

Oct 22, 2012, 2:49:47 AM10/22/12

to

I'm pretty sure that the answer is "that depends on the actual cpu
model", but that the real answer is "less than or equal to an L1 cache hit".

I.e. a read from memory that hits in a store buffer looks like a
single-cycle operation as seen from the surrounding code.

Andy (Super) Glew

unread,

Oct 22, 2012, 2:53:34 AM10/22/12

to Ivan Godard

On all of the machines I am familiar with the store buffers have the
same latency as the L1 cache.

You *could* try to make store-to-load forwarding faster than the L1
cache. But it is a pain to deal with many different latencies.

Moreover, machines have now begin to "registerify" memory:
to treat "store M[A]:=reg1; ... load reg2 := M[A]"
as equivalent to "move reg2 := reg1". Even if reg1 gets overwritten in
between.

(Actually, "store M[A]:=reg1; ... move reg2 := reg1".
Eliding the store is not done so much yet, because it *might* be visible
to another processor.

However, eliding the store if there is another store to the same address
is done.

nm...@cam.ac.uk

unread,

Oct 22, 2012, 4:03:08 AM10/22/12

to

In article <k61j3g$3n8$1...@dont-email.me>,

Stephen Sprunk <ste...@sprunk.org> wrote:
>
>That's why I disregarded the (more obvious, to me) parenthesis syntax;
>the following already has a defined meaning:
>
>(quotient, remainder) = divmod6432(dividend, divisor);
>
>which, AFAICT, would parse as this:
>
>remainder = divmod6432(dividend, divisor);
>
>except that I'm not sure the LHS would be an lvalue, and even if it were
>there would be an obvious type mismatch and the quotient would get lost.

It isn't, and there could be yet another syntactic kludge, just as
there was for the comma operator and function calls in an early version
of the C90 standard. But, yes, that was and is one reason.

>> But my point stands. Pascal does not have the same syntactic
>> horrors - for example, assignment is a statement (like in
>> Fortran) - but the real problem is in the semantic horrors,
>> which are made worse by allowing this. For example
>>
>> int * a, * b;
>> {a[5],b[10]} = ({a[10],b[5]} += {3,2});
>>
>> And, of course, that's a simple case.
>
>Well, I would never think of writing such a monstrosity, but its meaning
>is obviously (to me) equivalent to this:
>
>a[5] = a[10] += 3,
>b[10] = b[5] += 2;
>
>except that there is no sequence point between the two, so it would be
>undefined if a==b or a and b overlapped in certain ways.

Er, that's only C90 and (to a great extent) C99. C11 has changed
all that, in ways which are completely unclear. Inter alia, shared
memory parallelism exposes issues that can be ignored in sequential
code.

>OTOH, using braces results in something that is not a compound statement
>because there is no internal ";", so the only logical way to parse it is
>as an anonymous struct.

Er, no. I would need to study the various standards, but braces are
seriously overused even in C99, and words fail me about the ways that
they can cause chaos in C++. However, even in the simple case, you
are delaying the classification of the contents of the braces until
you reach the closing brace (and possibly the following operator),
which is a new and alien concept to the C syntactic model. It is
still a one-pass parsing model, even if C99 broke that for such
things as whether functions are external or not.

>> The only simple resolution would be to create the concept of an
>> assignment statement, but I am pretty sure that wouldn't solve the
>> issues in C11 or be acceptable to C++.
>
>Why would you need an assignment statement to make anonymous structs work?

Because an assignment expression delivers an lvalue. Such a nonce
struct of references is not and cannot be an lvalue.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Oct 22, 2012, 9:36:18 AM10/22/12

to

On Sunday, October 21, 2012 5:34:57 PM UTC-4, Ivan Godard wrote:
[snip]

> Or look at the mess that are Unix syscalls and error returns. Some
> return zero on failure, some zero on success, some a negative, some a
> status, some a null pointer, some don't report at all - and most with
> the hidden errno out argument. And all uniformly ignored - tell me the
> last time you checked the return value of printf (yes, it can fail).
>
> Contrast a uniform convention in which the result of every call is the
> error enum, and any values to be returned are OUT parameters.
>
> We use a library that intercepts every syscall with a macro of the same
> nam. The macro checks for and converts error returns to C++ throw, while
> still returning the (checked) data value of the original signature. It's
> saved my butt more times than I like to admit.

ISTM that it could be useful if error and return values could be
handled cleverly. E.g., on an ISA with condition codes or
predicate registers, it might be useful to place the indication
that there is an error in such a specialized branch-handling
register. Values could be merged/compacted in the implementation--
as often done in the manually generated error/return value
encodings mentioned above--and the implementation might vary among
ISAs (or even platforms), but the programmer would have a consistent
interface.

(Using a condition code to indicate the existence of an error
might allow error test branches to be folded out in the front-end
[not so much for x86, which changes a singular condition code too
frequently, or even ARM, which has a single condition code which
is used somewhat frequently; but for Power or Itanium a specialized
boolean could be relatively persistent at least in some cases, i.e.,
only being modified if there is an error. While error branches are
extremely predictable, removing a branch early may have some
advantages over just correct prediction.)

This would seem to require communicating information about the
validity of variables under different states, the providing of
default values for variables when they are undefined, and
information about valid values (e.g., pointer is 4-byte aligned,
value only needs 31 bits when valid) that would allow packing.

(Such cleverness could probably not be added to C, but it might be
a useful language feature in a theoretical sense. I suspect,
however, that the very modest benefits of such cleverness would
very rarely justify the implementation costs in compilers and
debuggers. [Yes, I seem to have an absurd affection for
complexity--when someone else has to deal with all the
disadvantages!] By the way, does C allow errorno to be register
allocated?)

Joe keane

unread,

Oct 22, 2012, 10:37:31 AM10/22/12

to

In article <06hcl9-...@ntp-sure.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

> uint64_t dividend;
> uint32_t divisor, result, remainder;
> ...
> result = divmod6432(dividend, divisor, &remainder);

The problem is, it looks like a C function call. You are not adding new
syntax to the language.

I wrote a test program, macros versus functions. Note all the code to
store to the temporary and load from it.

-- testit.c

#define COMP_A(X, A) ((A) = (X) + 2)
#define COMP_C(B, C) ((C) = (B) + 2)

static void comp_b(int a, int *bp);

int f(int x)
{
int a;
int b;
int c;

COMP_A(x, a);
{ int t; comp_b(a, &t); b = t; }
COMP_C(b, c);
return c;
}

static void comp_b(int a, int *bp)
{
int b;

b = a + 3;
*bp = b;
}

-- testit.s [cc -S -O2 testit.c]

.file "testit.c"
.text
.p2align 4,,15
.globl f
.type f, @function
f:
.LFB0:
.cfi_startproc
movl 4(%esp), %eax
addl $7, %eax
ret
.cfi_endproc
.LFE0:
.size f, .-f
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",@progbits

bagel

unread,

Oct 23, 2012, 11:27:35 AM10/23/12

to

On 10/21/2012 03:09 AM, Stephen Sprunk wrote:
> It's too bad that this isn't legal:
>
> uint64_t dividend;
> uint32_t divisor, quotient, remainder;
> struct {uint32_t, uint32_t} divmod6432(uint64_t, uint32_t);
> ...
> {quotient, remainder} = divmod6432(dividend, divisor);
>

> I can think of innumerable cases where it would be useful to return
> multiple values without having to mess about with temporary structures
> or pass-by-pointer. It would be a nice touch if it were also legal to
> call the above function like this:
>
> {quotient,} = divmod6432(dividend, divisor); //discard remainder
>
> or like this:
>
> {,remainder} = divmod6432(dividend, divisor); //discard quotient
>
> Adding this to the language seems a rather obvious request, so I have to
> assume that there's some serious obstacle that isn't so obvious?
>

> S

How about this <http://code.google.com/p/esl/>:

proc divmod(a: _uint64, b: _uint32): _uint32, _uint32
{ var div, mod: _uint32;

div, mod = a/b, a%b;
return div, mod;
}

Which generates this for x86-64:

divmod:
movl %esi, %ecx
movq %rdi, %rax
xorl %edx, %edx
divq %rcx
ret

Brian

Terje Mathisen

unread,

Oct 23, 2012, 1:44:04 PM10/23/12

to

bagel wrote:
> How about this <http://code.google.com/p/esl/>:
>
> proc divmod(a: _uint64, b: _uint32): _uint32, _uint32
> { var div, mod: _uint32;
>
> div, mod = a/b, a%b;
> return div, mod;
> }
>
> Which generates this for x86-64:
>
> divmod:
> movl %esi, %ecx
> movq %rdi, %rax
> xorl %edx, %edx
> divq %rcx
> ret

That's broken!

The XOR EDX,EDX means that it is doing a 32/32 division, instead of
64/32 as required. :-(

Stephen Sprunk

unread,

Oct 23, 2012, 2:15:46 PM10/23/12

to

On 23-Oct-12 12:44, Terje Mathisen wrote:

> bagel wrote:
>> divmod:
>> movl %esi, %ecx
>> movq %rdi, %rax
>> xorl %edx, %edx
>> divq %rcx
>> ret
>
> That's broken!
>
> The XOR EDX,EDX means that it is doing a 32/32 division, instead of
> 64/32 as required. :-(

Mixing register sizes obfuscates what's going on here, but it's correct.

DIV RCX divides RDX:RAX by RCX. Since our dividend is 64-bit, we put it
in RAX and clear RDX.

A store to EDX is defined to clear the upper 32 bits of RDX, while the
XOR itself clears the lower 32 bits. One could write XOR RDX, RDX, but
that requires adding a REX prefix for no benefit.

Likewise, a store to ECX is defined to clear the upper 32 bits of RCX,
while the MOV puts the divisor in the lower 32 bits. One could write
MOV RCX, RSI, but that requires adding a REX prefix for no benefit.

Terje Mathisen

unread,

Oct 24, 2012, 1:21:25 AM10/24/12

to

Stephen Sprunk wrote:
> On 23-Oct-12 12:44, Terje Mathisen wrote:
>> bagel wrote:
>>> divmod:
>>> movl %esi, %ecx
>>> movq %rdi, %rax
>>> xorl %edx, %edx
>>> divq %rcx
>>> ret
>>
>> That's broken!
>>
>> The XOR EDX,EDX means that it is doing a 32/32 division, instead of
>> 64/32 as required. :-(
>
> Mixing register sizes obfuscates what's going on here, but it's correct.

Oops, since the first operation was a 32-bit move from ESI to ECX (i.e.
the divisor) and the XOR also used a 32-bit register, I assumed we were
still talking about 32-bit code.

Mea Culpa!

OTOH, as soon as we start using 64-bit registers, the relevant operation
is a 128/64->(64,64) division!

>
> DIV RCX divides RDX:RAX by RCX. Since our dividend is 64-bit, we put it
> in RAX and clear RDX.
>
> A store to EDX is defined to clear the upper 32 bits of RDX, while the
> XOR itself clears the lower 32 bits. One could write XOR RDX, RDX, but
> that requires adding a REX prefix for no benefit.
>
> Likewise, a store to ECX is defined to clear the upper 32 bits of RCX,
> while the MOV puts the divisor in the lower 32 bits. One could write
> MOV RCX, RSI, but that requires adding a REX prefix for no benefit.

OK, I obviously haven't assimilated all the rules for mixing 32-bit and
64-bit operations, mostly due to Microsoft banning inline asm in 64-bit
mode. :-(

Stephen Sprunk

unread,

Oct 24, 2012, 11:43:40 AM10/24/12

to

On 24-Oct-12 00:21, Terje Mathisen wrote:
> Stephen Sprunk wrote:
>> On 23-Oct-12 12:44, Terje Mathisen wrote:
>>> bagel wrote:
>>>> divmod:
>>>> movl %esi, %ecx
>>>> movq %rdi, %rax
>>>> xorl %edx, %edx
>>>> divq %rcx
>>>> ret
>>>
>>> That's broken!
>>>
>>> The XOR EDX,EDX means that it is doing a 32/32 division, instead
>>> of 64/32 as required. :-(
>>
>> Mixing register sizes obfuscates what's going on here, but it's
>> correct.
>
> Oops, since the first operation was a 32-bit move from ESI to ECX
> (i.e. the divisor) and the XOR also used a 32-bit register, I assumed
> we were still talking about 32-bit code.
>
> Mea Culpa!
>
> OTOH, as soon as we start using 64-bit registers, the relevant
> operation is a 128/64->(64,64) division!

Correct. Technically, one could say that's not what is specified, but
it gives the same results for any values that would be meaningful in a
64/32 division, without having to split RAX into EDX:EAX.

>> DIV RCX divides RDX:RAX by RCX. Since our dividend is 64-bit, we
>> put it in RAX and clear RDX.

Note that this only works because DIV is unsigned division; if one were
using IDIV, CQO would be used to sign-extend RAX to RDX:RAX.

(CQO parallels CDQ, which sign-extends EAX to EDX:EAX, and CWD, which
sign-extends AX to DX:AX.)

>> A store to EDX is defined to clear the upper 32 bits of RDX, while
>> the XOR itself clears the lower 32 bits. One could write XOR RDX,
>> RDX, but that requires adding a REX prefix for no benefit.
>>
>> Likewise, a store to ECX is defined to clear the upper 32 bits of
>> RCX, while the MOV puts the divisor in the lower 32 bits. One
>> could write MOV RCX, RSI, but that requires adding a REX prefix for
>> no benefit.
>
> OK, I obviously haven't assimilated all the rules for mixing 32-bit
> and 64-bit operations, mostly due to Microsoft banning inline asm in
> 64-bit mode. :-(

The only rule relevant here is that a 32-bit register is zero-extended
to 64 bits. (MOVSXD does sign extension, if that's what you need.)

This is notable because it is _not_ how partial registers are handled in
16- and 32-bit code, which frequently results in stalls.

Piotr Wyderski

unread,

Oct 24, 2012, 2:34:37 PM10/24/12

to

Michael S wrote:

> Load latency also shouldn't be involved, at least on more advanced
> processors, because the case is extremely easy for load-to-store
> forwarding hardware.

Tell this to the SPARC "architects". Even on an Ultrasparc 4+
putting the loop counter within a memory variable results in
~3 times slowdown of a no-op loop.

Best regards, Piotr

Piotr Wyderski

unread,

Oct 24, 2012, 2:39:31 PM10/24/12

to

Andy (Super) Glew wrote:

> However, eliding the store if there is another store to the same address
> is done.

So, is it, at least in principle, possible for a processor to never
issue a store (as seen from the outside) via continuous elision?

volatile int* p;

for(;;) {

*p = 42;
}

Best regards, Piotr

Joe keane

unread,

Oct 24, 2012, 3:07:36 PM10/24/12

to

In article <k61jjr$vfs$1...@speranza.aioe.org>,

Ivan Godard <igo...@pacbell.net> wrote:
>If you look at the machine code for a struct return you will discover
>that they are returned via a pointer regardless of what it looks like
>in the source.

It is good to follow the ABI because:

a) if you screw up calling functions, your code won't work

b) it is nice to be able to call across languages

c) even with the same language and compiler, it is still a pain in the
rear to have multiple calling conventions

d) even if you write the function in assembler, it is nice to be able to
switch it with a C function

f) if you are that worried about function call overhead, don't call
functions so many times duh

Terje Mathisen

unread,

Oct 24, 2012, 3:53:09 PM10/24/12

to

That sounds exactly right on other cpus as well, doesn't it?

next:
dec ecx
jnz next

must obviously take one cycle/iteration, at least for reasonably large
loop counts.

nextmem:
mov ecx,[counter]
dec ecx
mov [counter],ecx
jnz nextmem

In that code 3 cycles is the absolute minimum, since everything except
the branch is sequentially dependent upon the previous instruction.

Stephen Sprunk

unread,

Oct 25, 2012, 5:11:42 PM10/25/12

to

The SysV ABI for i386 says this:

"If a function returns a structure or union, then the caller provides
space for the return value and places its address on the stack as
argument word zero. In effect, this address becomes a ‘‘hidden’’ first
argument. Having the caller supply the return object’s space allows
re-entrancy.

"A function that returns a structure or union also sets %eax to the
value of the original address of the caller’s area before it returns.
Thus when the caller receives control again, the address of the returned
object resides in register %eax and can be used to access the object."

The SysV ABI for x86-64 is more complicated; it says that structures of
up to "four eightbytes" can be returned via registers, but it only
assigns two registers (%rax and %rdx) to hold such results. It's not
clear where the third and fourth "eightbyte" should go. Structures
larger than "four eightbytes" are handled similarly to i386.

Robert Wessel

unread,

Oct 25, 2012, 8:44:10 PM10/25/12

to

On Thu, 25 Oct 2012 16:11:42 -0500, Stephen Sprunk

>argument word zero. In effect, this address becomes a 荘hidden鋳 first
>argument. Having the caller supply the return object痴 space allows

>re-entrancy.
>
>"A function that returns a structure or union also sets %eax to the

>value of the original address of the caller痴 area before it returns.

>Thus when the caller receives control again, the address of the returned
>object resides in register %eax and can be used to access the object."
>
>The SysV ABI for x86-64 is more complicated; it says that structures of
>up to "four eightbytes" can be returned via registers, but it only
>assigns two registers (%rax and %rdx) to hold such results. It's not
>clear where the third and fourth "eightbyte" should go. Structures
>larger than "four eightbytes" are handled similarly to i386.

My memory may be failing me, but I think that's correct for the system
ABI, but the 8-byte returns in edx/eax is correct for GCC for
functions using __attribute__((fastcall)).

Terje Mathisen

unread,

Oct 26, 2012, 1:05:51 AM10/26/12

to

Stephen Sprunk wrote:
> "A function that returns a structure or union also sets %eax to the

> value of the original address of the callerâ€™s area before it returns.

> Thus when the caller receives control again, the address of the returned
> object resides in register %eax and can be used to access the object."
>
> The SysV ABI for x86-64 is more complicated; it says that structures of
> up to "four eightbytes" can be returned via registers, but it only
> assigns two registers (%rax and %rdx) to hold such results. It's not
> clear where the third and fourth "eightbyte" should go. Structures
> larger than "four eightbytes" are handled similarly to i386.

That is still perfect for 64x64->128 MULs and 128/6->(64,64) DIVs. :-)

MitchAlsup

unread,

Oct 27, 2012, 12:08:19 AM10/27/12

to

On Thursday, October 25, 2012 4:11:43 PM UTC-5, Stephen Sprunk wrote:

In the 88K ABI, we allowed for 4 word return result to be passed back
in registers. This covers a whole lot of structure returns (dynamically)
withour resoring to memory refs.

Mitch

Piotr Wyderski

unread,

Oct 30, 2012, 3:22:19 PM10/30/12

to

Terje Mathisen wrote:

> In that code 3 cycles is the absolute minimum, since everything except
> the branch is sequentially dependent upon the previous instruction.

On x64 it worked with little or no visible difference. On Sparc
it was much slower. It was several years ago, so I don't remember
all the exact reasons for it to be so, but it was my task to find
the troublemaker. And it was no/defunct forwarding network in Sparc.

Best regards, Piotr