random ISA idea...

BGB

unread,

Oct 27, 2011, 7:03:26 PM10/27/11

to

dunno about existing technologies, but I was thinking some and had a few
misc ideas (random, I will probably not do much with them, given I am
purely a programmer, and not a hardware person).

x86 is nice, but not particular efficient on small/simple cores (so, it
is good for a main processor, but not so good if one wants lots of small
cores).

ARM works better on small cores, but in its classical form wastes
memory, and in its Thumb/Thumb2 form, has gotten a bit nasty looking.
Thumb is also difficult to interpret efficiently on other targets (much
bit twiddly), and is a little hairy at the ASM level.

so, a few goals:
fairly simple ISA;
compact yet regular representation;
should be possible to implement simply in hardware.

and a few restrictions:
all data values are aligned;
all instructions are aligned;
only a single memory access per operation (load or store).

an imagined design:
CPU has two buses, one for data and one for code, each with a separate
set of address lines (presumably connected to a common memory-space though);
presumably, memory can be accessed in a single clock (cores connected to
high-speed SRAM);
both data and address buses are 64 bits;
like data values, instructions may be 1/2/4/8 bytes, and are kept
aligned much like data;
opcodes are 8 or 16 bits;
opcodes may access a memory operand and a register at the same time;
maybe, x86-like memory addressing may be used [base+scale*index+disp];
there will be 16 registers;
unused opcode spots will be set to nop;
the behavior of a misaligned opcode will be undefined.

regs:
r0-r11: GPRs
r12/rsp: stack-pointer
r13/rbp: base-pointer
r14/rlr: link-register
r15/rip: instruction-pointer.

loads into r15 will encode many forms of 'jmp'.

example basic opcode format:
wwggxxxx: ww=instruction-width, gg=group, xxxx=opcode

ww: 00=8-bit, 01=16-bit, 10=32-bit, 11=64-bit
gg: 00/01/10=basic opcodes (opcode in 'xxxx'), 11=opcode in low 4 bits
(high 4 bits of opcode number), and the following byte (low 8 bits),
forming a 12-bit opcode number.

note:
0011xxxx: invalid/reserved (128-bit opcodes?...).

8-bit base opcodes:
0=nop;
1=ret ("mov rip, rlr");
2=ret2 (?, "mov rip, [rsp]; add rsp, 8")
...

16-bit base opcodes;
0=nop;
1=mov reg, reg;
2=jmp disp8;
3=jmp reg;
4=add reg, reg;
5=sub reg, reg;
...

so, 16-bit reg/reg form:
01ggxxxx rrrrbbbb

so, 16-bit reg form:
01ggxxxx rrrr0000

32-bit base opcodes;
0=nop;
1=mov reg, [mem];
2=mov [mem], reg;
3=lea reg, [mem] / mov reg, imm24;
4=add reg, [mem];
5=sub reg, [mem];
...

so, 32-bit reg/mem form:
10ggxxxx rrrrmmss bbbbiiii dddddddd dddddddd
10ggxxxx rrrr11ss eeeeeeee dddddddd dddddddd

rrrr=reg;
bbbb=base;
mm=mode (00=[base+sc*index+disp16], 01=[base+disp16], 10=resv, 11=[abs24]);
ss=scale (00=1, 01=2, 10=4, 11=8);
iiii=index;
dddd...=displacement (low 16-bits);
eeee...=displacement (high 8 bits).

for loads/stores, the scale implicitly gives the in-memory operand size
(sign-extended for normal arithmetic operators).

the 64-bit reg/mem forms would be similar, except supporting bigger
displacements and similar.

possible compound ops:
call mem ("mov rlr, rip; lea rip, [mem]");
call2 mem ("sub rsp, 8; mov [rsp], rip; lea rip, [mem]");
push reg ("sub rsp, 8; mov [rsp], reg");
pop reg ("mov reg, [rsp]; add rsp, 8");
...

dunno, any comments?...

Paul A. Clayton

unread,

Oct 27, 2011, 8:12:10 PM10/27/11

to

On Oct 27, 6:03 pm, BGB <cr88...@hotmail.com> wrote:
[snip outline of new ISA]
> dunno, any comments?...

A few quick comments.

1. Quoting Mitch Alsup: 'History records that
PC in the GPRs is "overly uniform." '

2. Aligned variable length instructions probably do
not make much sense. Data structure packing has
much more flexibility (order does not have semantic
content); forcing the use paired smaller instructions
to achieve alignment seems undesirable.

3. I suspect the utility of 8-bit instructions is
relatively limited and that a better choice of
instruction lengths would be 2, 4, 6, and 8 bytes.
(There might be some desire to support 64-bit
immediates, so instructions larger than 8 bytes
might be appropriate.) To maintain density, some
2 byte instructions may do more work (e.g.,
rather than having a 1 byte RETURN provide a 2 byte
RETURN-AND-ADJUST-STACK-BY-IMMEDIATE)

4. If instructions and data use separate
address spaces, it might be convenient to provide
data reads and stores into instruction space.

5. The encoding of lengths seems suboptimal in
that short instructions suffer a larger percent
overhead. (My guess is that using one encoding
of the first two bits of a 2-byte parcel to
indicate a longer encoding would be better. It
might also be worthwhile to use the last two bits
of the second parcel of an longer instruction to
indicate extension beyond 4 bytes and use the
encoding for instruction length in any third
parcel to indicate an additional 2 or 4 bytes.
Such an encoding may slightly simplify parsing.)

6. It may well be desirable to allow shorter
instructions to address fewer registers, perhaps
even looking ahead to extensions with 64 or 128
registers. It may also be desirable for
register names to be in the same place for all
instructions. (The encoding you present places
mode and scale in the position of the b register
name.) I receive the impression that extracting
register names (particularly for source operands)
is more timing critical than extracting immediates
(except perhaps for branches) and minor opcodes.

(I guess I need to work on defining my own ISA!)

Brett Davis

unread,

Oct 27, 2011, 11:10:47 PM10/27/11

to

In article
<37c0f9ae-e973-4ac7...@eb8g2000vbb.googlegroups.com>,

"Paul A. Clayton" <paaron...@gmail.com> wrote:

> 3. I suspect the utility of 8-bit instructions is
> relatively limited and that a better choice of
> instruction lengths would be 2, 4, 6, and 8 bytes.
> (There might be some desire to support 64-bit
> immediates, so instructions larger than 8 bytes
> might be appropriate.)

Take a look at how PowerPC creates immediates.
A short constant rotated up to 63 bits and optionally inverted.
Also the related masked bitfield insert/extract.
This covers 95% of the long constants you need.
The last 5% you just build with two opcodes,
not worth it to have instructions for those cases.

BGB

unread,

Oct 28, 2011, 2:51:02 AM10/28/11

to

On 10/27/2011 5:12 PM, Paul A. Clayton wrote:
> On Oct 27, 6:03 pm, BGB<cr88...@hotmail.com> wrote:
> [snip outline of new ISA]
>> dunno, any comments?...
>
> A few quick comments.
>
> 1. Quoting Mitch Alsup: 'History records that
> PC in the GPRs is "overly uniform." '
>

possibly, but PC as a GPR makes it a lot easier to handle PIC code.

this part of the design was partly influenced by ARM.

> 2. Aligned variable length instructions probably do
> not make much sense. Data structure packing has
> much more flexibility (order does not have semantic
> content); forcing the use paired smaller instructions
> to achieve alignment seems undesirable.
>

but, it does allow a higher code density than, say, always using 32 or
64 bit units, and is less hairy IMO than either Thumb-like or VLIW
encodings.

a partial advantage is that, to the chip itself, the use of
variable-width instructions can be much the same as with fixed-width
instructions.

say, the bus always pulls in 64 bits, but the low 3-bits mostly serve to
swap around the bytes in the word. also, it ensures that instructions
can always be loaded with a single load operation (vs plain byte-wise
operations which may require multiple memory loads).

internally, it shouldn't be too much different from the case where the
instruction was always fixed-width, since presumably the opcode/...
always ends up in the correct spot in the read-in bus-word, and any
bytes past the end of the instruction are ignored.

> 3. I suspect the utility of 8-bit instructions is
> relatively limited and that a better choice of
> instruction lengths would be 2, 4, 6, and 8 bytes.
> (There might be some desire to support 64-bit
> immediates, so instructions larger than 8 bytes
> might be appropriate.) To maintain density, some
> 2 byte instructions may do more work (e.g.,
> rather than having a 1 byte RETURN provide a 2 byte
> RETURN-AND-ADJUST-STACK-BY-IMMEDIATE)
>

"ret imm" exists on x86, but is not generally used, as caller cleanup is
pretty much standard. this is also what I had assumed.

I have doubts as to whether or not push/pop are needed, since generally
(in a compiler) one will simply adjust the stack frame by a constant
amount and use "mov" instructions.

now, as for full 64-bit loads:
this will probably require an instruction pair (likely two 32-bit
loads), or a load from memory.

granted, ARM essentially has the same issue.

in the vast majority of cases, a single 24 or 48 bit load would be
sufficient.

> 4. If instructions and data use separate
> address spaces, it might be convenient to provide
> data reads and stores into instruction space.
>

I had assumed both would be in the same address space, however different
buses would be used because, in the absence of local cache memory,
having to perform both code and data operations on the same bus could
limit throughput.

however, it is likely that these cores would be connected to SRAM, which
would be in-turn linked to DRAM, probably with the DRAM using a single
bus for both code and data.

however, there is a slight uncertainty, as it seems not entirely likely
that a CPU could process an instruction and request the memory access in
a single clock-cycle, so it is possible that the split bus may not
necessarily notably help with performance (absent some form of pipelining).

I was also left considering some issues with IO signalling, which I
realized could be probably simply a few logic gates and suspending
internal clock-ticks while waiting for memory to arrive.

> 5. The encoding of lengths seems suboptimal in
> that short instructions suffer a larger percent
> overhead. (My guess is that using one encoding
> of the first two bits of a 2-byte parcel to
> indicate a longer encoding would be better. It
> might also be worthwhile to use the last two bits
> of the second parcel of an longer instruction to
> indicate extension beyond 4 bytes and use the
> encoding for instruction length in any third
> parcel to indicate an additional 2 or 4 bytes.
> Such an encoding may slightly simplify parsing.)
>

potentially, however I was also considering the possibility of direct
interpretation, whereby the ability to simply shove the first byte into
a "switch()" statement and go from there would make the most sense.

given that the length also encodes part of the instruction semantics and
form as well, it is not simply overhead, as this information would have
been needed either way (in an ISA such as x86, it is spread over a
number of fields).

I was also thinking it is not entirely redundant with alignment, since
it allows instruction forms shorter than the current alignment.

> 6. It may well be desirable to allow shorter
> instructions to address fewer registers, perhaps
> even looking ahead to extensions with 64 or 128
> registers. It may also be desirable for
> register names to be in the same place for all
> instructions. (The encoding you present places
> mode and scale in the position of the b register
> name.) I receive the impression that extracting
> register names (particularly for source operands)
> is more timing critical than extracting immediates
> (except perhaps for branches) and minor opcodes.
>

the 16 and 32-bit forms represent different instruction forms (or
essentially different opcodes), and so presumably should not conflict.

the reason for putting things as they were was so that the "eeee" field
would not be split into multiple locations, which would be a much bigger
issue if it were being run in an interpreter.

> (I guess I need to work on defining my own ISA!)
>

dunno...

I just wrote about something idle I was thinking about.

Terje Mathisen

unread,

Oct 28, 2011, 4:54:05 AM10/28/11

to

BGB wrote:
> On 10/27/2011 5:12 PM, Paul A. Clayton wrote:
>> On Oct 27, 6:03 pm, BGB<cr88...@hotmail.com> wrote:
>> [snip outline of new ISA]
>>> dunno, any comments?...
>>
>> A few quick comments.
>>
>> 1. Quoting Mitch Alsup: 'History records that
>> PC in the GPRs is "overly uniform." '
>>
>
> possibly, but PC as a GPR makes it a lot easier to handle PIC code.

Only if/when you have data stored alongside all your code:

The main valid usage for this is to have (static/read-only) jump tables
in code space.

As long as you have PC-relative branches and calls, PIC is quite cheap:
You just need a supported method for copying the PC into a GPR, limiting
the real cost to a single more or less dedicated register.

Besides, PIC code is also more easily exploitable code, ASLR along with
hw to disable execution of read-only data seems to move in the opposite
direction.

>> 5. The encoding of lengths seems suboptimal in
>> that short instructions suffer a larger percent
>> overhead. (My guess is that using one encoding
>> of the first two bits of a 2-byte parcel to
>> indicate a longer encoding would be better. It
>> might also be worthwhile to use the last two bits
>> of the second parcel of an longer instruction to
>> indicate extension beyond 4 bytes and use the
>> encoding for instruction length in any third
>> parcel to indicate an additional 2 or 4 bytes.
>> Such an encoding may slightly simplify parsing.)
>>
>
> potentially, however I was also considering the possibility of direct
> interpretation, whereby the ability to simply shove the first byte into
> a "switch()" statement and go from there would make the most sense.

This form of emulation is getting more and more rare, it seems like
binary translation is what everyone is doing these days.

Even if you do start out with plain opcode-by-opcode emulation, you only
do this long enough to determine which parts of the code needs to be JIT
translated.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stefan Monnier

unread,

Oct 28, 2011, 9:17:56 AM10/28/11

to

> say, the bus always pulls in 64 bits, but the low 3-bits mostly serve to
> swap around the bytes in the word. also, it ensures that instructions can
> always be loaded with a single load operation (vs plain byte-wise operations
> which may require multiple memory loads).

BTW, I always assumed that newer ISAs would go to fixed-length
instructions of either 64bit or 128bit. Of course, such instructions
would include various operations (à la VLIW), but compared to VLIW, I'd
expect that those operations would depend on each other (e.g. you'd
have 2 ALU ops in the form R4 <= (R1 <op1> R2) <op2> R3).

Stefan

Paul A. Clayton

unread,

Oct 28, 2011, 11:08:31 AM10/28/11

to

On Oct 28, 3:54 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
[snip]

> Besides, PIC code is also more easily exploitable code, ASLR along with
> hw to disable execution of read-only data seems to move in the opposite
> direction.

I assume you meant "disable execution of writable data".

From a quick read of the wikipedia article, it seems that
ASLR techniques do not currently use procedure-level
scrambling. It is not clear how much benefit this would
have relative to the cost (though for personal computers,
even in-storage randomized layout would provide some
additional target diversity). (An extremely complex
system might use garbage collection techniques to remix
in-memory executables, but that is probably going too
far for minimal benefit--and possible loss when the
random layout happens to match one of the targeted
layouts?)

Out of curiosity, would there be any benefit to using the
otherwise unused virtual address bits (x86-64 currently
only uses 48 bits) as a key? This would prevent software
from using a pointer in one 'segment' to trivially
generate a valid address in another 'segment'. A
contiguous structure could not occupy multiple segments,
but much data could be scattered somewhat randomly
without reducing spatial locality of the virtual address
space. My guess is that this would provide negligible
benefit, but I thought I would share it anyway.

Marven Lee

unread,

Oct 28, 2011, 11:24:33 AM10/28/11

to

Stefan Monnier wote:

Sounds like Wulf's WM processor? I had only read about it yesterday.
I think I remember reading that two of its registers were FIFOs, one
for memory access and another for something else.

Could a 2 ALU op not fit into 32 bits?

--
Marv

BGB

unread,

Oct 28, 2011, 11:48:22 AM10/28/11

to

VLIW is an option, but has a few drawbacks.
also, VLIW makes a lot more sense for "higher-end" cores, which can
execute operations in parallel and similar (say, one as a few fancy
cores which run the whole computer).

however, VLIW also puts the complexities of CPU-specific
instruction-scheduling issues on the hands of the compiler, which is not
desirable.

for what I was imagining, it would be a simple "one-at-a-time" execution
path (although, as currently imagined I am ending up with around 4 clock
cycles per-instruction, granted this is absent a pipeline, whereas a 4
stage pipeline or similar could probably address this issue).

(but, all this leads to some operations likely still needing more clock
cycles, meaning probably still being stuck with variable cycles per
instruction, hmm...).

almost may as well use a single bus for code and data, since absent a
pipeline it is unlikely the core would need to access both code and data
at once.

the idea would be to keep the cores simple, so that then one can have
large numbers of them (say, ideally, if one could fit 512 or 1024 cores
on a die). as a result, each core would likely lack any sort of
floating-point or SIMD capabilities as well (all real-valued math is
either fixed-point or emulated to save die space...).

the reason for the thing then being 64-bits would be to allow for large
data-sets.

now as for ALU requirements, very possibly the ALU would be used
sequentially. multiple ALUs is also possible, with a larger "generic"
ALU for arithmetic (add/multiply), and maybe a smaller "adder unit"
dedicated to instructions and memory addressing.

possibly, the instruction-update could also be an adder for the low 4
bits, and an increment-unit for the rest.

granted, yes, this is all hypothetical at the moment.

>
> Stefan

EricP

unread,

Oct 28, 2011, 12:04:34 PM10/28/11

to

BGB wrote:
> On 10/27/2011 5:12 PM, Paul A. Clayton wrote:
>> On Oct 27, 6:03 pm, BGB<cr88...@hotmail.com> wrote:
>> [snip outline of new ISA]
>>> dunno, any comments?...
>>
>> A few quick comments.
>>
>> 1. Quoting Mitch Alsup: 'History records that
>> PC in the GPRs is "overly uniform." '
>>
>
> possibly, but PC as a GPR makes it a lot easier to handle PIC code.

It depends what you mean by 'PC as a GPR'.
It sounds like you want PC relative addressing,
or some other way to easily access the current PC,
say a MOV Rn, PC instruction which then can be used.
That is all fine (PC rel is best because you can
use a LEA Rn, [const+PC] to do a MOV Rn, PC).

But some architecture, e.g. the VAX had 16 general registers
and defined R15 as the program counter (and R14 as stack pointer,
R13 as frame pointer, R12 as argument pointer).

Having PC in R15 brings nothing to the table functionally,
and causes major headaches for getting the hardware to run fast
because the update of the register banks is always out of
phase with using the PC to fetch instructions.

>
> this part of the design was partly influenced by ARM.
>
>
>
>> 2. Aligned variable length instructions probably do
>> not make much sense. Data structure packing has
>> much more flexibility (order does not have semantic
>> content); forcing the use paired smaller instructions
>> to achieve alignment seems undesirable.
>>
>
> but, it does allow a higher code density than, say, always using 32 or
> 64 bit units, and is less hairy IMO than either Thumb-like or VLIW
> encodings.
>
> a partial advantage is that, to the chip itself, the use of
> variable-width instructions can be much the same as with fixed-width
> instructions.
>
> say, the bus always pulls in 64 bits, but the low 3-bits mostly serve to
> swap around the bytes in the word. also, it ensures that instructions
> can always be loaded with a single load operation (vs plain byte-wise
> operations which may require multiple memory loads).
>
> internally, it shouldn't be too much different from the case where the
> instruction was always fixed-width, since presumably the opcode/...
> always ends up in the correct spot in the read-in bus-word, and any
> bytes past the end of the instruction are ignored.

I believe you are talking about having instructions that
are variable length but always multiples of, say, 32 bits
naturally aligned.

Sure you can do that, but what does it add and what does it cost?

Variable width instructions are fine.
They were heresy for a while, but they're cool again.

But for the fixed size of 32 bits if you work through the
instruction encodings, you wind up with lots of zeros
padding out the instructions. And zeros are waste.
When you start to compress out the waste you wind up
with variable length, byte aligned instructions.

>> 3. I suspect the utility of 8-bit instructions is
>> relatively limited and that a better choice of
>> instruction lengths would be 2, 4, 6, and 8 bytes.
>> (There might be some desire to support 64-bit
>> immediates, so instructions larger than 8 bytes
>> might be appropriate.) To maintain density, some
>> 2 byte instructions may do more work (e.g.,
>> rather than having a 1 byte RETURN provide a 2 byte
>> RETURN-AND-ADJUST-STACK-BY-IMMEDIATE)
>>
>
> "ret imm" exists on x86, but is not generally used, as caller cleanup is
> pretty much standard. this is also what I had assumed.

Ummm... no. Callee cleanup is widely used.
MS C compiler supports both, and callee is the ABI standard
as that produces the smallest code for the usual case
of routines with a fixed arg list. And it does use RET imm.

>
> I have doubts as to whether or not push/pop are needed, since generally
> (in a compiler) one will simply adjust the stack frame by a constant
> amount and use "mov" instructions.

Six of one, half dozen of the other.
PUSH and POP saves instruction bits by implying a register
specifier and immediate offset, and so is more compact.
But multiple pushes does multiple subtracts and
uses more internal resources.

On the other hand, SUB SP, #const, MOV [SP-offset], Rn
can keep the stack nicely aligned, and only does one subtract.

Eric

BGB

unread,

Oct 28, 2011, 12:30:57 PM10/28/11

to

On 10/28/2011 1:54 AM, Terje Mathisen wrote:
> BGB wrote:
>> On 10/27/2011 5:12 PM, Paul A. Clayton wrote:
>>> On Oct 27, 6:03 pm, BGB<cr88...@hotmail.com> wrote:
>>> [snip outline of new ISA]
>>>> dunno, any comments?...
>>>
>>> A few quick comments.
>>>
>>> 1. Quoting Mitch Alsup: 'History records that
>>> PC in the GPRs is "overly uniform." '
>>>
>>
>> possibly, but PC as a GPR makes it a lot easier to handle PIC code.
>
> Only if/when you have data stored alongside all your code:
>
> The main valid usage for this is to have (static/read-only) jump tables
> in code space.
>

and, loading larger immediate values from memory.

say, one wants to load a full 64-bit register as an immediate, but this
would require a chain of several instructions to load.
so, one can instead output an indirect load, and pull the value from memory.

otherwise, one would need to do like x86-64, and have a dedicated
"[rip+disp]" memory-access form.

> As long as you have PC-relative branches and calls, PIC is quite cheap:
> You just need a supported method for copying the PC into a GPR, limiting
> the real cost to a single more or less dedicated register.
>

well, it is possible.

> Besides, PIC code is also more easily exploitable code, ASLR along with
> hw to disable execution of read-only data seems to move in the opposite
> direction.

security wasn't really a design consideration here.

also, likely, distinguishing between read/write and execute memory would
either come back to using a parallel bus, or giving cores some way to
see the state of the memory (probably known to the MMU, which in this
case would be external to the cores, and probably sit between the local
SRAM and any external DRAM).

it is also more so a question if one really needs the ability to mark
memory read-only.

(these cores were not really imagined to be used as primary CPUs, rather
more for bulk data processing).

>>> 5. The encoding of lengths seems suboptimal in
>>> that short instructions suffer a larger percent
>>> overhead. (My guess is that using one encoding
>>> of the first two bits of a 2-byte parcel to
>>> indicate a longer encoding would be better. It
>>> might also be worthwhile to use the last two bits
>>> of the second parcel of an longer instruction to
>>> indicate extension beyond 4 bytes and use the
>>> encoding for instruction length in any third
>>> parcel to indicate an additional 2 or 4 bytes.
>>> Such an encoding may slightly simplify parsing.)
>>>
>>
>> potentially, however I was also considering the possibility of direct
>> interpretation, whereby the ability to simply shove the first byte into
>> a "switch()" statement and go from there would make the most sense.
>
> This form of emulation is getting more and more rare, it seems like
> binary translation is what everyone is doing these days.
>
> Even if you do start out with plain opcode-by-opcode emulation, you only
> do this long enough to determine which parts of the code needs to be JIT
> translated.
>

potentially, but interpreters are not entirely dead yet...
(a trouble with JIT compilers is that they are more of a hassle to
implement and maintain, and one may find that performance is "good
enough" with an interpreter).

not too long ago, I tested an interpreter (or emulator) I had for x86
(written in plain C), inside of a virtual ARM box (again emulated, via
QEMU), inside of a VMware session (CPU virtualization).

I also observed that, despite the slowdown (of an emulated ARM
environment), the interpreter for my custom scripting language remained
reasonably responsive.

several layers down though performance was not exactly "impressive", but
it all worked (however, I soon ended up using a cross-compiler to
compile the ARM code, as running GCC within the emulated ARM environment
was unbearably slow...).

granted, yes, probably one wouldn't produce/deploy something like this,
but it is relevant for testing (in an absence of existent real HW).

in real-use, probably the existence both of these cores and their
specific ISA would likely be hidden from most software, and probably
some form of more generic high-level bytecode or similar would be used
for targeting them.

BGB

unread,

Oct 28, 2011, 1:28:04 PM10/28/11

to

On 10/28/2011 9:04 AM, EricP wrote:
> BGB wrote:
>> On 10/27/2011 5:12 PM, Paul A. Clayton wrote:
>>> On Oct 27, 6:03 pm, BGB<cr88...@hotmail.com> wrote:
>>> [snip outline of new ISA]
>>>> dunno, any comments?...
>>>
>>> A few quick comments.
>>>
>>> 1. Quoting Mitch Alsup: 'History records that
>>> PC in the GPRs is "overly uniform." '
>>>
>>
>> possibly, but PC as a GPR makes it a lot easier to handle PIC code.
>
> It depends what you mean by 'PC as a GPR'.
> It sounds like you want PC relative addressing,
> or some other way to easily access the current PC,
> say a MOV Rn, PC instruction which then can be used.
> That is all fine (PC rel is best because you can
> use a LEA Rn, [const+PC] to do a MOV Rn, PC).
>

yeah, this is an idea.

> But some architecture, e.g. the VAX had 16 general registers
> and defined R15 as the program counter (and R14 as stack pointer,
> R13 as frame pointer, R12 as argument pointer).
>
> Having PC in R15 brings nothing to the table functionally,
> and causes major headaches for getting the hardware to run fast
> because the update of the register banks is always out of
> phase with using the PC to fetch instructions.
>

the HW was not really intended to run-fast sequentially, but rather, by
keeping the cores very simple, one could have lots of them.

hence, if many instructions take 4-10 clock-cycles or more, it is no
huge loss.

granted, in the idea I had, there would not be register banking either
(each core would only have those registers, with them having callee or
caller-save status). likewise, there would not be protection rings or a
separate client/supervisor state (so, more like the 8086 in this regard).

I had imagined R0-R11 as GPRs, but R12-R15 would be essentially
special-purpose registers.

however, a full 16 GPRs would also make sense.

>>
>> this part of the design was partly influenced by ARM.
>>
>>
>>
>>> 2. Aligned variable length instructions probably do
>>> not make much sense. Data structure packing has
>>> much more flexibility (order does not have semantic
>>> content); forcing the use paired smaller instructions
>>> to achieve alignment seems undesirable.
>>>
>>
>> but, it does allow a higher code density than, say, always using 32 or
>> 64 bit units, and is less hairy IMO than either Thumb-like or VLIW
>> encodings.
>>
>> a partial advantage is that, to the chip itself, the use of
>> variable-width instructions can be much the same as with fixed-width
>> instructions.
>>
>> say, the bus always pulls in 64 bits, but the low 3-bits mostly serve
>> to swap around the bytes in the word. also, it ensures that
>> instructions can always be loaded with a single load operation (vs
>> plain byte-wise operations which may require multiple memory loads).
>>
>> internally, it shouldn't be too much different from the case where the
>> instruction was always fixed-width, since presumably the opcode/...
>> always ends up in the correct spot in the read-in bus-word, and any
>> bytes past the end of the instruction are ignored.
>
> I believe you are talking about having instructions that
> are variable length but always multiples of, say, 32 bits
> naturally aligned.
>

yeah:
1 byte, always byte-aligned;
2 byte, always even-byte aligned;
4 byte, always on a multiple of 4 bytes;
8 byte, always on a multiple of 8 bytes.

> Sure you can do that, but what does it add and what does it cost?
>
> Variable width instructions are fine.
> They were heresy for a while, but they're cool again.
>
> But for the fixed size of 32 bits if you work through the
> instruction encodings, you wind up with lots of zeros
> padding out the instructions. And zeros are waste.
> When you start to compress out the waste you wind up
> with variable length, byte aligned instructions.
>

then one may almost may as well just use an x86 subset.

but, anyways, probably a code-generator or assembler could have some
funky logic to try to organize instructions where possible such that
they are compact.

an example would be if the assembler could determine if several
instructions can be swapped around without effecting behavior, and if
this swapping would improve code density.

the reason for the alignment is, granted, that it would allow always
fetching the entire instruction in a single memory access (whereas with
byte-aligned instructions, one may need to access memory multiple times
to pull in the instruction).

>>> 3. I suspect the utility of 8-bit instructions is
>>> relatively limited and that a better choice of
>>> instruction lengths would be 2, 4, 6, and 8 bytes.
>>> (There might be some desire to support 64-bit
>>> immediates, so instructions larger than 8 bytes
>>> might be appropriate.) To maintain density, some
>>> 2 byte instructions may do more work (e.g.,
>>> rather than having a 1 byte RETURN provide a 2 byte
>>> RETURN-AND-ADJUST-STACK-BY-IMMEDIATE)
>>>
>>
>> "ret imm" exists on x86, but is not generally used, as caller cleanup
>> is pretty much standard. this is also what I had assumed.
>
> Ummm... no. Callee cleanup is widely used.
> MS C compiler supports both, and callee is the ABI standard
> as that produces the smallest code for the usual case
> of routines with a fixed arg list. And it does use RET imm.
>

AFAICT, stdcall is mostly "of historical significance" (used for Windows
API functions and similar), but most code produced by the compilers uses
"cdecl", which uses caller-cleanup.

on Linux on x86, cdecl is pretty much the sole calling convention.

also, both Win64 and SysV/AMD64 use caller-cleanup.

by this definition, callee-cleanup is reasonably dead.

either way, I had assumed that such an architecture would probably use
caller-cleanup only, and probably fixed-form function call frames (where
one reserves N words are the bottom of the callers' stack frame to use
for passing arguments to called functions).

>>
>> I have doubts as to whether or not push/pop are needed, since
>> generally (in a compiler) one will simply adjust the stack frame by a
>> constant amount and use "mov" instructions.
>
> Six of one, half dozen of the other.
> PUSH and POP saves instruction bits by implying a register
> specifier and immediate offset, and so is more compact.
> But multiple pushes does multiple subtracts and
> uses more internal resources.
>
> On the other hand, SUB SP, #const, MOV [SP-offset], Rn
> can keep the stack nicely aligned, and only does one subtract.
>

yeah.

as I imagined it, a "push" could be a 2-byte form, whereas a "mov" would
be a 4-byte form.

this could be sufficient to justify the existence of a push instruction.
I guess the biggie issue is mostly how this would effect transistor
count (I am not certain how much impact there would be, given I am not
all that much of a HW person...).

MitchAlsup

unread,

Oct 28, 2011, 2:07:00 PM10/28/11

to

On Thursday, October 27, 2011 6:03:26 PM UTC-5, BGB wrote:
> x86 is nice, but not particular efficient on small/simple cores (so, it
> is good for a main processor, but not so good if one wants lots of small
> cores).

I disagree that x86 cannot be particularly efficient in small/simple cores.
I do agree that x86 has not been particularly efficient in currently implemented small/simple cores.

> so, a few goals:
> fairly simple ISA;
> compact yet regular representation;
> should be possible to implement simply in hardware.
>
> and a few restrictions:
> all data values are aligned;
> all instructions are aligned;
> only a single memory access per operation (load or store).

After architecting machines in 3 different RISC architectures and 2 CISC architectures; suffering under the delima of having software "FIX" alignment issure via fast traps; I have come to the conclusion that hardware is better at fixing alignment problems--especially those MANDATED by programming languages. Of rescent, the packing of multiple values into short vectors (MMX, SSE, AltiVec,...) that attempting to require aligment on these things is worse than the uninitiated can imagine.

> an imagined design:
> CPU has two buses, one for data and one for code, each with a separate
> set of address lines (presumably connected to a common memory-space though);
> presumably, memory can be accessed in a single clock (cores connected to
> high-speed SRAM);
> both data and address buses are 64 bits;

I have seen CPU designs for CELL Phones using 128-bit busses. I think dual 64-bit busses is going to hamper your design.

> like data values, instructions may be 1/2/4/8 bytes, and are kept
> aligned much like data;

Bad move.

> opcodes are 8 or 16 bits;

Probably a bad move.

> opcodes may access a memory operand and a register at the same time;
> maybe, x86-like memory addressing may be used [base+scale*index+disp];
> there will be 16 registers;

Probably a good move.

> unused opcode spots will be set to nop;
> the behavior of a misaligned opcode will be undefined.

ALL UNUSED opCode space MUST be defined as exception generating. Otherwise you will hate yourself in design #2.

> regs:
> r0-r11: GPRs
> r12/rsp: stack-pointer
> r13/rbp: base-pointer
> r14/rlr: link-register
> r15/rip: instruction-pointer.

Allow the software gurus to define register use (excepting for the register the CALL instruction deposits the return address in.)

I agree wiht PAClaton that putting hte PC in a GPR is "not such a good plan". Since this register is kept in the front end, and used every friggen cycle, having the back end track it constantly and micro-faulting the front end when used will be found to be problematic in the bussing (wires) between the front end and the back end. Yes, it can be done, but it can also be done just as efficiently (wires+gates) by leaving the PC as the "front end control point" and bussing a constant down (The PC at the time of fetch) the pipe for when it is needed, that you win bigger by leaving it out of hte GPRs.

> loads into r15 will encode many forms of 'jmp'.

Bad move.

Mitch

BGB

unread,

Oct 28, 2011, 3:21:09 PM10/28/11

to

On 10/28/2011 11:07 AM, MitchAlsup wrote:
> On Thursday, October 27, 2011 6:03:26 PM UTC-5, BGB wrote:
>> x86 is nice, but not particular efficient on small/simple cores (so, it
>> is good for a main processor, but not so good if one wants lots of small
>> cores).
>
> I disagree that x86 cannot be particularly efficient in small/simple cores.
> I do agree that x86 has not been particularly efficient in currently implemented small/simple cores.
>

granted, if x86 can be used, then my idea is admittedly sort of
pointless, as then the obvious answer is "just use x86"...

>> so, a few goals:
>> fairly simple ISA;
>> compact yet regular representation;
>> should be possible to implement simply in hardware.
>>
>> and a few restrictions:
>> all data values are aligned;
>> all instructions are aligned;
>> only a single memory access per operation (load or store).
>
> After architecting machines in 3 different RISC architectures and 2 CISC architectures; suffering under the delima of having software "FIX" alignment issure via fast traps; I have come to the conclusion that hardware is better at fixing alignment problems--especially those MANDATED by programming languages. Of rescent, the packing of multiple values into short vectors (MMX, SSE, AltiVec,...) that attempting to require aligment on these things is worse than the uninitiated can imagine.
>

ok, just it was common for "simple" architectures to mandate strict
alignment (and have any misaligned access be via tricks, such as reading
using bytes, or reading as a larger aligned unit and doing a right-shift
or similar).

>> an imagined design:
>> CPU has two buses, one for data and one for code, each with a separate
>> set of address lines (presumably connected to a common memory-space though);
>> presumably, memory can be accessed in a single clock (cores connected to
>> high-speed SRAM);
>> both data and address buses are 64 bits;
>
> I have seen CPU designs for CELL Phones using 128-bit busses. I think dual 64-bit busses is going to hamper your design.
>

possibly.
for a 128 bit bus though, one likely needs either 128-bit value types,
or cache memory, to make it useful. that or the 128-bit memory is used
partly for pulling in smaller values, which are then shifted/rotated
into place (the rest is discarded).

I guess a relevant question would be the complexity difference between
"exchange on read" and "rotate on read". the "exchange-on-read" idea was
considered mostly as m68k did something like this (a misaligned read
often producing a value with the bytes swapped around IIRC).

>> like data values, instructions may be 1/2/4/8 bytes, and are kept
>> aligned much like data;
>
> Bad move.
>

why so?...

I think it could probably simplify the HW some, which was the main
reason for considering it.

>> opcodes are 8 or 16 bits;
>
> Probably a bad move.
>

not sure why this would be.

at least the "two bits 00/01/10=single byte, 11=dual byte" opcode scheme
was used effectively in my own VM bytecode designs, and has generally
worked fairly well.

the JVM and .NET use 0xFE and 0xFF as escape bytes.
x86 uses a bunch of prefix bytes.

not sure why an 8 or 16 bit opcode would be a big issue.

granted, in my own VM designs, this has often left invalid overlong
encodings for some opcodes, and a value-range hole due to the
implementation (namely a "recursive switch statement" as the main
instruction-handling function).

granted, HW need not be equivalent.

>> opcodes may access a memory operand and a register at the same time;
>> maybe, x86-like memory addressing may be used [base+scale*index+disp];
>> there will be 16 registers;
>
> Probably a good move.
>

yeah.
mostly it is that having to calculate stuff in registers, and thus use
up registers, to perform memory-address calculations, is kind of
annoying, and may not notably simplify the design (but does likely cost
some WRT performance).

>> unused opcode spots will be set to nop;
>> the behavior of a misaligned opcode will be undefined.
>
> ALL UNUSED opCode space MUST be defined as exception generating. Otherwise you will hate yourself in design #2.
>

I meant in terms of alignment.

say:
8-bit op;
nop8; //pad-align
16-bit op.

or:
8-bit op;
nop8; //pad-align
nop16; //pad-align
32-bit op.

nevermind any clock-cycles eaten by executing nops.
presumably, the assembler could try to figure out if it can pack the
instructions effectively, or just otherwise insert nops (code producing
ASM will not need to worry about producing aligning nops, any more than
it will need to worry about how immediates get loaded into registers, or
the specifics of "opr reg, imm" arithmetic-instruction forms).

note: there may be some "opr reg, imm" immediate forms, but in the
"general case" these may likely be implemented internally as "opr reg,
[mem]" operations.

>> regs:
>> r0-r11: GPRs
>> r12/rsp: stack-pointer
>> r13/rbp: base-pointer
>> r14/rlr: link-register
>> r15/rip: instruction-pointer.
>
> Allow the software gurus to define register use (excepting for the register the CALL instruction deposits the return address in.)
>
> I agree wiht PAClaton that putting hte PC in a GPR is "not such a good plan". Since this register is kept in the front end, and used every friggen cycle, having the back end track it constantly and micro-faulting the front end when used will be found to be problematic in the bussing (wires) between the front end and the back end. Yes, it can be done, but it can also be done just as efficiently (wires+gates) by leaving the PC as the "front end control point" and bussing a constant down (The PC at the time of fetch) the pipe for when it is needed, that you win bigger by leaving it out of hte GPRs.
>

maybe?:
r0-r12: GPRs
r13/rbp: base pointer (GPR, special in name and ABI only)
r14/rlr: link-register (special to call)
r15/rsp: stack-pointer (special/assigned use)

rip would then be implicit and managed by processor (and rip-relative
addressing could be a special access mode).

probably function entry will look something like:
push rlr
push rbp
mov rbp, rsp
sub rsp, #
...
mov rsp, rbp
pop rbp
ret2

>> loads into r15 will encode many forms of 'jmp'.
>
> Bad move.

possibly, but I figured it could be a "special case encoding".

fair enough, jmp and call and similar probably need their own dedicated
opcodes.

Paul A. Clayton

unread,

Oct 28, 2011, 3:57:32 PM10/28/11

to

On Oct 28, 1:07 pm, MitchAlsup <MitchAl...@aol.com> wrote:
[snip]

Thanks for providing a professional perspective!

> I disagree that x86 cannot be particularly efficient in small/simple
> cores. I do agree that x86 has not been particularly efficient in
> currently implemented small/simple cores.

But x86 does have some legacy which makes it less than ideal
for small cores which does not significantly help large cores.

Even just the 8KiB of microcode per system would probably
keep it from the ARM Cortex-M0 target areas. (Or is that
'tiny core'?)

I would assume that even by yourself you could develop a
an ISA at least 5% better than x86 for small systems without
sacrificing cost/performance/cost for larger systems. Of
course, 5% better is not nearly enough to upset the status
quo.

[snip]

> After architecting machines in 3 different RISC architectures
> and 2 CISC architectures; suffering under the delima of having

M88K was one of the RISCs and x86 one of the CISCs, what
were the others?

> software "FIX" alignment issure via fast traps; I have come to
> the conclusion that hardware is better at fixing alignment
> problems--especially those MANDATED by programming languages.
> Of rescent, the packing of multiple values into short vectors
> (MMX, SSE, AltiVec,...) that attempting to require aligment on
> these things is worse than the uninitiated can imagine.

At least one person would argue that packed elements can
be aligned according to the element size not the vector
register size. Of course, from a hardware perspective,
this would not make much difference; loading a byte
aligned vector of 8 byte-sized elements is not any
easier than loading an 8-byte single-element register
that is only byte aligned.

[snip]

> ALL UNUSED opCode space MUST be defined as exception
> generating. Otherwise you will hate yourself in design #2.

Here I very much disagree. Leaving aside some opcode
space for hints (which would be treated as no-ops for
implementations not supporting the hint) can be useful.

In general one does not want to perform software
fix-ups after an illegal opcode exception for hint
operations (which have no Architectural semantic
content).

[snip]

> Allow the software gurus to define register use
> (excepting for the register the CALL instruction
> deposits the return address in.)

So you do not agree with the developers of Alpha that
even that should be flexible? (I have read that being
able to use a different link register might be useful
for coroutines, but I have not taken the time to
understand the reasoning for this.) Of course,
Alpha was *very* orthogonal in its register jump
instruction.

I am inclined to disagree about not having special
registers. Having special registers allows code
density tricks and special handling of some
operations (e.g., Todd Austin's knapsack cache).

As long as the specialness does not excessively
constrain usage (making register allocation
difficult), it may have some benefits.

(I need to think some more before further commenting
in this thread.)

MitchAlsup

unread,

Oct 28, 2011, 5:31:28 PM10/28/11

to

On Friday, October 28, 2011 2:57:32 PM UTC-5, Paul A. Clayton wrote:

> On Oct 28, 1:07 pm, MitchAlsup <Mitch...@aol.com> wrote:
> [snip]
>
> Thanks for providing a professional perspective!
>
> > I disagree that x86 cannot be particularly efficient in small/simple
> > cores. I do agree that x86 has not been particularly efficient in
> > currently implemented small/simple cores.
>
> But x86 does have some legacy which makes it less than ideal
> for small cores which does not significantly help large cores.

The msr register space and the CPUID space are non-trivial in size.
These are a larger burden than microcode ROMs themselves.

> Even just the 8KiB of microcode per system would probably
> keep it from the ARM Cortex-M0 target areas. (Or is that
> 'tiny core'?)

I can design and implement a nice x86 core + L1 cache in the size of one or two I/O pads in 32nm. Is that small enough or not?

> I would assume that even by yourself you could develop a
> an ISA at least 5% better than x86 for small systems without
> sacrificing cost/performance/cost for larger systems. Of
> course, 5% better is not nearly enough to upset the status
> quo.

But I cannot afford the $10B to develop the infrastructure--so there is no way to "enter" the market. {Where infrastructure is the stuff required to make use of the new core. Stuff like, compilers, a couple of OSs, Southbridges,...}

> > ALL UNUSED opCode space MUST be defined as exception
> > generating. Otherwise you will hate yourself in design #2.
>
> Here I very much disagree. Leaving aside some opcode
> space for hints (which would be treated as no-ops for
> implementations not supporting the hint) can be useful.
>
> In general one does not want to perform software
> fix-ups after an illegal opcode exception for hint
> operations (which have no Architectural semantic
> content).

I suspect that when you get your second chance, you will hate yourself for the exact reasons illustrated above.

> [snip]
> > Allow the software gurus to define register use
> > (excepting for the register the CALL instruction
> > deposits the return address in.)
>
> So you do not agree with the developers of Alpha that
> even that should be flexible? (I have read that being
> able to use a different link register might be useful
> for coroutines, but I have not taken the time to
> understand the reasoning for this.) Of course,
> Alpha was *very* orthogonal in its register jump
> instruction.

You (I at least) can do coroutines without access to the PC in the GPRs. You just have to do it in a way that does not suffer from the reentrancy problems found in Fortran 1-through-66. This may require a couple of move instructions, and maybe the supression of callee register save conventions.

No modern programming language allow/supports the notion. So even here, while useful, you won't see the utility count above 0.0% unless you go out of your way to create a benchmark using this feature.

Mitch

MitchAlsup

unread,

Oct 28, 2011, 5:56:12 PM10/28/11

to

On Friday, October 28, 2011 2:21:09 PM UTC-5, BGB wrote:
> On 10/28/2011 11:07 AM, MitchAlsup wrote:
> >> like data values, instructions may be 1/2/4/8 bytes, and are kept
> >> aligned much like data;
> >
> > Bad move.
> >
>
> why so?...

If 1 and 2 and 4 and 8 byte instructions are useful, then why are 3 and 5 and 6 and 7 byte instructions not useful?

> I think it could probably simplify the HW some, which was the main
> reason for considering it.

Getting rid of exceptions ALWAYS simplifies HW; sometimes nearly as much as fixinig the architecture so that it is not a problem to begin with.

> >> opcodes are 8 or 16 bits;
> >
> > Probably a bad move.
> >
>
> not sure why this would be.
>
> at least the "two bits 00/01/10=single byte, 11=dual byte" opcode scheme
> was used effectively in my own VM bytecode designs, and has generally
> worked fairly well.

So you have 192 single byte instructions and 64*256 = 16384 two byte instructions.

If you count different address modes as different instructions, x86 has more instructions than what you have provided.

> the JVM and .NET use 0xFE and 0xFF as escape bytes.
> x86 uses a bunch of prefix bytes.

In the prefix space, x86 has fairly flexible use of the bits. A prefix that indicates SSE (versus MMX) opcode decoding, can be used to bias branch prediction on an instruction that is not MMX class. Thus, overall, each prefix byte adds a bit (or two) to the opcode decoding space. Some prefixes (REX in particular) add bits to the register specifiers, and other opcode decoding thingamabobs. This is simply more flexible, and when approachjed as if the byte stream was being passed through a LALR HW parser is a lot more flexible and expandable. In HW implementations not usefully harder than a more fixed format encoding.

>
> not sure why an 8 or 16 bit opcode would be a big issue.
>
> granted, in my own VM designs, this has often left invalid overlong
> encodings for some opcodes, and a value-range hole due to the
> implementation (namely a "recursive switch statement" as the main
> instruction-handling function).

In my HW LALR parser, I defined tables containing bit patterns. A negative number indicated that more decoding bytes were necessary and part of thevalue indicated the new state and another part indicated that which had just been decoded. A positive value indicated an entry in a different table that fully identified the instruction. A zero value in the table indicated that the opcode was not in the table (thus it is undefined). Each table has 256 entries and a handfull of bits. One applies 5 bytes of the byte stream to 5 sets of tables and (some of) the bits falling out of the table control multiplexers other bits are operands to the multiplexers. At then end of the tree of multiplexers, the necessary selcet lines are driven to control HW state concerning this instruction's decode.

When one has marker bits to denote instruction boundaries. The above paragraph is simply replicated to the decode width and run in complete parallelism.

> granted, HW need not be equivalent.
>
>
> >> opcodes may access a memory operand and a register at the same time;
> >> maybe, x86-like memory addressing may be used [base+scale*index+disp];
> >> there will be 16 registers;
> >
> > Probably a good move.
> >
>
> yeah.
> mostly it is that having to calculate stuff in registers, and thus use
> up registers, to perform memory-address calculations, is kind of
> annoying, and may not notably simplify the design (but does likely cost
> some WRT performance).

The misaligned support an the [base+scale*index+displacement] addressing mode are what changed my mind about RISC overall. Notice the 88K had [base+scale*index] and [base+displacement] only.

>
> >> unused opcode spots will be set to nop;
> >> the behavior of a misaligned opcode will be undefined.
> >
> > ALL UNUSED opCode space MUST be defined as exception generating. Otherwise you will hate yourself in design #2.
> >
>
> I meant in terms of alignment.
>
> say:
> 8-bit op;
> nop8; //pad-align
> 16-bit op.

This worked out poorly in: CDC 6600 but not so bad in CDC 6400. It worked out so poorly in S.E.L 32/50, 32/87, so much that they finally fixed in in 32/67.

Even when you can make the decoder eat the noops without creating bubbles in the pipe, they still waste a precious resource--the I cache.

>
> or:
> 8-bit op;
> nop8; //pad-align
> nop16; //pad-align
> 32-bit op.

I guarentee that youwill hate yourself for doing this on the #2 machine.

Just consider the following case:
8-bit op;
nop8
nop16
nop32
64-bit op;

> nevermind any clock-cycles eaten by executing nops.
> presumably, the assembler could try to figure out if it can pack the
> instructions effectively, or just otherwise insert nops (code producing
> ASM will not need to worry about producing aligning nops, any more than
> it will need to worry about how immediates get loaded into registers, or
> the specifics of "opr reg, imm" arithmetic-instruction forms).
>
> note: there may be some "opr reg, imm" immediate forms, but in the
> "general case" these may likely be implemented internally as "opr reg,
> [mem]" operations.

In my not so humble opinion, you will not enjoy the results of this choice. It causes extra work in the pipeline (inbound memory ref: which chews up valuable register basepointers), and displaces useful data from the data cache that could be properly positioned in the instruction cache.

Mitch

Terje Mathisen

unread,

Oct 28, 2011, 8:30:05 PM10/28/11

to

MitchAlsup wrote:
> After architecting machines in 3 different RISC architectures and 2
> CISC architectures; suffering under the delima of having software
> "FIX" alignment issure via fast traps; I have come to the conclusion
> that hardware is better at fixing alignment problems--especially
> those MANDATED by programming languages. Of rescent, the packing of
> multiple values into short vectors (MMX, SSE, AltiVec,...) that
> attempting to require aligment on these things is worse than the
> uninitiated can imagine.

Yeah, that one took me a few years to figure out, but now I am firmly in
the "Short vectors shall only require item-sized alignment" camp.

This is because otherwise it becomes far too painful to write code that
will tolerate changes in the width of vector regs, and pretty much
impossible to efficiently interface between code written for width A
(say 128 bits) and width B (256 bits).

Paul A. Clayton

unread,

Oct 28, 2011, 8:49:42 PM10/28/11

to

On Oct 28, 4:31 pm, MitchAlsup <MitchAl...@aol.com> wrote:
[snip]

> I can design and implement a nice x86 core + L1 cache in
> the size of one or two I/O pads in 32nm. Is that small enough
> or not?

Well, ARM indicates the base size for the Cortex-M0 in a 40G
process is 0.01 mm-squared (0.04 for Cortex-M4), but that
excludes RAM/cache (and just about everything else). Still
your x86 does sound small.

(Since an iPad is about 460 cm-squared, I just need to know
how large an O-pad is; I can do the division to determine
how large one I/O pad is. Sorry, the pun came to mind and
I did not have anyone here to inflict it on.)

[snip]

> > Here I very much disagree. Leaving aside some opcode
> > space for hints (which would be treated as no-ops for
> > implementations not supporting the hint) can be useful.

[snip]

> I suspect that when you get your second chance, you
> will hate yourself for the exact reasons illustrated above.

Since my first chance is never going to happen could you
explain why reserving a few opcodes for hints is a bad idea?
Such prevents them from being used for ordinary instructions
but should not otherwise have any architectural effects.

BGB

unread,

Oct 29, 2011, 12:53:36 AM10/29/11

to

On 10/28/2011 2:56 PM, MitchAlsup wrote:
> On Friday, October 28, 2011 2:21:09 PM UTC-5, BGB wrote:
>> On 10/28/2011 11:07 AM, MitchAlsup wrote:
>>>> like data values, instructions may be 1/2/4/8 bytes, and are kept
>>>> aligned much like data;
>>>
>>> Bad move.
>>>
>>
>> why so?...
>
> If 1 and 2 and 4 and 8 byte instructions are useful, then why are 3 and 5 and 6 and 7 byte instructions not useful?
>

because 3/5/6/7 are not power-of-2 aligned, and anything which fits in 3
bytes also goes in 4, and anything which fits in 5/6/7 goes in 8.

it is much like short vs int vs long-long.
although it would be potentially nifty, say, to have 24 bit integers,
they don't really contribute a whole lot, and one will generally pick 16
or 32 bits instead.

>> I think it could probably simplify the HW some, which was the main
>> reason for considering it.
>
> Getting rid of exceptions ALWAYS simplifies HW; sometimes nearly as much as fixinig the architecture so that it is not a problem to begin with.
>

possible, but in this case the goal would be to make a "simple"
architecture, rather than a necessarily "good" one.

>>>> opcodes are 8 or 16 bits;
>>>
>>> Probably a bad move.
>>>
>>
>> not sure why this would be.
>>
>> at least the "two bits 00/01/10=single byte, 11=dual byte" opcode scheme
>> was used effectively in my own VM bytecode designs, and has generally
>> worked fairly well.
>
> So you have 192 single byte instructions and 64*256 = 16384 two byte instructions.
>

yep.

I have a variant that extends the pattern, and can encode a 32-bit
opcode number in 5 bytes.

say
xxxxxxxx
11xxxxxx xxxxxxxx
1111xxxx xxxxxxxx xxxxxxxx
111111xx xxxxxxxx xxxxxxxx xxxxxxxx
11111111 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

but, it is a moot point as, presently, my VM only uses around 540
opcodes (it would likely be a bit more if they were statically-typed,
but most of these opcodes operate on "variant" types, which the
interpreter generally handles as dynamic types and which my JITs have
generally used type-inference on).

some of my other VM/interpreter designs have used a similar scheme,
except were statically-types, however, these were never fully
implemented (since statically-typed interpreters with a non-trivial
type-system are considerably more effort to implement).

I also considered it a little moot as:
a statically-typed interpreter is still slow in the naive case (IME,
often the "big switch()" comes to dominate in most simple "switch()"
based interpreters).

a less naive interpreter (or a JIT) renders the difference between
static types and inferred types largely a moot point IME (an interpreter
design which internally converts the bytecode into threaded-code can
easily enough infer the types in the process, having little need for
them to be type-specialized in the bytecode).

hence, my bytecode formats tend to have a little more in common with
MSIL / CIL / .NET ByteCode than with Java ByteCode in this regard.

> If you count different address modes as different instructions, x86 has more instructions than what you have provided.
>

yes, but if one doesn't count the addressing mode, IIRC it was something
like 2100 instruction forms (of 770 nmonics), which includes base
opcodes + x87 + SSE + partial AVX.

this is going by my assembler listing, where I don't think I really left
out anything significant.

in this case, the arguments for the opcodes are not included (such as
ModRM encodings), since the assembler manages this itself.

parts of my assembler are generated directly from listings, which use a
notation fairly similar to that used in the Intel docs (except for AVX
and XOP where I devised my own more-compact notation).

ARM actually led to considerably more complex logic, because the
encoding rules for ARM and Thumb instructions were far less regular than
for x86 (and led to a reasonably funky listing notation).

I guess it may not have been so bad if people were hand-writing the
logic for emitting each instruction form, but much more awkward with
"generic" logic.

>> the JVM and .NET use 0xFE and 0xFF as escape bytes.
>> x86 uses a bunch of prefix bytes.
>
> In the prefix space, x86 has fairly flexible use of the bits. A prefix that indicates SSE (versus MMX) opcode decoding, can be used to bias branch prediction on an instruction that is not MMX class. Thus, overall, each prefix byte adds a bit (or two) to the opcode decoding space. Some prefixes (REX in particular) add bits to the register specifiers, and other opcode decoding thingamabobs. This is simply more flexible, and when approachjed as if the byte stream was being passed through a LALR HW parser is a lot more flexible and expandable. In HW implementations not usefully harder than a more fixed format encoding.

yes, however, prefixes could lead to potentially more memory accesses I
would think (for a naive implementation).

I never really got though why access to the 16 x 64-bit registers has
not been added to real-mode or 32-bit protected mode.

when I wrote an x86 interpreter, I added them via a special prefix I
called "PREX" which was basically a 2-byte prefix similar to XOP or VEX,
but encoded the same basic information as a REX.

not entirely sure why Intel or AMD couldn't do similar, unless there is
some more subtle architectural reason for why 64-bit regs can't be added
to 32-bit mode (granted, assuming OS support).

>>
>> not sure why an 8 or 16 bit opcode would be a big issue.
>>
>> granted, in my own VM designs, this has often left invalid overlong
>> encodings for some opcodes, and a value-range hole due to the
>> implementation (namely a "recursive switch statement" as the main
>> instruction-handling function).
>
> In my HW LALR parser, I defined tables containing bit patterns. A negative number indicated that more decoding bytes were necessary and part of thevalue indicated the new state and another part indicated that which had just been decoded. A positive value indicated an entry in a different table that fully identified the instruction. A zero value in the table indicated that the opcode was not in the table (thus it is undefined). Each table has 256 entries and a handfull of bits. One applies 5 bytes of the byte stream to 5 sets of tables and (some of) the bits falling out of the table control multiplexers other bits are operands to the multiplexers. At then end of the tree of multiplexers, the necessary selcet lines are driven to control HW state concerning this instruction's decode.
>

hmm...

in my x86 interpreter, I had used something more akin to a regex
matcher. some LUTs were built from the listings, and then the
instruction was matched against any patterns, and the first-found match
was assumed to be correct.

> When one has marker bits to denote instruction boundaries. The above paragraph is simply replicated to the decode width and run in complete parallelism.
>

ok.

>> granted, HW need not be equivalent.
>>
>>
>>>> opcodes may access a memory operand and a register at the same time;
>>>> maybe, x86-like memory addressing may be used [base+scale*index+disp];
>>>> there will be 16 registers;
>>>
>>> Probably a good move.
>>>
>>
>> yeah.
>> mostly it is that having to calculate stuff in registers, and thus use
>> up registers, to perform memory-address calculations, is kind of
>> annoying, and may not notably simplify the design (but does likely cost
>> some WRT performance).
>
> The misaligned support an the [base+scale*index+displacement] addressing mode are what changed my mind about RISC overall. Notice the 88K had [base+scale*index] and [base+displacement] only.
>

memory addressing seems nice, but I am not sure what are the
costs/benefits of allowing misaligned access (vs not allowing it).

granted, from a programmer POV, having misaligned access is probably
nicer than not having it.

>>
>>>> unused opcode spots will be set to nop;
>>>> the behavior of a misaligned opcode will be undefined.
>>>
>>> ALL UNUSED opCode space MUST be defined as exception generating. Otherwise you will hate yourself in design #2.
>>>
>>
>> I meant in terms of alignment.
>>
>> say:
>> 8-bit op;
>> nop8; //pad-align
>> 16-bit op.
>
> This worked out poorly in: CDC 6600 but not so bad in CDC 6400. It worked out so poorly in S.E.L 32/50, 32/87, so much that they finally fixed in in 32/67.
>
> Even when you can make the decoder eat the noops without creating bubbles in the pipe, they still waste a precious resource--the I cache.

possibly, but I was thinking of the possibility of cores with neither a
pipeline nor (internal) cache (although the external SRAM could likely
be considered cache memory).

granted, the nops would probably eat clock-cyles.

I haven't yet tried to imagine how exactly the MMU and SRAM would work
(I was sort of imagining a bunch of cores on a bus linked up to a common
MMU, which would link IO requests either to SRAM, or somehow pause the
core until the value could be fetched from DRAM).

I am left to wonder some about the possibility of "bus collisions",
where if a common bus is used between a number of cores, two cores could
try to do IO at the same time and then clash.

also possible could be an "IO token", where a line could indicate that a
core wishes to perform IO, and the MMU puts a token on the bus for the
core it is granting the IO request for.

but, sadly, this is a bit far out of my area of expertise...

>>
>> or:
>> 8-bit op;
>> nop8; //pad-align
>> nop16; //pad-align
>> 32-bit op.
>
> I guarentee that youwill hate yourself for doing this on the #2 machine.
>
> Just consider the following case:
> 8-bit op;
> nop8
> nop16
> nop32
> 64-bit op;
>

yeah, maybe some sort of multi-stage nop:
8-bit op;
nop56;
64-bit op;

it would be mostly similar to the above, except that the first nop
indicates the number of bytes to be skipped over (or value to add to the
IP/PC).

>> nevermind any clock-cycles eaten by executing nops.
>> presumably, the assembler could try to figure out if it can pack the
>> instructions effectively, or just otherwise insert nops (code producing
>> ASM will not need to worry about producing aligning nops, any more than
>> it will need to worry about how immediates get loaded into registers, or
>> the specifics of "opr reg, imm" arithmetic-instruction forms).
>>
>> note: there may be some "opr reg, imm" immediate forms, but in the
>> "general case" these may likely be implemented internally as "opr reg,
>> [mem]" operations.
>
> In my not so humble opinion, you will not enjoy the results of this choice. It causes extra work in the pipeline (inbound memory ref: which chews up valuable register basepointers), and displaces useful data from the data cache that could be properly positioned in the instruction cache.
>

fair enough, except that the memory-access cases are likely to be less
common.

also, sequential speed of each core would be left as a lower priority
than complexity, since if the lower transistor count allows, say, 2
cores in the same space, for a modest slowdown, this may be a net
improvement...

granted, this is all well hypothetical, as I am a lone hobbyist
programmer with no access to any sort of chip-fab technology...

Brett Davis

unread,

Oct 29, 2011, 2:02:43 AM10/29/11

to

In article
<2083a986-53a8-4fe4...@g1g2000vbd.googlegroups.com>,

"Paul A. Clayton" <paaron...@gmail.com> wrote:

> On Oct 28, 4:31 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> [snip]

> > > Here I very much disagree. Leaving aside some opcode
> > > space for hints (which would be treated as no-ops for
> > > implementations not supporting the hint) can be useful.
> [snip]
> > I suspect that when you get your second chance, you
> > will hate yourself for the exact reasons illustrated above.
>
> Since my first chance is never going to happen could you
> explain why reserving a few opcodes for hints is a bad idea?
> Such prevents them from being used for ordinary instructions
> but should not otherwise have any architectural effects.

Sparc has branch hint direction bits which were not used by
the hardware. Compilers had to set those bits to something,
now all legacy code has those bits set wrong, so new Sparc
chips have to ignore the hints.

Terje Mathisen

unread,

Oct 29, 2011, 3:59:57 AM10/29/11

to

The jury is already in on that idea, in the form of 128-bit bundles on
Itanium:

Too little performance, 5 years too late.

nm...@cam.ac.uk

unread,

Oct 29, 2011, 4:43:13 AM10/29/11

to

In article <02urn8-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

>MitchAlsup wrote:
>> After architecting machines in 3 different RISC architectures and 2
>> CISC architectures; suffering under the delima of having software
>> "FIX" alignment issure via fast traps; I have come to the conclusion
>> that hardware is better at fixing alignment problems--especially
>> those MANDATED by programming languages. Of rescent, the packing of
>> multiple values into short vectors (MMX, SSE, AltiVec,...) that
>> attempting to require aligment on these things is worse than the
>> uninitiated can imagine.
>
>Yeah, that one took me a few years to figure out, but now I am firmly in
>the "Short vectors shall only require item-sized alignment" camp.
>
>This is because otherwise it becomes far too painful to write code that
>will tolerate changes in the width of vector regs, and pretty much
>impossible to efficiently interface between code written for width A
>(say 128 bits) and width B (256 bits).

Agreed. However, I am also firmly of the camp that any programming
language, interface or specification that mandates element misalignment
is broken. The correct solution is now, as it always has been, to pack
and unpack properly when that is needed, which can usually be done
for free, as the same logic is needed for the checking and tracing.
Er, everybody DOES do those in import and export, don't they? :-)

The other aspect is that fixing up ANYTHING by fast traps the way
they are (usually) filtered through a 'privileged' state is a
disaster area. With the alignment fiasco eliminated, it is easy
to do it cleanly (in theory), but it is a gibbering nightmare to
have to handle an overflow on a misaligned floating-point store,
both halves of which cause separate TLB misses. Oh, in a shared
memory program, with separate threads accessing adjacent locations,
high memory contention, and people wanting the trap-fixup-and-recover
option of IEEE 754 (and similar features for debugger and tracing
support).

Regards,
Nick Maclaren.

Andrew Reilly

unread,

Oct 29, 2011, 5:50:42 AM10/29/11

to

Why are the bits set wrong, if they say something true about the program
code? Is the issue that there were situations where the compiler did not
know the truth about program execution but had to say one way or the
other, becuase there was no "don't know" hint-bit value?

Cheers,

--
Andrew

Joe keane

unread,

Oct 29, 2011, 10:13:48 AM10/29/11

to

My opinion is that instruction-set encoding is *highly* overrated. Just
pick something that is not entirely stupid. The effect on code size is
not so great, and on performance even less.

If you want 1980s transistor counts (because you want lots of cores),
look at MIPS (or similar); you can twiddle it all you like, but you're
not going to make a big improvement.

Maybe change to 64 bits, but then again maybe not!

If you think about it, 4 GB is not so bad. Of course, you want some way
to access more somehow, but 32-bit pointers are not going to kill you.
And how often do you deal with more than two billion things? Typical
desktops have around 2 GB RAM; if you need more, we're not talking about
some extra instructions, we're talking about using the hard drive.

If you add in embedded stuff (cameras, cell phones) it goes down more;
"mainframes" are more of an anomaly than typical.

I think it has little similarity to a 64 KB limit, which caused the 286
disaster; computers at the time already exceeded the address space, and
gosh is that crampy. We learned something from this, but beware of
false analogy.

The 360 is an improvement over all its successors...

EricP

unread,

Oct 29, 2011, 10:57:01 AM10/29/11

to

Paul A. Clayton wrote:
>
> So you do not agree with the developers of Alpha that
> even that should be flexible? (I have read that being
> able to use a different link register might be useful
> for coroutines, but I have not taken the time to
> understand the reasoning for this.) Of course,
> Alpha was *very* orthogonal in its register jump
> instruction.

I believe the intent of Alpha's flexible link register
was to allow global program optimization and avoid
having to save and restore it to stack.

Eric

BGB

unread,

Oct 29, 2011, 12:19:20 PM10/29/11

to

On 10/29/2011 7:13 AM, Joe keane wrote:
> My opinion is that instruction-set encoding is *highly* overrated. Just
> pick something that is not entirely stupid. The effect on code size is
> not so great, and on performance even less.
>
> If you want 1980s transistor counts (because you want lots of cores),
> look at MIPS (or similar); you can twiddle it all you like, but you're
> not going to make a big improvement.
>

in a general sense, MIPS seems not drastically different from ARM.
although, it does use 3-argument forms, and a cleaner-looking
instruction format (fixed word).

> Maybe change to 64 bits, but then again maybe not!
>
> If you think about it, 4 GB is not so bad. Of course, you want some way
> to access more somehow, but 32-bit pointers are not going to kill you.
> And how often do you deal with more than two billion things? Typical
> desktops have around 2 GB RAM; if you need more, we're not talking about
> some extra instructions, we're talking about using the hard drive.
>

I was imagining for "very large" data sets, namely neural nets.

although only a modest amount of the total net would be active at any
time, the net itself is potentially huge.

also, my desktop has 16GB of RAM, because Windows 7 performs crappy
under load with even 4GB.

XP64 holds up fairly solidly under 4 or 6 GB though.

with 2GB at this point, ones' computer gets owned (lots of swapping,
...) from simply trying to use FireFox...

> If you add in embedded stuff (cameras, cell phones) it goes down more;
> "mainframes" are more of an anomaly than typical.
>

I was roughly assuming desktop-level power usages.

although still hypothetical, a type of packaging could be as a card
which uses PCIE or similar, and probably has processor units on it and a
bunch of RAM (say, DDR3 or GDDR or similar).

so, local cores interface via a bus with an MMU, which multiplexes
between the internal SRAM/cache, and the external DDR3.

another unit works to map between the on-card memory and system memory
(via the system bus), and possibly cooperates with the OS (via its
drivers) for sake of handling any swap activity (the on-card RAM
potentially serves as an on-card cache for a larger address region
managed by the OS).

the ability for programs to access the card could be potentially
facilitated either by libraries/APIs, and/or potentially by
memory-mapping parts of the card's virtual address space into the process.

programs/... would be loaded into the card's RAM via the main CPU, and
it would set the card in motion, mostly reading/writing parts of the
data set to know the results.

functionally, it would serve a similar role to current GPGPU type stuff,
except place more emphasis on GPGPU style tasks, and less on graphical
tasks.

so, threading capabilities and integer throughput would be placed as a
higher priority than the ability to perform lots of floating-point or
vector operations.

so, say, if the card has 4096 cores or so, rather than 256 cores (like
in a present video card), it could outperform a GPU if used for general
data-processing tasks (and would not incur as much of a penalty due to
threads branching independently).

> I think it has little similarity to a 64 KB limit, which caused the 286
> disaster; computers at the time already exceeded the address space, and
> gosh is that crampy. We learned something from this, but beware of
> false analogy.
>

developing a 3D engine on the PC, one is already starting to run into
awkward limits with fitting everything into a 32-bit address space
(although at the moment 32-bit code still goes a little faster, and
going to 64-bits greatly increases memory requirements and prevents the
program from running on 32-bit targets).

in a few years, it may be a bit more awkward.
it is likely to be especially awkward if ones' data-set is potentially
larger than the address space (even if the physical RAM is smaller, and
potentially swapping is used...).

> The 360 is an improvement over all its successors...

ok.

but, the IBM360 (?) was never mass-marketed to consumers...

Stefan Monnier

unread,

Oct 29, 2011, 12:54:01 PM10/29/11

to

>>> say, the bus always pulls in 64 bits, but the low 3-bits mostly serve to
>>> swap around the bytes in the word. also, it ensures that instructions can
>>> always be loaded with a single load operation (vs plain byte-wise operations
>>> which may require multiple memory loads).
>> BTW, I always assumed that newer ISAs would go to fixed-length
>> instructions of either 64bit or 128bit. Of course, such instructions
>> would include various operations (à la VLIW), but compared to VLIW, I'd
>> expect that those operations would depend on each other (e.g. you'd
>> have 2 ALU ops in the form R4<= (R1<op1> R2)<op2> R3).
> The jury is already in on that idea, in the form of 128-bit bundles on
> Itanium:

Could be it's the same as Itanium, but the way I see it it's very
different: the instruction wouldn't have much (if any) parallelism and
would be handled as a single atomic entity even within the OoO core.
So 128bit would probably be too much (end up with too many operations,
and hence too large chunks which become inefficient).

I think of it as "your usual RISC style CPU" but where the general
shape of the main datapath is not "2 inputs and 1 output" but rather
something like "4 inputs and 2 outputs" and where the connection between
those inputs and outputs is via more than just 2 operations.

So your execute stage grows, but the administrative overhead (register
renames, for example) is hopefully reduced. Of course, it does assume
that the available mix of operations (and connections between them)
within instructions is sufficiently flexible that compilers can fill the
instructions with useful operations.

Stefan

BGB

unread,

Oct 29, 2011, 3:30:11 PM10/29/11

to

On 10/27/2011 4:03 PM, BGB wrote:
> dunno about existing technologies, but I was thinking some and had a few
> misc ideas (random, I will probably not do much with them, given I am
> purely a programmer, and not a hardware person).
>

yep, all still hypothetical.

sorry if this is all stupid/boring, just me and trying to imagine how
all this could work...

so, here is an update for the idea, as it stands...

so, a few idea updates:
basic opcode encoding will be similar (aligned 1/2/4/8 byte opcodes with
optional nop-padding);
probably will keep the separate code and data bus (for timing reasons);
there is no internal per-core cache or pipeline (all execution is
serial, all IO goes to local bus).

the reason for opcodes being fixed-sized and aligned is because this
ensures that the whole instruction can be read in a single bus
operation, and it is a little more compact than the use of fixed-size
instructions (at 32 or 64 bits).

so, probably registers:
R0-R12: GPRs
R13/RBP: base-pointer
R14/RLR: link-register
R15/RSP: stack-pointer

RIP: instruction-pointer, internal register.

opcode encoding:
00xxxxxx, 1-byte forms
01ggxxxx, 2-byte forms
10ggxxxx, 4 byte forms
11gghhxx, 8-byte forms

gg=00/01/02: opcode in low 4 bits (6-bit base-opcode).
gg=11: opcode expands to following byte;
hh: like gg, where if gg=11 and hh=11, 2 bytes encode the opcode (giving
a 20-bit opcode number, reserved).

all 1-byte opcodes are single operations (no non-fixed arguments can be
encoded).

2 byte base-opcode forms:
IA: imm8
RR: reg, reg
RI: reg, imm4

4-byte base-opcode forms:
IB: imm24
RM: reg, [mem] ([base+sc*index+disp8])
MR: [mem], reg
RI: reg, imm16

8-byte base-opcode forms:
IB2: imm56
RM2: reg, [mem] ([base+sc*index+disp32])
MR2: [mem], reg
RI2: reg, imm48

IA: 01xxxxxx vvvvvvvv
RR: 01xxxxxx rrrrbbbb
RI: 01xxxxxx rrrrvvvv
RZ: 01xxxxxx rrrr0000

IB: 10xxxxxx vvvvvvvv vvvvvvvv vvvvvvvv
RM: 10xxxxxx rrrrmmss bbbbiiii dddddddd
10xxxxxx rrrr01ss bbbbdddd dddddddd
10xxxxxx rrrr11ss dddddddd dddddddd
MR: same as RM
RJ: 10xxxxxx rrrryyyy vvvvvvvv vvvvvvvv
MZ: 10xxxxxx yyyymmss bbbbiiii dddddddd

vvvv=value
rrrr=reg;
bbbb=base;
iiii=index;
yyyy=constant/reserved (set to 0)
mm=mode (00=[base+sc*index+disp8], 01=[base+disp12], 10=resv,
11=[rip+disp16]);
ss=scale (00=1, 01=2, 10=4, 11=8);

IB2: 10xxxxxx yyyyyyyy zzzzzzzz zzzzzzzz ...
RM2: 10xxxxxx rrrrmmmm bbbbssss iiiiyyyy dddddddd ...
MR2: same as RM2
RI2: 10xxxxxx rrrryyyy vvvvvvvv vvvvvvvv ...

nops:
nop8: 00000000
nop16: 01000000 00000000
nop32: 10000000 00000000 00000000 00000000

nop24 (align: rip&1=1): 01000000 00000000 00000000
nop48 (align: rip&3=2): 10000000 00000000(x5)
nop56 (align: rip&3=1): 10000000 00000000(x6)

example opcodes:

8-bit:
0=nop8;
1=ret (jmp rlr)

16-bit:
0=nop16/nop24;
1=mov/rr
2=jmp/ia
3=jmp/rz
4=add/rr
5=sub/rr
6=add/ri
7=sub/ri
...
16=call/rz
17=push/rz
18=pop/rz
...

32-bit:
0=nop32/nop48/nop56;
1=mov/rm
2=mov/mr
3=lea/rm
4=add/rm
5=sub/rm
6=add/rj
7=sub/rj
8=add/mr
9=sub/mr
...
16=call/mz
...

bus timings:
cores will have a fixed-length timing cycle, and will be staggered by 1
clock cycle (allowing N cores per bus).

cycles:
0, core puts RIP on code-addr bus
1, instruction is on code bus, core begins execution
2, delay (core magic)
3, delay (core-magic)
4, put memory address on data-addr-bus (read/write, if used)
5, memory-cell on data-bus (read)
6, core may put updated cell on bus (write), with a W line set.
7, delay (core may execute alignment-nop)

(I realized read/write would be needed, as otherwise there is no good
way I can think of to handle small-value writes...).

if needed, a core may perform a longer operation, and use additional
cycles (so, an operation will require N*M cycles to complete).

say, cycles 8-15 or 16-23, which will follow the same form, but the core
may operate internally if needed.

this layout should allow 8 cores on a bus with no clashes.
there will be several status-lines (code-suspend/data-suspend) which
would lock cores if IO can't be completed immediately (requires external
bus IO). possibly, there will also be an "interrupt" line, which would
trigger the core load/execute an interrupt handler (likely set at cycle 0).

reasoning: absent a local cache, each core would likely need to perform
IO at nearly every instruction. a fixed-cycle should likely give the
highest overall throughput, whereas a more flexible scheme would likely
result in a large amount of clashing.

the bus will likely be connected to an CCU (Cache-Control-Unit), which
will manage local memory/cache (and will be connected to a different IO
bus, and will have 1 or more cycled core-busses). the CCU bus may also
be cycled, but will likely use a merged code/data bus (and maybe a
different bus-signalling mechanism).

at the outer levels, will be an external MMU, which will manage
connecting the CCU's to the external DRAM (and performing any
address-translation).

Bernd Paysan

unread,

Oct 29, 2011, 3:56:27 PM10/29/11

to

Paul A. Clayton wrote:

> (Since an iPad is about 460 cm-squared, I just need to know
> how large an O-pad is; I can do the division to determine
> how large one I/O pad is. Sorry, the pun came to mind and
> I did not have anyone here to inflict it on.)

The size of Tegra's oPad is about the same as the iPad, it's also a 10"
screen, so I think the size of an I/O pad is about 1, dimensionless ;-).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://bernd-paysan.de/

Thomas Womack

unread,

Oct 29, 2011, 7:28:00 PM10/29/11

to

In article <j8h1ms$gpd$1...@reader1.panix.com>, Joe keane <j...@panix.com> wrote:
>My opinion is that instruction-set encoding is *highly* overrated. Just
>pick something that is not entirely stupid. The effect on code size is
>not so great, and on performance even less.
>
>If you want 1980s transistor counts (because you want lots of cores),
>look at MIPS (or similar); you can twiddle it all you like, but you're
>not going to make a big improvement.
>
>Maybe change to 64 bits, but then again maybe not!
>
>If you think about it, 4 GB is not so bad. Of course, you want some way
>to access more somehow, but 32-bit pointers are not going to kill you.
>And how often do you deal with more than two billion things? Typical
>desktops have around 2 GB RAM; if you need more, we're not talking about
>some extra instructions, we're talking about using the hard drive.

I think you're working with figures from five years ago; new desktops
nowadays have 4G if you buy the absolute-cheapest and 16G if you
don't. Today's fancier mobile phones have 1G and will have 2G with
the next refresh of the DRAM process.

Tom

Brett Davis

unread,

Oct 29, 2011, 8:23:59 PM10/29/11

to

In article <9h20jh...@mid.individual.net>,

The Sparc branch hint bug was a story I heard but never confirmed.
A quick look at the version 8 Sparc manual shows no hint bit, or obvious
dead bit in the encoding. With 8 updates the bit would have been
recycled by now, if indeed I got the CPU right, or the story right.

Sparc does have an interesting branch delay slot annul bit,
which makes the delay slot part of the else clause, giving more
opportunities to fill that slot with something besides a no-op.

Someone will post on the usefulness of compiler hints on branches.
I had heard that such was futile, but linux is apparently full of
if(likely()) and if(unlikely()) directives used to set x86 hint prefix bytes.
PowerPC has a single 'y' prediction bit, cleared for normal behavior
and set to reverse that behavior. This kind of blows my Sparc story
out of the water.

I had thought that none of the CPUs I compiled for had branch hint bits,
but now I am not so sure.

For deep pipelines predicting short forward branches untaken may just
mean the instructions inside the short branch get marked as
predicated, and a false prediction just means the results get thrown
away, while the PC continues in a optimal linear path.
With an average IPC of ~1, those instructions were probably "free".
If you branch hinted over the instructions and were wrong the cost
is a dozen or two cycles to refresh the instruction pipe.

Having branch hints set wrong can be very bad, and setting the branch
hints correctly may not help much. Easy to see that a OoO upgrade
on a CPU could cause havoc on old hint bits.

Andy "Krazy" Glew

unread,

Oct 29, 2011, 8:54:06 PM10/29/11

to

On 10/29/2011 5:23 PM, Brett Davis wrote:

> I had thought that none of the CPUs I compiled for had branch hint bits,
> but now I am not so sure.

http://wiki.osdev.org/X86_Instruction_Encoding

Branch taken/not taken prefixes
Branch hints may be used to lessen the impact of branch misprediction
somewhat. The 'branch taken' hint is a strong hint, while the 'branch
not taken' hint is a weak hint. The branch hints are only supported by
Intel since the Pentium 4. Whether using them on AMD architectures has
any (positive or negative) effect at all is not known.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/

The Pentium® 4 Processor introduced new instructions for adding static
hints to branches. It is not recommended that a programmer use these
instructions, as they add slightly to the size of the code and are
static hints only. It is best to use a conditional branch in the manner
that the static predictor expects, rather than adding these branch hints.

In the event that a branch hint is necessary, the following instruction
prefixes can be added before a branch instruction to change the way the
static predictor behaves:

0x3E – statically predict a branch as taken
0x2E – statically predict a branch as not taken

BGB

unread,

Oct 29, 2011, 9:12:19 PM10/29/11

to

yeah.

I have 16GB of DDR3 and spent $100.

a few years back I spent $100 for 4GB of DDR2.

a few more years and maybe 32GB or 64GB of RAM will be be in a similar
price range.

although CPUs aren't really getting much faster, RAM and HDDs are still
getting bigger at least...

BGB

unread,

Oct 29, 2011, 9:22:51 PM10/29/11

to

On 10/29/2011 5:54 PM, Andy "Krazy" Glew wrote:
> On 10/29/2011 5:23 PM, Brett Davis wrote:
>
>> I had thought that none of the CPUs I compiled for had branch hint bits,
>> but now I am not so sure.
>
> http://wiki.osdev.org/X86_Instruction_Encoding
>
> Branch taken/not taken prefixes
> Branch hints may be used to lessen the impact of branch misprediction
> somewhat. The 'branch taken' hint is a strong hint, while the 'branch
> not taken' hint is a weak hint. The branch hints are only supported by
> Intel since the Pentium 4. Whether using them on AMD architectures has
> any (positive or negative) effect at all is not known.
>

IIRC, I read before that AMD decided to ignore the hints (under the
reasoning that the branch-predictor could do a better job).

>
> http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/
>
>
> The Pentium® 4 Processor introduced new instructions for adding static
> hints to branches. It is not recommended that a programmer use these
> instructions, as they add slightly to the size of the code and are
> static hints only. It is best to use a conditional branch in the manner
> that the static predictor expects, rather than adding these branch hints.
>
> In the event that a branch hint is necessary, the following instruction
> prefixes can be added before a branch instruction to change the way the
> static predictor behaves:
>
> 0x3E – statically predict a branch as taken
> 0x2E – statically predict a branch as not taken
>

yep.

Paul A. Clayton

unread,

Oct 29, 2011, 9:23:27 PM10/29/11

to

On Oct 29, 7:23 pm, Brett Davis <gg...@yahoo.com> wrote:
[snip]

> Someone will post on the usefulness of compiler hints on branches.
> I had heard that such was futile, but linux is apparently full of
> if(likely()) and if(unlikely()) directives used to set x86 hint prefix bytes.

The directives are probably used for branch alignment not
hints in branch instructions. (Branch alignment takes the
unlikely path out of the fall-through path.) With aggressive
optimizations, this could even improve scheduling by hoisting
post-branch operations from the likely path before the branch
and introducing fix-up code in the unlikely path.

Such directives could also be used to assist the hardware by
favoring its ordinary static branch prediction method. (Static
prediction information might be used in the case of a branch
predictor miss [e.g., if using tagged BTBs with per-address
prediction information] or for biasing the predictor/settling
ties. If the static information was available early, it could
be used with an agree mechanism.) There might even be some
benefit in biasing the global history to improve gshare-type
branch predictors. (A gshare predictor with 9 history bits
could probably have 7 bits using agree-with-static-prediction;
but it is not clear that such would be worth the bother.)

[snip]

> For deep pipelines predicting short forward branches untaken may just
> mean the instructions inside the short branch get marked as
> predicated, and a false prediction just means the results get thrown
> away, while the PC continues in a optimal linear path.

I do not think any processors perform dynamic predication,
though it has been proposed academically.

I seem to recall reading that some processor did cache the
incidentally fetched instructions from the alternate path,
but that does not help that much with deep (and wide)
pipelines.

> With an average IPC of ~1, those instructions were probably "free".

Not exactly free (ROB and rename resources--running out of
these could prevent some independent operations in the
correct path from being visible/executable--, possible cache
pollution, and extra power consumption), which is why one
would want a confidence estimate on the branch prediction.

[snip]

> Having branch hints set wrong can be very bad, and setting the branch
> hints correctly may not help much. Easy to see that a OoO upgrade
> on a CPU could cause havoc on old hint bits.

How would OoO cause havoc on correctly set hints? A
processor can always ignore hints.

Paul A. Clayton

unread,

Oct 29, 2011, 10:21:35 PM10/29/11

to

On Oct 29, 7:54 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]
> http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/

Interesting. I did not know that the Pentium4 had a 16-bit
local history per BTB entry. That seems like a lot of
local history. (This seems like an area where compression--
even somewhat lossy compression--could be useful. Of course,
fixed transition rate/loop predictors probably take out much
of the compressibility of local history.)

Casper H.S. Dik

unread,

Oct 30, 2011, 8:41:30 AM10/30/11

to

Brett Davis <gg...@yahoo.com> writes:

>The Sparc branch hint bug was a story I heard but never confirmed.
>A quick look at the version 8 Sparc manual shows no hint bit, or obvious
>dead bit in the encoding. With 8 updates the bit would have been
>recycled by now, if indeed I got the CPU right, or the story right.

The hint was new in SPARCv9; the branch with prediction was new
(and looking at the manual, there are no "spare bits" in the
branch instruction; it uses all left-over bits for the jump target.
(The old branch instruction has a "displacement" of 22 bits, the newer
one has only 19 bits.)

Note that earlier SPARC processors didn't enforce that reserved bits
in a instructions were 0 (as required) and Solaris makes sure that
code created by broken compilers works.

I haven't heard about that the SPARC hardware ignores the prediction bit.

Casper
--

Brett Davis

unread,

Oct 30, 2011, 5:20:45 PM10/30/11

to

In article <4EACA02E...@SPAM.comp-arch.net>,

"Andy \"Krazy\" Glew" <an...@SPAM.comp-arch.net> wrote:

> On 10/29/2011 5:23 PM, Brett Davis wrote:
>
> > I had thought that none of the CPUs I compiled for had branch hint bits,
> > but now I am not so sure.
>
> http://wiki.osdev.org/X86_Instruction_Encoding
>
> Branch taken/not taken prefixes

While I have spent my life compiling on x86, I mostly never compiled FOR x86.
I am in the embedded market, I have never shipped a x86 app.

Thomas Entner

unread,

Oct 30, 2011, 5:38:23 PM10/30/11

to

On 28 Okt., 16:48, BGB <cr88...@hotmail.com> wrote:
> for what I was imagining, it would be a simple "one-at-a-time" execution
> path (although, as currently imagined I am ending up with around 4 clock
> cycles per-instruction, granted this is absent a pipeline, whereas a 4
> stage pipeline or similar could probably address this issue).
>
> (but, all this leads to some operations likely still needing more clock
> cycles, meaning probably still being stuck with variable cycles per
> instruction, hmm...).
>
> almost may as well use a single bus for code and data, since absent a
> pipeline it is unlikely the core would need to access both code and data
> at once.
>
> the idea would be to keep the cores simple, so that then one can have
> large numbers of them (say, ideally, if one could fit 512 or 1024 cores
> on a die). as a result, each core would likely lack any sort of
> floating-point or SIMD capabilities as well (all real-valued math is
> either fixed-point or emulated to save die space...).
>
> the reason for the thing then being 64-bits would be to allow for large
> data-sets.

Do you intend your design for a very special highly parallel
application, or for general purpose? Esp. if it is for general
purpose, I think the single core performance should be maximized as
much as possible in an "easy" way. So at least it should be fully
pipelined, achieving 1 instruction per cycle for most instructions. (I
guess at one performance point beyond which every further increase of
performance will cost a lot, I do not know where this point is:
pipelined? super-scalar? out-of-order? Below this point, even if you
can have twice as much cores if they have the half performance, you
will find almost no applications that really can use all cores
effectively as most tasks are hard execute in parallel.

Also, I think you will be soon at a point where the CPU-size is
smaller than the caches for the CPU, and the more CPUs you have on a
die, the more cache you will need per CPU, I guess, otherwise the
memory-bandwidth-bottleneck will hit you fully.

Variable length opcodes: I think this is a good idea to get compact
code. But why restrict the length to 1, 2, 4 and 8, and force them to
be aligned? I think 2, 4, 6, 8, 10 would make more sense, and
unaligned instructions will not be a problem for the hardware,
especially when there is just one instruction per clock, but will be a
huge penalty for performance, code-size, and compiler-writers. (I
think this is a solution for something where there is no real
problem.)

Regards,

Thomas

Brett Davis

unread,

Oct 30, 2011, 5:48:52 PM10/30/11

to

In article <4ead45fa$0$6985$e4fe...@news2.news.xs4all.nl>,

Casper H.S. Dik <Caspe...@OrSPaMcle.COM> wrote:

> Brett Davis <gg...@yahoo.com> writes:
>
> >The Sparc branch hint bug was a story I heard but never confirmed.
> >A quick look at the version 8 Sparc manual shows no hint bit, or obvious
> >dead bit in the encoding. With 8 updates the bit would have been
> >recycled by now, if indeed I got the CPU right, or the story right.
>
> The hint was new in SPARCv9; the branch with prediction was new
> (and looking at the manual, there are no "spare bits" in the
> branch instruction; it uses all left-over bits for the jump target.
> (The old branch instruction has a "displacement" of 22 bits, the newer
> one has only 19 bits.)

Two new condition bits, and one branch prediction bit, like Power.
There also has to be an instruction set mode bit for this change in
instruction sets to work.
The hardware guys HATE doing this.

> Note that earlier SPARC processors didn't enforce that reserved bits
> in a instructions were 0 (as required) and Solaris makes sure that
> code created by broken compilers works.

This is the source of the rumor, the compiler spewing crap into some
of the reserved bits, and Sparc having to live with that into the future.

Which proves the point of this subthread, all undefined bits need to
throw an exception, otherwise those bits may not be available for use.

BGB

unread,

Oct 30, 2011, 7:35:34 PM10/30/11

to

I was imagining it mostly for a reasonable specialized task:
running neural nets.

neurons have very little direct communication, and operate almost
entirely in parallel.

as noted a number of times already, I am mostly doing a hypothetical
design exercise, and have no real personal experience either with CPU
design nor personal access to chip-fab hardware. (however, I do have
some personal experience with designing/implementing compilers and VMs).

essentially, the thing would be stream processor, similar to a GPU, but
optimized for a different workload (namely, higher parallelism at lower
numerical throughput, likely working primarily with small integer values).

the idea then is mostly how to best maximize both parallelism and
net-throughput.

> Also, I think you will be soon at a point where the CPU-size is
> smaller than the caches for the CPU, and the more CPUs you have on a
> die, the more cache you will need per CPU, I guess, otherwise the
> memory-bandwidth-bottleneck will hit you fully.
>

each core is likely to have a reasonably small working set.
I had originally imagined having a large number of tiny address spaces
(making each core more or less like an 8080), however this would be a
total pain to work with, and supporting a large address space made more
sense, at the assumption that each core maybe only has a few kB for its
local working set.

> Variable length opcodes: I think this is a good idea to get compact
> code. But why restrict the length to 1, 2, 4 and 8, and force them to
> be aligned? I think 2, 4, 6, 8, 10 would make more sense, and
> unaligned instructions will not be a problem for the hardware,
> especially when there is just one instruction per clock, but will be a
> huge penalty for performance, code-size, and compiler-writers. (I
> think this is a solution for something where there is no real
> problem.)
>

as noted, the big issue with misaligned instructions is that to work
efficiently, one either needs per-core cache, or may have to perform
multiple IO operations to fetch the instruction.

say, one has 64-bit memory cells, and is fetching a 2-byte instruction.
absent alignment requirements, there is a risk that the instruction
could cross a cell boundary, and thus require 2 bus operations to fetch.

granted, yes, I guess maybe it may make more sense to, rather than
having a "cache control unit" and 8 cores which each take 8 clocks to
execute an instruction, to simply merge all of this into a single core
with a cache and an 8-stage pipeline (at 1-instruction-per-clock), since
in both cases the total throughput would be equivalent.

but, then again, if one has cache and a pipeline, then it renders the
particular instruction coding kind of a moot point... (with a cache, one
almost may as well allow arbitrarily-aligned instructions...).

another consideration was what to do about "microcode".

one idea was that one could make the instructions do one of 2 possible
tasks:
the core executes the instruction directly;
the core converts the instruction into an interrupt, and the interrupt
handler executes the instruction (probably with the instruction placed
into a register).

in this case, the ISA is its own microcode format (provided it is not
recursive...).

an example of such an issue would be whether or not to have an in-core
multiply unit:
an adder unit's size only increases linear with the number of bits;
a single-clock multiply unit goes up with a square of the bits.

hence, one could consider performing a multiply piecewise, but this
would likely require some sort of microcode.

a function call would be lame, hence using the opcode to directly
trigger an interrupt (and the code being written need not care too much
whether its instructions are being executed directly or via an interrupt
handler).

granted, another possibility would be that a code generator could
generate different code depending on whether or not a multiply
instruction exists.

but, at this point, I am almost left thinking one "may as well" just
have full 386-style cores (albeit maybe using an x86-64 variant instead
of 32-bit x86).

I guess another question would be, assuming a small (presumably
sequential) x86-64 like core, how many cores can one reasonably fit on a
die?...

another relevant question would be the costs of having or omitting SIMD,
or doing other possibilities (such as using scalar FP using the GPRs as
floating-point registers).

say:
mulss eax, edx ;(special instruction form, uses GPRs vs XMM regs)

vs, another option:
just using ARM cores...

or such...

jacko

unread,

Oct 30, 2011, 8:05:23 PM10/30/11

to

In the sense that one instruction does a 3 in 1 out operation, and the choice of sequencing requires a time index to be exceeded for execution at t+1 to occur, parallel is a limit of register locality.

Cheers Jacko

BGB

unread,

Oct 30, 2011, 9:35:59 PM10/30/11

to

On 10/30/2011 5:05 PM, jacko wrote:
> In the sense that one instruction does a 3 in 1 out operation, and the choice of sequencing requires a time index to be exceeded for execution at t+1 to occur, parallel is a limit of register locality.
>

context?...

I have no idea what was above this in Google-Groups or similar, so I
don't know what exactly was being responded to.

MitchAlsup

unread,

Oct 31, 2011, 2:54:07 PM10/31/11

to

On Saturday, October 29, 2011 8:22:51 PM UTC-5, BGB wrote:
> On 10/29/2011 5:54 PM, Andy "Krazy" Glew wrote:
> > Branch taken/not taken prefixes
> > Branch hints may be used to lessen the impact of branch misprediction
> > somewhat. The 'branch taken' hint is a strong hint, while the 'branch
> > not taken' hint is a weak hint. The branch hints are only supported by
> > Intel since the Pentium 4. Whether using them on AMD architectures has
> > any (positive or negative) effect at all is not known.
> >
>
> IIRC, I read before that AMD decided to ignore the hints (under the
> reasoning that the branch-predictor could do a better job).

We measured it, the predictor beat the prefix (quite handily).

Mitch

BGB

unread,

Oct 31, 2011, 5:08:09 PM10/31/11

to

yep, maybe because it can compare how often the branch is taken vs
not-taken under real working loads?...

> Mitch

Andy "Krazy" Glew

unread,

Oct 31, 2011, 8:22:09 PM10/31/11

to

On 10/31/2011 11:54 AM, MitchAlsup wrote:

Consider the following for a contended lock:

...
test_and_test_and_set_spinloop:
read lock into reg
if reg == 1 goto test_and_test_and_set_spinloop
atomic rmw to set lock, old value into reg
if reg == 1 goto test_and_test_and_set_spinloop
// got the lock
...

Yes, the predictor correctly predicts that the loops are taken.

But is that what you really want to have? A guaranteed branch
misprediction as soon as you acquire the lock, lengthening the critical
section?

BGB

unread,

Oct 31, 2011, 9:40:34 PM10/31/11

to

this depends some on the number of threads battling for the lock.
if the number of threads competing for a lock is low, then it will often
be the case that multiple threads competing for a lock will be rare
(leaving it more as a crash-prevention safety feature).

in this case, the predictor will likely predict that the loop exits quickly.

in the other case, there is no good option, as trying to optimize for
exiting the loop will likely simply make the loop spin slower.

although, I guess maybe the loop going slower will not be a huge issue,
and this may also reduce the number of bus-locking operations (maybe
improving overall system performance?).

or such...

Eric Northup

unread,

Nov 1, 2011, 1:07:22 PM11/1/11

to

On Oct 29, 5:23 pm, Brett Davis <gg...@yahoo.com> wrote:
> Someone will post on the usefulness of compiler hints on branches.
> I had heard that such was futile, but linux is apparently full of
> if(likely()) and if(unlikely()) directives used to set x86 hint prefix bytes.
> PowerPC has a single 'y' prediction bit, cleared for normal behavior
> and set to reverse that behavior. This kind of blows my Sparc story
> out of the water.

My understanding is gcc uses these directives not just to encode
static branch hints, but also to separate hot/cold code. I'm not even
sure if it will use the branch hints on x86.

Stefan Monnier

unread,

Nov 1, 2011, 2:12:22 PM11/1/11

to

> Consider the following for a contended lock:

> ...
> test_and_test_and_set_spinloop:
> read lock into reg
> if reg == 1 goto test_and_test_and_set_spinloop
> atomic rmw to set lock, old value into reg
> if reg == 1 goto test_and_test_and_set_spinloop
> // got the lock
> ...

> Yes, the predictor correctly predicts that the loops are taken.
> But is that what you really want to have? A guaranteed branch misprediction
> as soon as you acquire the lock, lengthening the critical section?

Interesting. So you'd want a strong hint here, which doesn't just mean
"I think it'll usually be not taken" but really "optimize for the
fallthrough case, even if it's rarely taken; just trust me". Because in
the active wait loop, looping faster won't reduce the execution time.

Basically the CPU's branch prediction assumes that which code is
executed doesn't depend on how fast we run, which is not a valid
assumption in the above case.

I wonder how often such strong hints would be useful. Also, is there
some way to trick current CPUs into doing "the right thing"?

Stefan

MitchAlsup

unread,

Nov 1, 2011, 5:40:26 PM11/1/11

to an...@spam.comp-arch.net

On Monday, October 31, 2011 7:22:09 PM UTC-5, Andy Krazy Glew wrote:
> Consider the following for a contended lock:
>

OK, let us consider the contended spin lock:

> ...
> test_and_test_and_set_spinloop:
> read lock into reg
> if reg == 1 goto test_and_test_and_set_spinloop
> atomic rmw to set lock, old value into reg
> if reg == 1 goto test_and_test_and_set_spinloop
> // got the lock

// Scenario:
// the lock is LOCKed
// there are n processors waiting in the spinloop
// the system has reached a point where every waiting thread is
// spinning in their own cache.

test_and_test_and_set_spinloop:
read lock into reg
if reg == 1 goto test_and_test_and_set_spinloop

// In the case where the reg is LOCKed; this processor is, in essence,
// waiting for a coherent write to invalidate the '1' in the lock memory
// location. Allowing the processor to access the <essentially> frozen
// value in the cache faster (or more often) does not usefully improve
// the wait time.
//
// In the case where the reg is unLOCKed; falling through is the best choice
// because taking the branch represents no forward progress.
//
// Thus predicting this branch always not_taken is near optimal. When this
// branch IS predicted not_taken, the pipeline will see the ATOMIC instruction
// and can then adjust branch prediction at a local scale.

atomic rmw to set lock, old value into reg
if reg == 1 goto test_and_test_and_set_spinloop
// got the lock

<do the critical section work>
write 0 into lock

// The owner of the lock writes a 0 into the lock. The write back cache
// tracks the shared nature of the lock holding cache line. The cache issues
// a coherent invalidate to the cache lineholding the lock.

// In a fabric based system different nodes see the choerent invalidate at
// different times. The spinning thread stops spinning in the cache (line
// was invalidated) and issues a read memory request to the cache line holding
// the lock. The memory controller sees N read requests to the same cache line.
// In most fabrics these reads are serialized for cache coherence reasons.
// Some memory controllers can use the read buffer to avoid DRAM access latency
// but must still serailize the read requests. In a fabric based system the
// request that gets to the memory controller first should win the lock
// (and does most of the time).

// The process that gests its read request serviced first sees the memory
// location unLOCKed.

// As this processor sees the locked location, the end of the coherent
// memory transaction arrives back at the memory controler and the second read
// to the memory location begins. By this point one can assume that all other
// waiting processors will have sent their read requests to the memory controller.

// The processor that saw the locationunLOCKed then performs an ATOMIC RMW
// to the locked location.This locked request arrives BEHIND the reads already
// pending on the memory controller to the same line.

// Under contention, there are n processors contending on the lock, 1 will
// get the lock when it is unLOCKed and n-1 will go back into the spinning
// state. In order to put each one back into the spinning state each thread
// must observe the lock is in the LOCKed state. Thus the processor that
// "gave the lock up" will cause a cascade of memory activity on the
// interconnect all targeting that one memory location. It will take a
// quadradic ((n-1)*(n-2)) number of memory accesses for the waiting
// processors to achieve the spinloop contained completely in the cache.
// Many times achieving quiessence takes longer than the work inside
// the critical section and all these memory refs slow down the unLOCKing
// of the lock. All of the contention works its way out of memory speed,
// not processor speed.

// If the memory controler were able to understand the situation, it would
// propogate the write over the reads so that the waiting processor threads
// could go back to sleep without seeing the lock in an unlocked state.
// were the memory controler to be able to make use of this, only 2 processors
// would see the lock in an unlocked state and fabric traffic would go way down.

// Once again; there is little utility in predicting this branch taken
// and considerable advantage to the lock holding thread to predict it
// not_taken.
//
// If the critical section does only a few memory accesses to complete
// its duties before unlocking the lock. The unlocking of the lock can
// end up stalled behind the cascade of read_with_intent_to_modify
// requests from the atomic instruction from other threads.

Back to prefixing condtional branches:

So, for this particular case, it would be just as easy for the processor to recognize the T&T&S paradigm and alter branch prediction at a local scale than to end up having to prefix branch instructions. That is the pipeline can deduce the predict_not_taken-ness by the presence of the <already existing> LOCK prefix on the RMW instruction rather than a prefix in front of the conditional branch instruction(s).

Does any processor actually do this? I have no idea.

Could the processor recognize this paradigm and then power down until it receives a coherent invalidate to the locked cache line? (Rather than spinning in the spin loop, go to sleep until the necessary event at the other side of the cache takes place.) This would save lots of power without having to 'decorate' branch instructions and be essentially "just as fast".

Back to considering spinlocks under contention:

However, it is my contention that allowing more than 2 (or 3) waiting threads to see the lock in the unlocked state is the root of the problem. Prefixing the branch instructions nearby the ATOMIC events is not getting at the root of the ATOMIC event latency or memory overheads.

Mitch

Andy "Krazy" Glew

unread,

Nov 2, 2011, 1:22:30 AM11/2/11

to

On 11/1/2011 2:40 PM, MitchAlsup wrote:
>
> On Monday, October 31, 2011 7:22:09 PM UTC-5, Andy Krazy Glew wrote:
>> Consider the following for a contended lock:
>>
> OK, let us consider the contended spin lock:
>> ...
>> test_and_test_and_set_spinloop:
>> read lock into reg
>> if reg == 1 goto test_and_test_and_set_spinloop
>> atomic rmw to set lock, old value into reg
>> if reg == 1 goto test_and_test_and_set_spinloop
>> // got the lock
>
>

> // In the case where the reg is unLOCKed; falling through is the best choice
> // because taking the branch represents no forward progress.
> //
> // Thus predicting this branch always not_taken is near optimal. When this
> // branch IS predicted not_taken, the pipeline will see the ATOMIC instruction
> // and can then adjust branch prediction at a local scale.

...

>
> So, for this particular case, it would be just as easy for the processor to recognize the T&T&S paradigm and alter branch prediction at a local scale than to end up having to prefix branch instructions. That is the pipeline can deduce the predict_not_taken-ness by the presence of the<already existing> LOCK prefix on the RMW instruction rather than a prefix in front of the conditional branch instruction(s).

You have a different definition of easy than I do:

This isn't a peephole optimization. It might almost be a peephole
optimization if the spinloop exit were immediately followed by t

> Could the processor recognize this paradigm and then power down until it receives a coherent invalidate to the locked cache line? (Rather than spinning in the spin loop, go to sleep until the necessary event at the other side of the cache takes place.) This would save lots of power without having to 'decorate' branch instructions and be essentially "just as fast".

By the way, virtual machines also need something similar.

> Back to considering spinlocks under contention:
>
> However, it is my contention that allowing more than 2 (or 3) waiting threads to see the lock in the unlocked state is the root of the problem. Prefixing the branch instructions nearby the ATOMIC events is not getting at the root of the ATOMIC event latency or memory overheads.

Sure, change the software.

How has that worked for you in the past?

Terje Mathisen

unread,

Nov 2, 2011, 4:54:28 AM11/2/11

to

MitchAlsup wrote:
> Back to considering spinlocks under contention:
>
> However, it is my contention that allowing more than 2 (or 3) waiting
> threads to see the lock in the unlocked state is the root of the
> problem. Prefixing the branch instructions nearby the ATOMIC events
> is not getting at the root of the ATOMIC event latency or memory
> overheads.

It seems obvious that in a situation where multiple threads are likely
to be waiting on a TTS lock like this, doing that initial Test which
seems like a good idea, is instead pretty horrible:

Since the only use for this particular cache line is to grab a lock
value, all accesses will be either Acquire or Release, right?

Simply doing (LOCK) XCHG (or LOCK CMPXCHG) directly would generate a
more expensive bus operation for each attempt, but as soon as the lock
was released by the former owner, the winner would then make immediate
forward progress.

The sad part here is that we have a good analogy to this situation in
regular CDMA Ethernet, if we treat the initial Test part as the Carrier
Detect in Ethernet: Unless it is currently free/idle, the Ethernet hw is
obliged to go into a randomized exponential backoff loop. It has been
proven theoretically and shown in practice that this is sufficient to
guarantee both stable operation and forward progress.

However, since we don't have the corresponding Ethernet problem where if
two stations try to send (nearly) at once, both fails, we can in fact
attempt to directly grab the lock, and then go into a backoff wait only
when the first attempt fails.

It seems like such a exponential backoff is the distributed sw
equivalent to your much faster hw solution to the problem. :-)

OTOH, most such locking code has probably been written under the
assumption that contention will be _very_ rare, and if it does happen,
the lock will probably become free in an iteration or two.

One point re bus operations:

In a TTS setup where the lock is already owned by somebody else, the
initial Test (read) operation will have to force the current owner to
first write back the cache line with the updated (locked) value, then
change the state to Shared, right?

Immediately following this, when the owner wants to free the lock, the
cache line must go back into Exclusive state, Invalidating the copies in
all other cpus.

Writing a 0 to release it should change the line to Modified and trigger
a writeback whereupon the other cpus can fight it out for who's next.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Piotr Wyderski

unread,

Nov 2, 2011, 9:07:28 AM11/2/11

to

MitchAlsup wrote:

> So you have 192 single byte instructions and 64*256 = 16384 two byte instructions.

If one has the luxury to define a completely new ISA, then
why don't use Huffman-style prefix encoding in order to specify
the instruction length? In the first byte:

0b0xxx xxxx = 128 1-byte instructions
0b10xx xxxx = 16384 2-byte instructions
0x1100 xxxx = 1048576 3-byte instructions
0x1101 xxxx = 268435456 4-byte instructions
0x1110 xxxx = 68719476736 5-byte instructions
0b1111 00 xx = etc.

Parallel decoding should be fairly easy and opcode space
compression could be good. Based on the generated code
statistics one can even imagine a non-linear mapping, i.e.
fill the 128 single-byte slots with the most probable instructions
(together with their operands) without any visible pattern,
then do the same with the 2-byte ones and leave the longer
formats to provide decoded info explicitly. Two small ROMs
will do the job on the most probable path and the rest can
be expected to be rare.

Best regards
Piotr Wyderski

Pedro Pereira

unread,

Nov 2, 2011, 9:17:13 AM11/2/11

to

Piotr Wyderski wrote:

> MitchAlsup wrote:
>
>> So you have 192 single byte instructions and 64*256 = 16384 two byte
>> instructions.
>
> If one has the luxury to define a completely new ISA, then
> why don't use Huffman-style prefix encoding in order to specify
> the instruction length? In the first byte:
>
> 0b0xxx xxxx = 128 1-byte instructions
> 0b10xx xxxx = 16384 2-byte instructions
> 0x1100 xxxx = 1048576 3-byte instructions
> 0x1101 xxxx = 268435456 4-byte instructions
> 0x1110 xxxx = 68719476736 5-byte instructions
> 0b1111 00 xx = etc.

What about a UTF-8 like instruction encoding:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Using this mechanism one can immediately see if a jump
target starts in the middle of a instruction, for example.

>
> Parallel decoding should be fairly easy and opcode space
> compression could be good.

Maybe with a UTF-8 like instruction encoding the parallel
decoding could be made easier.

Pedro Pereira

Piotr Wyderski

unread,

Nov 2, 2011, 9:25:50 AM11/2/11

to

Terje Mathisen wrote:

> Yeah, that one took me a few years to figure out, but now I am firmly in
> the "Short vectors shall only require item-sized alignment" camp.

But if you allow 8-bit items, then you need a full-blown
aligner network. This, in turn allows you to do pretty anything,
removing all the alignment constraints. You can also reuse it
for the GP part. Why don't sell it that way?

> This is because otherwise it becomes far too painful to write code that
> will tolerate changes in the width of vector regs, and pretty much
> impossible to efficiently interface between code written for width A
> (say 128 bits) and width B (256 bits).

BTDT. :-(

I'm also in the "SIMD-only architecture" camp, i.e. the way
the scalar SSE FP is done. Just give me 256-bit SIMD registers
with scalar-mode abilities to mimic the GP instruction set and
I'll be perfectly happy till the end of my life. :-)

Best regards,
Piotr Wyderski

Piotr Wyderski

unread,

Nov 2, 2011, 9:29:27 AM11/2/11

to

nm...@cam.ac.uk wrote:

> The correct solution is now, as it always has been, to pack
> and unpack properly when that is needed, which can usually be done
> for free

The problem is that repacking rarely comes for free, e.g.
in the raytracing community its cost, or the means to reduce
it, is the the source of the most craziest code I'v ever seen.

Best regards
Piotr Wyderski

Terje Mathisen

unread,

Nov 2, 2011, 9:33:33 AM11/2/11

to

Pedro Pereira wrote:

> Piotr Wyderski wrote:
>> If one has the luxury to define a completely new ISA, then
>> why don't use Huffman-style prefix encoding in order to specify
>> the instruction length? In the first byte:
>>
>> 0b0xxx xxxx = 128 1-byte instructions
>> 0b10xx xxxx = 16384 2-byte instructions
>> 0x1100 xxxx = 1048576 3-byte instructions
>> 0x1101 xxxx = 268435456 4-byte instructions
>> 0x1110 xxxx = 68719476736 5-byte instructions
>> 0b1111 00 xx = etc.
>
> What about a UTF-8 like instruction encoding:
>
> 0xxxxxxx
> 110xxxxx 10xxxxxx
> 1110xxxx 10xxxxxx 10xxxxxx
> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>
> Using this mechanism one can immediately see if a jump
> target starts in the middle of a instruction, for example.

Oh no!

This would immediately destroy all opportunities for Real Programmer"
(TM) style embedding of one instruction inside another. :-(

Today we can still do this, except the performance really sucks because
it causes flushing of any cached decode information, like instruction
boundaries. :-(

>> Parallel decoding should be fairly easy and opcode space
>> compression could be good.
>
> Maybe with a UTF-8 like instruction encoding the parallel
> decoding could be made easier.

I have to admit that this does have some nice properties.

Terje Mathisen

unread,

Nov 2, 2011, 10:15:08 AM11/2/11

to

Piotr Wyderski wrote:
> Terje Mathisen wrote:
>
>> Yeah, that one took me a few years to figure out, but now I am firmly in
>> the "Short vectors shall only require item-sized alignment" camp.
>
> But if you allow 8-bit items, then you need a full-blown
> aligner network. This, in turn allows you to do pretty anything,
> removing all the alignment constraints. You can also reuse it
> for the GP part. Why don't sell it that way?

If you do have it, why not flaunt it?

>
>> This is because otherwise it becomes far too painful to write code that
>> will tolerate changes in the width of vector regs, and pretty much
>> impossible to efficiently interface between code written for width A
>> (say 128 bits) and width B (256 bits).
>
> BTDT. :-(
>
> I'm also in the "SIMD-only architecture" camp, i.e. the way
> the scalar SSE FP is done. Just give me 256-bit SIMD registers
> with scalar-mode abilities to mimic the GP instruction set and
> I'll be perfectly happy till the end of my life. :-)

Sure, but only if you can do those load/store ops at scalar-size alignment.

MitchAlsup

unread,

Nov 2, 2011, 12:46:35 PM11/2/11

to an...@spam.comp-arch.net

No, in this particular case, I was going to change not the software, but the memory ordernig rules for memory accesses where LOCKed activities are ongoing. By allowing the LOCKed RMW's store to pass to the front of the memory requests for that cache line, only 2 or 3 memory requests will have seen the lock in the unLOCKed state, the rest of the threads will go back to sleep after seeing it as LOCKed again.

The other change one makes is that the store part of the Cacheable LOCKed RMW does not become visible outside the thread if the stored value is equal to the value already in the store-to location.

Mitch

MitchAlsup

unread,

Nov 2, 2011, 12:53:09 PM11/2/11

to

On Wednesday, November 2, 2011 3:54:28 AM UTC-5, Terje Mathisen wrote:

> MitchAlsup wrote:
> OTOH, most such locking code has probably been written under the
> assumption that contention will be _very_ rare, and if it does happen,
> the lock will probably become free in an iteration or two.

Which was not a particularly bad assumption when the sw was written, but is becomming more and more so every day.

> One point re bus operations:
>
> In a TTS setup where the lock is already owned by somebody else, the
> initial Test (read) operation will have to force the current owner to
> first write back the cache line with the updated (locked) value, then
> change the state to Shared, right?

Correct, the owner of the lock does not retain the ability to release the lock quickly (i.e. in the Modified or Owned states).

> Immediately following this, when the owner wants to free the lock, the
> cache line must go back into Exclusive state, Invalidating the copies in
> all other cpus.

Casusing a cascade of memory requests, few of which are useful in the "making forward progress" problem.

> Writing a 0 to release it should change the line to Modified and trigger
> a writeback whereupon the other cpus can fight it out for who's next.

What is worse is that while the owner of the lock is getting the line to write the lock, all the other threads contending for the lock just jump in front of him, hindering his abiity to write the lock he just got. The only saving grace, here, is the architectureal allowance to get the lock and write it in the follwoing cycle before "entertaining" new coherent requests from the memory controller.

Mitch

Paul A. Clayton

unread,

Nov 2, 2011, 3:41:07 PM11/2/11

to

On Nov 2, 11:53 am, MitchAlsup <MitchAl...@aol.com> wrote:
> Correct, the owner of the lock does not retain the
> ability to release the lock quickly (i.e. in the
> Modified or Owned states).

Would there be any/sufficient benefit to providing stores
that have something like a "no-write-allocate" decoration?
This would allow the write to quickly pass to a level of
cache nearer the readers and avoid a later invalidating
of the just written cache block.

Weird thought: for a modest number of threads (or thread
groups) would there be any use for a bit vector with a
offset to the bit that is currently active. Declarations
of interest (setting the bit) could be coalesced by
hardware and each thread could determine how long a line
there is by, e.g., rotating the bit vector by the active
offset and counting the set bits 'ahead' of its bit
position. My guess is that better mechanisms have been
developed (this thought was spurred by the the thought
of no-write-allocate stores, which could be coalesced).

BGB

unread,

Nov 2, 2011, 9:20:16 PM11/2/11

to

On 11/2/2011 6:25 AM, Piotr Wyderski wrote:
> Terje Mathisen wrote:
>
>> Yeah, that one took me a few years to figure out, but now I am firmly in
>> the "Short vectors shall only require item-sized alignment" camp.
>
> But if you allow 8-bit items, then you need a full-blown
> aligner network. This, in turn allows you to do pretty anything,
> removing all the alignment constraints. You can also reuse it
> for the GP part. Why don't sell it that way?
>

and, this probably isn't really free either...

requiring alignment (with a wide bus) does have the advantage that it
requires less transistors, but at the cost that any need to support
non-aligned units has to be done at a higher level (such as by whatever
codegen is generating code to target the processor).

I was thinking also some about the "true ISA vs virtual ISA" notion,
whereby the ISA which people target for the processor is not necessarily
the same as the one which runs on the processor, but instead there is an
intermediate JIT step.

for example, the real HW is a simplistic RISC-like architecture, but it
is used to run mostly code which uses the x86 ISA or similar (or some
other VM bytecode).

in the case of alignment, the codegen could try to infer and/or trap
value alignment, initially generating, say, "assume aligned but trap"
forms, eventually concluding that the value is always aligned (switching
to a more efficient encoding), or rarely aligned (and generating more
generic non-aligned logic).

>> This is because otherwise it becomes far too painful to write code that
>> will tolerate changes in the width of vector regs, and pretty much
>> impossible to efficiently interface between code written for width A
>> (say 128 bits) and width B (256 bits).
>
> BTDT. :-(
>
> I'm also in the "SIMD-only architecture" camp, i.e. the way
> the scalar SSE FP is done. Just give me 256-bit SIMD registers
> with scalar-mode abilities to mimic the GP instruction set and
> I'll be perfectly happy till the end of my life. :-)
>

I guess some of this may have to deal with how closely the programmer is
operating, say, with the underlying vector mechanics.

what I was imagining previously probably wouldn't be a SIMD-based
architecture.

in my case, I wasn't really assuming humans would be anywhere near
directly targeting the architecture, but probably there would be another
higher-level ISA or IL, which would in-turn be targeted by high-level
compilers.

programmers, then, would probably write code in a generic HLL (such as C
or C++) which would be compiled first to the main ISA or IL (x86 would
be an ISA option, but the other possibility would be an IL such as
MSIL/CIL or similar).

at the second stage, a JIT would sit between this high-level ISA and the
underlying HW, translating to the specific ISA for the specific HW (in
which case, "simple RISC like" vs VLIW/superscalar vs ... would be left
mostly to the specific domain requirements).

a partial example would be running a JIT-based x86 VM on something like
ARM. with a little work, an OS could itself mostly run on ARM, but then
run x86 based apps, with the apps not really noticing too much that they
are running on ARM.

ARM and x86 are "similar enough" that something like this would
potentially have a fairly modest performance degradation (and apart from
fairly processor-hungry apps, it is potentially the case that the
slowdown may not even be particularly noticable).

granted, yes, part of the process's address space would likely need to
be used up as a translation cache (say, if 64MB or 256MB or so is
reserved essentially to cache translated instruction-traces or similar,
probably flushing older/rarely-used traces whenever more memory is
needed to translate new traces).

note: "traces" would be chains of instructions which are free of jumps
or similar (either all jumps, or only conditional jumps). generally, a
trace would be built after landing on a given EIP address.

I had used something vaguely similar to this in my x86 interpreter
(albeit it translated x86 fragments into threaded-code, rather than
native code), this was partly because my instruction-decoder was fairly
slow. in this case, EIP was treated as a key into a hash-table (jumps
also internally generated interrupts, ...).

I had considered using similar for my VM's interpreter (since it could
outperform the traditional use of a "switch()" statement for opcode
dispatch), but I never got such an interpreter fully implemented (it
started getting hairy/complicated, and was not a huge priority, with me
eventually just staying with my pre-existing VM/interpreter logic).

I could eventually either get around to getting a new JIT written, or at
least migrate to threaded code, but it has been consistently a decent
amount of effort and not a huge priority (since the existing interpreter
is still mostly "fast enough" for what all I am using it for).

there is still some amount of native-code generation (mostly small-scale
/ ad-hoc) in the existing interpreter, only that the main
logic/dispatcher remains under the control of a giant switch statement.

I had recently also thought some about "native-code enhanced
threaded-code interpreter" (basically, threaded-code with ad-hoc
native-code generation thrown on), which could be argued to be "a poor
man's JIT" (and some thoughts for how to try to retrofit it onto the
existing interpreter without having to rewrite the whole damn thing in
the process).

it would probably work, despite maybe being "tacky" vs a more proper
codegen, which would translate whole functions/methods or
compilation-units at a time, and use a proper ABI/"calling conventions"
and register allocation and so on (but all this seems to become much
more complex when faced with dynamic languages, rather than say
"something like C or similar...").

so, I am left still faced with things like ad-hoc threaded code and
translating individual "traces" (and having to thunk over calls to/from
native code, ...).

or such...

BGB

unread,

Nov 2, 2011, 10:50:14 PM11/2/11

to

On 11/2/2011 6:17 AM, Pedro Pereira wrote:
> Piotr Wyderski wrote:
>
>> MitchAlsup wrote:
>>
>>> So you have 192 single byte instructions and 64*256 = 16384 two byte
>>> instructions.
>>
>> If one has the luxury to define a completely new ISA, then
>> why don't use Huffman-style prefix encoding in order to specify
>> the instruction length? In the first byte:
>>
>> 0b0xxx xxxx = 128 1-byte instructions
>> 0b10xx xxxx = 16384 2-byte instructions
>> 0x1100 xxxx = 1048576 3-byte instructions
>> 0x1101 xxxx = 268435456 4-byte instructions
>> 0x1110 xxxx = 68719476736 5-byte instructions
>> 0b1111 00 xx = etc.
>
> What about a UTF-8 like instruction encoding:
>
> 0xxxxxxx
> 110xxxxx 10xxxxxx
> 1110xxxx 10xxxxxx 10xxxxxx
> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>
> Using this mechanism one can immediately see if a jump
> target starts in the middle of a instruction, for example.
>

apart from the lower density (only 6 workable bits per byte, ...), this
is an interesting idea. I am guessing that this would be the entire
instruction, rather than just the opcode proper.

another variation of this idea could be to extend this to 16-bit words:
0xxxxxxxyyyyyyyy
110xxxxxyyyyyyyy 10xxxxxxyyyyyyyy
1110xxxxyyyyyyyy 10xxxxxxyyyyyyyy 10xxxxxxyyyyyyyy
11110xxxyyyyyyyy 10xxxxxxyyyyyyyy 10xxxxxxyyyyyyyy 10xxxxxxyyyyyyyy
...

in this case, only 1/2 as many bits are used as overhead (albeit this
disallows single-byte instructions).

an old interpreter of mine had indicated the length of multi-word
instructions:
00xxxxxxyyyyyyyy
01xxxxxxyyyyyyyy xxxxxxxxyyyyyyyy
10xxxxxxyyyyyyyy xxxxxxxxyyyyyyyy xxxxxxxxyyyyyyyy
11xxxxxxyyyyyyyy xxxxxxxxyyyyyyyy xxxxxxxxyyyyyyyy xxxxxxxxyyyyyyyy

this allowed a 14-bit opcode (in the first word), with additional words
as arguments.

I later switched to encoding the opcode in 1 or 2 bytes, and just
leaving the arguments on their own (since it was possible to just look
up the opcode immediates and know what to skip over...).

also possible, 3-state bytes:
0xxxyyyy
11xxyyyy 10xxyyyy
11xxyyyy 10xxyyyy 10xxyyyy
...

or, the above with words.

or (using the MSB as a start/middle byte):
0xxxyyyy
0xxxyyyy 1xxxyyyy
0xxxyyyy 1xxxyyyy 1xxxyyyy
...

sadly, there is no good way I can think of to mix bytes and words and
still preserve the ability to be able to know if one has landed in the
middle of an instruction.

granted, for an "opcode as a single number" scheme, then one would need
a way of encoding the length of the opcode field.

possibly (x/y=opcode, z=arguments):
00xxyyyy //6-bit opcode, no args
00xxyyyy 1zzzzzzz //6-bit opcode, 7-bit args
01xxxxxx 10yyyyyy //12-bit opcode, no args
00xxyyyy 1zzzzzzz 1zzzzzzz //6-bit opcode, 14-bit args
01xxxxxx 10yyyyyy 1zzzzzzz //12-bit opcode, 7-bit args
01xxxxxx 11yyyyyy 10yyyyyy //18-bit opcode, no args
...

>>
>> Parallel decoding should be fairly easy and opcode space
>> compression could be good.
>
> Maybe with a UTF-8 like instruction encoding the parallel
> decoding could be made easier.
>

could be...

Brett Davis

unread,

Nov 3, 2011, 12:59:50 AM11/3/11

to

In article <j8rgjk$ol9$1...@news.task.gda.pl>,

I have been pointing out to the hardware guys for a while that ALUs
are infinitely fast, while data is sucked through a cocktail straw.
This results in crazy unpack code, followed by some modest math,
followed by crazy repack code.

Vectors need to load/store on odd addresses because the data is compressed.

Restricting alignment just force us software guys to need 4 times
as many registers (for unpack/repack) and 8 times as much code.
This is far more inefficient than an odd vector load.

I expect to be able to write a 16 byte vector to an odd address,
and advance 9 bytes and write another vector.

In article <csv7o8-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

> Piotr Wyderski wrote:
> > Terje Mathisen wrote:
> >
> >> Yeah, that one took me a few years to figure out, but now I am firmly in
> >> the "Short vectors shall only require item-sized alignment" camp.
> >
> > But if you allow 8-bit items, then you need a full-blown
> > aligner network. This, in turn allows you to do pretty anything,
> > removing all the alignment constraints. You can also reuse it
> > for the GP part. Why don't sell it that way?
>

> If you do have it, why not flaunt it?

Yes, absolutely, completely correct!

Pedro Pereira

unread,

Nov 3, 2011, 6:00:15 AM11/3/11

to

It's not too hard to get 7 bits per byte :-)

Unary coding:
01xxxxxx
001xxxxx 1xxxxxxx
0001xxxx 1xxxxxxx 1xxxxxxx
00001xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
000001xx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
0000001x 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx

Elias gamma coding:
010xxxxx
011xxxxx 1xxxxxxx
00100xxx 1xxxxxxx 1xxxxxxx
00101xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
00110xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
00111xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx

Pedro Pereira

Paul A. Clayton

unread,

Nov 3, 2011, 8:47:31 AM11/3/11

to

On Nov 3, 5:00 am, Pedro Pereira <ppere...@grupopie.com> wrote:
[snip Huffman and UTF-8 encoding comment]

> It's not too hard to get 7 bits per byte :-)
>
> Unary coding:
> 01xxxxxx
> 001xxxxx 1xxxxxxx
> 0001xxxx 1xxxxxxx 1xxxxxxx
> 00001xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
> 000001xx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
> 0000001x 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
>
> Elias gamma coding:
> 010xxxxx
> 011xxxxx 1xxxxxxx
> 00100xxx 1xxxxxxx 1xxxxxxx
> 00101xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
> 00110xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
> 00111xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx

One might also consider simply using one bit as
an indicator of the start (or end) of an instruction.
Some x86 processors use a predecoded Icache format with
one bit per byte indicating the start of an instruction.

In part because of this thread, I worked out a variable
length instruction encoding that should be somewhat
easily parsed. It uses two bits to indicate the end of
an instruction: 0b11 indicates "this is not the last
parcel" while 0b00, 0b01, and 0b10 indicate the end of
an instruction. This allows 3/4 of the encodings for
shorter forms. Here is a link to the encoding:

https://sites.google.com/site/paulclaytonplace/computer-stuff/instruction-formats

EricP

unread,

Nov 3, 2011, 11:51:04 AM11/3/11

to

MitchAlsup wrote:
> On Wednesday, November 2, 2011 3:54:28 AM UTC-5, Terje Mathisen wrote:
>> MitchAlsup wrote:
>> OTOH, most such locking code has probably been written under the
>> assumption that contention will be _very_ rare, and if it does happen,
>> the lock will probably become free in an iteration or two.
>
> Which was not a particularly bad assumption when the sw was written, but is becomming more and more so every day.

The antisocial behavior of TTS spinlocks has been well known for
a long time - they should not be used if it is likely to have more
than 1 or 2 waiters as there are far more efficient mechanisms.
I doubt you would find any SMP kernel using them outside very
low contention situations.

If there is any chance of more than 1 or 2 waiters
I would suggest considering an MLH spin queue
(also called an LH lock) outlined in section 3 here:

Efficient Software Synchronization on
Large Cache Coherent Multiprocessors (1994)
Peter Magnusson , Anders Landin , Erik Hagersten
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.6490

It has O(N) cost, FIFO ordered, bounded delay, no thundering herd,
and for P processors and S spinlocks uses just P+S cache lines.

Basically the lock builds a single linked list of waiters
each spinning on a separate cache line.
As each waiter uses a separate cache line there is no fighting.
Grants are passed from lock owner to a single waiter in order
of lock request so the grant is FIFO and delay bounded.

There are no try-fail-retry loops at the software level, and
the only possible point of contention is a single AtomicExchange
but with no thundering herd there is negligible chance for
collision at the bus protocol level.

If you work out the bus messages for MLH lock, they are minimal.

A related issue is the bus messages that could be invoked
by some of the lock-free algorithms. Some of these could appear
to be acceptable at the software level, but possibly might flood
the bus under contention. Because the coherence protocols are
largely opaque (to me at least) it is impossible to evaluate.
For example, _exactly_ what happens if many processor do an
atomic operation on the same address at the same time.

Eric

Tim McCaffrey

unread,

Nov 3, 2011, 12:38:13 PM11/3/11

to

In article
<30504120.1875.1320183626494.JavaMail.geo-discussion-forums@vbza28>,
Mitch...@aol.com says...

>
>
>On Monday, October 31, 2011 7:22:09 PM UTC-5, Andy Krazy Glew wrote:
>> Consider the following for a contended lock:

>>=20

>OK, let us consider the contended spin lock:
>> ...
>> test_and_test_and_set_spinloop:
>> read lock into reg

>> if reg =3D=3D 1 goto test_and_test_and_set_spinloop

>> atomic rmw to set lock, old value into reg

>> if reg =3D=3D 1 goto test_and_test_and_set_spinloop
>> // got the lock
>
>// Scenario:=20

>// the lock is LOCKed
>// there are n processors waiting in the spinloop

>// the system has reached a point where every waiting thread is=
>=20

>// spinning in their own cache.
>
>test_and_test_and_set_spinloop:
> read lock into reg

> if reg =3D=3D 1 goto test_and_test_and_set_spinloop
>
>// In the case where the reg is LOCKed; this processor is, in essence,=20
>// waiting for a coherent write to invalidate the '1' in the lock memory=20
>// location. Allowing the processor to access the <essentially> frozen=20
>// value in the cache faster (or more often) does not usefully improve =20

>// the wait time.
>//

>// In the case where the reg is unLOCKed; falling through is the best choic=

>e
>// because taking the branch represents no forward progress.
>//
>// Thus predicting this branch always not_taken is near optimal. When this

>// branch IS predicted not_taken, the pipeline will see the ATOMIC instruct=

>ion
>// and can then adjust branch prediction at a local scale.
>
> atomic rmw to set lock, old value into reg

> if reg =3D=3D 1 goto test_and_test_and_set_spinloop

>// got the lock
> <do the critical section work>
> write 0 into lock
>
>// The owner of the lock writes a 0 into the lock. The write back cache

>// tracks the shared nature of the lock holding cache line. The cache issue=
>s=20
>// a coherent invalidate to the cache lineholding the lock.=20
>
>// In a fabric based system different nodes see the choerent invalidate at=
>=20
>// different times. The spinning thread stops spinning in the cache (line=
>=20
>// was invalidated) and issues a read memory request to the cache line hold=
>ing=20
>// the lock. The memory controller sees N read requests to the same cache l=

>ine.
>// In most fabrics these reads are serialized for cache coherence reasons.

>// Some memory controllers can use the read buffer to avoid DRAM access lat=

>ency
>// but must still serailize the read requests. In a fabric based system the
>// request that gets to the memory controller first should win the lock
>// (and does most of the time).
>

>// The process that gests its read request serviced first sees the memory=
>=20

>// location unLOCKed.
>
>// As this processor sees the locked location, the end of the coherent

>// memory transaction arrives back at the memory controler and the second r=
>ead=20
>// to the memory location begins. By this point one can assume that all oth=
>er=20
>// waiting processors will have sent their read requests to the memory cont=
>roller.
>
>// The processor that saw the locationunLOCKed then performs an ATOMIC RMW=
>=20
>// to the locked location.This locked request arrives BEHIND the reads alre=
>ady=20

>// pending on the memory controller to the same line.
>

>// Under contention, there are n processors contending on the lock, 1 will=
>=20
>// get the lock when it is unLOCKed and n-1 will go back into the spinning=
>=20

>// state. In order to put each one back into the spinning state each thread

>// must observe the lock is in the LOCKed state. Thus the processor that=20
>// "gave the lock up" will cause a cascade of memory activity on the=20
>// interconnect all targeting that one memory location. It will take a=20
>// quadradic ((n-1)*(n-2)) number of memory accesses for the waiting=20

>// processors to achieve the spinloop contained completely in the cache.

>// Many times achieving quiessence takes longer than the work inside=20
>// the critical section and all these memory refs slow down the unLOCKing=
>=20
>// of the lock. All of the contention works its way out of memory speed,=20
>// not processor speed.
>
>// If the memory controler were able to understand the situation, it would=
>=20

>// propogate the write over the reads so that the waiting processor threads
>// could go back to sleep without seeing the lock in an unlocked state.

>// were the memory controler to be able to make use of this, only 2 process=
>ors
>// would see the lock in an unlocked state and fabric traffic would go way =

>down.
>
>// Once again; there is little utility in predicting this branch taken
>// and considerable advantage to the lock holding thread to predict it
>// not_taken.
>//

>// If the critical section does only a few memory accesses to complete=20
>// its duties before unlocking the lock. The unlocking of the lock can=20
>// end up stalled behind the cascade of read_with_intent_to_modify=20

>// requests from the atomic instruction from other threads.
>
>Back to prefixing condtional branches:
>

>So, for this particular case, it would be just as easy for the processor to=
> recognize the T&T&S paradigm and alter branch prediction at a local scale =
>than to end up having to prefix branch instructions. That is the pipeline c=
>an deduce the predict_not_taken-ness by the presence of the <already existi=
>ng> LOCK prefix on the RMW instruction rather than a prefix in front of the=

> conditional branch instruction(s).
>
>Does any processor actually do this? I have no idea.
>

>Could the processor recognize this paradigm and then power down until it re=
>ceives a coherent invalidate to the locked cache line? (Rather than spinnin=
>g in the spin loop, go to sleep until the necessary event at the other side=
> of the cache takes place.) This would save lots of power without having to=

> 'decorate' branch instructions and be essentially "just as fast".
>
>Back to considering spinlocks under contention:
>

>However, it is my contention that allowing more than 2 (or 3) waiting threa=
>ds to see the lock in the unlocked state is the root of the problem. Prefix=
>ing the branch instructions nearby the ATOMIC events is not getting at the =

>root of the ATOMIC event latency or memory overheads.
>

The above thundering herd problem is why Microsoft introduced in-stack Queued
spinlocks. Each thread spins on its own flag word, when the owner gives up
the lock it locates who is next in the queue and updates the flag word.

- Tim

EricP

unread,

Nov 3, 2011, 2:08:37 PM11/3/11

to

Tim McCaffrey wrote:
>
> The above thundering herd problem is why Microsoft introduced in-stack Queued
> spinlocks. Each thread spins on its own flag word, when the owner gives up
> the lock it locates who is next in the queue and updates the flag word.

Here is a description of the MS in-stack queued spinlock

http://www.dcl.hpi.uni-potsdam.de/research/WRK/2009/12/queued-spinlocks-in-the-wrk/

It creates its single linked list with temporary
data structures on the local stack,
but the ReleaseInStackQueuedSpinlock must synchronize
with linking the list items and may have to
spin-wait for the list to stabilize.

It uses a total of 2 or 3 atomic ops for lock and release,
and seems to touch more variables and moves cache lines around more.

Eric

Chris M. Thomasson

unread,

Nov 3, 2011, 10:44:48 PM11/3/11

to

"EricP" <ThatWould...@thevillage.com> wrote in message
news:_MAsq.18085$t37....@newsfe14.iad...

FWIW, check out this bakery algorithm used to implement a
multi-producer/multi-consumer queue:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/734f7c367f720005

How would this thrash the hardware!

:^o

pseudo-code for mpmc-bounded-queue:
__________________________________________________
template<typename T,
unsigned long T_depth /* MUST BE POW OF 2 ! */>
struct mpmc_bounded_queue
{
struct cell_type
{
std::atomic<unsigned long> m_state;
T m_object;
};

std::atomic<unsigned long> m_head;
std::atomic<unsigned long> m_tail;
cell_type m_buffer[T_depth];

mpmc_bounded_queue() : m_head(0), m_tail(0)
{
// initialize version numbers.
for (unsigned long i = 0; i < T_depth; ++i)
{
m_buffer[i].m_state.store(i, mb_relaxed);
}
}

void push(T const& obj)
{
// obtain our head version and cell.
unsigned long idx = m_head.fetch_add(1, mb_relaxed);
cell_type& cell = m_buffer[idx & (T_depth - 1U)];

// wait for it...
smart_backoff_? backoff;
while (cell.m_state.load(mb_relaxed) != idx)
backoff.yield($);
std::atomic_thread_fence(mb_acquire);

// GOT IT! Okay, write to the object.
cell.m_object = obj;

// We are done; allow a consumer to consume.
std::atomic_thread_fence(mb_release);
cell.m_state.store(idx + 1, mb_relaxed);
}

void pop(T& obj)
{
// obtain our tail version and cell.
unsigned long idx = m_tail.fetch_add(1, mb_relaxed);
cell_type& cell = m_buffer[idx & (T_depth - 1U)];

// wait for it...
smart_backoff_? backoff;
while (cell.m_state.load(mb_relaxed) != idx + 1)
backoff.yield($);
std::atomic_thread_fence(mb_acquire);

// GOT IT! Okay, read from the object.
obj = cell.m_object;

// We are done; allow a producer to produce.
std::atomic_thread_fence(mb_release);
cell.m_state.store(idx + T_depth, mb_relaxed);
}
};
__________________________________________________

Terje Mathisen

unread,

Nov 4, 2011, 6:19:30 AM11/4/11

to

EricP wrote:
> Here is a description of the MS in-stack queued spinlock
>
> http://www.dcl.hpi.uni-potsdam.de/research/WRK/2009/12/queued-spinlocks-in-the-wrk/
>
>
> It creates its single linked list with temporary
> data structures on the local stack,
> but the ReleaseInStackQueuedSpinlock must synchronize
> with linking the list items and may have to
> spin-wait for the list to stabilize.

That does indeed look like a problem area, the release code looks a lot
like a generic TTS.

>
> It uses a total of 2 or 3 atomic ops for lock and release,
> and seems to touch more variables and moves cache lines around more.

Yeah. :-(

Terje Mathisen

unread,

Nov 4, 2011, 6:26:35 AM11/4/11

to

EricP wrote:
> Efficient Software Synchronization on
> Large Cache Coherent Multiprocessors (1994)
> Peter Magnusson , Anders Landin , Erik Hagersten
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.6490
>
> It has O(N) cost, FIFO ordered, bounded delay, no thundering herd,
> and for P processors and S spinlocks uses just P+S cache lines.

Nice work, and from 1994!

So why are we still seeing lots of high-overhead locking in current code?
:-(

Chris M. Thomasson

unread,

Nov 4, 2011, 4:35:46 PM11/4/11

to

"EricP" <ThatWould...@thevillage.com> wrote in message
news:_MAsq.18085$t37....@newsfe14.iad...

Ouch...

I wonder what you think about my experimental spinlock based on a
distributed bakery algorithm:
__________________________________________________
struct spinlock
{
#define LSDEPTH 32U
#define LSMASK (LSDEPTH - 1U)

struct lock_state
{
std::atomic<unsigned> m_state;
lock_state() : m_state(0) {}
};

std::atomic<unsigned> m_idxseq;
lock_state m_lstate[LSDEPTH];

spinlock() : m_idxseq(0) {}

unsigned lock()
{
unsigned idxseq = m_idxseq.fetch_add(1, mb_relaxed);
lock_state& ls = m_lstate[idxseq & LSMASK];

linear_backoff backoff;
while (ls.m_state.load(mb_relaxed) != idxseq)
backoff.yield();

std::atomic_thread_fence(mb_acquire);

return idxseq + 1;
}

void unlock(unsigned idxseq)
{
std::atomic_thread_fence(mb_release);

lock_state& ls = m_lstate[idxseq & LSMASK];
ls.m_state.store(idxseq, mb_relaxed);
}
};
__________________________________________________

I think it should have a little bit better scalability than a standard,
non-distributed bakery algorithm...

Humm...

EricP

unread,

Nov 4, 2011, 4:52:01 PM11/4/11

to

Chris M. Thomasson wrote:
>
> FWIW, check out this bakery algorithm used to implement a
> multi-producer/multi-consumer queue:
>
> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/734f7c367f720005
>
> How would this thrash the hardware!
>
> :^o
>
>
>
> pseudo-code for mpmc-bounded-queue:

> <snip>

The cell_type m_buffer is not cache line aware.
Depending on the size of m_objects there can be
multiple cells per line and line straddles.

In Pop it assigns a slot for a cpu to wait for a value
where it spin-waits for the object to arrive.
That causes a read_share on the line containing the cell.

<---ReadShare
Reply--->

In Push if there are multiple cells per line then
a write will cause invalidates to be broadcast to
each sharer, who then acks the invalidate and
re-read_share the line which the owner replies to
by each processor sharing that line.

E.g.: for 4 line sharers

Invalidate--->
<---Ack
<---Ack
<---Ack
<---Ack
<---ReadShare
Reply--->
<---ReadShare
Reply--->
<---ReadShare
Reply--->
<---ReadShare
Reply--->

Also in Push if m_object was some struct
such that when Push copies in "cell.m_object = obj"
that it require multiple write operations then
each write might cause those invalidates re-read_share.

That might cause a flurry of messages.

Invalidate--->
<---Flurry
Invalidate--->
<---Flurry
Invalidate--->
<---Flurry

It would be best if the array elements were
multiples of the cpu cache line size.

Perhaps you could arrange that cell and the m_state is copied
at the same time with as few write operations as possible.
If memory footprint is no concern then even put the
m_state and m_object in separate cache lines.

Also I'm not sure what happens if there are more producers
or consumers than T_depth but I think it misbehaves
(if 2 producers get assigned the same slot).

Eric

Chris M. Thomasson

unread,

Nov 4, 2011, 5:15:51 PM11/4/11

to

"EricP" <ThatWould...@thevillage.com> wrote in message

news:XhYsq.15944$rF5....@newsfe19.iad...

> Chris M. Thomasson wrote:
>>
>> FWIW, check out this bakery algorithm used to implement a
>> multi-producer/multi-consumer queue:
>>
>> http://groups.google.com/group/comp.programming.threads/browse_frm/thread/734f7c367f720005
>>
>> How would this thrash the hardware!
>>
>> :^o
>>
>>
>> pseudo-code for mpmc-bounded-queue:
>> <snip>
>
> The cell_type m_buffer is not cache line aware.
> Depending on the size of m_objects there can be
> multiple cells per line and line straddles.
>

[...]

Thank you for that detailed information Eric!

> It would be best if the array elements were
> multiples of the cpu cache line size.

Yes, of course you are correct. The code was basically a quick and dirty
illustration of the underlying synchronization algorithm.

> Perhaps you could arrange that cell and the m_state is copied
> at the same time with as few write operations as possible.
> If memory footprint is no concern then even put the
> m_state and m_object in separate cache lines.

Humm... Good idea.

> Also I'm not sure what happens if there are more producers
> or consumers than T_depth but I think it misbehaves
> (if 2 producers get assigned the same slot).

The producers wait for the counter in the slot (cell_type::m_state) to
become equal to the result they got from the fetch-and-add on the global
head counter (mpmc_bounded_queue::m_head). So if there are 4 slots with 5
producers, and the 5'th producer gets into slot zero at the same time as
producer 1, it will have to wait for the slot counter to become equal to 4.
The only way that can happen is when a consumer sets the slot counter to 4.
It does this by setting the slot counter to the result it got from
fetch-and-add _plus_ the size of the buffer. So, everything is orderly and
strictly FIFO.

EricP

unread,

Nov 4, 2011, 5:40:23 PM11/4/11

to

Chris M. Thomasson wrote:
> "EricP" <ThatWould...@thevillage.com> wrote in message
> news:_MAsq.18085$t37....@newsfe14.iad...
>> Tim McCaffrey wrote:
>>> The above thundering herd problem is why Microsoft introduced in-stack
>>> Queued spinlocks. Each thread spins on its own flag word, when the owner
>>> gives up the lock it locates who is next in the queue and updates the
>>> flag word.
>> Here is a description of the MS in-stack queued spinlock
>>
>> http://www.dcl.hpi.uni-potsdam.de/research/WRK/2009/12/queued-spinlocks-in-the-wrk/
>>
>> It creates its single linked list with temporary
>> data structures on the local stack,
>> but the ReleaseInStackQueuedSpinlock must synchronize
>> with linking the list items and may have to
>> spin-wait for the list to stabilize.
>>
>> It uses a total of 2 or 3 atomic ops for lock and release,
>> and seems to touch more variables and moves cache lines around more.
>
> Ouch...
>
>
> I wonder what you think about my experimental spinlock based on a
> distributed bakery algorithm:

Assuming atomic<unsigned> is 4 bytes and a cache line is 64
then you will have up to 16 spin-waiters on each cache line.
LSDEPTH = 32 so you would use 2 cache lines then wrap
back to the first one.

As each waiter joins the list, the cost sums up.
While they may be spinning on separate addresses,
as each new waiter joins the list it read_shares the same line.
Then the lock holder releases it and all the line sharers get
invalidated, all must ack the invalidate, all re-read_share.
The next one gets the lock and does the same thing.
So for N waiters it costs N+(N-1)+(N-2)...2+1 = (N^2+N)/2

For 16 waiters
16+15+14...+2+1 = (16^2+16)/2 = O(264)

or 264 time the cost of a single waiter.
But once you go beyond 32 waiters you wrap back
to the first cache line and start adding more to that.

Which is essentially a TTS spinlock.

Eric

Chris M. Thomasson

unread,

Nov 4, 2011, 6:15:49 PM11/4/11

to

"EricP" <ThatWould...@thevillage.com> wrote in message

news:KZYsq.23170$0s1....@newsfe11.iad...
> Chris M. Thomasson wrote:
[...]

>> I wonder what you think about my experimental spinlock based on a
>> distributed bakery algorithm:
>
> Assuming atomic<unsigned> is 4 bytes and a cache line is 64
> then you will have up to 16 spin-waiters on each cache line.
> LSDEPTH = 32 so you would use 2 cache lines then wrap
> back to the first one.
>
> As each waiter joins the list, the cost sums up.
> While they may be spinning on separate addresses,
> as each new waiter joins the list it read_shares the same line.

[...]

Okay. Would this tweak help out at all Eric?
_________________________________________________________
#define L2_CACHE 64

// assume that spinlock is aligned on a L2_CACHE boundary...

struct spinlock
{
#define LSDEPTH 32U
#define LSMASK (LSDEPTH - 1U)

struct lock_state
{
std::atomic<unsigned> m_state;

char l2_cache_pad[L2_CACHE - sizeof(std::atomic<unsigned>)];

lock_state() : m_state(0) {}
};

std::atomic<unsigned> m_idxseq;

char l2_cache_pad[L2_CACHE - sizeof(std::atomic<unsigned>)];

lock_state m_lstate[LSDEPTH];

spinlock() : m_idxseq(0) {}

unsigned lock()
{
unsigned idxseq = m_idxseq.fetch_add(1, mb_relaxed);
lock_state& ls = m_lstate[idxseq & LSMASK];

linear_backoff backoff;
while (ls.m_state.load(mb_relaxed) != idxseq)
backoff.yield();

std::atomic_thread_fence(mb_acquire);

return idxseq + 1;
}

void unlock(unsigned idxseq)
{
std::atomic_thread_fence(mb_release);

lock_state& ls = m_lstate[idxseq & LSMASK];
ls.m_state.store(idxseq, mb_relaxed);
}
};

_________________________________________________________

When one wants to use the spinlock, she or he simply makes sure that it is
properly aligned on a L2_CACHE boundary. Perhaps something like this:
_________________________________________________________
static unsigned char g_raw_buffer[sizeof(spinlock) + L2_CACHE] = { 0 };
static spinlock* g_spinlock = NULL;

static void spinlock_init()
{
g_spinlock = (spinlock*)ALIGN_PTR(g_raw_buffer, L2_CACHE);
}

// [...]

int main()
{
spinlock_init();

// [...];

return 0;
}
_________________________________________________________

Now, it seems that at least LSDEPTH number of waiters will be spinning on a
separate l2 cache line... Right?

Chris M. Thomasson

unread,

Nov 4, 2011, 6:22:35 PM11/4/11

to

"Chris M. Thomasson" <cri...@charter.net> wrote in message
news:MtZsq.18833$Cr1....@newsfe03.iad...
[...]

> When one wants to use the spinlock, she or he simply makes sure that it is
> properly aligned on a L2_CACHE boundary. Perhaps something like this:
> _________________________________________________________
> static unsigned char g_raw_buffer[sizeof(spinlock) + L2_CACHE] = { 0 };
> static spinlock* g_spinlock = NULL;
>
>
> static void spinlock_init()
> {
> g_spinlock = (spinlock*)ALIGN_PTR(g_raw_buffer, L2_CACHE);
> }

Yikes!!! Ummmm.... I should probably use placement new here...
_________________________________________________________
static void spinlock_init()
{
void* aligned_buffer = (void*)ALIGN_PTR(g_raw_buffer, L2_CACHE);
g_spinlock = new (aligned_buffer) spinlock();
}
_________________________________________________________

lol! ;^)

EricP

unread,

Nov 5, 2011, 1:05:12 PM11/5/11

to

Chris M. Thomasson wrote:
> "EricP" <ThatWould...@thevillage.com> wrote in message
> news:KZYsq.23170$0s1....@newsfe11.iad...
>> Chris M. Thomasson wrote:
> [...]
>>> I wonder what you think about my experimental spinlock based on a
>>> distributed bakery algorithm:
>> Assuming atomic<unsigned> is 4 bytes and a cache line is 64
>> then you will have up to 16 spin-waiters on each cache line.
>> LSDEPTH = 32 so you would use 2 cache lines then wrap
>> back to the first one.
>>
>> As each waiter joins the list, the cost sums up.
>> While they may be spinning on separate addresses,
>> as each new waiter joins the list it read_shares the same line.
>
> [...]
>
> Okay. Would this tweak help out at all Eric?

Yeah, that looks better - up to 32 processors.
Above that and yours will share the cache line.

>
> When one wants to use the spinlock, she or he simply makes sure that it is
> properly aligned on a L2_CACHE boundary. Perhaps something like this:

If we are talking about spinlocks then we are talking
about OS kernel code, and spinlocks are either declared
global or located inside an object.

For P processors and S spinlocks,
LH uses P+S cache lines while arrays use P*S.

In the kernel S can be _very_ large.

For global spinlocks the memory footprint might be less of a concern,
but per-object spinlocks an array approach is just unacceptable.

For example, each kernel thread header has spinlocks guarding
updates to particular data areas. Allocating LSDEPTH cache lines
in each thread header, and many other dynamically allocated
kernel objects, is not going to happen.

The LH spinlock uses just a pointer for the spinlock (queue head).
It uses just P+S cache lines lines for the whole system,
and it has the same (or similar) bus handshake traffic as
yours except it doesn't wrap around and share at LSDEPTH.

Eric

Chris M. Thomasson

unread,

Nov 7, 2011, 5:20:12 PM11/7/11

to

"EricP" <ThatWould...@thevillage.com> wrote in message

news:q2etq.5519$Uw7....@newsfe02.iad...

> Chris M. Thomasson wrote:
>> "EricP" <ThatWould...@thevillage.com> wrote in message
>> news:KZYsq.23170$0s1....@newsfe11.iad...
>>> Chris M. Thomasson wrote:
>> [...]
>>>> I wonder what you think about my experimental spinlock based on a
>>>> distributed bakery algorithm:
>>> Assuming atomic<unsigned> is 4 bytes and a cache line is 64
>>> then you will have up to 16 spin-waiters on each cache line.
>>> LSDEPTH = 32 so you would use 2 cache lines then wrap
>>> back to the first one.
>>>
>>> As each waiter joins the list, the cost sums up.
>>> While they may be spinning on separate addresses,
>>> as each new waiter joins the list it read_shares the same line.
>>
>> [...]
>>
>> Okay. Would this tweak help out at all Eric?
>
> Yeah, that looks better - up to 32 processors.
> Above that and yours will share the cache line.

Exactly.

>> When one wants to use the spinlock, she or he simply makes sure that it
>> is
>> properly aligned on a L2_CACHE boundary. Perhaps something like this:
>
> If we are talking about spinlocks then we are talking
> about OS kernel code,

Well, not necessarily... IMHO that is! ;^)

> and spinlocks are either declared global or located inside an object.

Yes.

> For P processors and S spinlocks,
> LH uses P+S cache lines while arrays use P*S.
>
> In the kernel S can be _very_ large.
>
> For global spinlocks the memory footprint might be less of a concern,
> but per-object spinlocks an array approach is just unacceptable.

Yes.

> For example, each kernel thread header has spinlocks guarding
> updates to particular data areas. Allocating LSDEPTH cache lines
> in each thread header, and many other dynamically allocated
> kernel objects, is not going to happen.

100% Agreed. My spinlock is a huge fat bastar% compared to the LH spinlock!
Ouch.

> The LH spinlock uses just a pointer for the spinlock (queue head).

The paper "suggests" that LH spinlock will use a "dummy" lock state along
with
a pointer that initially points to said dummy object. So that's a pointer
plus the dummy lock state.

> It uses just P+S cache lines lines for the whole system,
> and it has the same (or similar) bus handshake traffic as
> yours except it doesn't wrap around and share at LSDEPTH.

Yup. :^)

There is a way to get LH spinlock "global" state down to a single pointer,
here is some code:
__________________________________________________
struct lh_spinlock_2
{
struct flag
{
std::atomic<unsigned> m_state;
flag() : m_state(0) {}
};

struct thread
{
flag m_flag_1;
flag m_flag_2;
flag* m_cur;
flag* m_prev;
thread() : m_cur(&m_flag_1), m_prev(NULL) {}
};

std::atomic<flag*> m_flag;

lh_spinlock_2() : m_flag(NULL) {}

void lock(thread& t)
{
t.m_cur->m_state.store(1, mb_relaxed);
std::atomic_thread_fence(mb_release);

t.m_prev = m_flag.exchange(t.m_cur, mb_consume);

if (t.m_prev)
{
linear_backoff backoff;
while (t.m_prev->m_state.load(mb_relaxed))
backoff.yield();
}

std::atomic_thread_fence(mb_acquire);
}

void unlock(thread& t)
{
std::atomic_thread_fence(mb_release);
t.m_cur->m_state.store(0, mb_relaxed);

if (! (t.m_cur = t.m_prev))
{
t.m_cur = &t.m_flag_2;
}

t.m_prev = NULL;
}
};
__________________________________________________

BTW Here is M lock:
__________________________________________________
struct m_spinlock
{
struct flag
{
flag* m_next;
std::atomic<unsigned> m_state;
flag() : m_next(NULL), m_state(0) {}
};

struct thread
{
flag* m_flags;
flag* m_cur;
flag* m_prev;

thread()
: m_flags(new flag()), m_cur(new flag()), m_prev(NULL) {}

~thread()
{
while (m_flags)
{
flag* next = m_flags->m_next;
delete m_flags;
m_flags = next;
}

delete m_cur;

RL_ASSERT(! m_prev);
}
};

std::atomic<flag*> m_flag;

m_spinlock() : m_flag(NULL) {}

void lock(thread& t)
{
t.m_cur->m_state.store(1, mb_relaxed);
std::atomic_thread_fence(mb_release);

t.m_prev = m_flag.exchange(t.m_cur, mb_consume);

if (t.m_prev)
{
linear_backoff backoff;
while (t.m_prev->m_state.load(mb_relaxed))
backoff.yield();
}

std::atomic_thread_fence(mb_acquire);
}

void unlock(thread& t)
{
std::atomic_thread_fence(mb_release);

flag* cur = t.m_cur;
if (! m_flag.compare_exchange_strong(
cur,
NULL,
mb_relaxed))
{
t.m_cur->m_state.store(0, mb_relaxed);

if (! (t.m_cur = t.m_prev))
{
if (! (t.m_cur = t.m_flags))
{
t.m_cur = t.m_flags = new flag();
}

t.m_flags = t.m_cur->m_next;
}

else
{
t.m_prev = NULL;
}
}

else if (t.m_prev)
{
t.m_prev->m_next = t.m_flags;
t.m_flags = t.m_prev;
t.m_prev = NULL;
}
}
};
__________________________________________________

Tim McCaffrey

unread,

Nov 8, 2011, 1:46:42 PM11/8/11

to

In article <v7rco8-...@ntp6.tmsw.no>, "terje.mathisenattmsw.no" says...

>
>EricP wrote:
>> Efficient Software Synchronization on
>> Large Cache Coherent Multiprocessors (1994)
>> Peter Magnusson , Anders Landin , Erik Hagersten
>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.6490
>>
>> It has O(N) cost, FIFO ordered, bounded delay, no thundering herd,
>> and for P processors and S spinlocks uses just P+S cache lines.
>
>Nice work, and from 1994!
>
>So why are we still seeing lots of high-overhead locking in current code?
>:-(
>
>Terje

Try implementing it on a x86. That double word exchange is a pain. Also, I
don't get where "I" is supposed to be located. It has to be static, and you
need at least one per processor per lock, so I guess they assume getting
"I" is "free"?

- Tim

EricP

unread,

Nov 8, 2011, 1:52:16 PM11/8/11

to

Chris M. Thomasson wrote:
> "EricP" <ThatWould...@thevillage.com> wrote in message

>> The LH spinlock uses just a pointer for the spinlock (queue head).
>
> The paper "suggests" that LH spinlock will use a "dummy" lock state along
> with
> a pointer that initially points to said dummy object. So that's a pointer
> plus the dummy lock state.

Yes, total storage is a cache line plus a pointer to it.
I was referring to the size of the field in the object
containing the spinlock, e.g. the thread header.

> There is a way to get LH spinlock "global" state down to a single pointer,
> here is some code:

I'm not sure what you are saying.

I think LH were talking about "global" state just to
illustrate their point. A real implementation would
need to be dynamic. The way I implemented it,
each cpu starts out with a cache line,
and additionally the spinlock points to its own cache line.
That cache line is dynamically allocated from a heap in the
SpinQueueInitially and freed in the SpinQueueFinally routines.

Other than that, our implementations are essentially the same.
except there are some other considerations beyond LH:

1) spinlocks may be nested
2) spinlocks are not necessarily released in the order acquired
3) I wanted both a SpinQueue (mutex) and a
read-share/write-exclusive SpinQueueRW.

1 & 2 imply the current locked items is a single linked list
by each cpu. Also in Release(spinqueue) it needs to scan
the list and release the item for the correct queue.

3 is just a variation where there is also a sharecount.

Eric

EricP

unread,

Nov 8, 2011, 2:16:59 PM11/8/11

to

EricP wrote:
>
> except there are some other considerations beyond LH:

4) Oh and cache line size may be model dependent
which a cache line allocator can take into account.

Eric

Chris M. Thomasson

unread,

Nov 8, 2011, 5:21:14 PM11/8/11

to

"Tim McCaffrey" <timca...@aol.com> wrote in message
news:j9bteh$dhg$1...@USTR-NEWS.TR.UNISYS.COM...

> In article <v7rco8-...@ntp6.tmsw.no>, "terje.mathisenattmsw.no"
> says...
>>
>>EricP wrote:
>>> Efficient Software Synchronization on
>>> Large Cache Coherent Multiprocessors (1994)
>>> Peter Magnusson , Anders Landin , Erik Hagersten
>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.6490
>>>
>>> It has O(N) cost, FIFO ordered, bounded delay, no thundering herd,
>>> and for P processors and S spinlocks uses just P+S cache lines.
>>
>>Nice work, and from 1994!
>>
>>So why are we still seeing lots of high-overhead locking in current code?
>>:-(

[...]

> Try implementing it on a x86. That double word exchange is a pain.

AFAICT, you don't actually need DWCAS to implement the M lock. The
pseudo-code in the paper does indeed seem to suggest that DWCAS is needed
because they are atomically manipulating a processor ID along with a pointer
to a lock flag. However, I noticed that a processor ID is not really
required due to the strict ownership properties of the lock flags. Once a
processor gains ownership of a lock flag, it should be able to use the
pointer to said flag as a unique ID. You can take a look at my
implementation of the M lock in this post:

http://groups.google.com/group/comp.arch/msg/573d91b75f1d5a54

Actually, implementing on an x86 is pretty nice because of the implicit
acquire/release memory barriers. Unfortunately, the LH and M locks have to
set the state of the lock flag to "locked" _before_ it try's to get the
lock. This means that you need extra memory barriers in the lock
function; not too cool if you are on something like a PPC...

> Also, I
> don't get where "I" is supposed to be located. It has to be static, and
> you
> need at least one per processor per lock, so I guess they assume getting
> "I" is "free"?

`I' is a pointer to the current lock flag that is exclusively owned by a
processor and stored in per-processor data-structure. A processor
only needs more than one of these if it plans on holding more than
one lock at once...

Terje Mathisen

unread,

Nov 9, 2011, 7:44:55 AM11/9/11

to

Tim McCaffrey wrote:
> In article<v7rco8-...@ntp6.tmsw.no>, "terje.mathisenattmsw.no" says...

>> Nice work, and from 1994!
>>
>> So why are we still seeing lots of high-overhead locking in current code?
>> :-(
>>
>> Terje
>
> Try implementing it on a x86. That double word exchange is a pain. Also, I
> don't get where "I" is supposed to be located. It has to be static, and you
> need at least one per processor per lock, so I guess they assume getting
> "I" is "free"?

If you need to atomically exchange two pairs of values, in separate
cache lines, then you would indeed have a _big_ (bug?) problem on x86.

Assuming this isn't so, i.e. that you can either use CMPXCH16B to swap a
16-byte block, or pack the two values together into a 64 or 32-bit
element, then it should be OK.

EricP

unread,

Nov 9, 2011, 1:36:20 PM11/9/11

to

Chris M. Thomasson wrote:
>> In article <v7rco8-...@ntp6.tmsw.no>, "terje.mathisenattmsw.no"
>> says...
>>> EricP wrote:
>>>> Efficient Software Synchronization on
>>>> Large Cache Coherent Multiprocessors (1994)
>>>> Peter Magnusson , Anders Landin , Erik Hagersten
>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.6490
>>>>
>>>> It has O(N) cost, FIFO ordered, bounded delay, no thundering herd,
>>>> and for P processors and S spinlocks uses just P+S cache lines.
>>> Nice work, and from 1994!
>>>
>>> So why are we still seeing lots of high-overhead locking in current code?
>>> :-(
>

> Actually, implementing on an x86 is pretty nice because of the implicit
> acquire/release memory barriers. Unfortunately, the LH and M locks have to
> set the state of the lock flag to "locked" _before_ it try's to get the
> lock. This means that you need extra memory barriers in the lock
> function; not too cool if you are on something like a PPC...

Why, what's special about PPC?

Note that the analysis in that paper, while it does
account for local vs global reads and write,
and the number and type of atomic operations,
it does not take into account the exclusive vs shared
ownership state of cache line, and number of sharers,
or thundering herd effects in their costs.

The reason TTS is so bad is because all the waiters
share the same line, so each write has many invalidates,
and all try to AtomicXchg at once.
That does not come across in the paper.

In LH the spinlock is touched just once by an AtomicXchg
so it could be transferred directly from the old owner to new one.
The flag line has zero or one sharers.

The M lock requires an atomic op in the release which requires
transfering ownership of the lock line back to the releaser.
Also it had that alloc_flag and free_flag.

I'm not sure what relative impact of a memory barrier
is relative to the cost of cache line moves,
but here speculating on the approximate bus traffic for LH
it looked best overall:

SpinQueueAcquire( )
{
// Inital set of the flag to Busy.
// This could do a single transfer of the modified cache line
// from the single old owner to the new owner
// (and that delay could be minimized if the flag
// line was prefetched-for-write).
//
// That transfer of a modified line might require a main
// memory write back, depending on the bus protocol,
// but one would hope that the line could move from
// P1's L1 cache to P2's without waiting.
//
// Bus:
// ReadForWrite-->
// <--Reply
//
cpu.next->Busy = 1;

// MemBar (if required) flushes the store queue (flag) to L1.
// This stalls until _all_ prior stores complete.
// Other than 'next flag', most should already be in L1.
//
MembarWrite()

// AtomicXchg transfers the spin queue lock cache line
// from a single old owner to the requester.
//
// Bus:
// ReadForWrite-->
// <--Reply
//
prior = AtomicXchg (&spinqueue, cpu.next);

// Wait for ownership.
// Invalidate&Ack only required if lock is currently owned.
//
// Bus:
// ReadShare-->
// <--Reply
// ....
// <--Invalidate
// Ack-->
//
while (prior->Busy)
{ CpuYield(); }

MembarRead ();

// We now own the flag line. Prefetch it.
// This doesn't change the bus traffic, just eliminates the delay.
PrefetchForWrite (prior);
cpu.cur = cpu.next;
cpu.next = prior;
}

SpinQueueRelease ( )
{
// Flush changes to L1
MembarWrite ();

// Release the lock.
// Bus traffic required only if there is a waiter.
//
// Bus:
// Invalidate-->
// <--Ack
// <--ReadShare
// Reply-->
//
cpu.cur->Busy = 0;

}

Eric

Tim McCaffrey

unread,

Nov 9, 2011, 2:43:21 PM11/9/11

to

In article <hMzuq.29868$vg7....@newsfe04.iad>,
ThatWould...@thevillage.com says...

Do you consider in your analysis that the flag ("I") is associated with a
CPU? IOW, there would be an invalidate when it was set, and an invalidate
when it is cleared.

Chris - you really do need a set of "I"s for every lock. Consider a scenario
where a
thread 1 grabs lock A,
thread 2 tries to grab lock A and starts spinning,
thread 1 releases lock A,
thread 1 grabs lock B (using the same I)

I don't know for sure that thread 2 will necessarily see the clearing
of the "I" when thread 1 releases lock A before it grabs lock B. Memory
latency can be as much as a couple orders of magnitude slower than the
processor (worse on NUMA machines).

- Tim

EricP

unread,

Nov 9, 2011, 4:53:40 PM11/9/11

to

Tim McCaffrey wrote:
> In article <hMzuq.29868$vg7....@newsfe04.iad>,
> ThatWould...@thevillage.com says...
>>

>> I'm not sure what relative impact of a memory barrier
>> is relative to the cost of cache line moves,
>> but here speculating on the approximate bus traffic for LH
>> it looked best overall:
>>
>
> Do you consider in your analysis that the flag ("I") is associated with a
> CPU? IOW, there would be an invalidate when it was set, and an invalidate
> when it is cleared.

I was assuming that my ReadForWrite message included an Invalidate.
Finding info on the actual protocols for Intel and AMD is difficult.
I'm assuming there are more similarities between Intel's MESIF on
Quickpath and AMD's MOESI on Hypertransport than differences so
that a certain amount of arm waving is legal.

From what I can glean from various public sources
MOESI can hand off a modified line from one
processor to another without write back.

For *I = 1 it would work as follows:

- The flag cache line starts out in P1's cache in a
Owned (modified-shared) state because P1 last wrote it,
and P2's cache in a Shared state because P2 read it.

- When P2 writes to it *I = 1
this starts the upgrade from Share to Exclusive
(probably marks cache line as "exclusive transitioning")
P2 sends a command GETX message to the Home node.

- Each node has multiple inbound and outbound message queues
for Commands and Replies to avoid deadlock.
Your command message gets processed 'in the fullness of time'.

- Home node blocks further state changes to that line.

Home checks its directory to see who has copies.
However it finds just P1 is the owner.

Home replies to P2 with the number of Acks to expect.
Home sends Invalidates to any Sharers (none in this case).
- Sharers would change line state to Invalid and send Ack to P2.
Home forwards GETX message to P1.
Home remembers P2 now has line Exclusive.

- P1 sends line data to P2 and transitions state to Invalid.

- P2 counts down Invalidate Acks from Sharers (none)
and the Data reply from P1.
P2 changes line to Exclusive state.

- P2 sends an Unblock message to Home.

It is not clear to me what happens if the Home receives
another request while the line is blocked. It may be that
it Nack's back to the requester who must retry,
or it can hold the request in a queue (watch out for deadlocks!!!).

Eric

Chris M. Thomasson

unread,

Nov 9, 2011, 5:14:55 PM11/9/11

to

"Tim McCaffrey" <timca...@aol.com> wrote in message

news:j9el4o$4j0$1...@USTR-NEWS.TR.UNISYS.COM...
> In article <hMzuq.29868$vg7....@newsfe04.iad>,

>>
>>Chris M. Thomasson wrote:
[...]

> Chris - you really do need a set of "I"s for every lock. Consider a
> scenario
> where a
> thread 1 grabs lock A,
> thread 2 tries to grab lock A and starts spinning,
> thread 1 releases lock A,
> thread 1 grabs lock B (using the same I)
>
> I don't know for sure that thread 2 will necessarily see the clearing
> of the "I" when thread 1 releases lock A before it grabs lock B. Memory
> latency can be as much as a couple orders of magnitude slower than the
> processor (worse on NUMA machines).

WRT this example, consider keeping the local state within
thread-local-storage for a moment. Something like:
____________________________________________
struct lh_lock_flag
{
int m_flag; // = 0
};

struct thread
{
lh_lock_flag m_dummy;
lh_lock_flag* I; // == &m_dummy
lh_lock_flag* P;
};

struct lh_spinlock
{
lh_lock_flag m_dummy;
lh_lock_flag* L; // == &m_dummy
};
____________________________________________

Keep in mind that `I' is a pointer to a flag. Keep in mind that when a
thread grabs a lock it atomically exchanges ownership of the flag that was
pointed to by the spinlock. For instance, imagine that thread T1 has
ownership of lock flag LF1, and spinlock SL1 has ownership of LF2. When
thread T1 grabs spinlock SL1 it exchanges ownership such that thread T1 now
owns lock flag LF2 and spinlock SL1 now owns lock flag LF1.

In your example, this means that when thread 1 grabs lock B, it will _not_
be using the same lock flag it used to acquire lock A. Therefore, AFAICT,
there is no conflict. Lets see if I can illustrate it some more...

initial state:
__________________________________________
thread_1 owns lock_flag_1 - previous = NULL
thread_2 owns lock_flag_2 - previous = NULL
spinlock_A owns lock_flag_3
spinlock_B owns lock_flag_4
__________________________________________

thread_1 grabs spinlock_A...
which gives us a state of:
__________________________________________
thread_1 owns lock_flag_3 - previous = lock_flag_1
thread_2 owns lock_flag_2
spinlock_A owns lock_flag_1
spinlock_B owns lock_flag_4
__________________________________________

thread 2 tries to grab lock A and starts spinning,

which gives us a state of:
__________________________________________
thread_1 owns lock_flag_3 - previous = lock_flag_1
thread_2 owns lock_flag_1 - previous = lock_flag_2
spinlock_A owns lock_flag_2
spinlock_B owns lock_flag_4

__________________________________________

This means that thread_2 is spinning on lock_flag_1...

thread 1 releases lock A

which gives us a state of:
__________________________________________
thread_1 owns lock_flag_3 - previous = lock_flag_1
thread_2 owns lock_flag_1 - previous = lock_flag_2
spinlock_A owns lock_flag_2
spinlock_B owns lock_flag_4
__________________________________________

This means that thread_1 sets the flag on the previous flag (e.g.,
lock_flag_1) to an unlocked state, thereby waking up thread_2 who is
actively spinning on lock_flag_1.

thread 1 grabs lock B

which gives us a state of:
__________________________________________
thread_1 owns lock_flag_4 - previous = lock_flag_3
thread_2 owns lock_flag_1 - previous = lock_flag_2
spinlock_A owns lock_flag_2
spinlock_B owns lock_flag_3
__________________________________________

AFAICT, there is no conflicts here between any of the lock_flag's.

EricP

unread,

Nov 9, 2011, 5:14:14 PM11/9/11

to

EricP wrote:
>
> - P2 counts down Invalidate Acks from Sharers (none)
> and the Data reply from P1.
> P2 changes line to Exclusive state.

^^^^^^^^^
Oops - P2 receives the data from P1 with a modified bit set,
and changes the line state to Modified so that it will later
be written back to main memory.
Eventually the line will get evicted from some cache
and be sent back to Home for write to main memory.

Eric

Chris M. Thomasson

unread,

Nov 9, 2011, 5:26:32 PM11/9/11

to

"EricP" <ThatWould...@thevillage.com> wrote in message

news:hMzuq.29868$vg7....@newsfe04.iad...

> Chris M. Thomasson wrote:
>>> In article <v7rco8-...@ntp6.tmsw.no>, "terje.mathisenattmsw.no"
>>> says...
>>>> EricP wrote:
>>>>> Efficient Software Synchronization on
>>>>> Large Cache Coherent Multiprocessors (1994)
>>>>> Peter Magnusson , Anders Landin , Erik Hagersten
>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.6490
>>>>>
>>>>> It has O(N) cost, FIFO ordered, bounded delay, no thundering herd,
>>>>> and for P processors and S spinlocks uses just P+S cache lines.
>>>> Nice work, and from 1994!
>>>>
>>>> So why are we still seeing lots of high-overhead locking in current
>>>> code?
>>>> :-(
>>
>> Actually, implementing on an x86 is pretty nice because of the implicit
>> acquire/release memory barriers. Unfortunately, the LH and M locks have
>> to
>> set the state of the lock flag to "locked" _before_ it try's to get the
>> lock. This means that you need extra memory barriers in the lock
>> function; not too cool if you are on something like a PPC...
>
> Why, what's special about PPC?

You need a damn `LWSYNC' after you set the lock state to 1 and before the
atomic exchange. You also need a data-dependant acquire barrier after the
atomic exchange and before you spin on anything...

[...]

> I'm not sure what relative impact of a memory barrier
> is relative to the cost of cache line moves,
> but here speculating on the approximate bus traffic for LH
> it looked best overall:
>
> SpinQueueAcquire( )
> {

[...]

> prior = AtomicXchg (&spinqueue, cpu.next);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You need a data-dependant acquire barrier right here, or,
`std::memory_order_consume' in C++11 terms...

Tim McCaffrey

unread,

Nov 9, 2011, 8:59:36 PM11/9/11

to

Thank you both Eric & Chris, I'm starting to "get it" on how these functions
are supposed to work (the syntax errrors/typos in the original article don't
help :( ).

I'm trying to see if I can implement these for Windows device drivers, which
implies some constraining conditions (like no thread local storage).

- Tim

EricP

unread,

Nov 10, 2011, 2:14:47 PM11/10/11

to

The LH lock is better than the Win kernel InStackQueuedSpinLock
but the difference would hardly be noticeable.
But if you want to try it anyway....

In the WinNT kernel (and Linux does something similar)
each processor (core) has its own private state information,
stuff like interrupt state, IRQL, DPC list, current thread pointer,
stored in a strut. This Processor State Descriptor (PSD) is
pointed to by the GS segment register, and that can be turned
into normal pointer.

So if you were MS you would stash your
CurrentSpinItem and FreeSpinItem in the kernel Psd.

You don't have access to that so you have to tart one up.

You need an array of cache aligned per-processor structs
in non-paged memory. KeQueryMaximumProcessorCount elements.

You must raise your IRQL to DISPATCH_LEVEL (if not already at it)
to shut off the scheduler, then KeGetCurrentProcessorNumber
gives you your index into that array for each processor.
The spin queue locks should not be used both at DISPATCH
level and in a device interrupt routine (or you must take
great care to ensure an interrupt during a SpinQueueAcquire
or SpinQueueRelease doesn't corrupt your Psd).

The memory for your Psd array and each spin item can be
allocated with ExAllocatePoolWithTag (NonPagedPoolCacheAligned, ...)

The MS compiler intrinsic __cpuid can return the cache line size.

Eric

Tim McCaffrey

unread,

Nov 10, 2011, 6:06:04 PM11/10/11

to

In article <apVuq.32321$Cr1....@newsfe03.iad>,
ThatWould...@thevillage.com says...

>
>Tim McCaffrey wrote:
>> Thank you both Eric & Chris, I'm starting to "get it" on how these
functions
>> are supposed to work (the syntax errrors/typos in the original article
don't
>> help :( ).
>>
>> I'm trying to see if I can implement these for Windows device drivers,
which
>> implies some constraining conditions (like no thread local storage).
>>
>> - Tim
>>
>
>The LH lock is better than the Win kernel InStackQueuedSpinLock
>but the difference would hardly be noticeable.
>But if you want to try it anyway....
>

Well, on some of the boxes I have to work with it might make a difference.

Thanks for the other info, although I actually already knew it already.

This would probably all be in the form of an experiment anyway, wholesale
change out of locking code is pretty scary.

- Tim

Tim McCaffrey

unread,

Nov 11, 2011, 2:59:16 PM11/11/11

to

In article <j9hlcs$tvu$1...@USTR-NEWS.TR.UNISYS.COM>, timca...@aol.com says...

Well, here is what I ended up with. I started, kind-of, with M version 1 and
ended up with this.

Big changes:

There is no sharing of flags between locks.

The flags are part of the lock structure (so there is only one thing
to keep track of, this is for ease of programming & reliability more than
anything).

The "ownership" of a flag stays with a CPU (which is bound to have an
impact of cache coherence traffic, but I haven't walked it through all the
way).

And, yes, this is all written for Visual C. It is meant for to be
used in a Windows Device Driver.

While it compiles, I haven't tested it. I'm not fully convinced that
there are no race conditions. Feel free to critic...

--------------------------------------------

typedef struct _tm_flags
{
LONG flag;
KIRQL Old_Irql;
LONG Pad[14]; // make sure every array element is
on a different cache line.
} tm_flags;

typedef struct _tm_lock
{
volatile LONG Ix;
LONG Pad[15];// put flags on a separate cache line
tm_flags flags[1];
} tm_lock;

void tm_acquire(tm_lock *lock)
{
ULONG CPU_Num;
KIRQL Current_IRQL;
tm_flags* pFlags;
LONG PrevIx;

Current_IRQL = KeGetCurrentIrql();
if (Current_IRQL < DISPATCH_LEVEL)
KeRaiseIrqlToDpcLevel();
CPU_Num = KeGetCurrentProcessorNumber();
pFlags = &lock->flags[CPU_Num];
pFlags->Old_Irql = Current_IRQL;
pFlags->flag = 1;

PrevIx = InterlockedExchange(&(lock->Ix), (LONG)CPU_Num);

if (PrevIx != -1)
{
volatile LONG *pTest = &(lock->flags[PrevIx].flag);
while (*pTest) ; // spin
}
}

void tm_release(tm_lock *lock)
{
ULONG CPU_Num;
ULONG Ix;
tm_flags* pFlags;

CPU_Num = KeGetCurrentProcessorNumber();
pFlags = &lock->flags[CPU_Num];

Ix = InterlockedCompareExchange(&(lock->Ix), -1, CPU_Num); // If
we are the owner, mark lock not-busy
pFlags->flag = 0; // free "local" lock
if (pFlags->Old_Irql < DISPATCH_LEVEL)
KeLowerIrql(pFlags->Old_Irql);
}

tm_lock* tm_allocatelock(void)
{
tm_lock * lock;
ULONG CPU_Count;

CPU_Count =
#if NTDDI_VERSION < NTDDI_VISTA
KeNumberProcessors();
#else
KeQueryMaximumProcessorCount ();
#endif
AllocateAndZeroMemoryEx(&lock, sizeof(lock) +
(CPU_Count-1)*sizeof(tm_flags),'KCMT');
if (lock) lock->Ix = -1;
return lock;
}

void tm_freelock(tm_lock **lock)
{
if (*lock == NULL) return;

if ((*lock)->Ix != -1)
KeBugCheckEx(0xBD424242,(ULONG_PTR)*lock,(*lock)->Ix,(*lock)->flags[(*lock)->
Ix].flag,0);

FreeMemory(*lock);
*lock = NULL;
}

EricP

unread,

Nov 12, 2011, 12:45:09 PM11/12/11

to

Tim McCaffrey wrote:
>
> Well, here is what I ended up with. I started, kind-of, with M version 1 and
> ended up with this.

Let's call it M-ish

>
> Big changes:
>
> There is no sharing of flags between locks.
>
> The flags are part of the lock structure (so there is only one thing
> to keep track of, this is for ease of programming & reliability more than
> anything).
>
> The "ownership" of a flag stays with a CPU (which is bound to have an
> impact of cache coherence traffic, but I haven't walked it through all the
> way).

Good idea.
However as the M-ish lock requires an extra InterlockedCompareExchange
on Release, it has to move the lock line back to this processor
so your saving here is eaten up.

>
> And, yes, this is all written for Visual C. It is meant for to be
> used in a Windows Device Driver.
>
> While it compiles, I haven't tested it. I'm not fully convinced that
> there are no race conditions. Feel free to critic...

The major difference with your approach
is you are using an array while the
LH and M locks were using individual cache lines.
The array does simplify your M-ish lock.

Note that we could be talking about
P = 128, S = 10000 and line size = 128 so
(P+S)*LINE = 1,296,384 bytes
P*S*LINE = 163,840,000 bytes

>
>
>
> --------------------------------------------
>
> typedef struct _tm_flags
> {
> LONG flag;
> KIRQL Old_Irql;
> LONG Pad[14]; // make sure every array element is
> on a different cache line.
> } tm_flags;

Intel System Programmer guide section 8.10.6.7 suggests
"128 byte sector" alignment is required on Netburst.
On Netburst a sector is 2 cache lines.

See Intel System Programmer manual section 8.10.6.7
MANAGEMENT OF IDLE AND BLOCKED CONDITIONS

>
> typedef struct _tm_lock
> {
> volatile LONG Ix;
> LONG Pad[15];// put flags on a separate cache line
> tm_flags flags[1];
> } tm_lock;
>
> void tm_acquire(tm_lock *lock)
> {
> ULONG CPU_Num;
> KIRQL Current_IRQL;
> tm_flags* pFlags;
> LONG PrevIx;
>
> Current_IRQL = KeGetCurrentIrql();
> if (Current_IRQL < DISPATCH_LEVEL)
> KeRaiseIrqlToDpcLevel();
> CPU_Num = KeGetCurrentProcessorNumber();
> pFlags = &lock->flags[CPU_Num];
> pFlags->Old_Irql = Current_IRQL;
> pFlags->flag = 1;
>
> PrevIx = InterlockedExchange(&(lock->Ix), (LONG)CPU_Num);
>
> if (PrevIx != -1)
> {
> volatile LONG *pTest = &(lock->flags[PrevIx].flag);
> while (*pTest) ; // spin

You should spin with the Pause instruction
(intrinsic _mm_pause)

I don't see any races.

Eric

Tim McCaffrey

unread,

Nov 13, 2011, 3:06:27 PM11/13/11

to

In article <Bhyvq.23187$SW4....@newsfe08.iad>,
ThatWould...@thevillage.com says...

>
>Tim McCaffrey wrote:
>>
>> Well, here is what I ended up with. I started, kind-of, with M version 1
and
>> ended up with this.
>
>Let's call it M-ish
>

>The major difference with your approach
>is you are using an array while the
>LH and M locks were using individual cache lines.
>The array does simplify your M-ish lock.
>
>Note that we could be talking about
>P = 128, S = 10000 and line size = 128 so
>(P+S)*LINE = 1,296,384 bytes
>P*S*LINE = 163,840,000 bytes
>

Yes, I suppose I could add a pointer (in a separate cache line so it can be
shared without additional cache traffic) to an array of shared flags. Then
the programmer has to be careful that in nested lock cases the flags don't
overlap. That gets especially nasty when you need to lock two random
structures of the same type at the same time.

>
>Intel System Programmer guide section 8.10.6.7 suggests
>"128 byte sector" alignment is required on Netburst.
>On Netburst a sector is 2 cache lines.
>
>See Intel System Programmer manual section 8.10.6.7
>MANAGEMENT OF IDLE AND BLOCKED CONDITIONS
>

Yeah, I need to nail the details down that stuff. The constants above were
mostly place holders. I probably could tweak it to handle cache-line size
dynamically as well.

>
>You should spin with the Pause instruction
>(intrinsic _mm_pause)
>

Right, for hyperthreading environments.

>
>I don't see any races.
>

Good. :)

- Tim

Chris M. Thomasson

unread,

Nov 18, 2011, 4:31:36 PM11/18/11

to

"Tim McCaffrey" <timca...@aol.com> wrote in message

news:j9p803$9d9$1...@USTR-NEWS.TR.UNISYS.COM...

> In article <Bhyvq.23187$SW4....@newsfe08.iad>,
> ThatWould...@thevillage.com says...
>>
>>Tim McCaffrey wrote:
>>>
>>> Well, here is what I ended up with. I started, kind-of, with M version
>>> 1
> and
>>> ended up with this.
>>
>>Let's call it M-ish
>>
>
>
>>The major difference with your approach
>>is you are using an array while the
>>LH and M locks were using individual cache lines.
>>The array does simplify your M-ish lock.
>>
>>Note that we could be talking about
>>P = 128, S = 10000 and line size = 128 so
>>(P+S)*LINE = 1,296,384 bytes
>>P*S*LINE = 163,840,000 bytes
>>
>
> Yes, I suppose I could add a pointer (in a separate cache line so it can
> be
> shared without additional cache traffic) to an array of shared flags.
> Then
> the programmer has to be careful that in nested lock cases the flags don't
> overlap. That gets especially nasty when you need to lock two random
> structures of the same type at the same time.

FWIW, if you are going to use a array based lock, perhaps my spinlock based
on a distirbuted bakery algorihtm might be of some interest to you:

You could dynamically allocate the lock array to accommodate the actual
number of processors in the system.

>>I don't see any races.
>>
> Good. :)

I haven't really looked closely at it, but I don't see any obvious race
conditions yet. Perhaps I will model your algorithm in Relacy Race Detector
just to be sure.

Chris M. Thomasson

unread,

Nov 18, 2011, 4:34:17 PM11/18/11

to

"Tim McCaffrey" <timca...@aol.com> wrote in message

news:j9juqj$huo$1...@USTR-NEWS.TR.UNISYS.COM...

> In article <j9hlcs$tvu$1...@USTR-NEWS.TR.UNISYS.COM>, timca...@aol.com
> says...

[...]
> --------------------------------------------
[...]
> void tm_acquire(tm_lock *lock)
> {
[...]

> PrevIx = InterlockedExchange(&(lock->Ix), (LONG)CPU_Num);

Consider using `InterlockedExchangeAcquire()' here...

[...]
> }
>
> void tm_release(tm_lock *lock)
> {
[...]

> Ix = InterlockedCompareExchange(&(lock->Ix), -1, CPU_Num); // If

And using `InterlockedCompareExchangeRelease()' here...

[...]
> }
[...]

;^)