ARM64 and Code Density

Vince Weaver

unread,

Sep 25, 2013, 3:05:53 PM9/25/13

to

Hello

A few years back we had a discussion here on code density, and on my
ICCD'09 paper where I hand-assembled a small program on 20+ architectures
and did a size comparison.
http://www.deater.net/weave/vmwprod/asm/ll/

Recently (in an Embedded Systems class I'm teaching) code density came
up again, in regards to the new arm64/aarch64/armv8 instruction set.

So I've been working on porting the code to the new architecture (as
arm64 is a completely new ISA compared to arm32/thumb/thumb2).

The results are sort of interesting. I was under the impression that
arm64 was going to be a very MIPS-like 32-bit-fixed-insn-width
64-bit RISC ISA, but it has many interesting instructions and
arm32-esque features that do help with code density. You can
write denser code than arm32, but nowhere near as good as thumb2.

So far, the LZSS routine in my code looks like this:
THUMB2: 76 bytes
THUMB: 76 bytes
ARM64: 96 bytes
ARM32: 116 bytes
(for comparison, x86 is 63 bytes)

I'm still working on getting a full binary done, for comparison the
existing versions are:
THUMB2: 925 bytes
THUMB: 937 bytes
ARM32: 1186 bytes

The major difference that helps a lot with ARM64 is finally having
a reasonable size immediate field. The extra 16 registers really
don't help much for this particular benchmark. Some of the new
semi-conditional instructions added to make up for the lack of
full conditional execution help a bit too, especiall TBZ
(test and branch if zero).

What the architecture could really use would be an "increment multiple"
instruction:
inc {r1,r3,r4}
or similar.

So does anyone have any thoughts on where ARM is taking code density?
Does anyone even care about code density anymore?

As an aside, I also coded up support for the newish x86_32 (32-bit ILP
on top of x86_64) Linux code. In the end it didn't buy anything at all,
because the main advantage is lack of 64-bit pointers, and when hand
coding small assembly programs you can cheat and assume your pointers
are below 2^32. There was a size decrease in the executable, but
that was purely because a 32-bit ELF header is smaller than a 64-bit
ELF header.

Vince

hitc...@gmail.com

unread,

Sep 25, 2013, 5:44:07 PM9/25/13

to

A64 seems to be aimed squarely at making it (relatively) simple to implement efficient superscalar OoO cores. A Thumb-3 would be good for code density, but maybe they didn't have finalized designs ready yet? They might be researching better ways to implement a dense instruction set-- maybe even a variable-width bytecode that still decodes efficiently.

I'd be interested in seeing that benchmark for MSP430-- it's a particularly dense instruction set.

Mark Thorson

unread,

Sep 25, 2013, 7:45:58 PM9/25/13

to

Vince Weaver wrote:
>
> What the architecture could really use would be an "increment multiple"
> instruction:
> inc {r1,r3,r4}
> or similar.

Interesting thought. What does the increment?
If it's the ALU, you'd have to run all three
through the ALU. If you have a separate incrementer
or two, some of these increments could be run
through that. Then, you'd be stuck by the number
of register file ports. It's hard to see how these
would be executed simultaneously, unless it's on
the Harvard Mark I in which the registers were
implemented as storage counters that could be
incremented simultaneously. I think the cost would
be too high to go back to "smart" registers. The
Mark I was capable of a high degree of parallelism
because all of the registers could do adds and
subtracts, but this parallelism was almost never
used.

It would seem this is not really an instruction.
It's like a macro for three instructions that gets
unbundled in decode and executed serially. Is this
the most desirable macro? There might be others
that would do better.

This brings up an interesting question. Should
macros for code density be included in the
instruction set? Thumb doesn't do that -- it's
just a subset of the regular instruction set.
I suppose the task management instructions in x86
do that, but Intel's RMX real-time operating system
implements the task management data structures
without using the task management instructions
because simpler instructions are faster. I suppose
the task management instructions make it easier
to write applications which use RMX, if you write
in assembly language.

Pedro Pereira

unread,

Sep 26, 2013, 7:05:46 AM9/26/13

to

Vince Weaver wrote:

[...]

> As an aside, I also coded up support for the newish x86_32 (32-bit ILP
> on top of x86_64) Linux code. In the end it didn't buy anything at all,
> because the main advantage is lack of 64-bit pointers, and when hand
> coding small assembly programs you can cheat and assume your pointers
> are below 2^32. There was a size decrease in the executable, but
> that was purely because a 32-bit ELF header is smaller than a 64-bit
> ELF header.

It should buy space on the D-Cache.

Pedro Pereira

Vince Weaver

unread,

Sep 26, 2013, 11:22:00 AM9/26/13

to

It doesn't help with instruction code-density on my (obviously
a bit contrived) benchmark though.

If you write your x86_64 code carefully you can avoid a lot
of the d-cache overhead by forcing 32-bit values, although I
guess things like pushing function return values on the stack
are going to take up 64-bits unless you open-code the call
instruction.

Vince

Vince Weaver

unread,

Sep 26, 2013, 11:28:27 AM9/26/13

to

On 2013-09-25, hitc...@gmail.com <hitc...@gmail.com> wrote:
> A64 seems to be aimed squarely at making it (relatively) simple to
> implement efficient superscalar OoO cores. A Thumb-3 would be good for
> code density, but maybe they didn't have finalized designs ready yet?
> They might be researching better ways to implement a dense instruction
> set-- maybe even a variable-width bytecode that still decodes
> efficiently.

It's a shame info on instruction set design isn't published more often.
It would be really interesting to hear the considerations they had
and what was rejected before they got to the final design.

> I'd be interested in seeing that benchmark for MSP430-- it's a
> particularly dense instruction set.

I'll add that one to the list, I've been meaning to try a TI chip at some
point, they often seem to have interesting instruction sets.

Vince

--
main(){char O,o[66]="|\n\\/_ ",*I=o+7,l[]="B!FhhBHCWE9C?cJFKET$+h'Iq*chT"
,i=0,_;while(_=l[i++])for(O=0;O++<_>>5;I++)*I=*(I-(_&31));*I=0;puts(o+5);}

Vince Weaver

unread,

Sep 26, 2013, 11:45:08 AM9/26/13

to

On 2013-09-25, Mark Thorson <nos...@sonic.net> wrote:
> Vince Weaver wrote:
>>
>> What the architecture could really use would be an "increment multiple"
>> instruction:
>> inc {r1,r3,r4}
>> or similar.
>
> Interesting thought. What does the increment?
> If it's the ALU, you'd have to run all three
> through the ALU. If you have a separate incrementer
> or two, some of these increments could be run
> through that. Then, you'd be stuck by the number
> of register file ports. It's hard to see how these
> would be executed simultaneously,

I guess that's true, I was thinking as a SW programmer
and not as a HW architect there.

As an aside, I wonder if the arm32/thumb2 compatability
layer on the arm64 chips is mixed in (and so they have
to have a barrel shifter in each ALU just in case),
if they break the arm32 stuff up into micro-ops and
the complex instructions are just slow, or if they
do something like the itanium did with x86 compatability
and just drop a full-featured arm32 chip on the die
for use in such cases.

> It would seem this is not really an instruction.
> It's like a macro for three instructions that gets
> unbundled in decode and executed serially. Is this
> the most desirable macro? There might be others
> that would do better.

True. Just in my particular microbenchmark I came across
a few spots where I needed multiple increments and wasting
an entire 32-bit instruction just to do "add r1,r1,#1"
seems like such overkill. Part of why x86 gets good code
density is the 1-byte inc instruction (although they
dropped that for x86_64).

This actually came up because my arm32 code had some conditionally
executed load instructions, and those load instructions used
post-increment addressing. It's hard to replicate that with the
new arm64 conditional-move-and-operate instructions short of breaking
out the post-increment, but then you waste a huge number of bytes
on address increment instructions.

> This brings up an interesting question. Should
> macros for code density be included in the
> instruction set? Thumb doesn't do that -- it's
> just a subset of the regular instruction set.

I think it would be an interesting idea, but as always the challenge
is picking which density macros to implement. I wonder if anyone
investigated this back when academic code-density research peaked
in the late 90s.

I get the impression industry decided that fixed-width 16-bit
encoding was good enough, and variable-instruction width was
better but usually not worth the complexity.

Andy (Super) Glew

unread,

Sep 26, 2013, 12:12:03 PM9/26/13

to

On 9/26/2013 8:45 AM, Vince Weaver wrote:
> On 2013-09-25, Mark Thorson <nos...@sonic.net> wrote:
>> Vince Weaver wrote:
>>>
>>> What the architecture could really use would be an "increment multiple"
>>> instruction:
>>> inc {r1,r3,r4}
>>> or similar.
>>
>> Interesting thought. What does the increment?
>> If it's the ALU, you'd have to run all three
>> through the ALU. If you have a separate incrementer
>> or two, some of these increments could be run
>> through that. Then, you'd be stuck by the number
>> of register file ports. It's hard to see how these
>> would be executed simultaneously,

Just for fun, I can see a way - to implement this somewhat efficiently:
symbolic renaming. Lazy computation.

I have mentioned in the past about microarchitecture techniques where
the renamer does not just map a logical register to a physical register,
but instead maps a logical register to a formula

E.g. lreg4 --> preg29
inc lreg2
lreg4 -> preg29+1

Therefore
lreg1 --> preg100 + 2
lreg3 --> preg22 +4
lreg4 --> preg300
inc {r1,r3,r4}
lreg1 --> preg100 + 2 + 1
lreg3 --> preg22 + 4 +1
lreg4 --> preg300 + 1

Yes, you have to still do the several increments.
But the constants in the symbolic renamer are often small
(implementation), so increment 3 4 bit constants might be okay.

You would still next extra rename ports,however. Unless you built a CAM
based renamer, e.g. bitmap.

Like I said, just for fun. An all nighter.

--
The content of this message is my personal opinion only. Although I am
an employee (currently Imagination Technologies, which acquired MIPS; in
the past of companies such as Intellectual Ventures/QIPS, Intel, AMD,
Motorola, and Gould), I reveal this only so that the reader may account
for any possible bias I may have towards my employers' products. The
statements I make here in no way represent my employers' positions on
the issue, nor am I authorized to speak on behalf of my employers, past
or present.

Andy (Super) Glew

unread,

Sep 26, 2013, 12:12:45 PM9/26/13

to

On 9/26/2013 8:45 AM, Vince Weaver wrote:

It's called VLIW.

>
> I get the impression industry decided that fixed-width 16-bit
> encoding was good enough, and variable-instruction width was
> better but usually not worth the complexity.
>
> Vince
>

--

tridac

unread,

Sep 26, 2013, 12:31:11 PM9/26/13

to

On 09/26/13 15:28, Vince Weaver wrote:

> I'll add that one to the list, I've been meaning to try a TI chip at some
> point, they often seem to have interesting instruction sets.
>
> Vince
>

You should try comparing the MSP430 with the pdp11 (circa 1969) to see where
they got their ideas from. But yes, a very clean architecture with a
wide variety
of addressing modes. Only problem with the earlier versions was the very
limited
16 bit address space, but maybe they have fixed that by now...

Chris

--
** Remove the meaning of life to reply...

James Harris

unread,

Sep 26, 2013, 1:02:50 PM9/26/13

to

"Vince Weaver" <vi...@pianoman.cluster.toy> wrote in message
news:slrnl48ko9...@pianoman.cluster.toy...

> On 2013-09-25, hitc...@gmail.com <hitc...@gmail.com> wrote:
>> A64 seems to be aimed squarely at making it (relatively) simple to
>> implement efficient superscalar OoO cores. A Thumb-3 would be good for
>> code density, but maybe they didn't have finalized designs ready yet?
>> They might be researching better ways to implement a dense instruction
>> set-- maybe even a variable-width bytecode that still decodes
>> efficiently.
>
> It's a shame info on instruction set design isn't published more often.
> It would be really interesting to hear the considerations they had
> and what was rejected before they got to the final design.
>
>> I'd be interested in seeing that benchmark for MSP430-- it's a
>> particularly dense instruction set.
>
> I'll add that one to the list, I've been meaning to try a TI chip at some
> point, they often seem to have interesting instruction sets.

Speaking of dense instruction sets ... I heard there was a Burroughs
approach which emphasised small code size. I haven't seen their instruction
encoding but maybe it was for one of those mentioned on the following page.

http://en.wikipedia.org/wiki/Burroughs_large_systems

Also, the Transputer design is supposed to be good for writing compact
programs.

http://en.wikipedia.org/wiki/Transputer

For the latter, many instructions can be just one byte long. In fact, at the
machine level *all* instructions are one byte long but there are two prefix
instructions which can be used to extend data values or opcodes.

James

Ivan Godard

unread,

Sep 26, 2013, 1:50:37 PM9/26/13

to

The great majority of B6500 ops were one byte. load ("VALC") and LEA
("NAMC") were two bytes, and branches were three. Hard to beat that.

The Mill runs 14-17 bits per op, depending on family member and actual
code, which is the best we know of for a 64-bit machine.

Ivan

Ivan Godard

unread,

Sep 26, 2013, 1:56:43 PM9/26/13

to

On 9/26/2013 8:45 AM, Vince Weaver wrote:

> On 2013-09-25, Mark Thorson <nos...@sonic.net> wrote:
>> Vince Weaver wrote:
>>>
>>> What the architecture could really use would be an "increment multiple"
>>> instruction:
>>> inc {r1,r3,r4}
>>> or similar.
>>
>> Interesting thought. What does the increment?
>> If it's the ALU, you'd have to run all three
>> through the ALU. If you have a separate incrementer
>> or two, some of these increments could be run
>> through that. Then, you'd be stuck by the number
>> of register file ports. It's hard to see how these
>> would be executed simultaneously,
>
> I guess that's true, I was thinking as a SW programmer
> and not as a HW architect there.
>
> As an aside, I wonder if the arm32/thumb2 compatability
> layer on the arm64 chips is mixed in (and so they have
> to have a barrel shifter in each ALU just in case),
> if they break the arm32 stuff up into micro-ops and
> the complex instructions are just slow, or if they
> do something like the itanium did with x86 compatability
> and just drop a full-featured arm32 chip on the die
> for use in such cases.

The days when a barrel shifter was a significant cost are long gone.

Ivan

Vince Weaver

unread,

Sep 26, 2013, 2:08:51 PM9/26/13

to

On 2013-09-26, Ivan Godard <iv...@ootbcomp.com> wrote:
>
> The days when a barrel shifter was a significant cost are long gone.

I thought barrel shifters were big and slow, especially as your bit
widths got wider. Wasn't that why there wasn't one on the Pentium 4?
Although I guess Pentium 4 can count as ancient history these days.

Vince

Robert Wessel

unread,

Sep 26, 2013, 9:02:47 PM9/26/13

to

On Thu, 26 Sep 2013 15:45:08 +0000 (UTC), Vince Weaver
<vi...@pianoman.cluster.toy> wrote:

>
>As an aside, I wonder if the arm32/thumb2 compatability
>layer on the arm64 chips is mixed in (and so they have
>to have a barrel shifter in each ALU just in case),
>if they break the arm32 stuff up into micro-ops and
>the complex instructions are just slow, or if they
>do something like the itanium did with x86 compatability
>and just drop a full-featured arm32 chip on the die
>for use in such cases.

As an aside, IPF never actually did that - there was certainly
hardware dedicated to x86 emulation in the early iterations, but it
was never complete (for example, most FP stuff went though the normal
IPF FPU), nor was it ever a "full featured" x86 core. No real mode,
you could not boot the processor with x86 code, virtual memory was
built on top of IPF's VM, no virtual x86 mode, almost none of the
privileged stuff was there (IOW, it could in no way run an x86 OS),
etc.

It also didn't last all that long, somewhere in the Madison/Montecito
days that was dropped, and replaced by pure software emulation (which
was actually faster).

Terje Mathisen

unread,

Sep 27, 2013, 2:12:39 AM9/27/13

to

Vince Weaver wrote:
> Hello
>
> A few years back we had a discussion here on code density, and on my
> ICCD'09 paper where I hand-assembled a small program on 20+ architectures
> and did a size comparison.
> http://www.deater.net/weave/vmwprod/asm/ll/

...

> So far, the LZSS routine in my code looks like this:
> THUMB2: 76 bytes
> THUMB: 76 bytes
> ARM64: 96 bytes
> ARM32: 116 bytes
> (for comparison, x86 is 63 bytes)

Have you looked at LZ4?

I really like this algorithm due to the combination of an easy to
explain/implement algorithm, pretty good compression ratios, and
_really_ fast decompression.

The really fun part of lz4 decompression is how it can often run faster
than a pure memcpy() (REP MOVS) block copy operation. :-)

> As an aside, I also coded up support for the newish x86_32 (32-bit ILP
> on top of x86_64) Linux code. In the end it didn't buy anything at all,
> because the main advantage is lack of 64-bit pointers, and when hand
> coding small assembly programs you can cheat and assume your pointers
> are below 2^32. There was a size decrease in the executable, but
> that was purely because a 32-bit ELF header is smaller than a 64-bit
> ELF header.

That's as expected: Since you already have an x86 version, your
algorithm runs well with 6-7 registers, so the 8 extra (which require
longer encodings) doesn't really buy you anything?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stefan Monnier

unread,

Sep 27, 2013, 10:26:08 AM9/27/13

to

>>> As an aside, I also coded up support for the newish x86_32 (32-bit ILP

>> It should buy space on the D-Cache.
> It doesn't help with instruction code-density on my (obviously
> a bit contrived) benchmark though.

I don't think x86_32 intends to increase code density.

> If you write your x86_64 code carefully you can avoid a lot
> of the d-cache overhead by forcing 32-bit values, although I

The problem comes when you use systems where "everything is boxed".
E.g. Python, Javascript, Ruby, etc... objects will typically require that all
object fields need to be large enough to hold a pointer.

Stefan

Andy (Super) Glew

unread,

Sep 27, 2013, 11:23:12 AM9/27/13

to

Barrel shifters were common circa 1997.

Circa 2003 barrel shifters went out of fashion: too big and too slow.
Supplanted by multistage shifters. (Bruce G. at AMD in 2004 was
muttering about bring barrel shifters back.)

I suppose that it is possible that barrel shifters can still be found in
certain slow situations. But I don't think I have run into one recently.

However, I suspect that Ivan, being a software guy, is using the term
"barrel shifter" to mean "any combinatoric circuit that does the full
shift in hardware, possibly including the usual multistage shifter".

Ivan Godard

unread,

Sep 27, 2013, 12:10:41 PM9/27/13

to

On 9/27/2013 8:23 AM, Andy (Super) Glew wrote:
> On 9/26/2013 11:08 AM, Vince Weaver wrote:
>> On 2013-09-26, Ivan Godard <iv...@ootbcomp.com> wrote:
>>>
>>> The days when a barrel shifter was a significant cost are long gone.
>>
>> I thought barrel shifters were big and slow, especially as your bit
>> widths got wider. Wasn't that why there wasn't one on the Pentium 4?
>> Although I guess Pentium 4 can count as ancient history these days.
>>
>> Vince
>>
>
> Barrel shifters were common circa 1997.
>
> Circa 2003 barrel shifters went out of fashion: too big and too slow.
> Supplanted by multistage shifters. (Bruce G. at AMD in 2004 was
> muttering about bring barrel shifters back.)
>
> I suppose that it is possible that barrel shifters can still be found in
> certain slow situations. But I don't think I have run into one recently.
>
> However, I suspect that Ivan, being a software guy, is using the term
> "barrel shifter" to mean "any combinatoric circuit that does the full
> shift in hardware, possibly including the usual multistage shifter".
>
>

Yes I'm a software guy, but you make a distinction I have never seen
before. I'm thinking of a 64-bit shifter done as three stages of 4x1
muxes; https://en.wikipedia.org/wiki/Barrel_shifter has the same idea
only described using 2x1 muxes.

So what's a "barrel shifter" in your lexicon? A bucket brigade?

Andy (Super) Glew

unread,

Sep 27, 2013, 2:12:43 PM9/27/13

to

When I first encountered the term "barrel shifter" it referred to what
is essentially a crossbar network: e.g. input data coming in on the X
axis, wires on the Y axis, with switches connecting one X wire to one Y
wire at the crosspoints. (Typically with another set of output wires on
the X axis connected to each of Y wires, so that in and out would be
along the same axis.)

I think this was the term used by Mead Conway in the classic VLSI book.

This form of shifter is big and slow. Even though, at least
conceptually, each bit flows only through one crosspoint switch, one
gate, the long wires and the loading are quite expensive. (Might be fun
to figure out the logical effort.)

Multistage shift networks were used (a) prior to full custom VLSI, when
you had to build these things out of LSI and MSI components, and (b)
after people realized that long wires were bad even in VLSI. Plus, I
don't think most synthesis tools can produce long wire based designs.

I figured that you and ... whoever complained that Willamette/Pentium 4
did not have a barrel shifter ... were using the term "barrel shifter"
differently, because Willamette *had* a multistage shifter. But
certainly not a barrel shifter in the sense that I used. I think that
Wmt's shifter was fully pipelined - however, it was in the slow clock
domain, so took twice as long as an ALU instruction intrinsically, and
then had the overhead of getting out of and back into the fireball.

I.e. I am fairly sure that Wmt had a multistage shifter, a barrel
shifter in the terminology you use, Ivan. But not a barrel shifter in
the older terminology.

I have long been fascinated by how these terms change definitions over
time. I am fairly certain that at least one big patent lawsuit was
decided in favor of the patent owner, because the terminology drifted in
his favor.

By the way: I personally tend to use the term "barrel shifter" in the
most general sense. the way you do, Ivan. However, the folks at AMD
(who mainly came from DEC) were vociferous in telling me that it meant
something else to them.

Andy (Super) Glew

unread,

Sep 27, 2013, 2:28:11 PM9/27/13

to

This image is of a crossbar - what I call a specific meaning of "barrel
shiftedr".

http://www-comm.cs.shinshu-u.ac.jp/public/comparch/img72.gif

Also this 94 patent:"

http://www.google.com/patents?hl=en&lr=&vid=USPAT5416731&id=u1QmAAAAEBAJ&oi=fnd&dq=barrel+shifter&printsec=abstract#v=onepage&q=barrel%20shifter&f=false

But this 1987 patent uses the term "barrel shifter" for a multistage
shifter:

http://www.google.com/patents?hl=en&lr=&vid=USPAT4829460&id=HdQvAAAAEBAJ&oi=fnd&dq=barrel+shifter+patent+&printsec=abstract#v=onepage&q=barrel%20shifter%20patent&f=false

But it claims that the "conventional prior art" for a barrel shifter is
the sort of thing I was talking about.

I.e. it looks like the specific term barrel shift for a crossbar array
evolved to be applied to the multistage shift networks that accomplish a
similar purpose.

This 2002 paper talks about multistage using muxes, and uses the term
"barrel shifter" in a more general sense.

Pillmeier, Matthew R., Michael J. Schulte, and Eugene G. Walters III.
"Design alternatives for barrel shifters." Proceedings of SPIE—Advanced
Signal Processing Algorithms, Architectures, and Implementations XII
4791 (2002): 436-447.

https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/Shifter_Schulte.pdf

Ivan Godard

unread,

Sep 27, 2013, 2:40:37 PM9/27/13

to

Interesting; I had never heard of using a full crossbar for a barrel
shifter. The crossbar is serious overkill for just a shift; it
implements an arbitrary bit shuffle. I suppose the byte shuffles of the
x86 and other chips use a crossbar too - does anyone implement a full
bit shuffle? Your bit-matrix-multiplies are related too; does that need
a crossbar as well?

BGB

unread,

Sep 27, 2013, 3:05:15 PM9/27/13

to

On 9/27/2013 1:12 AM, Terje Mathisen wrote:
> Vince Weaver wrote:
>> Hello
>>
>> A few years back we had a discussion here on code density, and on my
>> ICCD'09 paper where I hand-assembled a small program on 20+ architectures
>> and did a size comparison.
>> http://www.deater.net/weave/vmwprod/asm/ll/
> ...
>> So far, the LZSS routine in my code looks like this:
>> THUMB2: 76 bytes
>> THUMB: 76 bytes
>> ARM64: 96 bytes
>> ARM32: 116 bytes
>> (for comparison, x86 is 63 bytes)
>
> Have you looked at LZ4?
>
> I really like this algorithm due to the combination of an easy to
> explain/implement algorithm, pretty good compression ratios, and
> _really_ fast decompression.
>
> The really fun part of lz4 decompression is how it can often run faster
> than a pure memcpy() (REP MOVS) block copy operation. :-)
>

yes, this is a good point in-general of byte-based LZ77 variants:
they are fast.

LZ4 looks interesting.

a lot of my own designs (for bytewise LZ77), had usually been built
around value-ranges, such as a range for literal byte-data, and another
range to indicate the start of a run (and its specific format).

example:
0x00-0x7F: raw byte (0-127);
0x80-0x9F: run of raw bytes, followed by 1-32 literal bytes;
0xA0-0xBF: short LZ run (1-32 bytes, 8-bit distance);
0xC0-0xDF: deep LZ run (1-32 bytes, 16-bit distance);
0xE0: RLE (length=byte, dist=1);
0xE1: long LZ run A (length=byte, dist=byte);
0xE2: long LZ run B (length=byte, dist=word);
...

usually, the exact format is fairly specialized to the use case and type
of data though.

block-oriented LZ77 can also be pretty fast, which is basically the same
except using the "block" as a basic unit of literal data.

different variants have different block sizes, ex: 16/32/64/128 bits.

typically, these blocks can be copied as a single unit (such as a CPU
word or SIMD value).

it depends some on the specific data and use-case though whether
something like this makes sense (for example: most of my current uses of
the above strategy have generally been for "fast" video codecs, which
mostly work by copying around blocks of pixels).

timca...@aol.com

unread,

Sep 27, 2013, 3:51:08 PM9/27/13

to

On Friday, September 27, 2013 2:28:11 PM UTC-4, Andy (Super) Glew wrote:
> On 9/27/2013 11:12 AM, Andy (Super) Glew wrote:
>

[stuff about barrel shifters]

The description Ivan used fits with what I learned in school in ~1979.

I also thought it was either Burroughs or CDC that had the original
patent. (So, 1963ish).

- Tim

Terje Mathisen

unread,

Sep 27, 2013, 4:51:41 PM9/27/13

to

BGB wrote:
> On 9/27/2013 1:12 AM, Terje Mathisen wrote:
>> Have you looked at LZ4?
>>
>> I really like this algorithm due to the combination of an easy to
>> explain/implement algorithm, pretty good compression ratios, and
>> _really_ fast decompression.
>>
>> The really fun part of lz4 decompression is how it can often run faster
>> than a pure memcpy() (REP MOVS) block copy operation. :-)
>>
>
> yes, this is a good point in-general of byte-based LZ77 variants:
> they are fast.
>
> LZ4 looks interesting.

As a programmer I liked the way they make life easy for the
decompressor, by specifying that literal data and pointers to previous
blocks will always alternate and that the input will always end with a
block of literal data, i.e. a built-in guard which means that you can
always process a pair of (back_pointer)(literal) blocks between each
test for end_of_input.

Vince Weaver

unread,

Sep 27, 2013, 5:24:46 PM9/27/13

to

On 2013-09-27, Robert Wessel <robert...@yahoo.com> wrote:
> As an aside, IPF never actually did that - there was certainly
> hardware dedicated to x86 emulation in the early iterations, but it
> was never complete (for example, most FP stuff went though the normal
> IPF FPU), nor was it ever a "full featured" x86 core. No real mode,
> you could not boot the processor with x86 code, virtual memory was
> built on top of IPF's VM, no virtual x86 mode, almost none of the
> privileged stuff was there (IOW, it could in no way run an x86 OS),
> etc.

You're right, I was mis-remembering a die shot I had seen that had a
big square chunk labeled "x86" but I found that again and it's actually
labeled "x86 control" and there's a blurb in the paper about how
the x86 emulation uses a lot of the ia64 hardware.

I actually had access to one of those early ia64 machines for a while
and ran a lot of x86 benchmarks on it. It was interesting, as there
was a special hardware performance counter for measuring x86 instructions
and it did as good a job, if not better, than the equivelant counter
on various other x86 machines I had access to at the time.

In any case, I'm still curious how ARM implements things, I thought
that they in theory have experience with heterogeneous cores with the whole
BIG.little thing so maybe they could work out how to put a full arm32
core on there, but I don't think the A7 die shots show anything like that.

Vince

Vince Weaver

unread,

Sep 27, 2013, 5:37:01 PM9/27/13

to

On 2013-09-27, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
> Have you looked at LZ4?
>
> I really like this algorithm due to the combination of an easy to
> explain/implement algorithm, pretty good compression ratios, and
> _really_ fast decompression.

I haven't looked at it before.

Hmm, someone has posted LZ4 decompression code for 8088 that fits in 79 bytes.

My 8086 LZSS code is 58 bytes, although in theory the difference could
be made up by better compression of the output ASCII Art in my benchmark.

I probably wouldn't want to move my benchmark over to LZ4 at this point
because then I'd have to re-optimize all 28 of the other architectures
I've already done and that would take a while, especially as I don't
have a lot of the hardware anymore.

> That's as expected: Since you already have an x86 version, your
> algorithm runs well with 6-7 registers, so the 8 extra (which require
> longer encodings) doesn't really buy you anything?

The code actually works best with 8 registers, but it keeps the 8th on top of
the stack so it works well on x86.

So yes, none of the new features of x86_64 help at all, and losing
the single-byte increments actually hurt a lot.

The x86 code in my benchmark is actually the most highly optimized, and
the only one (besides z80) that anyone besides me has helped with. I'm
sure there's still room for improvement though.

I also made a try at using the new SSE4 string instructions to see if
they could help with the strlen/strcat routines in the code (a lot of
people had suggested this to me) but that doesn't really buy you anything
size-wise as far as I could tell.

Vince

timca...@aol.com

unread,

Sep 27, 2013, 6:27:38 PM9/27/13

to

On Friday, September 27, 2013 5:37:01 PM UTC-4, Vince Weaver wrote:

>
>
> The x86 code in my benchmark is actually the most highly optimized, and
>
> the only one (besides z80) that anyone besides me has helped with. I'm
>
> sure there's still room for improvement though.
>

Yeah, I sent you an e-mail on how to get 2 more bytes out of the lzss, and
perhaps more from the find_string routine...

- Tim

Robert Wessel

unread,

Sep 27, 2013, 7:28:26 PM9/27/13

to

Architecturally, ARM32 registers, and whatnot, are mapped onto the
ARM64 registers. As I read the architectural definition, there's
specific provision for allowing the states to be copied between
hardware units at the transitions between A32 and A64 modes. So at
least theoretically things could be pretty well separated. OTOH, it
looks like A64 provides pretty straight-forward mappings of most A32
instructions (predication is the biggest omission). My guess is that
the intended implementation is to mainly separate the A32/A64 ISA
decode in the front end.

The problem with having truly heterogeneous and separate cores (which
I don't really consider big.LITTLE as having*) is that it hugely
complicates the OS and the transition mechanism. I could see an A32
execution unit, that still used the A64's memory and system management
stuff (at least in the modes where you weren't running A32 mode at the
bottom EL level), but I don't see a reasonable case for two complete
cores - too much duplication, and one will always be idle, not to
mention the complexity of managing the two.

I could see several likely implementations, depending on the required
A32 performance. For maximum performance, you'd need hardware in the
core to support almost all of the A32 features. Most of that has to
there anyway, but a few things, like predication, don't. For a
slightly lower performance A32 mode, you can split most of the
problematic A32 instructions into a couple of A64 micro-ops (for
example, you could turn a predicated instruction into a conditional
branch and an operational instruction). For the lowest requirement,
you could do emulation. And even there, the similarity of the
instruction sets and semantics should make (JIT) binary recompilation
pretty reasonable.

*Sure the implementation of the big and LITTLE cores is very
different, but they look (except for performance) almost identical
from the OS's perspective.

BGB

unread,

Sep 27, 2013, 11:53:50 PM9/27/13

to

On 9/27/2013 3:51 PM, Terje Mathisen wrote:
> BGB wrote:
>> On 9/27/2013 1:12 AM, Terje Mathisen wrote:
>>> Have you looked at LZ4?
>>>
>>> I really like this algorithm due to the combination of an easy to
>>> explain/implement algorithm, pretty good compression ratios, and
>>> _really_ fast decompression.
>>>
>>> The really fun part of lz4 decompression is how it can often run faster
>>> than a pure memcpy() (REP MOVS) block copy operation. :-)
>>>
>>
>> yes, this is a good point in-general of byte-based LZ77 variants:
>> they are fast.
>>
>> LZ4 looks interesting.
>
> As a programmer I liked the way they make life easy for the
> decompressor, by specifying that literal data and pointers to previous
> blocks will always alternate and that the input will always end with a
> block of literal data, i.e. a built-in guard which means that you can
> always process a pair of (back_pointer)(literal) blocks between each
> test for end_of_input.
>

fair enough.

yeah, in more usual variants, it is necessary to check whether one has
literal data, a run, or an end-of-input, for each iteration around the
main loop (as well as maybe sanity checks to avoid running past the ends
of the buffers, ...).

> Terje
>

Bakul Shah

unread,

Sep 28, 2013, 2:44:08 AM9/28/13

to

On 9/27/13 11:28 AM, Andy (Super) Glew wrote:
> I.e. it looks like the specific term barrel shift for a crossbar array evolved to be applied to the
> multistage shift networks that accomplish a similar purpose.

My understanding is similar to Ivan's. BTW, who came up with this term?
Sounds much better than a `shifting circuit'!

Bengt Larsson

unread,

Sep 28, 2013, 8:02:24 AM9/28/13

to

"Andy (Super) Glew" <an...@SPAM.comp-arch.net> wrote:

>I.e. it looks like the specific term barrel shift for a crossbar array
>evolved to be applied to the multistage shift networks that accomplish a
>similar purpose.

I guess I can third this. That is what I thought it meant also.

Terje Mathisen

unread,

Sep 28, 2013, 8:45:22 AM9/28/13

to

Vince Weaver wrote:
> On 2013-09-27, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>
>> Have you looked at LZ4?
>>
>> I really like this algorithm due to the combination of an easy to
>> explain/implement algorithm, pretty good compression ratios, and
>> _really_ fast decompression.
>
> I haven't looked at it before.
>
> Hmm, someone has posted LZ4 decompression code for 8088 that fits in 79 bytes.

Is that the "OldSchool Rambler" or is his code even smaller?

>
> My 8086 LZSS code is 58 bytes, although in theory the difference could

Wow!

> be made up by better compression of the output ASCII Art in my benchmark.
>
> I probably wouldn't want to move my benchmark over to LZ4 at this point
> because then I'd have to re-optimize all 28 of the other architectures
> I've already done and that would take a while, especially as I don't
> have a lot of the hardware anymore.

Sure, I wouldn't expect you to.

>
>> That's as expected: Since you already have an x86 version, your
>> algorithm runs well with 6-7 registers, so the 8 extra (which require
>> longer encodings) doesn't really buy you anything?
>
> The code actually works best with 8 registers, but it keeps the 8th on top of
> the stack so it works well on x86.
>
> So yes, none of the new features of x86_64 help at all, and losing
> the single-byte increments actually hurt a lot.
>
> The x86 code in my benchmark is actually the most highly optimized, and
> the only one (besides z80) that anyone besides me has helped with. I'm
> sure there's still room for improvement though.

"Any program can be reduced by at least one instruction" and "Any
program contains at least one bug" reduces to "Any program can be
reduced to a single instruction, which will be wrong."

Case in point:

Both IBM's IEFBR14 and the MSDOS RET program used to consist of a single
instruction which just returned to the OS.

They had the common bug that the return code/exit code would be whatever
happened to be left in the relevant register by the OS and/or previous
program.

>
> I also made a try at using the new SSE4 string instructions to see if
> they could help with the strlen/strcat routines in the code (a lot of
> people had suggested this to me) but that doesn't really buy you anything
> size-wise as far as I could tell.

Fancy speed optimizations are almost never size optimization, and the
SSE opcodes are all pretty long.

Yes, they can process 8/16/32/64 items with a single instruction, but
the processing loop will be significantly more complicated than a naive
REP SCAS/REP MOVS version.

BGB

unread,

Sep 28, 2013, 1:23:31 PM9/28/13

to

On 9/28/2013 7:45 AM, Terje Mathisen wrote:
> Vince Weaver wrote:
>> On 2013-09-27, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>>
>>> Have you looked at LZ4?
>>>
>>> I really like this algorithm due to the combination of an easy to
>>> explain/implement algorithm, pretty good compression ratios, and
>>> _really_ fast decompression.
>>
>> I haven't looked at it before.
>>
>> Hmm, someone has posted LZ4 decompression code for 8088 that fits in
>> 79 bytes.
>
> Is that the "OldSchool Rambler" or is his code even smaller?
>>
>> My 8086 LZSS code is 58 bytes, although in theory the difference could
>
> Wow!
>

yeah.

me idly remembering trying to fit useful amounts of stuff into a
512-byte bootsector.

some people were off loading a kernel from a filesystem, decompressing
it, setting up P-Mode, and jumping to the kernel, somehow all within a
single boot-loader.

I generally just used the strategy of loading the kernel and a
"second-stage" loader (simpler, less hairy), which would itself handle
any decompression, image loading, and setting up P-Mode (generally at
the time was using FAT12/16/32 and LZ + PE/COFF).

I could probably do it all a little better now, but alas...

it would be hard to beat out just using an existing kernel (such as the
Linux kernel), which at the time, this realization killed the effort
(like, getting it working, only to realize just how far it was behind
being a "real" OS).

yep.

x86 instruction encoding strategy:
throw a bunch of magic prefix bytes at the problem.

had more of the 1-byte space been used for 2-byte opcodes, maybe there
would probably be less overly long instruction forms.

like, a little less:
660FXX/r and F30FXX/r and 660F38XX/r and similar.

> Yes, they can process 8/16/32/64 items with a single instruction, but
> the processing loop will be significantly more complicated than a naive
> REP SCAS/REP MOVS version.
>

I once did a test, and found that past a certain size, on the hardware I
was using at the time (early 2000s), the method used to copy memory
didn't really make much of a difference WRT performance.

so, it was like a simple "REP MOVSB" loop, vs using SSE operations for
copies, vs running an LZ77 decompressor, all running at pretty much
*exactly* the same speed.

with a newer computer with DDR3, there doesn't really seem to be such an
obvious speed-ceiling though (so, the choice of copy operation matters a
little more).

also, the CPU and RAM have gotten sufficiently faster relative to the
HDD that storing data in a compressed form and decompressing in RAM can
be faster (*1).

though, granted, in many cases, a lot of these gains can offset by the
OS's disk-cache, where file-IO is pretty fast if the data happens to be
being cached in RAM by the OS (but is much less likely for large files).

*1: I suspect current high-res video playback is only really possible
due to this: uncompressed 1080p would require more bandwidth than is
available from a typical HDD.

granted, there is a wide range of possibilities:
H.264: compresses well, but is a little expensive to decode;
H.263: compresses slightly worse, but is notably faster;
Theora: seems fairly similar to H.263 as far as I can tell;
...
RPZA: compresses pretty bad, but very fast decoders are possible (*2).

*2: for some use-cases, like animated textures (very short looping
animations), it seems pretty good, but not so great for video (in
general). for video in 1080p or 4k resolutions, disk IO is likely to
kill performance, likely at minimum making entropy coding necessary, but
giving more of an advantage to other formats.

actually, disk-IO is a similar issue for trying to use DDS (trying to
load DDS files is somewhat IO bound). IME, most of the
size/quality/speed tradeoff regarding texture loading goes to a
customized JPEG variant (which also works pretty well for animated
textures, but still isn't really great for video).

Theora does pretty good for video, but isn't really a particularly great
option for animated textures. it is likely a pretty good option for
things like cutscenes or similar though.

or such...

Terje Mathisen

unread,

Sep 28, 2013, 2:30:39 PM9/28/13

to

BGB wrote:
> me idly remembering trying to fit useful amounts of stuff into a
> 512-byte bootsector.

I never did that, but I did spend quite a bit of time size-optimizing
the resident parts of Dos "TSR" programs, in order to leave as much
usable memory as possible.

Case in point:

My replacement for the 60 KB IBM used for Norwegian screen font and kbd
drivers needed something like 696 bytes (which got rounded up to 704),
including the unavoidable DOS overheads.

I also wrote a 1600 byte resident size shared printer driver, i.e. to
allow any printer connected to a PC on the network to be used by anyone:

Those 1600 bytes included HW interrupt drivers for the 18.2 HZ timer and
the network stack as well as any serial or parallel ports, dual 512-byte
network packet buffers, the local stack and all the network housekeeping.

The most fun part: I just sat down one morning with no written design,
just a plan in my head, and started writing both the client and the
server parts.

5 hours later both were done, the asm (client) code had two or three
syntax errors, but as soon as I fixed those so it would assemble,
everything just worked.

Needless to say, this has never happened to me again, but I still
believe that every new program I write will be bugfree and work on the
first attempt.

Programmers (at least old programmers!) needs to be optimists, otherwise
we would have quit a long time ago!

BGB

unread,

Sep 28, 2013, 3:52:07 PM9/28/13

to

a lot of other older programmers I had dealt with had generally been
more of the type that were condescending and looked down on everyone,
got all puffed up about how programming was "serious business" and "time
is money" and whatever, and generally looked down on people who wrote
their own code (vs using whatever was the next new/shiny API or
framework put out by Microsoft, or Oracle).

so, I had sat around being sort of the anathema programmer, mostly
fiddling around with whatever and writing my own code for most things
(in "arcane, low-level" languages, like C), and have thus far generally
lived life not making any money from doing programming stuff, and
generally seeing the importance of standards/conventions/... as more
relative than absolute, me not really caring much for UML diagrams, ...

but, then at the same time, regularly being hated-on by people, and not
really able to get a job anywhere, ... but alas...

never did much serious MS-DOS real-mode development.

when I was doing stuff in DOS, I was mostly still learning stuff early
on, and soon moved to 32-bit development (via compilers with "DOS
extenders", like DJGPP and similar).

most other development has been in 32-bit environments, generally Linux
or Windows.

for example, I was developing my OS project mostly in a Windows 2000 and
later Windows XP environment, using MinGW to compile the kernel. using
PE/COFF "made sense", since this is what MinGW produced, though there
were a few holes in my understanding, and I later noted that my loader
made some assumptions which would not have worked, for example, had I
been using the MS linker.

after getting the OS about as far as having a basic GUI and Ethernet +
TCP/IP networking and similar, I can to the conclusion that basically
there was little hope of being able to compete well with mainstream
OS's, mostly due to the extreme amounts of driver development which
would have been required.

so, the whole OS effort was then mostly abandoned, though a lot of code
had been kept from it and reused in later projects.

a lot of people that wandered off into OS development after I had pretty
much given up on it had mostly been using GRUB+ELF for loading.

EricP

unread,

Sep 28, 2013, 4:32:00 PM9/28/13

to

Andy (Super) Glew wrote:
>
> I.e. it looks like the specific term barrel shift for a crossbar array
> evolved to be applied to the multistage shift networks that accomplish a
> similar purpose.

I had always thought that the term barrel shifter went back to
the TTL MSI days. Originally cpus only had 1 bit shift instructions.
As circuit density increased they added multi-bit one-clock shifters.
The pattern of wires on the micrograph looked like a wire barrel,
and the name stuck, at least for the general public.
The colloquial use of the term came to mean any single clock multi bit
shifter as people would ask 'does it use a barrel shifter' to mean
'does it shift multiple bits in one clock'.

In trying to find an example micrograph I discovered that the term
used in cpus used to be "barrel switch", such as this 1969 patent

Electronic barrel switch for data shifting (1969)
http://www.google.ca/patents/US3610903

and I found that "barrel switch" goes back at least to
this 1899 patent for a parallel switch controlling trains:

Controller for electric-railway cars (1899)
http://www.google.ca/patents/US645767?pg=PA10&dq=%22barrel+switch%22&hl=en&sa=X&ei=1DZHUuz1IsTfyQG4ioFg&ved=0CEAQ6AEwAQ

The OCR for these old patents is a little screwy - the PDFs are better.

So where the "barrel" comes from is still a mystery.

Eric

Ivan Godard

unread,

Sep 28, 2013, 6:14:38 PM9/28/13

to

*Very* cool. Thank you.

Ivan

Bruce Hoult

unread,

Sep 28, 2013, 6:54:22 PM9/28/13

to

On Friday, September 27, 2013 3:45:08 AM UTC+12, Vince Weaver wrote:
> As an aside, I wonder if the arm32/thumb2 compatability
> layer on the arm64 chips is mixed in (and so they have
> to have a barrel shifter in each ALU just in case),

arm64 allows a shift/rotate on one register argument.

> if they break the arm32 stuff up into micro-ops and
> the complex instructions are just slow, or if they

They must do with load/store/push/pop multiple, otherwise there's no reason to leave it out of arm64.

> do something like the itanium did with x86 compatability
> and just drop a full-featured arm32 chip on the die
> for use in such cases.

That is contraindicated by the iPhone 5s running similar 32 bit code at essentially the same speed as 64 bit code.

Stephen Fuld

unread,

Sep 29, 2013, 5:21:04 PM9/29/13

to

On 9/28/2013 1:32 PM, EricP wrote:

snip

> So where the "barrel" comes from is still a mystery.

I have no specific knowledge, but it occurred to me that it might be
related to the barrel in a revolver gun. There might be room for say
six bullets in the barrel, and when you turn it, all bullets move
(shift) one place over. If you turn it six places, all the bullets are
back in their original place.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,

Sep 29, 2013, 5:35:57 PM9/29/13

to

Stephen Fuld wrote:
> On 9/28/2013 1:32 PM, EricP wrote:
>
> snip
>
>> So where the "barrel" comes from is still a mystery.
>
> I have no specific knowledge, but it occurred to me that it might be
> related to the barrel in a revolver gun. There might be room for say
> six bullets in the barrel, and when you turn it, all bullets move
> (shift) one place over. If you turn it six places, all the bullets are
> back in their original place.

I was thinking along similar lines.
The 1899 patent shows a rotary switch.
A barrel switch may be a rotary switch with one or more disks,
each disk with multiple positions, rotated manually or with
stepper solenoids. The whole thing could be put in a metal tube
or "barrel" for protection.

And one could think of the electronic equivalent
as rotating a selector to the proper position.

Eric

Ivan Godard

unread,

Sep 29, 2013, 7:34:05 PM9/29/13

to

On 9/29/2013 2:21 PM, Stephen Fuld wrote:
> On 9/28/2013 1:32 PM, EricP wrote:
>
> snip
>
>> So where the "barrel" comes from is still a mystery.
>
> I have no specific knowledge, but it occurred to me that it might be
> related to the barrel in a revolver gun. There might be room for say
> six bullets in the barrel, and when you turn it, all bullets move
> (shift) one place over. If you turn it six places, all the bullets are
> back in their original place.
>
>
>
>

Not a revolver, in which the chambers rotate but there is only one
barrel, but maybe a Gatling.

Rob Warnock

unread,

Sep 30, 2013, 1:19:16 AM9/30/13

to

Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
+---------------

| EricP wrote:
| > So where the "barrel" comes from is still a mystery.
|
| I have no specific knowledge, but it occurred to me that it might be

| related to the barrel in a revolver gun. ...
+---------------

I've always [for several decades, at least] visualized a
beer or wine barrel, with machine words being wrapped around
the circumference of the barrel, like the "hoops" at the ends,
and the vertical "staves" being bit lanes between the input
(bottom) and the output (top). If you twist the whole top of
the barrel with respect to the bottom, machine words you put
in the bottom come out "twisted" (rotated) at the top.

And if you build a truncated-cone barrel which is twice the
circumference at the bottom as at the top, you can get the
Am29000 "EXTRACT" instruction, which concatenates two registers
[which need not be sequential], barrel-shifts them by 0-31, and
then extracts a contiguous 32-bit result into the result register.
Which may be why the 29k documentation calls it a "funnel shifter"
rather than a "barrel shifter. ;-} See here:

http://bitsavers.trailing-edge.com/pdf/amd/Am29000/1987_Am29000_Users_Manual.pdf

Section 4.3.4 "FIELD SHIFT UNIT" (pp. 4-17 & 4-18) and the
"EXTRACT" instruction in Section 8.3 "INSTRUCTION FORMATS"
(p. 8-69). [Note that if srcA & srcB are the same, then EXTRACT
is simply a 0-31 bit rotate.]

</29k_nostalgia_mode>

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <http://rpw3.org/>
San Mateo, CA 94403

Terje Mathisen

unread,

Sep 30, 2013, 3:26:23 AM9/30/13

to

Rob Warnock wrote:
> And if you build a truncated-cone barrel which is twice the
> circumference at the bottom as at the top, you can get the
> Am29000 "EXTRACT" instruction, which concatenates two registers
> [which need not be sequential], barrel-shifts them by 0-31, and
> then extracts a contiguous 32-bit result into the result register.
> Which may be why the 29k documentation calls it a "funnel shifter"
> rather than a "barrel shifter. ;-} See here:
>
> http://bitsavers.trailing-edge.com/pdf/amd/Am29000/1987_Am29000_Users_Manual.pdf
>
> Section 4.3.4 "FIELD SHIFT UNIT" (pp. 4-17 & 4-18) and the
> "EXTRACT" instruction in Section 8.3 "INSTRUCTION FORMATS"
> (p. 8-69). [Note that if srcA & srcB are the same, then EXTRACT
> is simply a 0-31 bit rotate.]

Except for having three instead of just two register specifiers (so the
target is implied to be the first source register), this seems very
similar to the SH[RL]D Shift Right/Left Double x86 instructions

SHRD EAX,EDX,nn

Bakul Shah

unread,

Sep 30, 2013, 4:39:50 AM9/30/13

to

The EXTRACT instruction was handy when you wanted to shift
a contiguous array by an arbitrary amount (e.g bignum*2^N).

As for the "barrel shifter" term, may be it has to do with
a gun barrel. Its striations force a fired bullet into
spinning. Sort of like how an input word comes out of
a barrel shifter rotated. But this is just speculation.

tridac

unread,

Sep 30, 2013, 5:08:52 PM9/30/13

to

On 09/29/13 21:21, Stephen Fuld wrote:

>
> I have no specific knowledge, but it occurred to me that it might be
> related to the barrel in a revolver gun. There might be room for say six
> bullets in the barrel, and when you turn it, all bullets move (shift)
> one place over. If you turn it six places, all the bullets are back in
> their original place.
>
>

Similar knowledge state, but my visualisation was a 2D shift register with
the n bit outputs fed back into the inputs to allow rotates as well. Such a
configuration would allow shifting or transferring bits from any part of the
matrix to any other part by selective use of shift register load and clock
lines. Probably a bit slow for the modern idiom though.

Didn't the 2900 bit slice family have such a facility ?...

Chris

--
** Remove the meaning of life to reply...

Andy (Super) Glew

unread,

Sep 30, 2013, 6:24:32 PM9/30/13

to

I can envisage switches where the input signals are all on the
circumference of a first circular pipe of first wooden disk, and the
outputs on a second, with the cross wiring and switching lying between
the disks. Close up the space between the disks to protect it from the
elements, et voila!

Circuits and switching networks in 3D are fun.

I do not think, however, that the technology used in telegraphy admitted
Wired-OR or the equivalents. Which would suggest that the original
barrel shifters, if my guess is correct, were multilayer switches.

Or... well, the "rings" of the barrel could be both the controls and the
routing between circularly interleaved input and output wires.

--
The content of this message is my personal opinion only. Although I am
an employee (currently Imagination Technologies, which acquired MIPS; in
the past of companies such as Intellectual Ventures/QIPS, Intel, AMD,
Motorola, and Gould), I reveal this only so that the reader may account
for any possible bias I may have towards my employers' products. The
statements I make here in no way represent my employers' positions on
the issue, nor am I authorized to speak on behalf of my employers, past
or present.

Andy (Super) Glew

unread,

Sep 30, 2013, 6:34:40 PM9/30/13

to

On 9/27/2013 11:40 AM, Ivan Godard wrote:
> On 9/27/2013 11:12 AM, Andy (Super) Glew wrote:
>> On 9/27/2013 9:10 AM, Ivan Godard wrote:
>>> On 9/27/2013 8:23 AM, Andy (Super) Glew wrote:
>>>> On 9/26/2013 11:08 AM, Vince Weaver wrote:
>>>>> On 2013-09-26, Ivan Godard <iv...@ootbcomp.com> wrote:

>
> Interesting; I had never heard of using a full crossbar for a barrel
> shifter. The crossbar is serious overkill for just a shift; it
> implements an arbitrary bit shuffle. I suppose the byte shuffles of the
> x86 and other chips use a crossbar too - does anyone implement a full
> bit shuffle? Your bit-matrix-multiplies are related too; does that need
> a crossbar as well?

The crossbar allows arbitrary shuffling.

The problem is getting the control bits into the crossbar.

Shifts allow a wiring pattern where the control flows diagonally across
the crossbar.

There is no good way to get an arbitrary wiring pattern of 1s and 0s to
control individual crosspoints, on a cycle by cycle basis. M-bit wide
input, N-bit wide output, MxN crosspoint bits.

If you are going to do several permutations using the same control, you
can shift the control in serially, perhaps 1 bit wide, or perhaps N
(e.g. N=64 bits wide) at a time. And then store the control at the
crosspoints. Possibly several different permutations.

Cray did a 64x64 Bit Matrix Multiply. Tera a decomposed this - instead
of 64x64, they did narrower, 8x8 or 16x16, but SIMD vector style.

Bit-Matrix-MUltiply conceptually needs a crossbar. You can always split
it up. Permutes split up more easily that a full BMM.

Terje Mathisen

unread,

Oct 1, 2013, 7:11:41 AM10/1/13

to

Andy (Super) Glew wrote:
> Bit-Matrix-MUltiply conceptually needs a crossbar. You can always split
> it up. Permutes split up more easily that a full BMM.

Any, I would guess that in theory you only need ceil(log2(n!)) bits to
specify any possible permute (in the meaning of card shuffle), but all
the implementations I have seen allow duplicates, i.e. a single input
bit/block can be routed to multiple outputs.

In this case you must have at least log2(n)*n control bits, so a 64 bit
version requires 6*64 bits, which would be doable using 1 input reg, 6
control regs and an output reg. :-)

It seems more feasible to do this for SIMD, where a 64-byte (512-bit
wide) byte shuffle can be easily specified using 6 out of each 8-bit
byte in a control reg.

For byte permutes this would work up to 2048 bit wide SIMD registers.

Smaller widths use fewer bits out of each byte, afair Altivec had a pair
of 16-byte source regs, using 5 bits to specify each source?

adaptive...@gmail.com

unread,

Oct 1, 2013, 10:39:24 AM10/1/13

to

Hi Andy-san,

>The problem is getting the control bits into the crossbar

Crossbar can be implemented by multiplexers?
Then it is same form as shifters, I think.

Regards,
S.Takano

Andy (Super) Glew

unread,

Oct 1, 2013, 1:25:27 PM10/1/13

to

So much for permutations, both with and without repetitions.

Bit Matrix Multiply also supports computations that are not
permutations. Although there you end up with the "What BMM do you
want?" AND-OR? AND-XOR? AND-POPCOUNT? Certain of these support
permutations as a special case, where only one bit of the input
contributes to any given bit of the output.

AND-OR BMM can be done with wired logic, if your technology supports
wired logic. Wired logic admits crossbar arrays.

Most other types of BMM, like AND-XOR and AND-POPCOUNT, cannot be done
with wired logic. So hence no crossbars.

Andy (Super) Glew

unread,

Oct 1, 2013, 6:43:55 PM10/1/13

to

On 10/1/2013 10:25 AM, Andy (Super) Glew wrote:
> On 10/1/2013 4:11 AM, Terje Mathisen wrote:
>> Andy (Super) Glew wrote:
>>> Bit-Matrix-MUltiply conceptually needs a crossbar. You can always split
>>> it up. Permutes split up more easily that a full BMM.
>>
>> Any, I would guess that in theory you only need ceil(log2(n!)) bits to
>> specify any possible permute (in the meaning of card shuffle), but all
>> the implementations I have seen allow duplicates, i.e. a single input
>> bit/block can be routed to multiple outputs.
>>
>> In this case you must have at least log2(n)*n control bits, so a 64 bit
>> version requires 6*64 bits, which would be doable using 1 input reg, 6
>> control regs and an output reg. :-)
>>
>> It seems more feasible to do this for SIMD, where a 64-byte (512-bit
>> wide) byte shuffle can be easily specified using 6 out of each 8-bit
>> byte in a control reg.

For bit permutes, all permutations, with repetitions, of a 64 bit input,
can be expressed with 64 * 6 bits.

I.e. a 512 bit wide SIMD register suffices to specify all bit
permutations of 64 bits. Using 6 bits per byte.

Heck, a 512 bit wide SIMD register is sufficient for a concatenate and
permute bits , as one sometimes needs to do for misaligned packed data
of wrong endianness.

But, even though the permutation can be described in an operand, it is
still a pain to connect those bits to the permutation logic.

Mike Stump

unread,

Oct 2, 2013, 11:10:15 PM10/2/13

to

In article <82H1u.130379$m63....@fx13.iad>,

EricP <ThatWould...@thevillage.com> wrote:
>I had always thought that the term barrel shifter went back to
>the TTL MSI days.

>In trying to find an example micrograph I discovered that the term
>used in cpus used to be "barrel switch", such as this 1969 patent
>
>Electronic barrel switch for data shifting (1969)
>http://www.google.ca/patents/US3610903
>
>and I found that "barrel switch" goes back at least to
>this 1899 patent for a parallel switch controlling trains:

>So where the "barrel" comes from is still a mystery.

I have no knowledgable insight to add, but, a random guess, take a
look at cam switches and drum switches. I suspect a barrel switch
came from one of those, but the difference I can't guess at. Maybe
related to just the housing around a drum/cam switch?

mac

unread,

Oct 4, 2013, 1:49:42 PM10/4/13

to

>>> me idly remembering trying to fit useful amounts of stuff into a
>>> 512-byte bootsector.

Code bloat!

80 columns * 12 bits / 36 bit instr = 26 instrs
Booting Honeywell card image, usually from tape or disk.

Bonus for a 27th zero-filled instr.

BGB

unread,

Oct 4, 2013, 5:46:41 PM10/4/13

to

yes, but it probably wasn't dealing with FAT or trying to set-up x86
protected mode, or deal with LZ77 compression and PE/COFF loading.

dealing with all this generally requires several kB of code, so the
usual thing then is loading the kernel and setup initially from the FAT
filesystem in the bootsector, and then doing the decompression, PE/COFF
loading, and setting up protected mode, in the setup program.

some people had also done loading + pmode switch in the bootsector, but
then you generally have to use a raw uncompressed kernel binary, and
forsake "little things" like having paging and interrupts also set up
directly following the pmode switch.

also, the "setup" stub gives a place for BIOS stubs.

MitchAlsup

unread,

Oct 5, 2013, 2:44:24 PM10/5/13

to

On Friday, September 27, 2013 1:40:37 PM UTC-5, Ivan Godard wrote:
> Interesting; I had never heard of using a full crossbar for a barrel
> shifter. The crossbar is serious overkill for just a shift; it
> implements an arbitrary bit shuffle. I suppose the byte shuffles of the
> x86 and other chips use a crossbar too - does anyone implement a full
> bit shuffle? Your bit-matrix-multiplies are related too; does that need
> a crossbar as well?

The reason barrel shifters went out of fassion is that they are:
2*(operand width)
whereas multiplexer shifters are only
(operand width)*(1+log[muxinputs])
wide.

Terje Mathisen

unread,

Oct 5, 2013, 5:52:56 PM10/5/13

to

Huh???

How can (1+log(muxinputs)) (I assume this is log2()?) be significantly
less than 2?

Is the first 2^width (i.e. power instead of mult)?

Joe Pfeiffer

unread,

Oct 5, 2013, 7:35:03 PM10/5/13

to

Weren't barrel shifters n*n?

Andy (Super) Glew

unread,

Oct 5, 2013, 8:49:18 PM10/5/13

to

Care to explain what you mean by "barrel shifter" here, Mitch?

I call shifters that concatenate two N-bit inputs and do a shift of up
to N bits "funnel shifters".

I.e. in my terminology a "funnel shift" is a type of shift, like
arithmetic versus logical shift.

Whereas I use(d) "barrel shifter" to refer to a particular type of
shifting circuit implementation. Compared, e.g. to multiple stages of muxes.

In my terminology, a barrel shift circuit or a mux shift circuit could
be used to implement any of the following shift types:
* right shifts
* arithmetic (sign extended)
* logical (zero extended)
* funnel (concatenated, shifting in bits from another input)
* left shifts
* logical
* rarely, arithmetic (tweaked so that i >> 1 == i / 2)
I suppose that a funnel or concatenated right shift is also possible.

Perhaps you mean something else.

Bill Findlay

unread,

Oct 5, 2013, 11:44:28 PM10/5/13

to

"Andy (Super) Glew" <an...@SPAM.comp-arch.net> wrote:

>
> In my terminology, a barrel shift circuit or a mux shift circuit could be
> used to implement any of the following shift types:
> * right shifts
> * arithmetic (sign extended)
> * logical (zero extended)
> * funnel (concatenated, shifting in bits from another input)
> * left shifts
> * logical
> * rarely, arithmetic (tweaked so that i >> 1 == i / 2)
> I suppose that a funnel or concatenated right shift is also possible.
>

As a matter of interest, do you think x>>2 should round, or have that as an
option?

--
Bill Findlay

Anton Ertl

unread,

Oct 6, 2013, 8:17:40 AM10/6/13

to

"Andy (Super) Glew" <an...@SPAM.comp-arch.net> writes:
>In my terminology, a barrel shift circuit or a mux shift circuit could
>be used to implement any of the following shift types:
> * right shifts
> * arithmetic (sign extended)
> * logical (zero extended)
> * funnel (concatenated, shifting in bits from another input)
> * left shifts
> * logical
> * rarely, arithmetic (tweaked so that i >> 1 == i / 2)

The last one is (a variant of a?) a right shift, not a left shift.
Left shifts are the same for signed and unsigned numbers (at least
with 2's-complement and 1s-complement representations). If you do an
arithmetic (sign-extended) shift right, you get floored division
(round towards -inf); I guess you want to tweak it to be equivalent to
the symmetric division (round towards 0) implemented in most (all?)
CPUs that implement signed division.

BTW, since we have a few real architects here: Given that it is
relatively cheap to go from unsigned division to either floored or
symmetric division in software, why do CPU architects feel the need to
add symmetric division instructions to the architecture? (And Mitch
can also tell us what happened with that in the 88100.:-)

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

MitchAlsup

unread,

Oct 6, 2013, 7:50:13 PM10/6/13

to

On Saturday, October 5, 2013 4:52:56 PM UTC-5, Terje Mathisen wrote:
> How can (1+log(muxinputs)) (I assume this is log2()?) be significantly
> less than 2?

You forgot the (operand size) multiplier on the (1+log2(muxinputs))

The 64-bit*64 position shifter contains 3 layers of 1:4 multiplexers.
The first layer performs shifts in the amounts of 0,1,2,3.
The second layer performs shifts in the amounts of 0,4,8,12.
The third layer performs shifts in the amounts of 0,16,32,48.
Thus a single bit can be shifted any amount from 0 through 63 positions.

The ruting of the bits being shifted will take 64 wires (more or less)
diagonally across the structure. The routing of the 6=log2(64) control
lines is verticle across the structure. Thus these are 64+6 wires wide.

These things are not generally big enough to set the pitch of the data
path; which is generally set by either the register file of the ALU.

Mitch

MitchAlsup

unread,

Oct 6, 2013, 7:55:00 PM10/6/13

to

On Saturday, October 5, 2013 7:49:18 PM UTC-5, Andy (Super) Glew wrote:
> I call shifters that concatenate two N-bit inputs and do a shift of up
> to N bits "funnel shifters".
>
> I.e. in my terminology a "funnel shift" is a type of shift, like
> arithmetic versus logical shift.
>
> Whereas I use(d) "barrel shifter" to refer to a particular type of
> shifting circuit implementation. Compared, e.g. to multiple stages of muxes.

In my world, the difference between a funnel shifter and a barrel shifter
is the input driver. If the input driver drives 2*n bits in, the same barrel
shifter will be operating like a funnel shifter or like a rotator depending
on what the second set of bits contain. Should the second set of bits contain
0's or sign bits, then the shifter is operating like a barrel shifter in either
logical or arithmetic senses. One multiplexer array can be used in any of the 4
shifting duties by a configurable input driver.

In both cases, the multiplexer array is the same circuit.

Mitch

MitchAlsup

unread,

Oct 6, 2013, 7:57:38 PM10/6/13

to

On Saturday, October 5, 2013 10:44:28 PM UTC-5, Bill Findlay wrote:
> As a matter of interest, do you think x>>2 should round, or have that as an
> option?

Only if you can afford 2 cycles to perform the shift.

Mitch

MitchAlsup

unread,

Oct 6, 2013, 7:58:44 PM10/6/13

to

On Sunday, October 6, 2013 7:17:40 AM UTC-5, Anton Ertl wrote:
> BTW, since we have a few real architects here: Given that it is
> relatively cheap to go from unsigned division to either floored or
> symmetric division in software, why do CPU architects feel the need to
> add symmetric division instructions to the architecture? (And Mitch
> can also tell us what happened with that in the 88100.:-)

I'm afraid my memory needs a bigger kick start on symetric division
and the 88100.

Mitch

Bill Findlay

unread,

Oct 6, 2013, 8:00:49 PM10/6/13

to

On 07/10/2013 00:57, in article
8aa2a377-7dbe-4d33...@googlegroups.com, "MitchAlsup"

Or have separate rounded and unrounded arithmetic right shift orders?

--
Bill Findlay
with blueyonder.co.uk;
use surname & forename;

Andy (Super) Glew

unread,

Oct 6, 2013, 11:44:35 PM10/6/13

to

Having right shift round is interesting - Blaine Gaither, a Burroughs
architect, wanted that at Gould. It's often desired in SIMD "average"
operations.

However, because rounding may involve carry propagation, I put it into
the arithmetic class. "Divide by 2^n".

Anyway: you have to have conventional shifts. Folks expect them. In 32
bit instruction sets, there is rarely room for much more. For a wider
instruction set, I would consider it. I am open to many "widget"
instructions, as long as they don't add much to the data path.

Anton Ertl

unread,

Oct 7, 2013, 7:25:56 AM10/7/13

to

MitchAlsup <Mitch...@aol.com> writes:
>I'm afraid my memory needs a bigger kick start on symetric division
>and the 88100.

My memory is dim, too, so I googled a bit and found
<http://gcc.gnu.org/onlinedocs/gcc-3.3.5/gcc/M88K-Options.html>, which
says:

|On the MC88100 processor the signed integer division instruction div)
|traps to the operating system on a negative operand. The operating
|system transparently completes the operation, but at a large cost in
|execution time. By default, when compiling code that might be run on
|an MC88100 processor, GCC emulates signed integer division using the
|unsigned integer division instruction divu), thereby avoiding the
|large penalty of a trap to the operating system.

I assume that div is supposed to perform symmetric division, although
the manual does not actually say so.

Bill Findlay

unread,

Oct 7, 2013, 7:49:06 AM10/7/13

to

On 07/10/2013 04:44, in article 52522E23...@SPAM.comp-arch.net, "Andy

(Super) Glew" <an...@SPAM.comp-arch.net> wrote:

> On 10/5/2013 8:44 PM, Bill Findlay wrote:
>> "Andy (Super) Glew" <an...@SPAM.comp-arch.net> wrote:
>>
>>>
>>> In my terminology, a barrel shift circuit or a mux shift circuit could be
>>> used to implement any of the following shift types:
>>> * right shifts
>>> * arithmetic (sign extended)
>>> * logical (zero extended)
>>> * funnel (concatenated, shifting in bits from another input)
>>> * left shifts
>>> * logical
>>> * rarely, arithmetic (tweaked so that i >> 1 == i / 2)
>>> I suppose that a funnel or concatenated right shift is also possible.
>>>
>>
>> As a matter of interest, do you think x>>2 should round, or have that as an
>> option?
>
> Having right shift round is interesting - Blaine Gaither, a Burroughs
> architect, wanted that at Gould. It's often desired in SIMD "average"
> operations.

The KDF9 arithmetic right shift rounded; arithmetic left shift flagged
overflow as if it were a "* 2^n" operation.

(The shifter had provision for 1-, ..., 7-, 8- and 16-place shifts; unlike a
modern shifter it applied these sequentially to synthesize the desired shift
length, each step taking 1 cycle. With a 48-bit word the maximum legal shift
took 4 clocks.)

> However, because rounding may involve carry propagation, I put it into
> the arithmetic class. "Divide by 2^n".
>
> Anyway: you have to have conventional shifts. Folks expect them. In 32
> bit instruction sets, there is rarely room for much more. For a wider
> instruction set, I would consider it. I am open to many "widget"
> instructions, as long as they don't add much to the data path.

--

timca...@aol.com

unread,

Oct 7, 2013, 11:55:59 AM10/7/13

to

On Sunday, October 6, 2013 8:17:40 AM UTC-4, Anton Ertl wrote:

> The last one is (a variant of a?) a right shift, not a left shift.
> Left shifts are the same for signed and unsigned numbers (at least
> with 2's-complement and 1s-complement representations).

This is not exactly true (although it is almost always implemented this way)
Signed left shift should "stick" the sign bit.

- Tim

Anton Ertl

unread,

Oct 7, 2013, 12:09:37 PM10/7/13

to

The only time this makes a difference is when there is an overflow.
The result is wrong in that case, so why should the sticky sign be
more appropriate?

MitchAlsup

unread,

Oct 7, 2013, 12:43:08 PM10/7/13

to

On Monday, October 7, 2013 6:25:56 AM UTC-5, Anton Ertl wrote:

> MitchAlsup <Mitch...@aol.com> writes:
> My memory is dim, too, so I googled a bit and found
> <http://gcc.gnu.org/onlinedocs/gcc-3.3.5/gcc/M88K-Options.html>, which
> says:
>
> |On the MC88100 processor the signed integer division instruction div)
> |traps to the operating system on a negative operand. The operating
> |system transparently completes the operation, but at a large cost in
> |execution time.

OK, I remember. The FP Adder was performing the integer divide calculation
and had no support for signedness.

So, rather than add complexity, we punted.

Mitch

Anton Ertl

unread,

Oct 8, 2013, 10:48:27 AM10/8/13

to

MitchAlsup <Mitch...@aol.com> writes:
>OK, I remember. The FP Adder was performing the integer divide calculation
>and had no support for signedness.
>
>So, rather than add complexity, we punted.

Which makes me wonder why you added the div instruction in the first
place. And also why you did not document that problem in the 88100
manual.