Mill instruction density numbers with code sample

Ivan Godard

unread,

Apr 15, 2016, 5:07:46 AM4/15/16

to

A while ago I asked if readers were interested in Mill details and
statistics, and got a generally positive reaction. There's been some
discussion of instruction set density here, so this note provides an
example of the code now being produced by our very-pre-alpha compiler,
with some statistics on density.

The compiler is a modified clang/llvm feeding our own specializer. The
specializer can be though of as a freestanding code-generator/back end.
The example was compiled by llvm 3.6 at -O3, although the compiler
produces the same output at -O1.

The target in this example is the Gold configuration, a high-end Mill
family member. Essentially the same code would be produced for other
family members, although occupying more instructions on those members
with fewer resources than Gold.

The source code of this test is:
extern int foo(int x) { return x; } i
extern int bar(int x) { return x; }
int main(int argc, char** argv) {
if ((argc & 7) == 0)
return foo(0);
else
return bar(argc - 345);
}

The foo and bar dummy functions were marked no-inline so as to
illustrate the code generated for the main() function; if inlining were
enabled then most of main would have been optimized away.

The machine code generated is:
F("_Z3fooi");
retn(b0 %0);

F("_Z3bari");
retn(b0 %0);

F("main");
rd(w(0ul)) %4,
con(w(345ul)) %9,
subsx(b2 %0, b0 %9) %7,
eqlb(), andl(b3 %0, 7) %2 %3,
callfl1(b1 %3, "_Z3bari", b2 %7) %8,
calltr1(b2 %3, "_Z3fooi", b5 %4) %5,
pick(b3 %3, b0 %5, b1 %8) %10,
retn(b0 %6);

This code is conAsm, the Mill machine assembly language. Each text that
looks like a function call is an operation. Operations separated by
commas are in the same (wide) instruction and issue together; the
instruction ends with a semicolon. Here main() is a single instruction
comprising nine operations; the bodies of the dummy functions are one
instruction with one operation each.

The specializer has applied if-conversion to remove the if/else control
flow, made the calls conditional, and applied minor optimizations as
part of code generation. The optimizations and scheduling are driven by
the machine-processable specifications of the ISA and the target. There
is no target specific code in the tool chain.

The encoding takes advantage of Mill operation phasing, which issues the
operations of an instruction over three consecutive cycles. The machine
issues one instruction per cycle, so three instructions overlap. This
permits intra-instruction dataflow, as seen in the example, in addition
to the usual inter-instruction dataflow. In this example, rd and con
would issue in the first cycle; subsx, eqlb, andl, callfl1, and calltr1
would issue in the second; and retn in the third. The pick operation
does not have a normal issue, but logically issues in between the second
and third cycle.

Text of the form "bN" is a belt reference where N is the temporal index
of the referenced datum on the belt. Text of the form "%N" is comment to
show the reader the linkage between values produced (%N after the
operation) and consumed (%N after the argument); it is not encoded in
the binary.

The main() function of this example is one instruction, executing in one
cycle, and the called dummy function requires a second cycle, for an
overall program execution time of two cycles. Only one of the two dummy
functions will be called, because the calls are conditional with
opposite senses. The Mill supports inter-run control flow prediction, so
if the test is predictable the whole program will execute without stall.
If the test is missed then mispredict recovery will add another five
cycles to the execution.

On Mini, an extremely low-end Mill family member, the generated code for
the same example is:
F("_Z3fooi");
retn(b0 %0);

F("_Z3bari");
retn(b0 %0);

F("main");
rd(w(345ul)) %9,
subsx(b1 %0, b0 %9) %7;

eqlb(), andl(b2 %0, 7) %2 %3,
callfl1(b1 %3, "_Z3bari", b2 %7) %8;

rd(w(0ul)) %4,
calltr1(b3 %3, "_Z3fooi", b0 %4) %5;

pick(b4 %3, b0 %5, b2 %8) %10,
retn(b0 %6);
requiring four instructions and five cycles, again with a possible 5
cycle mispredict penalty.

The conAsm assembler collects statistics when converting code to binary
for the simulator. The entire program occupies 39 bytes on Gold and 34
on Mini. The average operation size (corresponding to average
instruction size on a narrow machine) is 3.1 bytes. Sizes on other Mill
family members will be similar.

These results are from preliminary and untuned ISA specifications and a
very new and shaky tool chain. We feel that a 10% improvement in density
is likely with further tuning, and 20% is possible.

More technical information about the Mill is available at:
http://millcomputing.com/docs/

Stephen Fuld

unread,

Apr 15, 2016, 12:44:22 PM4/15/16

to

On 4/15/2016 2:07 AM, Ivan Godard wrote:
> A while ago I asked if readers were interested in Mill details and
> statistics, and got a generally positive reaction. There's been some
> discussion of instruction set density here, so this note provides an
> example of the code now being produced by our very-pre-alpha compiler,
> with some statistics on density.

snip details

>
> The conAsm assembler collects statistics when converting code to binary
> for the simulator. The entire program occupies 39 bytes on Gold and 34
> on Mini.

It takes fewer bytes on a higher end system?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Ivan Godard

unread,

Apr 15, 2016, 1:17:13 PM4/15/16

to

*More* bytes on a higher-end system. Gold implements in hardware some
operations that are software-emulated on Mini, such as floating point,
so those ops need not be encodable on Mini and opCode fields can be
smaller. Similarly, Mini has a shorter belt needing threebits for a belt
reference, whereas Gold needs five bits. In addition instruction-level
overhead is higher on Gold, because fields that (for example) give the
byte length of the overall instruction must be larger on Gold.
Offsetting this is that some codes may run out of belt and require
spill/fill ops on Mini that are not needed on Gold with its longer belt,
but that's not applicable in this example.

The difference across the line is not large - about 15% - but it is there.

Stephen Fuld

unread,

Apr 15, 2016, 4:13:43 PM4/15/16

to

On 4/15/2016 10:17 AM, Ivan Godard wrote:
> On 4/15/2016 9:44 AM, Stephen Fuld wrote:
>> On 4/15/2016 2:07 AM, Ivan Godard wrote:
>>> A while ago I asked if readers were interested in Mill details and
>>> statistics, and got a generally positive reaction. There's been some
>>> discussion of instruction set density here, so this note provides an
>>> example of the code now being produced by our very-pre-alpha compiler,
>>> with some statistics on density.
>>
>> snip details
>>
>>
>>>
>>> The conAsm assembler collects statistics when converting code to binary
>>> for the simulator. The entire program occupies 39 bytes on Gold and 34
>>> on Mini.
>>
>> It takes fewer bytes on a higher end system?
>>
>
> *More* bytes on a higher-end system.

Yes, sorry, That is what I meant. I got the sense of the test wrong. :-(

> Gold implements in hardware some
> operations that are software-emulated on Mini, such as floating point,
> so those ops need not be encodable on Mini and opCode fields can be
> smaller. Similarly, Mini has a shorter belt needing threebits for a belt
> reference, whereas Gold needs five bits. In addition instruction-level
> overhead is higher on Gold, because fields that (for example) give the
> byte length of the overall instruction must be larger on Gold.
> Offsetting this is that some codes may run out of belt and require
> spill/fill ops on Mini that are not needed on Gold with its longer belt,
> but that's not applicable in this example.

Also, doesn't the larger number of instructions in this example mean
more "instruction-level overhead"?

> The difference across the line is not large - about 15% - but it is there.

Interesting. I just never thought about a lower end product being able
to encode the solution to a problem in fewer bits than a higher end
"compatible" product. But in this case, it all makes sense. There just
might be a (modest) code size versus performance trade off that I didn't
expect.

Rick C. Hodgin

unread,

Apr 15, 2016, 4:36:34 PM4/15/16

to

On Friday, April 15, 2016 at 5:07:46 AM UTC-4, Ivan Godard wrote:
> F("main");
> 0: rd(w(0ul)) %4,
> 1: con(w(345ul)) %9,
> 2: subsx(b2 %0, b0 %9) %7,
> 3: eqlb(), andl(b3 %0, 7) %2 %3,
> 4: callfl1(b1 %3, "_Z3bari", b2 %7) %8,
> 5: calltr1(b2 %3, "_Z3fooi", b5 %4) %5,
> 6: pick(b3 %3, b0 %5, b1 %8) %10,
> 7: retn(b0 %6);

Am I understanding this correctly, that on the Mill's belt you have
a certain number of slots (which looks like 8 here), and that you
encode a single instruction or operation per slot, which then in a
large concatenated expression like this signals the operations to
conduct in all slots simultaneously?

And how would the pick() in slot #6 be able to pick based on the
results of the branched code in slot #4 or #5 based on the prior
result?

It seems like it would stall, or that this entire context would
have to be pushed onto the stack as you then branching to call
_Z3bari or _Z3fooi.

Best regards,
Rick C. Hodgin

Ivan Godard

unread,

Apr 15, 2016, 5:29:46 PM4/15/16

to

On 4/15/2016 1:13 PM, Stephen Fuld wrote:

<snip>

> Also, doesn't the larger number of instructions in this example mean
> more "instruction-level overhead"?

Yes, it does, which is another countervailing factor. The largest
difference though is the morsel size, three bits on Mini and five on
Gold. By eyeball I count 21 morsels in the program; they are used for
belt references, operand widths, and small constants in various operations.

>> The difference across the line is not large - about 15% - but it is
>> there.
>
>
> Interesting. I just never thought about a lower end product being
> able to encode the solution to a problem in fewer bits than a higher
> end "compatible" product. But in this case, it all makes sense.
> There just might be a (modest) code size versus performance trade off
> that I didn't expect.
>

Yes - or a modest saving in the size and cost of structures such as the
icache. Of course, those structures in turn must scale across the line.
This example shows Gold pulling a single ~35 byte instruction,
potentially per cycle, and the example main() function is far from
saturating the Gold issue rate, which can be fetching, decoding and
issuing ~70 bytes per cycle. In contrast the Mini peak rate is ~10 bytes.

Here is the Gold binary generated for this program. Because Mill code
executes in two streams running in opposite directions in memory, the
entry point of a function or EBB (and its entry bytes) is at the label,
flow-side half-instructions are down the left column toward decreasing
addresses, while exu-side half-instructions are down the right column
toward increasing addresses. You may need to widen your display; this
makes even less sense after line wrap :-)
> -----------------------------------------------------------------------------------
> L("_Z3fooi");
> 00000-7 .. .. .. .. 1. .. .. .. <-- | --> 00000-7 .. .. .. .. .0 .. .. ..
> retn(b0);
> 00000-7 01 02 03 04 .. .. .. .. <-- |
> -----------------------------------------------------------------------------------
> L("_Z3bari");
> 00008-f .. 1. .. .. .. .. .. .. <-- | --> 00008-f .. .0 .. .. .. .. .. ..
> retn(b0);
> 00008-f 04 .. .. .. .. .. .. .. <-- |
> 00000-7 .. .. .. .. .. 01 02 03 |
> -----------------------------------------------------------------------------------
> L("main");
> 00018-f .. .. .. 1. .. .. .. .. <-- | --> 00018-f .. .. .. .1 .. .. .. ..
> rd(opAttr::pop0Code(10)), con(w(159)), subsx(b2, b0), eqlb(), andl(b3, 7),
> callfl1(b1, "_Z3bari", b2), calltr1(b2, "_Z3fooi", b5), pick(b3, b0, b1),
> retn(b0);
> 00018-f 02 c3 1e .. .. .. .. .. <-- | --> 00018-f .. .. .. .. 55 18 85 8f
> 00010-7 45 08 52 c8 f0 90 21 03 | 00020-7 03 c6 07 6e df 02 40 6a
> 00008-f .. .. 59 01 11 16 80 08 | 00028-f 03 04 .. .. .. .. .. ..
> -----------------------------------------------------------------------------------

For comparison, here is the Mini binary:
> -----------------------------------------------------------------------------------
> L("_Z3fooi");
> 00000-7 .. .. .. 1. .. .. .. .. <-- | --> 00000-7 .. .. .. .0 .. .. .. ..
> retn(b0);
> 00000-7 01 61 20 .. .. .. .. .. <-- |
> -----------------------------------------------------------------------------------
> L("_Z3bari");
> 00000-7 .. .. .. .. .. .. .. 1. <-- | --> 00000-7 .. .. .. .. .. .. .. .0
> retn(b0);
> 00000-7 .. .. .. .. 01 61 20 .. <-- |
> -----------------------------------------------------------------------------------
> L("main");
> 00010-7 .. .. .. .. .. 2. .. .. <-- | --> 00010-7 .. .. .. .. .. .2 .. ..
> rd(opAttr::pop0Code(355)), subsx(b1, b0);
> | --> 00010-7 .. .. .. .. .. .. 17 c6
> | 00018-f 3e e0 69 .. .. .. .. ..
> eqlb(), andl(b2, 7), callfl1(b1, "_Z3bari", b2);
> 00010-7 0d 80 92 89 40 .. .. .. <-- | --> 00018-f .. .. .. a3 63 77 08 ..
> rd(opAttr::pop0Code(10)), calltr1(b3, "_Z3fooi", b0);
> 00008-f .. .. .. 11 00 96 d9 40 <-- | --> 00018-f .. .. .. .. .. .. .. 15
> | 00020-7 14 1c .. .. .. .. .. ..
> pick(b4, b0, b2), retn(b0);
> 00008-f 00 61 20 .. .. .. .. .. <-- | --> 00020-7 .. .. c5 00 21 .. .. ..
> -----------------------------------------------------------------------------------

For each instruction after the first, the exu and flow parts of that
instruction are not adjacent in memory, so there are two addresses shown
for each instruction, one for the exu half and one for the flow half.

It takes a little getting used to Fortunately, few people need to look
at binary, and they rarely.

wolfgang kern

unread,

Apr 16, 2016, 4:33:47 AM4/16/16

to

Ivan Godard posted:
...

> Yes - or a modest saving in the size and cost of structures such as the
> icache. Of course, those structures in turn must scale across the line.
> This example shows Gold pulling a single ~35 byte instruction,
> potentially per cycle, and the example main() function is far from
> saturating the Gold issue rate, which can be fetching, decoding and
> issuing ~70 bytes per cycle. In contrast the Mini peak rate is ~10 bytes.

> Here is the Gold binary generated for this program. Because Mill code
> executes in two streams running in opposite directions in memory, the
> entry point of a function or EBB (and its entry bytes) is at the label,
> flow-side half-instructions are down the left column toward decreasing
> addresses, while exu-side half-instructions are down the right column
> toward increasing addresses. You may need to widen your display; this
> makes even less sense after line wrap :-)

[binary...]

> For each instruction after the first, the exu and flow parts of that
> instruction are not adjacent in memory, so there are two addresses shown
> for each instruction, one for the exu half and one for the flow half.

> It takes a little getting used to Fortunately, few people need to look
> at binary, and they rarely.

Perhaps I'm the last machine code programmer yet. 35 byte per instuction
seem to overwhelm any human brain, but if I correct interprete your
posted figures then I see a three byte 'prefix/group'-field followed
by several 32/64 bit fields which may not be to complex for an
experienced hex-coder to remember ;)

thanks for the detailed explanation,
__
wolfgang

Ivan Godard

unread,

Apr 16, 2016, 5:39:08 AM4/16/16

to

On 4/16/2016 1:27 AM, wolfgang kern wrote:
>
> Ivan Godard posted:

<snip>

> [binary...]
>
>> For each instruction after the first, the exu and flow parts of that
>> instruction are not adjacent in memory, so there are two addresses
>> shown for each instruction, one for the exu half and one for the flow
>> half.
>
>> It takes a little getting used to Fortunately, few people need to look
>> at binary, and they rarely.
>
> Perhaps I'm the last machine code programmer yet. 35 byte per instuction
> seem to overwhelm any human brain, but if I correct interprete your
> posted figures then I see a three byte 'prefix/group'-field followed by
> several 32/64 bit fields which may not be to complex for an experienced
> hex-coder to remember ;)
>
> thanks for the detailed explanation,
> __
> wolfgang

The half-instructions are byte aligned; everything internal to one is
bit aligned. There is a single byte of metadata at the entry point in
the middle of the EBB, between the exu and flow sides, that doesn't
belong to any instruction. The header of each half-instruction (at the
end toward the middle of each half) gives the byte length of that half
and how many slots are occupied in each of the three blocks on that
side. Blocks, and slots within blocks, are encoded outwards from the
header on each side.

The headers and each slot have a unique bit width and encoding,
entropy-optimal for the operation population of that slot, which is
arbitrary and given in the member specification. There are no 32- or
64-bit fields except by coincidence, nor byte sized or aligned fields
either for that matter, within either operations or header.

The same software that figures out the entropy-optimal encoding also
generates the source code that when compiled encodes and decodes the
binary, and in a different form creates the Verilog for the hardware
decoder.

None of this encoding is manual. I shudder to think of the idea :-) It
is possible to get a dump of the binary that shows the individual bit
fields and how they map to the operation attributes, but no one but a
hardware verification engineer chasing a bug would care, especially as
many of the field are composite, containing the cross-product of
attribute value ranges, for improved entropy density.

wolfgang kern

unread,

Apr 16, 2016, 7:39:26 AM4/16/16

to

Ivan Godard replied:

<snip>

>> [binary...]

>> Perhaps I'm the last machine code programmer yet. 35 byte per instuction
>> seem to overwhelm any human brain, but if I correct interprete your
>> posted figures then I see a three byte 'prefix/group'-field followed by
>> several 32/64 bit fields which may not be to complex for an experienced
>> hex-coder to remember ;)

> The half-instructions are byte aligned; everything internal to one is
> bit aligned. There is a single byte of metadata at the entry point in
> the middle of the EBB, between the exu and flow sides, that doesn't
> belong to any instruction. The header of each half-instruction (at the
> end toward the middle of each half) gives the byte length of that half
> and how many slots are occupied in each of the three blocks on that
> side. Blocks, and slots within blocks, are encoded outwards from the
> header on each side.

> The headers and each slot have a unique bit width and encoding,
> entropy-optimal for the operation population of that slot, which is
> arbitrary and given in the member specification. There are no 32- or
> 64-bit fields except by coincidence, nor byte sized or aligned fields
> either for that matter, within either operations or header.

Oh, I see...

> The same software that figures out the entropy-optimal encoding also
> generates the source code that when compiled encodes and decodes the
> binary, and in a different form creates the Verilog for the hardware
> decoder.

> None of this encoding is manual. I shudder to think of the idea :-) It
> is possible to get a dump of the binary that shows the individual bit
> fields and how they map to the operation attributes, but no one but a
> hardware verification engineer chasing a bug would care, especially as
> many of the field are composite, containing the cross-product of
> attribute value ranges, for improved entropy density.

But with enough information I would be able to write myself a tool
similar to my hex-editable (x86/etc..) disassemblers for your mill ;)
__
wolfgang

Ivan Godard

unread,

Apr 16, 2016, 2:16:28 PM4/16/16

to

On 4/16/2016 4:38 AM, wolfgang kern wrote:
>
> Ivan Godard replied:

>> None of this encoding is manual. I shudder to think of the idea :-) It
>> is possible to get a dump of the binary that shows the individual bit
>> fields and how they map to the operation attributes, but no one but a
>> hardware verification engineer chasing a bug would care, especially as
>> many of the field are composite, containing the cross-product of
>> attribute value ranges, for improved entropy density.
>
> But with enough information I would be able to write myself a tool
> similar to my hex-editable (x86/etc..) disassemblers for your mill ;)

Of course - except we already wrote one several years ago, and will be
open-sourcing it with the chips.

Walter Banks

unread,

Apr 18, 2016, 8:01:11 PM4/18/16

to

> On Mini, an extremely low-end Mill family member, the generated code
> for the same example is: F("_Z3fooi"); retn(b0 %0);
>
>
> F("_Z3bari"); retn(b0 %0);
>
>
> F("main"); rd(w(345ul)) %9, subsx(b1 %0, b0 %9) %7;
>
> eqlb(), andl(b2 %0, 7) %2 %3, callfl1(b1 %3, "_Z3bari", b2 %7) %8;
>
> rd(w(0ul)) %4, calltr1(b3 %3, "_Z3fooi", b0 %4) %5;
>
> pick(b4 %3, b0 %5, b2 %8) %10, retn(b0 %6); requiring four
> instructions and five cycles, again with a possible 5 cycle
> mispredict penalty.

Ivan,

I appreciate you posting this. Any chance you can post something
a little more representative of an actual program. Even a dozen lines or so.

w..

Bruce Hoult

unread,

Apr 18, 2016, 9:58:44 PM4/18/16

to

> I appreciate you posting this. Any chance you can post something
> a little more representative of an actual program. Even a dozen lines or so.

Indeed. It's great.

I'd love to see something like lzss or lz4 decompression. (or compression, but that's much more code)

Ivan Godard

unread,

Apr 18, 2016, 11:34:53 PM4/18/16

to

On 4/18/2016 5:01 PM, Walter Banks wrote:
<snip>

> I appreciate you posting this. Any chance you can post something
> a little more representative of an actual program. Even a dozen lines or
> so.

OK.

This is the "parseval" function from the core_util.c file of the EEMBC
benchmarks. The source is below; for include files and context see the
EEMBC suite.

The target is the Gold configuration. The software-pipeline,
load-deferring, and vectorization optimizations are not enabled; they
are still buggy, and even when they work the resulting code has been so
sliced and diced as to be even more incomprehensible than this. You may
notice some questionable code even in this scalar version - they are
glaring to someone familiar with the machine; the tool chain is *very*
pre-alpha.

--------------------------------------------------------------------------
ee_s32 parseval(char *valstring) {
ee_s32 retval=0;
ee_s32 neg=1;
int hexmode=0;
if (*valstring == '-') {
neg=-1;
valstring++;
}
if ((valstring[0] == '0') && (valstring[1] == 'x')) {
hexmode=1;
valstring+=2;
}
/* first look for digits */
if (hexmode) {
while (((*valstring >= '0') && (*valstring <= '9')) ||
((*valstring >= 'a') && (*valstring <= 'f'))) {
ee_s32 digit=*valstring-'0';
if (digit>9)
digit=10+*valstring-'a';
retval*=16;
retval+=digit;
valstring++;
}
} else {
while ((*valstring >= '0') && (*valstring <= '9')) {
ee_s32 digit=*valstring-'0';
retval*=10;
retval+=digit;
valstring++;
}
}
/* now add qualifiers */
if (*valstring=='K')
retval*=1024;
if (*valstring=='M')
retval*=1024*1024;

retval*=neg;
return retval;
}

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
F("parseval");
load(b0 %0, 0, b) %3;

nop();
lea(b0 %0, 1) %5;

eqlb(b1 %3, 45) %4,
pick(b0 %4, b1 %5, b3 %0) %7;

load(b0 %7, 1, b) %10;

nop();
lea(b0 %7, 2) %12;

eqlb(b1 %10, 120) %11,
pick(b0 %11, b1 %12, b3 %7) %13;

load(b4 %7, 0, b) %8,
load(b0 %13, 0, b) %14;

nop();
nop();
eqlb(b3 %8, 48) %9,
sub(b4 %8, 48) %63,
sub(b4 %14, 48) %16,
sub(b5 %14, 97) %18,
con(w(18446744073709551615ul)) %1,
brfl(b4 %9, "parseval$5", b14 %47, b15 %48, b16 %50),
pick(b13 %4, b4 %1, b5 %2) %6,
rd(w(1ul)) %2;

sub(b7 %14, 48) %64,
lssub(b3 %16, 10) %17,
lssub(b3 %18, 6) %19,
brfl(b13 %11, "parseval$5", b17 %47, b18 %48, b19 %50);

orl(b2 %17, b1 %19) %20,
brfl(b0 %20, "parseval$4", b22 %38, b22 %39, b22 %65),
rd(w(0ul)) %15;

inner("parseval$3", b12 %14, b1 %15, b14 %13);

L("parseval$4") %38 %39 %65;
eqlb(b2 %39, 75) %40,
shiftl(b2 %38, 10) %41,
pick(b1 %40, b0 %41, b3 %38) %42;

eqlb(b5 %39, 77) %43,
shiftl(b1 %42, 20) %44,
pick(b1 %43, b0 %44, b2 %42) %45;

mulsx(b0 %45, b6 %65) %46;

nop();
retn(b0 %46);

L("parseval$5") %47 %48 %50;
lssub(b3 %50, 10) %51,
brfl(b0 %51, "parseval$4", b5 %38, b5 %39, b5 %65),
rd(w(0ul)) %49;

inner("parseval$6", b2 %47, b1 %49, b3 %48);

L("parseval$3") %21 %22 %23;
load(b2 %23, 1, b) %32;

nop();
widens(b1 %21, w) %25,
shiftl(b3 %22, 4) %30;

subsx(b1 %25, 48) %26,
sub(b3 %32, 48) %33,
sub(b4 %32, 97) %35;

subsx(b4 %25, 87) %28,
gtrsb(b3 %26, 9) %27,
lssub(b3 %33, 10) %34,
lssub(b3 %35, 6) %36,
lea(b8 %23, 1) %24,
pick(b2 %27, b3 %28, b6 %26) %29;

addsx(b1 %29, b9 %30) %31,
orl(b4 %34, b3 %36) %37,
leavefl(b0 %37, "parseval$4", b15 %38, b16 %39, b14 %65);

br("parseval$3", b13 %32, b1 %31, b2 %24);

L("parseval$6") %52 %53 %54;
mulsx(b1 %53, 10) %57,
load(b2 %54, 1, b) %60;

nop();
nop();
widens(b2 %52, w) %56,
sub(b1 %57, 48) %58,
sub(b3 %60, 48) %61,
lea(b4 %54, 1) %55;

add(b2 %58, b3 %56) %59,
lssub(b2 %61, 10) %62,
leavefl(b0 %62, "parseval$4", b9 %38, b10 %39, b8 %65);

br("parseval$6", b7 %60, b1 %59, b2 %55);

Ivan Godard

unread,

Apr 18, 2016, 11:37:01 PM4/18/16

to

On 4/18/2016 6:58 PM, Bruce Hoult wrote:

> Indeed. It's great.
>
> I'd love to see something like lzss or lz4 decompression. (or compression, but that's much more code)
>

Needs pipelining and vectorization; please be patient.

Terje Mathisen

unread,

Apr 19, 2016, 3:12:33 AM4/19/16

to

I have written a few LZ4 decoders, with x86 SSE intrinsics I managed to
beat the Google reference version. :-)

The plain/naive version does not need any vector ops until you start
optimizing, i.e. for the (possibly overlapping) memcpy() loops.

My fastest version loads the next 16 bytes (unaligned), then uses a
table lookup based on the offset length N to load an in-vector index
lookup table which I then use to replicate the first N bytes across a
pair of vectors.

This all depends on having fairly fast unaligned load and store byte
vector ops, and on being able to control up to 27 bytes of scratch
buffer space past the end of the output buffer!

(27 because I write 32 bytes in each iteration and the file format
guarantees that the last 5 bytes (minimum) will be a straight copy.)

The alternative is to simply limit the fast decoder to
(buffer_length-GUARD) bytes and then fall back on safe code for the tail.

On the Mill I have a strong feeling that we'll be able to use the
None/NaR metadata to use (mostly) aligned operations, with None coding
to avoid overwriting either previously written data or space past the
allocated end.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Apr 19, 2016, 8:22:47 PM4/19/16

to

On Monday, April 18, 2016 at 10:34:53 PM UTC-5, Ivan Godard wrote:

I thought it might be fun to see what a RISC instruction set with predication can do to this function: (done by hand so a few typos might be present)

__parseval
MOV Rretval,0
MOV Rneg,1
MOV Rhexmode,0
LDB Rchar,[Rvalstring]
LDB Rchar1,[Rvalstring+1]
CMP Rcmp,Rchar,"-"
PB.EQ Rcmp,0b1010
MOV Rneg,-1
ADD Rvalstring,Rvalstring,1
LDB Rchar,[Rvalstring]
LDB Rchar1,[Rvalstring+1]
CMP Rcmp,Rchar,"0"
CMP Rcmp1,Rchar1,"x"
AND Rcmp2,Rcmp,Rcmp1
PB.EQ Rcmp2,0b1010
MOV Rhexmode,1
ADD Rvalstring,Rvalstring,2
B.NE Rhexmode,if1else
while1top:
LDB Rchar[Rvalstring]
CMP Rcmp,Rchar,"0"
CMP Rcmp1,Rchar,"9"
CMP Rcmp2,Rchar,"a"
CMP Rcmp3,Rchar,"f"
PB.GE Rcmp, 0b101010101010101001
PB.LE Rcmp1,0b1010101010101001
PB.GE Rcmp2,0b10101010101001
PB.LE Rcmp3.0b101010101001
ADD Rdigit,Rchar,-"0"
PB.GT Rcmp1,0b10101001
ADD Rdigit,Rchar,10-"a"
SLA Rretval,Rretval,4
ADD Rretval,Rretval,digit
BR.A while1top
BR.A qualifiers
if1else:
while2top:
LDB Rchar[Rvalstring]
CMP Rcmp,Rchar,"0"
CMP Rcmp1,Rchar,"9"
BP.GE Rcmp,0b1010101001
PB.LE Rcmp1,0b10101001
LDA Rretval,[Rretval,Rretval<<2]
LDA Rretval,[Rdigit,Rretval<<1]
ADD Rvalstring,Rvalstring,1
BR.A while2top
qualifiers:
LDB Rchar,[Rvalstring]
CMP Rcmp,Rchar,"K"
PB.EQ Rcmp,0b10
SLA Rretval,Rretval,10
CMP Rcmp.Rcmp,"M"
PB.EQ Rcmp,0b10
PC.ne Rneg,0b10
ADD Rretval,-Rretval,0
JMP Rret

Mitch

Ivan Godard

unread,

Apr 19, 2016, 9:20:30 PM4/19/16

to

On 4/19/2016 5:22 PM, MitchAlsup wrote:
> On Monday, April 18, 2016 at 10:34:53 PM UTC-5, Ivan Godard wrote:
>
> I thought it might be fun to see what a RISC instruction set with
> predication can do to this function: (done by hand so a few typos
> might be present)

<code snip>

I count 52 instructions, or 210 bytes, vs 337 bytes on Gold. Not really
comparable, hand code vs pre-alpha compiler, but shows how much we still
have to do. It would be interesting to see the compiler-produced code
for this function for comparison; I suspect that Mitch's hand code runs
rings around anybody's compiler :-)

The Gold code has 89 ops; the difference between 89 and 52
arises mostly from two reasons: explicit nops (16), and speculative code
replication (the current Mill tool chain optimizes for speed not space
and does a lot of speculation to use the machine width). The major
architectural difference seems to be the use of predication (RISC) vs
selection (the pick op on Mill); there are six picks.

Cycle counts cannot be compared; as a statically scheduled exposed pipe
machine, the Mill cycle counts (excluding things like memory misses) can
be read directly from the asm, but not so for a RISC. Some of the Mill
instructions are fairly wide, with as many as nine ops to slow the RISC,
but others have only one op or an opset that would fit in a single issue
in a high-end OOO.

It will be a while before we can supply benchmark timings, and can't at
all in this particular code due to EEMBC rules.

Are you readers interested in more work-in-process status?

MitchAlsup

unread,

Apr 19, 2016, 10:16:20 PM4/19/16

to

On Tuesday, April 19, 2016 at 8:20:30 PM UTC-5, Ivan Godard wrote:
> On 4/19/2016 5:22 PM, MitchAlsup wrote:
> > On Monday, April 18, 2016 at 10:34:53 PM UTC-5, Ivan Godard wrote:
> >
> > I thought it might be fun to see what a RISC instruction set with
> > predication can do to this function: (done by hand so a few typos
> > might be present)
>
> <code snip>
>
> I count 52 instructions, or 210 bytes, vs 337 bytes on Gold. Not really
> comparable, hand code vs pre-alpha compiler, but shows how much we still
> have to do. It would be interesting to see the compiler-produced code
> for this function for comparison; I suspect that Mitch's hand code runs
> rings around anybody's compiler :-)

Hi Ivan,

What I primarily wanted to demonstrate was the utility of my predication scheme to get rid of branches and the way the condition-codeless compare instructions can also minimize both branches and other gook centered around && and || conditionals.

Once you get branches out of the way, code scheduling over predication is no different than code scheduling straight line arithmetic. And getting the branches out of the way helps massively.

I suspect that without my predication and CC-less ISA, the code would be 25% longer, minimum.

None of the other {32-bit and 64-bit immediates and displacements, or stores of immediates with displacements were necessary, here.}

Mitch

Ivan Godard

unread,

Apr 19, 2016, 11:25:23 PM4/19/16

to

On 4/19/2016 7:16 PM, MitchAlsup wrote:

>
> Hi Ivan,
>
> What I primarily wanted to demonstrate was the utility of my
> predication scheme to get rid of branches and the way the
> condition-codeless compare instructions can also minimize both
> branches and other gook centered around && and || conditionals.
>

> Once you get branches out of the way, code scheduling overMill architecture,

> predication is no different than code scheduling straight line
> arithmetic. And getting the branches out of the way helps massively.
>
> I suspect that without my predication and CC-less ISA, the code would
> be 25% longer, minimum.
>
> None of the other {32-bit and 64-bit immediates and displacements, or
> stores of immediates with displacements were necessary, here.}
>
> Mitch
>

I agree if-conversion is critical.

LLVM gave me genAsm with 7 basic blocks; that got if-converted down to
five in the Gold code. Your hand code had four. Three is the minimum
possible with two loops, but there is code after the loops so your four
is optimal. Mine had a block between the loops ("main$5") with only one
instruction. In the current Mill architecture, loops are contexts that
you enter with inner() and exit with leave(). An inner() inside a loop
is a nested loop, so to have two consecutive loops at the same level you
must leave the first and enter the second, and that requires a landing
pad in between; hence 5 blocks in the Mill code.

There's no real reason preventing a sort of loop context tailcall, where
a loop exits and immediately enters another as one op. I'll have to
think about that notion.

For us there's less need to eliminate branch ops than on a machine with
a branch predictor. Because a single instruction can contain lots of
branches (Gold has four branch units), of which several may be enabled
but only the first enabled is taken, we actually have a density similar
to your predication and don't have to chain through the branch tree the
way a conventional does. So for example a switch using a binary search
tree can go 16 ways in two instructions, which is *much* better than
loading from a jump table, and has no d$1 impact. You could do much the
same with your predication I think.

Of course, all this would be so much easier if you would join and help :-)

Terje Mathisen

unread,

Apr 20, 2016, 5:18:03 AM4/20/16

to

MitchAlsup wrote:
> On Monday, April 18, 2016 at 10:34:53 PM UTC-5, Ivan Godard wrote:
>
> I thought it might be fun to see what a RISC instruction set with predication can do
> to this function: (done by hand so a few typos might be present)

Interesting instruction set. :-)

I followed to logic OK, except I never found the expected

SLA Rretval,Rretval,20

for the 'M' case?

>
> __parseval
> MOV Rretval,0
> MOV Rneg,1
> MOV Rhexmode,0
> LDB Rchar,[Rvalstring]
> LDB Rchar1,[Rvalstring+1]
> CMP Rcmp,Rchar,"-"
> PB.EQ Rcmp,0b1010

The PB.EQ is predicate mask for the following instructions, two bits per
opcode so this says that for the next two instructions you should do
them if the CMP returned Equal, and skip if not, right?

> MOV Rneg,-1
> ADD Rvalstring,Rvalstring,1

> LDB Rchar,[Rvalstring]
> LDB Rchar1,[Rvalstring+1]
> CMP Rcmp,Rchar,"0"
> CMP Rcmp1,Rchar1,"x"
> AND Rcmp2,Rcmp,Rcmp1
> PB.EQ Rcmp2,0b1010

Ditto, skip the next two if '0x' wasn't found.

SInce you almost certainly allow unaligned loads I would have expected
the code above to be converted into something like this (assuming H for
16-bit half-words):

LDH Rpair,[Rvalstring]
CMP Rcmp,Rpair,"0x" ;; Possibly "x0"?
PB.EQ Rcmp,0b0101

This would at least save the cycle used to combine the two CMP results.

You could also have used longer predicate tails instead of combining the
compare results but that would probably not have saved anything?

> MOV Rhexmode,1
> ADD Rvalstring,Rvalstring,2

> B.NE Rhexmode,if1else
> while1top:
> LDB Rchar[Rvalstring]
> CMP Rcmp,Rchar,"0"
> CMP Rcmp1,Rchar,"9"
> CMP Rcmp2,Rchar,"a"
> CMP Rcmp3,Rchar,"f"
> PB.GE Rcmp, 0b101010101010101001
> PB.LE Rcmp1,0b1010101010101001
> PB.GE Rcmp2,0b10101010101001
> PB.LE Rcmp3.0b101010101001

The tricky code sequence above depends on having a character set where
'a-f' is using higher codes than '0-9', but since that is OK for both
ASCII and EBCDIC it is probably a safe assumption.:-)

If would have been tempting to combine the two decoding loops into one,
with predicates covering the hex part when decoding decimal, but that
would have required a PB.EQ Rhexmode,0b... to cover the hex part and
either a register which contained 10/16 and a MUL, or extending the
predicate cover to include the hex vs decimal scaling.

Fun stuff!

MitchAlsup

unread,

Apr 20, 2016, 9:51:58 AM4/20/16

to

On Wednesday, April 20, 2016 at 4:18:03 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Monday, April 18, 2016 at 10:34:53 PM UTC-5, Ivan Godard wrote:
> >
> > I thought it might be fun to see what a RISC instruction set with predication can do
> > to this function: (done by hand so a few typos might be present)
>
> Interesting instruction set. :-)
>
> I followed to logic OK, except I never found the expected
>
> SLA Rretval,Rretval,20

Showing why one lets compilers do this sort of thing.......)

>
> for the 'M' case?
>
> >
> > __parseval
> > MOV Rretval,0
> > MOV Rneg,1
> > MOV Rhexmode,0
> > LDB Rchar,[Rvalstring]
> > LDB Rchar1,[Rvalstring+1]
> > CMP Rcmp,Rchar,"-"
> > PB.EQ Rcmp,0b1010
>
> The PB.EQ is predicate mask for the following instructions, two bits per
> opcode so this says that for the next two instructions you should do
> them if the CMP returned Equal, and skip if not, right?

There are 4 choices in the predicate shadow:
00 Don't execute
01 Execute if FALSE
10 Execute if TRUE
11 Execute always

The shift register this stuff gets logged into ORs the TRUE part from predicate to predicate and NANDs the FALSE part. This allows one to construct intricate predicate shadows.

>
> > MOV Rneg,-1
> > ADD Rvalstring,Rvalstring,1
>
> > LDB Rchar,[Rvalstring]
> > LDB Rchar1,[Rvalstring+1]
> > CMP Rcmp,Rchar,"0"
> > CMP Rcmp1,Rchar1,"x"
> > AND Rcmp2,Rcmp,Rcmp1
> > PB.EQ Rcmp2,0b1010
>
> Ditto, skip the next two if '0x' wasn't found.

This converts an && into an & because it uses the same bit position from both compares.

>
> SInce you almost certainly allow unaligned loads I would have expected
> the code above to be converted into something like this (assuming H for
> 16-bit half-words):
>
> LDH Rpair,[Rvalstring]
> CMP Rcmp,Rpair,"0x" ;; Possibly "x0"?
> PB.EQ Rcmp,0b0101
>
> This would at least save the cycle used to combine the two CMP results.

I missed that one, thanks.

>
> You could also have used longer predicate tails instead of combining the
> compare results but that would probably not have saved anything?

The predicate shadow is only 8 instructions long, including additional predicates.

I do this kind of parsing with tables::

while( parsetable[(char=*valstring++)] & NUMBER )
{
retval = retval*base+char&0x0F;
}

Much faster.
>
> Fun stuff!

Indeed.

Rick C. Hodgin

unread,

Apr 20, 2016, 10:00:48 AM4/20/16

to

On Wednesday, April 20, 2016 at 9:51:58 AM UTC-4, MitchAlsup wrote:
> I do this kind of parsing with tables::
> while( parsetable[(char=*valstring++)] & NUMBER )
> {
> retval = retval*base+char&0x0F;
> }

What's the "& NUMBER" part for?

Terje Mathisen

unread,

Apr 20, 2016, 4:09:06 PM4/20/16

to

That's the way I did it for many years, except I'd encode the actual
value into the lookup table so that I wouldn't have to worry about the
offset between '9' and 'A' (or 'a').

These days however I'm more interested in figuring out table-less
methods since that allows SIMD vector processing:

Grab the next (aligned?) 8 or 16 bytes, then do parallel compares for
the desired ranges, combining them as needed.

Locate the first non-matching character and use that as the terminator,
then convert the entire block at once. :-)

Lookup tables usually (unless you have good scatter/gather support)
makes the algorithm serial.

John Dallman

unread,

Apr 21, 2016, 4:18:26 AM4/21/16

to

In article <nf6lah$t3p$1...@dont-email.me>, iv...@millcomputing.com (Ivan

Godard) wrote:

> Are you readers interested in more work-in-process status?

Definitely.

John

Michael Barry

unread,

Apr 22, 2016, 4:20:25 AM4/22/16

to

On Tuesday, April 19, 2016 at 5:22:47 PM UTC-7, MitchAlsup wrote:
> ...

> I thought it might be fun to see what a RISC instruction set with predication can do to this function: (done by hand so a few typos might be present)

> ... [code snip] ...
>
> Mitch

I tried it for my unfinished 32-bit hobby processor, and parseval
weighed in at 35 machine instructions, occupying 36 words (the
1024*1024 literal needed its own word). Four of those 36 words
were just for restoring the registers it stomped.

Several caveats:
1) It would only "work" for UTF-32 (no native byte addressing).
2) It was hand-optimized, hand-assembled, and hand-tested (no
compiler, assembler, or simulator yet).
3) I am probably the only person in the world weird enough to be
motivated to understand, much less compose, 65m32 machine
language, so plopping the code here without Ivan's permission
would be a bit rude.

It was fun anyway [geek laugh].

Mike B.

Ivan Godard

unread,

Apr 22, 2016, 4:31:37 AM4/22/16

to

Feel free; the source is not mine, it's EEMBC, and is public domain as
far as I know.

Rui Salvaterra

unread,

Apr 22, 2016, 4:39:06 AM4/22/16

to

quarta-feira, 20 de Abril de 2016 às 02:20:30 UTC+1, Ivan Godard escreveu:
>
> Are you readers interested in more work-in-process status?

Yes, please! :)

Bruce Hoult

unread,

Apr 22, 2016, 8:44:19 AM4/22/16

to

On Wednesday, April 20, 2016 at 4:20:30 AM UTC+3, Ivan Godard wrote:
> On 4/19/2016 5:22 PM, MitchAlsup wrote:
> > On Monday, April 18, 2016 at 10:34:53 PM UTC-5, Ivan Godard wrote:
> >
> > I thought it might be fun to see what a RISC instruction set with
> > predication can do to this function: (done by hand so a few typos
> > might be present)
>
> <code snip>
>
> I count 52 instructions, or 210 bytes, vs 337 bytes on Gold. Not really
> comparable, hand code vs pre-alpha compiler, but shows how much we still
> have to do. It would be interesting to see the compiler-produced code
> for this function for comparison; I suspect that Mitch's hand code runs
> rings around anybody's compiler :-)
>
> The Gold code has 89 ops;

And 33 instructions?

Just for interest, I ran the same code through some recent gcc versions I have lying around, using -Os:

powerpc: 212 bytes, 53 instructions
aarch64: 204 bytes, 51 instructions
arm: 176 bytes, 44 instructions
x86_64: 135 bytes, 49 instructions
i386: 130 bytes, 55 instructions
thumb2: 120 bytes, 51 instructions
thumb: 112 bytes, 56 instructions

There can be some vagaries in optimization level, tuning etc. For example, the thumb code is of course valid thumb2 code, so the thumb2 isn't actually the smallest.

Rick C. Hodgin

unread,

Apr 22, 2016, 9:19:08 AM4/22/16

to

FWIW, after Ivan's first post I wrote an i386 version by hand to see what
could be done on x86 by comparison, and was able to do it in 48 instructions
(without spending any extra time trying to optimize it, but just an off-the-
top-of-my-head solution). I could save a few instructions by using i686.

Ivan Godard

unread,

Apr 22, 2016, 12:40:51 PM4/22/16

to

Can you rerun at least a couple of your sample with -O3 (which is in
effect the setting on the code sample I posted)? And if possible post
the PowerPC asm?

Terje Mathisen

unread,

Apr 22, 2016, 2:20:38 PM4/22/16

to

uint16_t parseval(char *str)
{
mov si,[str]
xor di,di ; return value
xor ax,ax
cmp word ptr [si],'x'*256+'0'
je hex
next_decimal:
lodsb
sub al,'0'
jb done
cmp al,9
ja done
lea di,[di+di*4] ; *5
add di,di
add di,ax ; *10+curr
jmp next_decimal

hex:
lodsb
sub al,'0'
jb done
cmp al,9
jbe ok
sub al,'a'-'9'-1
jb done
cmp al,15
ja done
shl di,4 ; *16
add di,ax ; *16+curr
jmp hex

done:
dec si
lodsb
mov cl,20
cmp al,'M'
je shift
mov cl,10
cmp al,'K'
jne skip_shift
shl di,cl
skip_shift:
mov ax,di
}

That's about 40 instructions including the C function overhead (4 ops),
around 75 bytes when I just add up the instruction bytes from memory, so
take that number as a guesstimate...

Rick C. Hodgin

unread,

Apr 22, 2016, 2:27:10 PM4/22/16

to

You've always been a better assembly developer than I am. I've always
thought your assembly skills were something like "my ideal" or "what I
would strive toward were I looking for the best." I've thought that all
along. In fact, when I first saw your tagline about an exercise in
caching back in the day (late 90s??), I remember spending some time
trying to figure out other ways to look at it differently, and couldn't
come up with any. I gave myself the mental assessment at that time that
I was behind you in terms of assembly programming skills, and it is
clear that it remains that way to this day.

Your skills are impressive.

Bruce Hoult

unread,

Apr 22, 2016, 5:46:39 PM4/22/16

to

Sure.

By bytes:

$ ./run.sh -O3 3
thumb2 66 160
i386 65 171
x64 66 215
arm 56 224
thumb 112 224
aarch64 65 260
powerpc 66 264

By instructions:

$ ./run.sh -O3 2
arm 56 224
aarch64 65 260
i386 65 171
powerpc 66 264
thumb2 66 160
x64 66 215
thumb 112 224

In the interests of duplication, my script:

#!/bin/bash
opt=$1
sort=${2:-2}

(for x in powerpc aarch64 arm thumb2 thumb i386 x64;do
thumb=;flags=;isa=$x
case $x in i386|x64) isa=x86;; thumb*) isa=arm; thumb="-mthumb";; esac
case $x in i386) flags="-march=i386 -m32";; arm) v=5;; thumb) v=6;; thumb2) v=7;; esac
case $isa in arm) abi=eabi; flags="-march=armv$v";; esac
case $isa in x86) tc=;; *) tc=${isa}-linux-gnu${abi}-;; esac
echo -n "$x "
${tc}gcc $flags $opt $thumb -c parseval.c -o parseval.o || exit 1
${tc}objdump -d parseval.o | tee parseval.s.$x | perl -ne \
'/^ +([0-9a-f]+):\t([ 0-9a-f]+)/ and ($ins,$adr,$code)=($ins+1,$1,$2);
END{$code=~s/ //g;print $ins," ",hex($adr)+length($code)/2,"\n"}'
done) | sort -n -k$sort

And the PowerPC -O3 code:

00000000 <parseval>:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 39 60 00 01 li r11,1
8: 89 23 00 00 lbz r9,0(r3)
c: 2b 89 00 2d cmplwi cr7,r9,45
10: 41 9e 00 e0 beq cr7,f0 <parseval+0xf0>
14: 2b 89 00 30 cmplwi cr7,r9,48
18: 41 9e 00 78 beq cr7,90 <parseval+0x90>
1c: 39 49 ff d0 addi r10,r9,-48
20: 55 48 06 3e clrlwi r8,r10,24
24: 2b 88 00 09 cmplwi cr7,r8,9
28: 41 9d 00 d8 bgt cr7,100 <parseval+0x100>
2c: 7c 67 1b 78 mr r7,r3
30: 39 00 00 00 li r8,0
34: 60 00 00 00 nop
38: 60 00 00 00 nop
3c: 60 00 00 00 nop
40: 8d 27 00 01 lbzu r9,1(r7)
44: 1d 08 00 0a mulli r8,r8,10
48: 7d 08 52 14 add r8,r8,r10
4c: 39 49 ff d0 addi r10,r9,-48
50: 55 46 06 3e clrlwi r6,r10,24
54: 2b 86 00 09 cmplwi cr7,r6,9
58: 40 9d ff e8 ble cr7,40 <parseval+0x40>
5c: 2b 89 00 4b cmplwi cr7,r9,75
60: 41 9e 00 20 beq cr7,80 <parseval+0x80>
64: 2b 89 00 4d cmplwi cr7,r9,77
68: 40 9e 00 08 bne cr7,70 <parseval+0x70>
6c: 55 08 a0 16 rlwinm r8,r8,20,0,11
70: 7c 68 59 d6 mullw r3,r8,r11
74: 38 21 00 10 addi r1,r1,16
78: 4e 80 00 20 blr
7c: 60 00 00 00 nop
80: 55 08 50 2a rlwinm r8,r8,10,0,21
84: 7c 68 59 d6 mullw r3,r8,r11
88: 38 21 00 10 addi r1,r1,16
8c: 4e 80 00 20 blr
90: 89 23 00 01 lbz r9,1(r3)
94: 39 40 00 00 li r10,0
98: 38 e3 00 01 addi r7,r3,1
9c: 39 00 00 00 li r8,0
a0: 2f 89 00 78 cmpwi cr7,r9,120
a4: 40 9e ff 88 bne cr7,2c <parseval+0x2c>
a8: 60 00 00 00 nop
ac: 60 00 00 00 nop
b0: 8d 27 00 01 lbzu r9,1(r7)
b4: 55 04 20 36 rlwinm r4,r8,4,0,27
b8: 39 49 ff d0 addi r10,r9,-48
bc: 38 c9 ff 9f addi r6,r9,-97
c0: 55 45 06 3e clrlwi r5,r10,24
c4: 2f 8a 00 09 cmpwi cr7,r10,9
c8: 28 85 00 09 cmplwi cr1,r5,9
cc: 2b 06 00 05 cmplwi cr6,r6,5
d0: 40 85 00 08 ble cr1,d8 <parseval+0xd8>
d4: 41 99 ff 88 bgt cr6,5c <parseval+0x5c>
d8: 40 9d 00 08 ble cr7,e0 <parseval+0xe0>
dc: 39 49 ff a9 addi r10,r9,-87
e0: 7d 04 52 14 add r8,r4,r10
e4: 4b ff ff cc b b0 <parseval+0xb0>
e8: 60 00 00 00 nop
ec: 60 00 00 00 nop
f0: 89 23 00 01 lbz r9,1(r3)
f4: 39 60 ff ff li r11,-1
f8: 38 63 00 01 addi r3,r3,1
fc: 4b ff ff 18 b 14 <parseval+0x14>
100: 39 00 00 00 li r8,0
104: 4b ff ff 58 b 5c <parseval+0x5c>

Bruce Hoult

unread,

Apr 22, 2016, 7:23:24 PM4/22/16

to

On Saturday, April 23, 2016 at 4:40:51 AM UTC+12, Ivan Godard wrote:

This is a very interesting sequence in the PowerPC code:

c4: 2f 8a 00 09 cmpwi cr7,r10,9
c8: 28 85 00 09 cmplwi cr1,r5,9
cc: 2b 06 00 05 cmplwi cr6,r6,5
d0: 40 85 00 08 ble cr1,d8 <parseval+0xd8>
d4: 41 99 ff 88 bgt cr6,5c <parseval+0x5c>
d8: 40 9d 00 08 ble cr7,e0 <parseval+0xe0>

Using three compares (I assume probably dispatched in parallel on most) to set three different condition code registers -- a unique PPC feature. And then three branches, on on each of those condition code registers.

Even more intriguing: the first branch branches to the third one!

Thumb2 (and arm, without the "it" and "ite") has a similarly interesting sequence:

6a: 2905 cmp r1, #5
6c: bf88 it hi
6e: 2809 cmphi r0, #9
70: bf8c ite hi
72: 2100 movhi r1, #0
74: 2101 movls r1, #1
76: d8e1 bhi.n 3c <parseval+0x3c>
78: 2100 movs r1, #0

Ivan Godard

unread,

Apr 22, 2016, 10:18:44 PM4/22/16

to

On 4/22/2016 4:23 PM, Bruce Hoult wrote:

>
> This is a very interesting sequence in the PowerPC code:
>
> c4: 2f 8a 00 09 cmpwi cr7,r10,9
> c8: 28 85 00 09 cmplwi cr1,r5,9
> cc: 2b 06 00 05 cmplwi cr6,r6,5
> d0: 40 85 00 08 ble cr1,d8 <parseval+0xd8>
> d4: 41 99 ff 88 bgt cr6,5c <parseval+0x5c>
> d8: 40 9d 00 08 ble cr7,e0 <parseval+0xe0>
>
> Using three compares (I assume probably dispatched in parallel on most) to set three different condition code registers -- a unique PPC feature. And then three branches, on on each of those condition code registers.

There's a PowerPC with three ALUs?

> Even more intriguing: the first branch branches to the third one!

Our code gen produces that when we succeed in hoisting all the ops out
of a basic bloc that has both more than one inbound arc and more than
one outbound arc; I assume other compilers do the same.

I vaguely recall Savard once playing around with 2-in three way branch
op that would address this case, but don't remember the outcome.

Bruce Hoult

unread,

Apr 23, 2016, 6:34:03 AM4/23/16

to

On Saturday, April 23, 2016 at 2:18:44 PM UTC+12, Ivan Godard wrote:
> There's a PowerPC with three ALUs?

2nd gen in 1995 ... PowerPC 604 ... dispatch 4 ins per cycle (and retire up to 6), 2 simple int ALUs, 1 complex int ALU (I think it means multiply/divide as well), FPU, load/store, branch.

So, yeah, I think those three compares could all go in one cycle.

3.6m transistors. Sounds so little now. The 604e added more cache and went really well for the time.

Hmm. The G3, G4, G5 all seem to have gone back to just two integer ALUs but added more load/store. I have no idea about designs since Apple stopped using them.

So, ok, I guess the majority of PPCs used by Apple at least weren't three wide in integer ops. I do remember hand writing some code for 604 that used that though :-)

Quadibloc

unread,

Apr 23, 2016, 9:35:55 AM4/23/16

to

On Friday, April 22, 2016 at 8:18:44 PM UTC-6, Ivan Godard wrote:
> On 4/22/2016 4:23 PM, Bruce Hoult wrote:

> > This is a very interesting sequence in the PowerPC code:
> >
> > c4: 2f 8a 00 09 cmpwi cr7,r10,9
> > c8: 28 85 00 09 cmplwi cr1,r5,9
> > cc: 2b 06 00 05 cmplwi cr6,r6,5
> > d0: 40 85 00 08 ble cr1,d8 <parseval+0xd8>
> > d4: 41 99 ff 88 bgt cr6,5c <parseval+0x5c>
> > d8: 40 9d 00 08 ble cr7,e0 <parseval+0xe0>
> >
> > Using three compares (I assume probably dispatched in parallel on most) to set three different condition code registers -- a unique PPC feature. And then three branches, on on each of those condition code registers.
>
> There's a PowerPC with three ALUs?

I Googled this. It turns out that the PowerPC has a 32-bit condition register
which is divided into _eight_ four bit fields. Each of those four bit fields is
a set of condition code bits.

The three compare instructions beginning the assembler code example specify
that they're setting condition code groups 7, 1, and 6 respectively.

And the three branches also specify which condition code field is being used by
them.

John Savard

Bruce Hoult

unread,

Apr 23, 2016, 9:54:48 AM4/23/16

to

Ivan asked about ALUs, not condition code registers.

Everyone who's used PPC know it has eight sets of condition codes :) In the mid 90's I hand wrote code that used even more than three of them. But I don't recall seeing a compiler make this much use of them before.

Compares can set any condition code register, and branches can branch from any set. But arithmetic operations (optionally) set only register zero.

Quadibloc

unread,

Apr 23, 2016, 10:36:03 AM4/23/16

to

On Saturday, April 23, 2016 at 7:54:48 AM UTC-6, Bruce Hoult wrote:

> Ivan asked about ALUs, not condition code registers.

He asked if it had three ALUs because the description of the code sample noted
that it set, and tested, three sets of condition codes. So he had assumed that
each set of condition codes belonged to its own ALU - which, based on how other
architectures work, is perfectly reasonable.

John Savard

Bruce Hoult

unread,

Apr 23, 2016, 10:42:38 AM4/23/16

to

I think Ivan knows PPC architecture a bit better than that :)

His question was because I said all three compares could be done in parallel. And he's right -- most PPCs have only been able to dispatch/execute two integer operations per cycle. They went wider for the 604(e) and 620, but went narrower again later as the bottlenecks on real code were found to be elsewhere.

Niels Jørgen Kruse

unread,

Apr 24, 2016, 5:14:09 AM4/24/16

to

Bruce Hoult <bruce...@gmail.com> wrote:

> His question was because I said all three compares could be done in
> parallel. And he's right -- most PPCs have only been able to
> dispatch/execute two integer operations per cycle. They went wider for
> the 604(e) and 620, but went narrower again later as the bottlenecks on
> real code were found to be elsewhere.

The MPC 7450 had 3 simple integer units.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Anton Ertl

unread,

Apr 24, 2016, 7:11:39 AM4/24/16

to

Bruce Hoult <bruce...@gmail.com> writes:
>On Saturday, April 23, 2016 at 2:18:44 PM UTC+12, Ivan Godard wrote:
>> There's a PowerPC with three ALUs?
>
>2nd gen in 1995 ... PowerPC 604 ... dispatch 4 ins per cycle (and retire up to 6), 2 simple int ALUs, 1 complex int ALU (I think it means multiply/divide as well), FPU, load/store, branch.
>
>So, yeah, I think those three compares could all go in one cycle.
>
>3.6m transistors. Sounds so little now. The 604e added more cache and went really well for the time.
>
>Hmm. The G3, G4, G5 all seem to have gone back to just two integer ALUs but added more load/store.

MPC7447 (later G4): Four integer units (3 simple + 1 complex)

<http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/integrated-host-processors/host-processor:MPC7447>

<http://archive.arstechnica.com/cpu/004/ppc-2/m-ppc-2-1.html> gives an
overview over the G3 and G4 microarchitectures.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Bruce Hoult

unread,

Apr 24, 2016, 9:18:40 AM4/24/16

to

On Sunday, April 24, 2016 at 11:11:39 PM UTC+12, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >On Saturday, April 23, 2016 at 2:18:44 PM UTC+12, Ivan Godard wrote:
> >> There's a PowerPC with three ALUs?
> >
> >2nd gen in 1995 ... PowerPC 604 ... dispatch 4 ins per cycle (and retire up to 6), 2 simple int ALUs, 1 complex int ALU (I think it means multiply/divide as well), FPU, load/store, branch.
> >
> >So, yeah, I think those three compares could all go in one cycle.
> >
> >3.6m transistors. Sounds so little now. The 604e added more cache and went really well for the time.
> >
> >Hmm. The G3, G4, G5 all seem to have gone back to just two integer ALUs but added more load/store.
>
> MPC7447 (later G4): Four integer units (3 simple + 1 complex)

Aha, yes, and the 7445 too, which Apple used in G4 iMacs from 800 MHz - 1.25 GHz.

The 7447a was in Mac Minis from 1.25 - 1.5 GHz. Those boxes went *very* well for the time and I used them as home servers (bought cheaply used) until just a couple of years ago, but somehow I'd stopped looking at the microarchitecture by then and moved on to G5 and Centrino/Core.

Anton Ertl

unread,

Apr 24, 2016, 10:45:55 AM4/24/16

to

Bruce Hoult <bruce...@gmail.com> writes:
>The 7447a was in Mac Minis from 1.25 - 1.5 GHz.

And at least also in the iBook G4 1GHz (I had one, and it actually ran
at 1066MHz).

> Those boxes went *very* wel=
>l for the time

In my experience they had somewhat worse IPC than Intel/AMD CPUs at
the time, and were not available at such high clock rates. Energy
consumption was good, however. Here's some data:

Latex Benchmark:
CPU/Machine/Software time introduced
Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89s 1/2001
iBook G4 12", 1066MHz 7447A, 512KB L2, Debian Sarge GNU/Linux 2.62s 2004*
Athlon (Thunderbird) 800, Abit KT7, PC100-333, RedHat 5.1 2.49s 6/2000
PowerMac G5, 2000MHz PPC970, Gentoo Linux PPC64 1.47s 6/2003
Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76s 9/2003

* The 7447 from 2003 could probably also have gone for that clock rate.

Gforth:
siev bubble matrix fib machine and configuration
0.89 1.17 0.58 1.18 PPC7447A (G4) 1066MHz
0.55 0.73 0.32 0.66 Pentium-III 1000 (Coppermine) introduced 3/2000
0.427 0.676 0.303 0.630 PPC970 (G5) 2000MHz; 32-bit executable
0.21 0.33 0.14 0.36 Athlon 64 3200+ (2GHz, 1MB L2)

All results with Gforth 0.6.2 configured with "--enable-force-reg"
(which makes up for a good part for the lack of registers on IA-32 and
the failure of gcc (to this day) to cope with that; hmm, if you
multiply the Intel/AMD results by 2 (typical slowdown without this
option), the PPCs look much more competetive) and compiled with
gcc-2.95. The Athlon 64 result is probably in 32-bit mode.

It looks to me like the 744x/745x were just a few years too late, and
that was deadly in the times of the MHz race.

Bruce Hoult

unread,

Apr 24, 2016, 4:57:40 PM4/24/16

to

On Monday, April 25, 2016 at 2:45:55 AM UTC+12, Anton Ertl wrote:
> In my experience they had somewhat worse IPC than Intel/AMD CPUs at
> the time, and were not available at such high clock rates. Energy
> consumption was good, however. Here's some data:
>
> Latex Benchmark:
> CPU/Machine/Software time introduced
> Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89s 1/2001
> iBook G4 12", 1066MHz 7447A, 512KB L2, Debian Sarge GNU/Linux 2.62s 2004*
> Athlon (Thunderbird) 800, Abit KT7, PC100-333, RedHat 5.1 2.49s 6/2000
> PowerMac G5, 2000MHz PPC970, Gentoo Linux PPC64 1.47s 6/2003
> Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76s 9/2003

Those look reasonably fair, though a lot comes down to compilers.

> * The 7447 from 2003 could probably also have gone for that clock rate.

Definitely it did, in the 12" and 17" PowerBooks.

My primary laptop in those day was always a Mac. I didn't regard Intel machines as an option until Centrino.

I also always had a honking big Mac desktop, and *also* a series of Intel machines to run Linux. So I had a Pentium Pro 200, then an Athlon 700, than an Athlon XP 3200+.

I found the Pentium Pro 200 a bit faster than a 200 MHz 604e.
The Athlon 700 was a bit faster single core than my 866 G4, but then the G4 had two CPUs, so it was handily faster for most things I cared about.
The Athlon 3200+ was again a bit faster single core than the 2.0 G5. Not as much as you indicate for the (slightly later) Athlon 64. And, again, the G5 had two CPUs, so it won.

In 2004 I was working for a company with a large C++ + OpenGL + Qt app for TV weather graphics. The same code base compiled and ran on all of Mac, Windows, and Linux. The live on air broadcast machines at customers were always Linux, the offline data entry and show preview machines were usually Windows (customer could choose), and 2/3 of the developers preferred to use Mac. The 2.0 G5s built the software quite noticeably faster than the Windows and Linux machines. Alas, I don't remember the specs of those machines, but they would have been the best standard PCs available from Dell.

So, yes, I think it's fair to say that after the Pentium Pro hit, and AMD pressed their advantage when the P4 was a dud, the PowerPC chips were usually a little bit behind on IPC. But the x86 vendors made it very expensive to build a multi-core PC (that's a SERVER, and $$$), while Apple didn't.

Kerr Mudd-John

unread,

May 24, 2016, 10:46:30 AM5/24/16

to

can't we use aam 0x0A ?

> sub al,'0'
> jb done
> cmp al,9
> ja done
> lea di,[di+di*4] ; *5
> add di,di
> add di,ax ; *10+curr
> jmp next_decimal
>
> hex:

don't you want to skip the "0x" part 1st?
e.g. "lodsw"

> lodsb
> sub al,'0'
> jb done
> cmp al,9
> jbe ok
> sub al,'a'-'9'-1
> jb done
> cmp al,15
> ja done
> shl di,4 ; *16
> add di,ax ; *16+curr
> jmp hex
>
> done:
> dec si
> lodsb

or just mov al,[si-1]

> mov cl,20
> cmp al,'M'
> je shift
> mov cl,10
> cmp al,'K'
> jne skip_shift

I think you need a
"shift:"
here

> shl di,cl
> skip_shift:
> mov ax,di
> }
>
> That's about 40 instructions including the C function overhead (4 ops),
> around 75 bytes when I just add up the instruction bytes from memory, so
> take that number as a guesstimate...
>
> Terje
>

--

Bah, and indeed, Humbug

Terje Mathisen

unread,

May 24, 2016, 12:37:17 PM5/24/16

to

Kerr Mudd-John wrote:
> On Fri, 22 Apr 2016 19:20:35 +0100, Terje Mathisen
> <terje.m...@tmsw.no> wrote:
>> uint16_t parseval(char *str)
>> {
>> mov si,[str]
>> xor di,di ; return value
>> xor ax,ax
>> cmp word ptr [si],'x'*256+'0'
>> je hex
>> next_decimal:
>> lodsb
>
> can't we use aam 0x0A ?

Sure. I just never liked to use the hacked versions of the BCD opcodes,
even if they are now all documented as OK.

>
>> sub al,'0'
>> jb done
>> cmp al,9
>> ja done
>> lea di,[di+di*4] ; *5
>> add di,di
>> add di,ax ; *10+curr
>> jmp next_decimal
>>
>> hex:
>
> don't you want to skip the "0x" part 1st?
> e.g. "lodsw"

Good catch.

>
>> lodsb
>> sub al,'0'
>> jb done
>> cmp al,9
>> jbe ok
>> sub al,'a'-'9'-1
>> jb done
>> cmp al,15
>> ja done
>> shl di,4 ; *16
>> add di,ax ; *16+curr
>> jmp hex
>>
>> done:
>> dec si
>> lodsb
> or just mov al,[si-1]

The MOV version would be 3 bytes afair, while DEC + LODSB are both
single-byte opcodes.

>> mov cl,20
>> cmp al,'M'
>> je shift
>> mov cl,10
>> cmp al,'K'
>> jne skip_shift
> I think you need a
> "shift:"
> here

Yes indeed. :-)

>> shl di,cl
>> skip_shift:
>> mov ax,di
>> }
>>
>> That's about 40 instructions including the C function overhead (4
>> ops), around 75 bytes when I just add up the instruction bytes from
>> memory, so take that number as a guesstimate...
>>
>> Terje
>>
>
>

--

wolfgang kern

unread,

May 24, 2016, 1:26:24 PM5/24/16

to

Kerr Mudd-John suggested:
...

> can't we use aam 0x0A ?

Isn't 0x0a the default for AAM anyway ? :)

__
wolfgang

already...@yahoo.com

unread,

May 24, 2016, 6:33:21 PM5/24/16

to

On Sunday, April 24, 2016 at 2:11:39 PM UTC+3, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >On Saturday, April 23, 2016 at 2:18:44 PM UTC+12, Ivan Godard wrote:
> >> There's a PowerPC with three ALUs?
> >
> >2nd gen in 1995 ... PowerPC 604 ... dispatch 4 ins per cycle (and retire up to 6), 2 simple int ALUs, 1 complex int ALU (I think it means multiply/divide as well), FPU, load/store, branch.
> >
> >So, yeah, I think those three compares could all go in one cycle.
> >
> >3.6m transistors. Sounds so little now. The 604e added more cache and went really well for the time.
> >
> >Hmm. The G3, G4, G5 all seem to have gone back to just two integer ALUs but added more load/store.
>
> MPC7447 (later G4): Four integer units (3 simple + 1 complex)

The core is e600. There exists pretty good TRM
http://www.nxp.com/files/32bit/doc/ref_manual/E600CORERM.pdf

It looks like 3 CR updates per clock are, indeed, possible.

wolfgang kern

unread,

May 25, 2016, 4:31:37 AM5/25/16

to

Terje Mathisen posted (in part):

> uint16_t parseval(char *str)
...
> lea di,[di+di*4] ; *5
...

it took me a while to figure the opcode for this ;)

my disassembler may not be as smart as your assembler because it
acts the same as on other [mem] here and show for 16-bit mode:

67 8D 3C BF LEA DI,[EDI+EDI*4]

so I may add one line to my ToDo-list.
__
wolfgang

Michael Barry

unread,

May 26, 2016, 2:11:24 AM5/26/16

to

On Friday, April 22, 2016 at 1:31:37 AM UTC-7, Ivan Godard wrote:
> On 4/22/2016 1:20 AM, Michael Barry wrote:

> > ...

> > 3) I am probably the only person in the world weird enough to be
> > motivated to understand, much less compose, 65m32 machine
> > language, so plopping the code here without Ivan's permission
> > would be a bit rude.
>
> Feel free; the source is not mine, it's EEMBC, and is public domain as
> far as I know.

> ...

0000: parseval: ; if fastcall is assumed,
0000:b4000000 pdy #,a ; *valstring arrives in register a
0001:a0300000 lda #0 ; retval = 0
0002:b2300000 pdx #0 ; neg = false
0003:ba30000a pdu #10 ; base = 10
0004:b8a00000 pdb ,y ; load first char
0005:c830002d cpb #'-'
0006:8217ffff stx [eq] #-1,x ; if leading '-' then set neg to true
0007:a9270001 ldb [eq] 1,y+ ; bump over the '-' and load next char
0008:c8300030 cpb #'0'
0009:a8a70001 ldb [eq] 1,y ; if '0' then peek ahead for an 'x'
000a:c8370078 cpb [eq] #'x'
000b:8a570006 stu [eq] #6,u ; if leading "0x" then bump base to 16
000c:84270002 sty [eq] #2,y ; and bump over the "0x"
000d: whiletop:
000d:a9200000 ldb ,y+ ; load potential digit into b
000e:fc40ffd0 dec #-'0',b ; clear ^C and set ^N if b < '0'
000f:a65afff6 tst [pl] #-10,u ; clear ^N if b > '9'
0010:e83affd9 adb [pl] #'9'-'a'+1 ; correct for b > '9' (no flag effect)
0011:c83a000a cpb [pl] #10 ; clear ^C if '9' < b < 'a'
0012:ca450001 cpu [cs] #1,b ; set ^C if base > digit
0013:ae740003 bcc qualify
0014:6e400000 ldc #,b
0015:72500000 mul #,u ; retval = retval * base + digit
0016:ae70fff6 bra whiletop
0017: qualify:
0017:6e300000 ldc #0
0018:c830004b cpb #'K'
0019:72370400 mul [eq] #1024 ; handle 'K' suffix
001a:c830004d cpb #'M'
001b:72378000 mul [eq] #1024*1024 ; handle 'M' suffix
001c:00100000
001d:a6100000 tst #,x
001e:643b0001 cdd [mi] #1 ; negate retval if necessary
001f:a9e00000 plb
0020:abe00000 plu
0021:a3e00000 plx ; restore stomped registers
0022:a5e00000 ply
0023:afe00000 rts ; return with retval in register a

Mike B.

Terje Mathisen

unread,

May 26, 2016, 4:05:48 AM5/26/16

to

OK, so your version is actually doing what I suggested, i.e. use common
code for both decimal and hex decoding, good for you!

A similar x86 version could also be a bit shorter than the standard
two-way code:

parseval: ; source in ESI, return value in EAX
xor ecx,ecx ; 2 bytes - accumulator
lea ebx,[ecx+10]; 3 bytes - Shorter way to load base 10!
cmp word ptr [esi],'0' + 'x'*256 ; 5 bytes with 16-bit prefix
jne base_ok ; 2
add bl,6 ; 3
inc esi ; 1 - Skip '0x' prefix
inc esi ; 1
base_ok:
xor eax,eax ; 2
next:
lodsb ; 1
sub al,'0' ; 3
jb done ; 2
cmp al,9 ; 2 or 3
jbe ok ; 2
sub al,'a'-'0' ; 2 or 3
jb done ; 2
add al,10 ; 2 or 3
cmp al,bl ; 2 - input digit >= base?
jae done ; 2
ok:
imul ecx,ebx ; 4 ?
add ecx,eax ; 2
jmp next ; 2
done:
mov eax,ecx ; 2
ret ; 1

I.e. 50+ bytes depending upon the exact encoding of those instructions
where I'm not sure I remember the true count.

Kerr Mudd-John

unread,

May 26, 2016, 2:49:51 PM5/26/16

to

On Tue, 24 May 2016 17:37:14 +0100, Terje Mathisen
<terje.m...@tmsw.no> wrote:

> Kerr Mudd-John wrote:
[]

>> or just mov al,[si-1]
>
> The MOV version would be 3 bytes afair, while DEC + LODSB are both
> single-byte opcodes.
>

[]

Ah, yes. Sorry.