still worrying about optimization...

hughag...@nospicedham.gmail.com

unread,

Oct 31, 2016, 6:34:52 PM10/31/16

to

In the old days of 32-bit x86, the former was always preferred because it was shorter (the 0 gets compiled as a 32-bit literal):
xor ebx, ebx
mov ebx, 0
The problem with the former however, is that it sets the flags which can clobber existing flags you need, and (I would expect) this also prevents it from parallizing with other instructions nearby that set the flags.

What about this?
xor ebx, ebx
mov rbx, 0
Isn't it true nowadays that you don't compile a full-size literal when a 1-byte literal can be sign-extended into full-size? But, we have the REX byte that adds some length, so that has to be considered. I'm really unclear on what the rules are for what machine-code gets generated by the instructions --- this is actually described in the Intel manual, but it is complicated and I haven't learned it yet.

Here is another question that I wonder about:
cmp rbx, 0
test rbx, 0
Which is better? The former is more readable, but in the old days the latter was considered to be much more efficient.

Here is another one:
How much cost is there in doing a JMP (unconditional)? This is always predicted correctly, so there shouldn't be much cost --- the trace-cache doesn't get emptied out and refilled --- OTOH, a new 16-byte paragraph has to be loaded and compiled because the jump destination is not likely to be in the same paragraph as the JMP is. I wonder about this question because quite a lot of my primitives end in DROP --- should I have a JMP to the DROP function, or should I inline the DROP code? Also, is there any difference in speed between a JMP with an 8-bit displacement and a JMP with a 16-bit displacement?

The x86 continues to be mysterious to me --- certainly the most complicated processor that I've ever worked with...

Edward Brekelbaum

unread,

Nov 2, 2016, 2:20:28 PM11/2/16

to

On Monday, October 31, 2016 at 5:34:52 PM UTC-5, hughag...@nospicedham.gmail.com wrote:
> In the old days of 32-bit x86, the former was always preferred because it was shorter (the 0 gets compiled as a 32-bit literal):
> xor ebx, ebx
> mov ebx, 0
> The problem with the former however, is that it sets the flags which can clobber existing flags you need, and (I would expect) this also prevents it from parallizing with other instructions nearby that set the flags.
>

The "former" is XOR, which is 2 bytes (0x31 /r). The "latter" is MOV which is 5 bytes (B8 +r id). XOR will set flags, while MOV does not.

Usually, it is easiest to assume that flags only live to the next instruction. So, you want to pair the op that sets the flags with the op that uses them (CMP; BR). Most modern hardware will then be able to fuse these ops, which should give you better performance.

> What about this?
> xor ebx, ebx
> mov rbx, 0
> Isn't it true nowadays that you don't compile a full-size literal when a 1-byte literal can be sign-extended into full-size? But, we have the REX byte that adds some length, so that has to be considered. I'm really unclear on what the rules are for what machine-code gets generated by the instructions --- this is actually described in the Intel manual, but it is complicated and I haven't learned it yet.

If you specify an 'E' register while in 64 bit mode, the upper 32 bits are cleared. So, yes, clearing operations on 32 bit regs will yield 64 bits of 0.

That said. "MOV EBX, 0" will be 5 bytes (the immediate is matched to the register size) (the extending MOVSX and MOVZX are reg-reg only).

> Here is another question that I wonder about:
> cmp rbx, 0
> test rbx, 0
> Which is better? The former is more readable, but in the old days the latter was considered to be much more efficient.

Both should be the same. CMP is a subtract which does a little more work than TEST (which is a bitwise and). The difference should not be noticeable in a modern machine (unless you're doing nothing except these ops).

> Here is another one:
> How much cost is there in doing a JMP (unconditional)? This is always predicted correctly, so there shouldn't be much cost --- the trace-cache doesn't get emptied out and refilled --- OTOH, a new 16-byte paragraph has to be loaded and compiled because the jump destination is not likely to be in the same paragraph as the JMP is. I wonder about this question because quite a lot of my primitives end in DROP --- should I have a JMP to the DROP function, or should I inline the DROP code? Also, is there any difference in speed between a JMP with an 8-bit displacement and a JMP with a 16-bit displacement?
>

Not taken branches are the cheapest (they only consume space in the instruction stream). Always taken branches are next (they take an entry in the BTB).

Determining how instructions will be fed in is enormously complicated and subject to change.

> The x86 continues to be mysterious to me --- certainly the most complicated processor that I've ever worked with...

The best thing is to try both ways and measure the differences. You can develop an intuition, but sometimes reality will surprise you :)

Hope that helps!

James Harris

unread,

Nov 2, 2016, 9:06:05 PM11/2/16

to

On 31/10/2016 22:30, hughag...@nospicedham.gmail.com wrote:
> In the old days of 32-bit x86, the former was always preferred
> because it was shorter (the 0 gets compiled as a 32-bit literal): xor
> ebx, ebx mov ebx, 0 The problem with the former however, is that it
> sets the flags which can clobber existing flags you need, and (I
> would expect) this also prevents it from parallizing with other
> instructions nearby that set the flags.

IIRC the six status flags get carried down the pipeline with the renamed
register. Thus if an instruction affects them all or affects none of
them then the operation can take place independently. But an operation
which affects just some of the flags creates a dependency, requiring its
flag changes to be merged with those from the previous instruction.

That's why SUB REG,1 is often faster than DEC REG. The latter is smaller
but needs its flag changes to be merged with the prior flags register.

> What about this? xor ebx, ebx mov rbx, 0 Isn't it true nowadays that
> you don't compile a full-size literal when a 1-byte literal can be
> sign-extended into full-size? But, we have the REX byte that adds
> some length, so that has to be considered. I'm really unclear on what
> the rules are for what machine-code gets generated by the
> instructions --- this is actually described in the Intel manual, but
> it is complicated and I haven't learned it yet.
>
> Here is another question that I wonder about: cmp rbx, 0 test rbx, 0
> Which is better? The former is more readable, but in the old days the
> latter was considered to be much more efficient.

Maybe you mean TEST RBX,RBX. If it affects the six status flags
(including invalidating them) then I think either CMP or TEST will be as
fast. But TEST REG,REG will be shorter.

> Here is another one: How much cost is there in doing a JMP
> (unconditional)? This is always predicted correctly, so there
> shouldn't be much cost --- the trace-cache doesn't get emptied out
> and refilled --- OTOH, a new 16-byte paragraph has to be loaded and
> compiled because the jump destination is not likely to be in the same
> paragraph as the JMP is. I wonder about this question because quite a
> lot of my primitives end in DROP --- should I have a JMP to the DROP
> function, or should I inline the DROP code? Also, is there any
> difference in speed between a JMP with an 8-bit displacement and a
> JMP with a 16-bit displacement?
>
> The x86 continues to be mysterious to me --- certainly the most
> complicated processor that I've ever worked with...

Check out the Intel and AMD optimisation guides and Agner Fog's
documents. Frankly, his stuff is more informative.

--
James Harris

Rod Pemberton

unread,

Nov 2, 2016, 10:36:13 PM11/2/16

to

On Mon, 31 Oct 2016 15:30:03 -0700 (PDT)
hughag...@nospicedham.gmail.com wrote:

> In the old days of 32-bit x86, the former was always preferred
> because it was shorter (the 0 gets compiled as a 32-bit literal): xor
> ebx, ebx mov ebx, 0
> The problem with the former however, is that it sets the flags which
> can clobber existing flags you need, and (I would expect) this also
> prevents it from parallizing with other instructions nearby that set
> the flags.

My notes say:

xor ebx, ebx ; destroys flags, not subject to flag stalls
mov ebx, 0h ; preserves flags, but is subject to flag stalls

If you already have a register that is zeroed, you can simply 'mov'
from register to register.

If you have a register that is partially zeroed, you can use 'movzx' or
'movsx' instructions to create a larger sized zero without destroying
flags. If AL or AX is clear, then you can use 'cbw' or 'cwde' to
extend them to AX or EAX. You can even use 'cwd' or 'cdq' to clear DX
or EDX in addition to extending AL or AX into AX or EAX.

IIRC, in general, the x86 doesn't guarantee flags are preserved through
more than one instructions. I.e., you can only be assured that they're
valid for the next instruction, unless you save them with a 'pushf'.
There is usually a table at the end of the processor manuals which show
what happens to the flags on a per instruction basis. Many
instructions are listed as having undefined results for various flags
after the instruction executes. This means some flags might not be
preserved through that instruction on certain processors.

> What about this?
> xor ebx, ebx
> mov rbx, 0
> Isn't it true nowadays that you don't compile a full-size literal
> when a 1-byte literal can be sign-extended into full-size? But, we
> have the REX byte that adds some length, so that has to be
> considered. I'm really unclear on what the rules are for what
> machine-code gets generated by the instructions --- this is actually
> described in the Intel manual, but it is complicated and I haven't
> learned it yet.

AFAIK, 'mov' doesn't support imm8 or reg8 or mem8 operands, but I'm not
up to date on 64-bit instructions.

For 16-bit/32-bit modes, my notes say:

imm8 supported add, adc, and, cmp, or, sub, sbb, xor
imm8 supported rcl, rcr, rol, ror, sal, sar, shl, shr
imm8 supported shrd, shld
imm8 supported btc, btr, btc, push
reg8/mem8 supported movzx, movsz

You'll have to look up any you're interested in for constraints, e.g.,
'push' is a signed-extension of the imm8.

> Here is another question that I wonder about:
> cmp rbx, 0
> test rbx, 0
> Which is better? The former is more readable, but in the old days the
> latter was considered to be much more efficient.

cmp rbx, 0h ; non-destructive 'sub', destroys flags, no flag stall
test rbx, 0h ; non-destructive 'and', destroys flags, partial flag stall

So, these preserve registers, have no register stalls, but set flags.
'cmp' supports mixed-size operations.

For processors that pair instructions, the instructions you asked
about are pairable: xor, test, cmp, mov.

> Here is another one:
> How much cost is there in doing a JMP (unconditional)? This is always
> predicted correctly, so there shouldn't be much cost --- the
> trace-cache doesn't get emptied out and refilled --- OTOH, a new
> 16-byte paragraph has to be loaded and compiled because the jump
> destination is not likely to be in the same paragraph as the JMP is.
> I wonder about this question because quite a lot of my primitives end
> in DROP --- should I have a JMP to the DROP function, or should I
> inline the DROP code? Also, is there any difference in speed between
> a JMP with an 8-bit displacement and a JMP with a 16-bit displacement?

¿Qué? Habla Inglés por favor. Lenguaje ensamblador no tiene una
instrucción DROP. Los programadores del Forth en comp.lang.forth
pueden saber sobre DROP. (Gracias, Google Translate.)

> The x86 continues to be mysterious to me --- certainly the most
> complicated processor that I've ever worked with...

"Do not pass Go. Do not collect $200."

Rod Pemberton

Terje Mathisen

unread,

Nov 3, 2016, 9:07:02 AM11/3/16

to

Rod Pemberton wrote:
> IIRC, in general, the x86 doesn't guarantee flags are preserved through
> more than one instructions. I.e., you can only be assured that they're
> valid for the next instruction, unless you save them with a 'pushf'.

This is of course wrong!

YOu can have an arbitrary number of instructions between an opcode which
sets some or all of the flags, and a branch which needs those flags, as
long as all those intermediate instructions are documented to NOT modify
the flag(s) you need.

I.e. the shortest loop to copy data from src to dst while calculating
the TCPIP checksum (carry wraparound addition) looks like this:

; EDI+ESI -> dst
; ESI -> src
; ECX = # of bytes/4

next32:
mov [edi+esi],eax
adc edx,eax
mov eax,[esi]
lea esi,[esi+4]
dec ecx
jnz next32

The ADC will include the carry from the previous ADC, one iteration
earlier, because the MOV and LEA does not modify any flags, and DEC does
not touch the carry flag.

There are of course many other ways to write this, the easiest way to
speed it up is to work with larger registers and accumulate all the
carries in parallel, i.e. using 256-bit SIMD operations you can do 16
16-bit adds at once, then a compare to set a second set of carry
registers to -1 or 0 followed by a subtraction to add this to the carry
register. You can do this 65535 times without any possibility of
overflow so that's no problem.

At the very end you have to fold the carries into the accumulator and
use horizontal adds to end up with a single 16-bit value.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Rod Pemberton

unread,

Nov 3, 2016, 4:52:33 PM11/3/16

to

On Thu, 3 Nov 2016 07:58:13 -0500
Terje Mathisen <terje.m...@nospicedham.tmsw.no> wrote:

> Rod Pemberton wrote:

> > IIRC, in general, the x86 doesn't guarantee flags are preserved
> > through more than one instructions. I.e., you can only be assured
> > that they're valid for the next instruction, unless you save them
> > with a 'pushf'.
>
> This is of course wrong!

Not "in general," it's not. It is in specific constructed situations as
you demonstrated.

The vast majority of x86 instructions which preserve flags are
instructions which are not that useful. They're basically comprised of
data movement, register loads, looping, and size conversion
instructions. Most of the instructions that modify values, which
you'll need or that a compiler will generate, will be arithmetic,
binary, test, or shift operations, all of which modify flags. And,
many other instructions are marked as having the results for specific
flags as being undefined, even if the instruction doesn't use that
flag. This means you can't expect that flag to be preserved.
(I constructed lists of these for my own personal use.)

The 6502 is what Hugh was likely thinking about. Preserving flags
through long sequences was "easy" on the 6502. The x86 is not the same
as the 6502 which didn't touch flags the instruction didn't use. The
6502 also didn't change many flags per instruction, unlike the x86
which tends to set or change most of them at once.

Rod Pemberton

James Harris

unread,

Nov 3, 2016, 5:22:37 PM11/3/16

to

On 03/11/2016 20:43, Rod Pemberton wrote:
> On Thu, 3 Nov 2016 07:58:13 -0500
> Terje Mathisen <terje.m...@nospicedham.tmsw.no> wrote:
>
>> Rod Pemberton wrote:
>
>>> IIRC, in general, the x86 doesn't guarantee flags are preserved
>>> through more than one instructions. I.e., you can only be assured
>>> that they're valid for the next instruction, unless you save them
>>> with a 'pushf'.
>>
>> This is of course wrong!
>
> Not "in general," it's not. It is in specific constructed situations as
> you demonstrated.
>
> The vast majority of x86 instructions which preserve flags are
> instructions which are not that useful. They're basically comprised of
> data movement, register loads, looping, and size conversion
> instructions. Most of the instructions that modify values, which
> you'll need or that a compiler will generate, will be arithmetic,
> binary, test, or shift operations, all of which modify flags.

Not quite all. IIRC one or both of NOT or NEG doesn't set flags. But
that is unusual.

> And,
> many other instructions are marked as having the results for specific
> flags as being undefined, even if the instruction doesn't use that
> flag. This means you can't expect that flag to be preserved.
> (I constructed lists of these for my own personal use.)

Isn't there a table showing flag effects at the back of the 386 manual,
and probably later manuals?

> The 6502 is what Hugh was likely thinking about. Preserving flags
> through long sequences was "easy" on the 6502.

It's odd that you remember it that way. AIR the 6502 was /more/ inclined
to set flags. When I started on an x86 I noticed that register loads
left flags unchanged. The 6502 sets them on register loads.

> The x86 is not the same
> as the 6502 which didn't touch flags the instruction didn't use. The
> 6502 also didn't change many flags per instruction, unlike the x86
> which tends to set or change most of them at once.

True. I think we've discussed this before....

--
James Harris

hughag...@nospicedham.gmail.com

unread,

Nov 5, 2016, 10:24:54 PM11/5/16

to

On Monday, October 31, 2016 at 3:34:52 PM UTC-7, hughag...@nospicedham.gmail.com wrote:
> What about this?
> xor ebx, ebx
> mov rbx, 0
> Isn't it true nowadays that you don't compile a full-size literal when a 1-byte literal can be sign-extended into full-size? But, we have the REX byte that adds some length, so that has to be considered. I'm really unclear on what the rules are for what machine-code gets generated by the instructions --- this is actually described in the Intel manual, but it is complicated and I haven't learned it yet.

This was a nonsense question because it was the same as the last question.

> Here is another question that I wonder about:
> cmp rbx, 0
> test rbx, 0
> Which is better? The former is more readable, but in the old days the latter was considered to be much more efficient.

This was a typo --- I meant:
cmp rbx, 0
test rbx, rbx

I've made this same typo in programs too...

> Here is another one:
> How much cost is there in doing a JMP (unconditional)? This is always predicted correctly, so there shouldn't be much cost --- the trace-cache doesn't get emptied out and refilled --- OTOH, a new 16-byte paragraph has to be loaded and compiled because the jump destination is not likely to be in the same paragraph as the JMP is. I wonder about this question because quite a lot of my primitives end in DROP --- should I have a JMP to the DROP function, or should I inline the DROP code? Also, is there any difference in speed between a JMP with an 8-bit displacement and a JMP with a 16-bit displacement?

This was a good question.

hughag...@nospicedham.gmail.com

unread,

Nov 5, 2016, 10:39:57 PM11/5/16

to

On Thursday, November 3, 2016 at 2:22:37 PM UTC-7, James Harris wrote:
> On 03/11/2016 20:43, Rod Pemberton wrote:
> > On Thu, 3 Nov 2016 07:58:13 -0500
> > Terje Mathisen <terje.m...@nospicedham.tmsw.no> wrote:
> >
> >> Rod Pemberton wrote:
> >
> >>> IIRC, in general, the x86 doesn't guarantee flags are preserved
> >>> through more than one instructions. I.e., you can only be assured
> >>> that they're valid for the next instruction, unless you save them
> >>> with a 'pushf'.
> >>
> >> This is of course wrong!
> >
> > Not "in general," it's not. It is in specific constructed situations as
> > you demonstrated.
> >
> > The vast majority of x86 instructions which preserve flags are
> > instructions which are not that useful. They're basically comprised of
> > data movement, register loads, looping, and size conversion
> > instructions. Most of the instructions that modify values, which
> > you'll need or that a compiler will generate, will be arithmetic,
> > binary, test, or shift operations, all of which modify flags.
>
> Not quite all. IIRC one or both of NOT or NEG doesn't set flags. But
> that is unusual.

NOT doesn't set flags --- that is a surprising quirk of the x86.

NEG does set flags.

>
> > And,
> > many other instructions are marked as having the results for specific
> > flags as being undefined, even if the instruction doesn't use that
> > flag. This means you can't expect that flag to be preserved.
> > (I constructed lists of these for my own personal use.)
>
> Isn't there a table showing flag effects at the back of the 386 manual,
> and probably later manuals?

I mentioned this earlier, in asking for a book on the "good parts" of x86 assembly-language. The Intel manuals have the information, but it is very difficult to look up (it took me several minutes to find NOT and NEG to answer your last question about whether they affect the flags). We could really use a concise description of the x86 that provides basic information such as this that could be used as a handy reference.

> > The 6502 is what Hugh was likely thinking about. Preserving flags
> > through long sequences was "easy" on the 6502.
>
> It's odd that you remember it that way. AIR the 6502 was /more/ inclined
> to set flags. When I started on an x86 I noticed that register loads
> left flags unchanged. The 6502 sets them on register loads.

Rod Pemberton doesn't know what he's talking about --- he is not worth responding to.

One of the major problems with the 6502 was that LDA etc. would set the flags --- sometimes I would need to do PHP and PLP to save and restore the flags, to prevent them from getting clobbered --- when I moved to the 8088 I was greatly pleased to find that MOV did not set the flags, which was a much better design.

The 6502 was pretty cool for its day --- it did have some design flaws --- it was designed way back in the 1970s when people didn't know much about assembly-language, so those flaws have to be considered within that context.

Terje Mathisen

unread,

Nov 5, 2016, 11:10:00 PM11/5/16

to

hughag...@nospicedham.gmail.com wrote:
> This was a typo --- I meant: cmp rbx, 0 test rbx, rbx

Use the shorter version.

>
> I've made this same typo in programs too...
>
>> Here is another one: How much cost is there in doing a JMP
>> (unconditional)? This is always predicted correctly, so there
>> shouldn't be much cost --- the trace-cache doesn't get emptied out
>> and refilled --- OTOH, a new 16-byte paragraph has to be loaded and
>> compiled because the jump destination is not likely to be in the
>> same paragraph as the JMP is. I wonder about this question because
>> quite a lot of my primitives end in DROP --- should I have a JMP to
>> the DROP function, or should I inline the DROP code? Also, is there
>> any difference in speed between a JMP with an 8-bit displacement
>> and a JMP with a 16-bit displacement?
>
> This was a good question.

Any branch, even an unconditional one, is likely to require a Branch
Target Buffer entry, even if it will end up being correctly predicted.

There might be cpu versions which optimize this particular opcode to not
require a BTB, but I would not depend on that.

How large is your DROP word? Using a two-byte branch to reach a common
end point might be reasonable, but if the offset is too large then you
get into the long form branch, at which point the size advantage might
be far lower.

hughag...@nospicedham.gmail.com

unread,

Nov 6, 2016, 1:27:18 AM11/6/16

to

On Saturday, November 5, 2016 at 8:10:00 PM UTC-7, Terje Mathisen wrote:
> Any branch, even an unconditional one, is likely to require a Branch
> Target Buffer entry, even if it will end up being correctly predicted.
>
> There might be cpu versions which optimize this particular opcode to not
> require a BTB, but I would not depend on that.
>
> How large is your DROP word? Using a two-byte branch to reach a common
> end point might be reasonable, but if the offset is too large then you
> get into the long form branch, at which point the size advantage might
> be far lower.

This is it:

func
_drop_: ; a --
lodsd
pop rbx
jmp qword [rdi+rax]
fend

It is pretty short. Isn't it true that on the 64-bit x86 we have 8-bit, 16-bit and 32-bit offsets for Jxx instructions? So, even if _DROP_ is a long ways away, it will still only be one extra byte compared to if it were nearby (less than 128 bytes away).

Most of my functions are bracketed with the FUNC and FEND macros:

macro FUNC {
local function_start, function_size

if (($-$$) mod paragraph)+(function_size mod paragraph) >= paragraph
align paragraph
end if

function_start = $
start@FUNC equ function_start
size@FUNC equ function_size
}

macro FEND {
size@FUNC = $ - start@FUNC
restore start@FUNC
restore size@FUNC
}

; FUNC and FEND were written by Tomasz Grysztar (http://board.flatassembler.net/topic.php?p=163729#163729).
; These wrap functions and will paragraph-align if necessary to minimize the number of paragraphs occupied by a function.
; According to the Intel Optimization manual, code is decoded into trace-code in paragraph chunks.

I suppose my question actually is: How much speed is lost in a function length going over a paragraph boundary and requiring another paragraph to be loaded and compiled into trace-code?

In the above question, if I JMP to _DROP_ then I certainly require a new paragraph to be loaded and compiled into trace-code --- if I inline the drop code, then I may or may not require a new paragraph to be loaded (depending upon how close I am to the paragraph boundary).

By using JMP _DROP_ I make my functions shorter, so they are less likely to go over a paragraph boundary --- even if only a few bytes are saved nominally, this can actually save almost a whole paragraph because the FUNC for the next function won't have to do an ALIGN 16 at all.

My understanding is that the major speed killer is code-cache thrashing, so the more memory that is saved, the more code will be in the code-cache, and the less code-cache thrashing there will be --- this may make the program faster overall, despite small costs here and there because a new paragraph has to be loaded and compiled into trace-code that would not have been needed if inline code were used instead of JMP termination.

I predict that pretty soon George Neuner will tell me to stop worrying about optimization and just write the program --- worry about optimization later --- his advice is good, of course, but it is difficult for me to not wonder about this optimization stuff and I tend to worry that I'm doing it wrong...

I also predict that pretty soon several people will tell me to test the different versions of my program to determine which is faster. This advice I'm dubious about. Most of the time when I test programs, the results vary a lot --- there is more variance between tests of the same version, than there is between different versions --- this obviously implies that the tests are meaningless.

> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

A lot of people, including myself, have wondered about what your tag-line means. Do you have a webpage somewhere that explains your tag-line?

hughag...@nospicedham.gmail.com

unread,

Nov 6, 2016, 1:27:19 AM11/6/16

to

Here is another question.

When I was programming the MiniForth, I found that parallelization worked a lot better if data was held in registers for as long as possible. The best thing was to load a register in one instruction, then do some other unrelated instructions, then use that register --- in this case, the instruction that loaded the register would get parallelized with one or more surrounding instructions and hence would have zero cost.

I'm using the same rule-of-thumb now on the x86. For example:

func
_not_: ; a -- NOT-a ; make a proper flag opposite of parameter
mov edx, false ; RDX= false
test rbx, rbx
cmovne rdx, true ; RDX= true if RBX<>0
lodsd
mov rbx, rdx
jmp qword [rdi+rax]
fend

Here the LODSD loads RAX which is used in the JMP --- but I put the LODSD earlier in the function in the hopes that it would parallelize better with the surrounding instructions --- I didn't put the LODSD immediately prior to the JMP.

I know this rule-of-thumb was good on the old x86 processors --- when I was working on the MiniForth I read Abrash's Zen book and noticed that he was doing the same kind of optimization on the x86 as I was doing on the MiniForth --- but I wonder if this rule-of-thumb is still good on the new x86 processors.

Philip Lantz

unread,

Nov 6, 2016, 2:27:26 AM11/6/16

to

Yes, absolutely.

I would write that as:
_not_:
lodsd
test rbx, rbx
setnz bl
movzx ebx, bl
jmp qword [rdi+rax]

(Note that there is no cmovcc r, imm instruction.)
Shouldn't it be cmove rather than cmovne, though (since it is called "not")?

Philip

wolfgang kern

unread,

Nov 6, 2016, 4:12:32 AM11/6/16

to

hughaguilar wrote:

| This was a typo --- I meant:
| cmp rbx, 0
| test rbx, rbx

this two work different anyway:

48 85 DB TEST rbx,rbx ;CY=0 OV=0 Aux=? P,Z,S modified

48 83 FB 00 CMP rbx,0 ;all six flags were modified.

so TEST cant be used for <0
__
wolfgang

George Neuner

unread,

Nov 6, 2016, 5:27:37 AM11/6/16

to

On Sat, 5 Nov 2016 19:26:43 -0700 (PDT),
hughag...@nospicedham.gmail.com wrote:

>The 6502 was pretty cool for its day --- it did have some design flaws
>--- it was designed way back in the 1970s when people didn't know
>much about assembly-language, so those flaws have to be considered
>within that context.

???

At the time the 6502 was designed (circa ~1972), computers had been
programmed in textual assembly language for almost 20 years, and
programmed in opcodes for decades before that.

They knew plenty about designing instruction sets for programmers.
What they didn't know was much about designing instruction sets for
compilers.

Just because you disagree with the choices they made doesn't mean
there weren't good reasons to make them. Many past accumulator
machines set flags on load - it is an implicit comparison to zero
which (presumably) on average saves more instructions than are wasted
by occasionally having to preserve the flags.

George

Kerr Mudd-John

unread,

Nov 6, 2016, 10:12:52 AM11/6/16

to

On Sun, 06 Nov 2016 01:26:43 -0100, <hughag...@nospicedham.gmail.com>
wrote:

[]

>
> Rod Pemberton doesn't know what he's talking about --- he is not worth
> responding to.
>

[]

fed up trolling the forth NG?

--
Bah, and indeed, Humbug

Rod Pemberton

unread,

Nov 6, 2016, 2:28:06 PM11/6/16

to

On Sat, 5 Nov 2016 20:44:10 -0700 (PDT)
hughag...@nospicedham.gmail.com wrote:

[OT for comp.lang.asm.x86]
[follow-ups set to comp.lang.forth]

> I predict that pretty soon George Neuner will tell me to stop
> worrying about optimization and just write the program --- worry
> about optimization later --- his advice is good, of course, but it is
> difficult for me to not wonder about this optimization stuff and I
> tend to worry that I'm doing it wrong...
>
> I also predict that pretty soon several people will tell me to test
> the different versions of my program to determine which is faster.
> This advice I'm dubious about. Most of the time when I test programs,
> the results vary a lot --- there is more variance between tests of
> the same version, than there is between different versions --- this
> obviously implies that the tests are meaningless.
>

Now that everyone has given you an optimized assembly solution for your
DROP issue, I predict that I'll have to ask why so many of your Forth
primitives end with DROP. So, Hugh, WHY do so many of your Forth
primitives end with DROP? I think that's a valid question. Don't you?

I would assume that it is faster if you eliminated the need for DROP on
the majority of your primitives, even if it causes your other
primitives to be longer. Or, possibly, that it would be better to
rework the inner interpreter or QUIT or NEXT etc to not need DROP.
Perhaps, you could integrate the DROP into NEXT or QUIT etc. There is
most likely some place in your code where the DROP is a "natural fit"
and where the DROP can be optimized away.

Rod Pemberton

hughag...@nospicedham.gmail.com

unread,

Nov 6, 2016, 5:28:17 PM11/6/16

to

Well, I had not been aware of SETxx or MOVZX --- those are definitely worth knowing --- these are the kinds of instructions that I would have liked to have in the "good parts" book so I could learn about them.

You are right --- I had the flag backwards --- thanks for pointing that out.

You are also right that CMOVxx lacks an immediate addressing-mode --- I knew about that --- I had global variables holding the 1 and 0 values, which is a hassle.

hughag...@nospicedham.gmail.com

unread,

Nov 6, 2016, 8:58:30 PM11/6/16

to

On Sunday, November 6, 2016 at 12:27:26 AM UTC-7, Philip Lantz wrote:
> hughag...@nospicedham.gmail.com wrote:

> > func
> > _not_: ; a -- NOT-a ; make a proper flag opposite of parameter
> > mov edx, false ; RDX= false
> > test rbx, rbx
> > cmovne rdx, true ; RDX= true if RBX<>0
> > lodsd
> > mov rbx, rdx
> > jmp qword [rdi+rax]
> > fend
>

> I would write that as:
> _not_:
> lodsd
> test rbx, rbx
> setnz bl
> movzx ebx, bl
> jmp qword [rdi+rax]
>
> (Note that there is no cmovcc r, imm instruction.)
> Shouldn't it be cmove rather than cmovne, though (since it is called "not")?
>
> Philip

Here is my new version:

func
_not_: ; a -- NOT-a ; make a proper flag opposite of parameter

test rbx, rbx
setz bl ; BL= 1 if zero or 0 if non-zero
lodsd
movzx ebx, bl
shl rbx, 32
jmp qword [rdi+rax]
fend

Note that I'm assuming unity to be 2^32 in Straight Forth (hence the SHL in there).

That SETxx is very useful! Thanks for telling me about it. I had read Chapter 7 of the Intel manual, but missed the paragraph on SETxx --- I had considered MOVZX to be not useful, and I forgot about it --- I still don't know what some of those instructions are used for.

Does it make any difference in regard to parallelization where the LODSD goes? You put it at the beginning, and I put it in the middle. I put it after BL gets set and before BL gets used, because setting and using a register can't parallelize, but the LODSD can parallelize with one or the other. That is what I used to do in MiniForth assembly --- split the setting and using of a register with an instruction that is unrelated.

BTW: For the MiniForth, my assembler would pack instructions together into a single opcode at compile-time --- on the x86, this is done at run-time by the x86 itself --- it seems very weird that the x86 does compilation at run-time; I suppose this technique was used to prevent people from finding out how their trace-code works.

X86 machine-code is like a high-level language that is getting JIT'ed into trace-code (a low level machine-code), but nobody knows what that trace-code looks like.

Philip Lantz

unread,

Nov 6, 2016, 11:13:40 PM11/6/16

to

I put the lodsd as early as possible in order to get the result as soon as
possible for the jump. The four operations on b have to execute in sequence,
but all four of them can be executed while waiting for the load to complete.

A modern processor parallelizes in a much more general way than you imagine.
The processor initiates the memory read and then keeps executing the following
instructions. It keeps track of all the instructions that are in flight and
retires them in order once the results are available. It can retire multiple
instructions per clock. Four per clock, I think, although this may vary for
different processors. I think some huge number like 128 micro-ops can be in
flight at a time.

I'm sure there are people here that understand this way better than me and
can clarify some details. Most of what I know is from reading the optimization
manual.

Philip

Obligatory disclaimer: I work for Intel, but I do not speak for them.
All opinions and errors herein are my own.

hughag...@nospicedham.gmail.com

unread,

Nov 8, 2016, 2:00:12 AM11/8/16

to

On Sunday, November 6, 2016 at 9:13:40 PM UTC-7, Philip Lantz wrote:
> hughag...@nospicedham.gmail.com wrote:
> > Does it make any difference in regard to parallelization where the LODSD goes?
> > You put it at the beginning, and I put it in the middle. I put it after BL gets
> > set and before BL gets used, because setting and using a register can't
> > parallelize, but the LODSD can parallelize with one or the other.
>
> I put the lodsd as early as possible in order to get the result as soon as
> possible for the jump. The four operations on b have to execute in sequence,
> but all four of them can be executed while waiting for the load to complete.

Thanks for the intel (so to speak).

I may have been over-thinking the subject, trying to put the LODSD in the perfect spot --- I'll just put it as early as possible as you suggest, which is easier --- apparently, it is a very slow instruction, if 4 others can execute while it is executing.

> A modern processor parallelizes in a much more general way than you imagine.

It was in 1994/1995 that I wrote MFX for the MiniForth --- somewhat dated now --- that is my only experience with parallelization (plus reading Abrash's Zen book at the same time).

I have read in the Intel optimization manual that XCHG is very slow, so I write _SWAP_ like this:

func
_swap_: ; a b -- b a
lodsd
pop rdx
push rbx

mov rbx, rdx
jmp qword [rdi+rax]
fend

Rather than the more obvious:

func
_swap_: ; a b -- b a
lodsd
xchg rbx, [rsp]
jmp qword [rdi+rax]
fend

That one I'm pretty sure of. I wonder though, which _OVER_ is best:

func
_over_: ; a b -- a b a
lodsd
pop rdx
push rdx
push rbx

mov rbx, rdx
jmp qword [rdi+rax]
fend

func
_over_: ; a b -- a b a
lodsd
push rbx
mov rbx, [rsp+8]
jmp qword [rdi+rax]
fend

Here is another more extreme case:

func
_rover_: ; a b c -- a b c a
pop rax
pop rdx
push rdx
push rax

lodsd
mov rbx, rdx
jmp qword [rdi+rax]
fend

func
_rover_: ; a b c -- a b c a
lodsd
push rbx
mov rbx, [rsp+8+8]
jmp qword [rdi+rax]
fend

I read in the Intel optimization manual that PUSH and POP instructions juxtaposed get compiled efficiently. Stack juggling words such as this typically are a big speed sink in Forth, so this stuff is important. The peephole optimizer will, as much as possible, combine producers (such as OVER etc.) with consumers (such as + etc.). For example:

func
_over_plus_: ; a b -- a b+a
lodsd
add rbx, [rsp]
jmp qword [rdi+rax]
fend

For the most part, the big speed killer is:
jmp qword [rdi+rax]
Combining primitives together reduces how many times this happens, which should help the speed a lot --- this actually may be more important than how the primitives are written --- this implies that I should just make the primitives as short as possible so I can maximize how many primitives fit in the code-cache.

I read the Intel optimization manual, but I didn't learn much. There are a lot of hints for boosting the speed --- there is no indication as to how important they are relative to each other --- because the hints are typically contradictory to each other, I'm left not knowing any more than I did before I began studying.

Philip Lantz

unread,

Nov 8, 2016, 4:30:22 AM11/8/16

to

hughag...@nospicedham.gmail.com wrote:
> Philip Lantz wrote:
> > hughag...@nospicedham.gmail.com wrote:
> > > Does it make any difference in regard to parallelization where the LODSD goes?
> > > You put it at the beginning, and I put it in the middle. I put it after BL gets
> > > set and before BL gets used, because setting and using a register can't
> > > parallelize, but the LODSD can parallelize with one or the other.
> >
> > I put the lodsd as early as possible in order to get the result as soon as
> > possible for the jump. The four operations on b have to execute in sequence,
> > but all four of them can be executed while waiting for the load to complete.
>
> Thanks for the intel (so to speak).
>
> I may have been over-thinking the subject, trying to put the LODSD in the
> perfect spot --- I'll just put it as early as possible as you suggest, which
> is easier --- apparently, it is a very slow instruction, if 4 others can
> execute while it is executing.

Yes, any memory access is very slow. Even an L1 cache hit probably takes 4
clocks. And of course those 4 instructions are among the simplest ones and
take one clock each. If the memory access misses all the caches, it may take
hundreds of clocks.

> I have read in the Intel optimization manual that XCHG is very slow, so I
> write _SWAP_ like this:
>
> func
> _swap_: ; a b -- b a
> lodsd
> pop rdx
> push rbx
> mov rbx, rdx
> jmp qword [rdi+rax]
> fend
>
> Rather than the more obvious:
>
> func
> _swap_: ; a b -- b a
> lodsd
> xchg rbx, [rsp]
> jmp qword [rdi+rax]
> fend
>
> That one I'm pretty sure of.

Yes, an xchg that references memory is a locked operation, so it's safe
to say that it is never a good choice unless you need the locked behavior.
Also, even though it looks nice and compact, the second one is actually
the same size as the first one (5 bytes, not counting the jmp).

> I wonder though, which _OVER_ is best:
>
> func
> _over_: ; a b -- a b a
> lodsd
> pop rdx
> push rdx
> push rbx
> mov rbx, rdx
> jmp qword [rdi+rax]
> fend
>
> func
> _over_: ; a b -- a b a
> lodsd
> push rbx
> mov rbx, [rsp+8]
> jmp qword [rdi+rax]
> fend

The first one performs an unnecessary memory write, but that probably
doesn't cost much. (Subsequent instructions don't have to wait for a
write to complete.) I prefer the second one, but I don't actually know
how much [rsp+8] takes, so I could be wrong. In this case, I might
just go with the smaller one. (The first one is 6 bytes; the second
one is 7 bytes, not counting the jmp.)

> Here is another more extreme case:
>
> func
> _rover_: ; a b c -- a b c a
> pop rax
> pop rdx
> push rdx
> push rax
> lodsd
> mov rbx, rdx
> jmp qword [rdi+rax]
> fend
>
> func
> _rover_: ; a b c -- a b c a
> lodsd
> push rbx
> mov rbx, [rsp+8+8]
> jmp qword [rdi+rax]
> fend

The first one performs an unnecessary memory read and two unnecessary
writes, which I think are going to cost more than the complex addressing
mode. (Also you forgot a push rbx.) So I like the second one. But I
don't actually know how much [rsp+16] takes, so I could be wrong. As
above, I might just go with the smaller one. (The first one is 8
bytes, the second one is 7 bytes, not counting the jmp.)

> I read in the Intel optimization manual that PUSH and POP instructions
> juxtaposed get compiled efficiently.

As I understand it, the micro-ops are very efficient, but the memory
accesses still need to be performed. Stack accesses presumably always
hit L1 cache, so you're looking at ~4 clocks per read.

> Stack juggling words such as this
> typically are a big speed sink in Forth, so this stuff is important.
> The peephole optimizer will, as much as possible, combine producers
> (such as OVER etc.) with consumers (such as + etc.). For example:
>
> func
> _over_plus_: ; a b -- a b+a
> lodsd
> add rbx, [rsp]
> jmp qword [rdi+rax]
> fend

Yes, this should help a lot (assuming the cost of the work to do the
optimization doesn't overwhelm the number of times it is executed).

> For the most part, the big speed killer is:
> jmp qword [rdi+rax]
> Combining primitives together reduces how many times this happens,
> which should help the speed a lot --- this actually may be more
> important than how the primitives are written --- this implies that
> I should just make the primitives as short as possible so I can
> maximize how many primitives fit in the code-cache.
>
> I read the Intel optimization manual, but I didn't learn much.
> There are a lot of hints for boosting the speed --- there is no
> indication as to how important they are relative to each other
> --- because the hints are typically contradictory to each other,
> I'm left not knowing any more than I did before I began studying.

You mentioned earlier that you haven't found taking actual measurements
to be helpful, because the variations within a single run exceed the
differences between runs. If that is the case, you really are over-
thinking this. (I trust you've heard of the first two rules of
optimization.) I'm only spending time on this because I enjoy staring
at assembly code and thinking about how it will behave, not because
I think it's actually likely to be helpful to the success of your
project. :-)

Philip

hughag...@nospicedham.gmail.com

unread,

Nov 8, 2016, 2:45:57 PM11/8/16

to

I had read in the Intel optimization manual that [RSP+nnn] is bad, and that PUSH and POP should be used instead.

It is possible that what they meant, was that this messed up the RET prediction. This is irrelevant in my case however, because I'm not using CALL and RET --- so I should just go with the [RSP+nnn] version that is both easier to read and has no unneeded memory-access.

Apparently, memory-access continues to be a big speed killer, even with caching.

> I'm only spending time on this because I enjoy staring
> at assembly code and thinking about how it will behave, not because
> I think it's actually likely to be helpful to the success of your
> project. :-)

Well, my goal is to generate code that executes twice as fast as SwiftForth code (half as fast as VFX), which is the minimum needed for anybody to take the compiler seriously. SwiftForth generates STC (subroutine-threaded code), but it has a very bad optimizer --- it is just inlining the short sub-function rather than compiling CALLs to them --- the code is very bloated, which results in a lot of code-cache thrashing. I should get better speed with ITC (indirect-threaded code) because most or all of my primitives are in the code-cache. Also, I'm using the 64-bit x86, so I get 16 registers rather than 8, which should reduce memory-access a lot (for example: SwiftForth doesn't hold the local-frame pointer in a register, but uses a global variable instead, which is extremely slow).

Mostly I expect to get better speed because I support data structures. I have "chains" and arrays. I made up the term "chain" because I want to standardize behavior rather than implementation. Typically the chain would be an AVL tree with an UP pointer (to allow LEFT and RITE etc. to work). ANS-Forth doesn't support any data structures, so the implementation of any data-structure must be done by the application-programmer. This usually results in slow code because the compiler doesn't optimize very well. By comparison, my chain and array support code is hand-written assembly-language. Also, ANS-Forth programmers implement data-structures inside of application program code, usually by cut-and-paste-and-tweak of old application programs --- their programs are over-complicated, error-prone and hard to read --- ANS-Forth is a horrible language; it is as bad as COBOL, and for all of the same reasons.

I'm just getting started on the implementation of the compiler. The design took a long time (32 years), but I think I've got something good now. I have a document describing the design if anybody is interested. Implementation is slow going as this is my first 64-bit x86 program. As for whether the project will be a "success" or not, that depends upon your definition of "success." I'm going to give it away for free, so forget about any definition that involves making money. My own definition is that I will save Forth --- ever since ANS-Forth became the standard in 1994, the Forth language has been considered to be a joke --- if my Straight Forth will cause people to take Forth seriously again, like they did prior to ANS-Forth, then I will consider it to be a success.

hughag...@nospicedham.gmail.com

unread,

Nov 8, 2016, 4:16:05 PM11/8/16

to

On Saturday, November 5, 2016 at 10:27:18 PM UTC-7, hughag...@nospicedham.gmail.com wrote:
> Most of my functions are bracketed with the FUNC and FEND macros:
>
> macro FUNC {
> local function_start, function_size
>
> if (($-$$) mod paragraph)+(function_size mod paragraph) >= paragraph
> align paragraph
> end if
>
> function_start = $
> start@FUNC equ function_start
> size@FUNC equ function_size
> }
>
> macro FEND {
> size@FUNC = $ - start@FUNC
> restore start@FUNC
> restore size@FUNC
> }
>
> ; FUNC and FEND were written by Tomasz Grysztar (http://board.flatassembler.net/topic.php?p=163729#163729).
> ; These wrap functions and will paragraph-align if necessary to minimize the number of paragraphs occupied by a function.
> ; According to the Intel Optimization manual, code is decoded into trace-code in paragraph chunks.

FASM is the only assembler that I know of, for any processor, that has a macro language capable of the FUNC and FEND macros shown above. I am very impressed!

hughag...@nospicedham.gmail.com

unread,

Nov 8, 2016, 6:03:07 PM11/8/16

to

Rod Pemberton is a troll. Most likely, he knew that LDA etc. on the 6502 set the flags whereas MOV on the x86 doesn't --- he just said it backwards so that somebody would respond to him. He cross-posted to comp.lang.forth so that any responses (none yet!) would show up on both forums --- cross-posting is troll behavior that should have gotten his post deleted by the moderator --- any further discussion of Rod Pemberton should be taken to this thread:
https://groups.google.com/forum/#!topic/alt.lang.asm/ejInq4uRta4

Anyway, in response to George Neuner (not a troll), let me say that the Mostek 6502 (it was 1975 not 1972) was far superior to the Motorola 6800 and Intel 8080 (both 1974). To a large extent, the 6502 kick-started the microprocessor age! It is debatable if the 65c02 was superior to the Zilog Z80 (I liked the 65c02, for whatever my opinion is worth) --- we could debate on this until our hair turns gray, and never get a resolution, so lets not go there...

But I will also say that the 65c02 had design flaws (above I said that it is "superior" not "perfect"). If I could go back in a time-machine and offer some suggestions, these would certainly change the course of computer history dramatically:

1.) Get rid of the indirect-X addressing mode --- it was a waste of chip-space --- nobody ever used it for anything.

2.) Change the jump-indirect to use the Y register rather than the X register (the X register is needed for use as the Forth data-stack pointer) --- token-threaded code will be needed for many languages (Apple Pascal, Promal, etc.) --- this saves a lot of memory, and memory was expensive in the 1970s.

3.) Fix the return-address so it is a valid address (on the 6502 it is off by one, which is a bug not a feature). Also, provide LDS and STS instructions for loading and saving the S register to zero-page --- and fix the S register so that it is a valid address (when saved to zero-page as a low-byte, given 1 as the high-byte, you get a valid zero-page pointer).

4.) Support 128KB (64KB for code and 64KB for data) by having a 17-bit address bus and switching the high bit back and forth as needed during instruction execution (get rid of that set-overflow pin that was never used) --- memory will become less expensive in the 1980s, and people will want bigger programs that work with more data.

Note that #2 above would have made high-level languages realistic in the 1970s when memory was still super-expensive --- this would have made computers useful at a time when (in the actual history) they were just toys --- this is, I think, what George Neuner was saying in his post.

Note that the #3 bug-fixes above open a lot of doors. LDS and STS would have allowed a local-frame pointer in zero-page (pointing into the return-stack) allowing both Forth or C to have local variables, which would have made reentrant code possible, which would have made a multi-tasking OS realistic. We could have also had efficient direct-threaded-code in Forth (every colon word starts with a JSR to DOCOLON, and the DOCOLON pulls the return address from the S stack and stores it in IP which is a zero-page pointer).

Note that #4 above would have made the 6502 competitive for business use with the 8088 (the only reason the 8088 was used, was that it allowed for more memory; it was much slower than the 65c02 for games and other programs requiring little memory) --- CP/M would have expired quickly and without lamentation, and MS-DOS would have never come into existence --- we would have instead gotten a future involving a better OS built on a better processor. :-D

My discussion of the 6502 in this post is off-topic for CLAX, but I don't see this as being a hijacking of our thread in which we are discussing 64-bit x86 optimization (I started the thread, so is it possible for me to hijack my own thread?). Is it true that everybody on CLAX has already had their hair turn gray? Lets have a show of hands --- is there anybody here under the age of 40? --- is there anybody here who doesn't think that the 6502 is the coolest processor of all time?

Frank Kotler

unread,

Nov 8, 2016, 6:48:11 PM11/8/16

to

hughag...@nospicedham.gmail.com wrote:

...

> Rod Pemberton is a troll.

Bye, Hugh.

Sincerely,
Frank

Terje Mathisen

unread,

Nov 9, 2016, 2:44:37 AM11/9/16

to

hughag...@nospicedham.gmail.com wrote:
> Note that #4 above would have made the 6502 competitive for business
> use with the 8088 (the only reason the 8088 was used, was that it
> allowed for more memory; it was much slower than the 65c02 for games

That is simply WRONG.

The alternative to the 808x was the 68000, IBM ended up with the 8088
explicitly due to the support for 8-bit bus peripherals, something which
the 68K family didn't support at the time afair?

> and other programs requiring little memory) --- CP/M would have
> expired quickly and without lamentation, and MS-DOS would have never
> come into existence --- we would have instead gotten a future
> involving a better OS built on a better processor. :-D

I.e. the 68K which in its 68020 incarnation was one of the worst HW
optimization targets in modern history due to far too many funky
addressing modes.

>
> My discussion of the 6502 in this post is off-topic for CLAX, but I
> don't see this as being a hijacking of our thread in which we are
> discussing 64-bit x86 optimization (I started the thread, so is it
> possible for me to hijack my own thread?). Is it true that everybody
> on CLAX has already had their hair turn gray? Lets have a show of
> hands --- is there anybody here under the age of 40? --- is there

59.

> anybody here who doesn't think that the 6502 is the coolest processor
> of all time?

Yeah, me.

The Mill is FAR cooler, take a look at http://millcomputing.com/

barry...@nospicedham.yahoo.com

unread,

Nov 9, 2016, 4:18:27 AM11/9/16

to

On Tuesday, November 8, 2016 at 3:03:07 PM UTC-8, hughag...@nospicedham.gmail.com wrote:
> ...

> Lets have a show of hands --- is there anybody here
> under the age of 40?

50 over here.

> ... is there anybody here who doesn't think that the 6502 is the

> coolest processor of all time?

No argument from me, but you know what they say about opinions ...
I even think that its condition code updating behavior is "perfect".

Thanks, I'll show myself out ...

Mike B.

Robert Wessel

unread,

Nov 9, 2016, 5:18:55 AM11/9/16

to

On Wed, 9 Nov 2016 08:37:06 +0100, Terje Mathisen
<terje.m...@nospicedham.tmsw.no> wrote:

>hughag...@nospicedham.gmail.com wrote:
>> Note that #4 above would have made the 6502 competitive for business
>> use with the 8088 (the only reason the 8088 was used, was that it
>> allowed for more memory; it was much slower than the 65c02 for games
>
>That is simply WRONG.
>
>The alternative to the 808x was the 68000, IBM ended up with the 8088
>explicitly due to the support for 8-bit bus peripherals, something which
>the 68K family didn't support at the time afair?

The 68000 had no issues with 8-bit peripherals, but it preferred
6800-style ones (although generating a 8080-compatible bus from that,
and then using 8080-style devices, was not really very complicated).
Support for 8-bit 6800 devices was an explicit goal for the 68000, the
only really oddity was that such devices would occupy alternate
addresses instead of consecutive ones.

The 68000 did not, however, do a good job of supporting 8-bit memory,
sharing that trait with the 8086, which also did not get selected. The
8-bit-bus 68008 shipped far too late to end up in the IBM PC. And the
68008 was even more memory bound than the 8088 was.

I don't doubt a high degree of compatibility with 8080 code and
hardware was considered a plus. And remember that IBM's original
sales target for the PC was only 250,000 units over 5 years, which
made 8080 compatibility even more desirable.

George Neuner

unread,

Nov 9, 2016, 3:34:30 PM11/9/16

to

On Tue, 8 Nov 2016 14:56:39 -0800 (PST),
hughag...@nospicedham.gmail.com wrote:

>On Sunday, November 6, 2016 at 3:27:37 AM UTC-7, George Neuner wrote:
>>
>> At the time the 6502 was designed (circa ~1972),
>>
>

>let me say that the Mostek 6502 (it was 1975 not 1972)

The 6502 was made by MOS Technology. Mostek was a different company.
Confusing not just because of the similar names, but because they both
were formed by defectors from Texas Instruments.
https://en.wikipedia.org/wiki/MOS_Technology
https://en.wikipedia.org/wiki/Mostek

The 6502 was released in 1975, but it was designed earlier [though
perhaps not quite as early as '72]. The 6501 and 6502 were designed
concurrenlty, but at that time, MOS was running near capacity as a
second source for TI chips. Mass production of the 65xx chips meant
abandoning much of that revenue stream, so it took a while to get them
into production.

A lawsuit by Motorola over the 6501 quickly halted production of that
chip. The 6501 and 6502 essentially were the same chip, but the 01
was pin compatible with Motorola's 6800, whereas the 6502 was not. So
the 6502 survived.

>It is debatable if the 65c02 was superior to the Zilog Z80 (I
>liked the 65c02, for whatever my opinion is worth) --- we could
>debate on this until our hair turns gray, and never get a
>resolution, so lets not go there...

I also cut teeth on 65c02. [An Apple //e]

The Z80, and the 8080 before it, arguably was better if you needed a
lot of 16-bit operations. The 65c02 had a (IMO) more well rounded
instruction set, and was noticably faster when clocked comparably. The
Z80 needed a 4:1 clock advantage to beat the 65c02 on a general mix of
code. [8080 was even slower and needed 6:1]

George

Robert Wessel

unread,

Nov 9, 2016, 4:04:33 PM11/9/16

to

On Wed, 09 Nov 2016 15:28:25 -0500, George Neuner
<gneu...@nospicedham.comcast.net> wrote:

>The Z80, and the 8080 before it, arguably was better if you needed a
>lot of 16-bit operations. The 65c02 had a (IMO) more well rounded
>instruction set, and was noticably faster when clocked comparably. The
>Z80 needed a 4:1 clock advantage to beat the 65c02 on a general mix of
>code. [8080 was even slower and needed 6:1]

More like 2:1. The 6502 needed a two phase clock - so a 1MHz 6502
basically had 2 million clock pulses each second.

wolfgang kern

unread,

Nov 11, 2016, 5:18:27 AM11/11/16

to

Edward Brekelbaum said (in part):

| Usually, it is easiest to assume that flags only live to the next
| instruction.

good advice for those who haven't done their homework yet :)

assembler programmers got a huge advantage over HLL-coders
by making use of flags and write shorter faster code.

part of my old list on condition flags:

------------
Instructions that wont alter any flags:

BSWAP
CALL (all variants)
CWB/CWD/CDQ
CLTS
CMOV
CPUID
ENTER
IN/OUT
INS/OUTS
INVD
INVLPG
JMP (all variants)
Jcc
JCXZ
LAHF
LDS/LES/LSS/LFS/LGS
LEA
LEAVE
LGDT
LIDT
LMSW
LODS
LOOP
LOOPz
LOOPnz
LTR
MOV (all)
MOVS
MOVSX
MOVZX
NOP
NOT
POP (all)
POPA
PUSH (all)
PUSHA
PUSHF
RDMSR
RDPMC
RDTSC
REP
REPZ
REPNZ
RET
RETnn
RETF
RETFnn
SETcc
STOS
STR
SYSEXIT (but see SYSENTER)
WAIT
WBINVD
WRMSR
XCHG
XLAT
SALC

all prefix bytes
and all FPU and SSE instructions except a handfull.

----------
Instructions that alter some flags but keep all CC-flags:
(INT routines may alter the pushed flags)

BOUND (IF INT5)
INTxx/INT3/INTO
SYSENTER
----------
Instructions which alter the Zero flag only:

ARPL
LAR
LSL
VERR
VERW
----------

I may update my list and add newer instructions some day...
__
wolfgang

Rod Pemberton

unread,

Nov 11, 2016, 6:18:33 AM11/11/16

to

On Fri, 11 Nov 2016 11:09:27 +0100
"wolfgang kern" <now...@never.at> wrote:

> Edward Brekelbaum said (in part):

(from comp.lang.asm.x86)

I've also compiled such lists of various instruction features, and over
time it became a large document. This is about 80% of said document.
20% is not posted for various reasons. You'll need fixed width text
font. Also, slight editing and reformatting was done due to Usenet
line-wraps. You may need to unwrap some lines.

wolfgang, the sections with similar headings to yours are somewhat
further down.

<--start-->
x86 instruction information
compiled and authored by Rod Pemberton
----

NOTE: For brevity, most of this document will use 16-bit registers
when there is a choice between either a 16-bit or 32-bit register.
e.g.,
ax - This means ax or eax.

NOTE: In general, this document does has no 64-bit coverage.

NOTE: this document assumes some familiarity with x86 instructions.
e.g.,
r/m/16/32 - This means the instruction supports registers and
memory in 16 and 32 bit forms, but not 8 bit forms.
e.g.,
movs,movsb/w/d - This is an abbreviation for movs,movsb,movsw,movsd.

CPU instruction length:
386+ 15 bytes maximum, GP fault generated if exceeded
286 10 bytes maximum
86 no maximum - instruction size 1 to 4 bytes

registers:
8-bit: al,ah,bl,bh,cl,ch,dl,dh
16-bit: ax,bx,cx,dx,si,di,bp,sp
32-bit: eax,ebx,ecx,edx,esi,edi,ebp,esp

- Note that the 8 byte registers don't correspond with the
8 word or dword registers, but with bytes in the first four
registers only: ax,bx,cd,dx

- This register non-orthogonality affects the use of byte register
instructions together with: si,di,bp,sp, i.e., setcc

segments:
ss,cs,ds,es
fs,gs (386+)

default instruction segment:
"all" to ds: - except for ip, bp, sp, and for string instructions, di
ip to cs:
di to es: - for string instructions only, otherwise ds:
bp,sp to ss:

intended register usage:
ax - accumulator - used for calculations
bx - base
- used as a pointer for 16-bit indirect memory access
- used as a base address with the xlat,xlatb instruction
cx - count
- used with looping instructions and loop prefixes
- used with bitwise shift or rotate instructions
dx - data - used an accumulator extension
si - source index - used with string instructions for reading
di - destination index - used with string instructions for writing
sp - stack pointer - used with stack instructions
bp - base pointer - used as the base address of a stack frame
ip - instruction pointer

intended segment usage:
cs - code segment
ds - data segment
es - extra segment
ss - stack segment
fs - segment
gs - segment

instruction overrides:
ss,cs,ds,es,fs,gs (0x36,0x2E,0x3E,0x26,0x64,0x65)
rep,repne,repnz,repe,repz (0xF2,0xF3)
lock (0xF0)
0x66 - operand size
0x67 - address size
branch hint taken for jcc - 0x3E (ds override)
branch hint not taken for jcc - 0x2E (cs override)
sse etc. (0x66, 0xF2, 0xF3 - op size and rep)

16-bit register addressing modes:
al,cl,dl,bl,ah,ch,dh,bh (8-bit registers)
ax,cx,dx,bx,sp,bp,si,di (16-bit registers)

16-bit memory addressing modes:
ds:[bx+si+(no offset,8-bit offset,16-bit offset)]
ds:[bx+di+(no offset,8-bit offset,16-bit offset)]
ss:[bp+si+(no offset,8-bit offset,16-bit offset)]
ss:[bp+di+(no offset,8-bit offset,16-bit offset)]
ds:[si+(no offset,8-bit offset,16-bit offset)]
ds:[di+(no offset,8-bit offset,16-bit offset)]
ss:[bp+(8-bit offset,16-bit offset)] (NOTE: no 'no offset' form)
ds:[16-bit displacement]
(NOTE:above replaces 'no offset' form of bp+offset)
ds:[bx+(no offset,8-bit offset,16-bit offset)]

32-bit register addressing modes:
al,cl,dl,bl,ah,ch,dh,bh (8-bit registers)
eax,ecx,edx,ebx,esp,ebp,esi,edi (32-bit registers)

32-bit memory addressing modes:
ds:[eax+(no offset,8-bit offset,32-bit offset)]
ds:[ecx+(no offset,8-bit offset,32-bit offset)]
ds:[edx+(no offset,8-bit offset,32-bit offset)]
ds:[ebx+(no offset,8-bit offset,32-bit offset)]
SIB forms (**) (no esp+offset mode - SIB encoding allows esp+offset)
ss:[ebp+(8-bit offset,32-bit offset)] (NOTE: no 'no offset' form)
ds:[32-bit displacement]
(NOTE:above replaces 'no offset' form of ebp+offset)
ds:[esi+(no offset,8-bit offset,32-bit offset)]
ds:[edi+(no offset,8-bit offset,32-bit offset)]

(**) two SIB forms (scaled index byte):

ds/ss:[(reg0a,32-bit displacement)+(reg1*n,none)]
ds/ss:[reg0b+(reg1*n,none)+(8-bit offset,32-bit offset)]

segment is ds, except if reg0a or reg0b is esp or ebp, then it's ss
reg0a=eax,ecx,edx,ebx,esp,esi,edi,32-bit displacement(ebp) (no ebp)
reg0b=eax,ecx,edx,ebx,esp,esi,edi,ebp
reg1=eax,ecx,edx,ebx,ebp,esi,edi,none(esp) (no esp)
n=1,2,4,8

xlat address mode:
xlat is equivalent to 'mov al,ds:[bx+al]'
So, xlat effectively adds an additional
address mode for 8086 of 'ds:[bx+al]'.

registers by register encoding:
eax(0) ax(0) al(0) es(0)
ecx(1) cx(1) cl(1) cs(1)
edx(2) dx(2) dl(2) ss(2)
ebx(3) bx(3) bl(3) ds(3)
esp(4) sp(4) ah(4) fs(4)
ebp(5) bp(5) ch(5) gs(5)
esi(6) si(6) dh(6)
edi(7) di(7) bh(7)

registers by register group:
eax(0) ax(0) ah(4) al(0)
ecx(1) cx(1) ch(5) cl(1)
edx(2) dx(2) dh(6) dl(2)
ebx(3) bx(3) bh(7) bl(3)
esp(4) sp(4)
ebp(5) bp(5)
esi(6) si(6)
edi(7) di(7)

data movement:
The x86 has three (basic) locations where instructions can load
or store values: registers, stack, and memory.

From the diagram, note that there are instructions which will
move data between:

1) register and register
2) register and memory
3) register and stack
4) stack and memory

mov
register xchg register
+-----------------+
| |
push | | mov
pop | | xchg
mov [sp] | |
| |
+-----------------+
stack push mem memory
pop mem

Note that multiple instructions (using the stack or registers) are
usually required to move data between:

1) memory and memory
2) stack and stack

One exception to this is the movs instruction. The movs instruction
allows memory to memory moves. The instruction has setup overhead
because it uses many hardcoded registers. The instruction is slow
but does much work. When used with a repeat prefix, it is faster
than a loop composed of other instructions.

The diagram ignores segment registers and I/O ports.

fastest modes:
1) register,register
2) accumulator,immediate
3) register,immediate

'register,register' instructions:
adc,add,and,cmp,or,sbb,sub,test,xor
mov
arpl,bsf/r,cmpxchg,movsx,movzx,xchg

'accumulator,immediate' (or 'immediate,accumulator') mode instructions:
adc,add,and,cmp,or,sbb,sub,test,xor
in,out

'register,immediate' instructions:
adc,add,and,cmp,or,sbb,sub,test,xor
mov
rcl,rcr,rol,ror,sal,sar,shl,shr

hardcoded accumulator (al,ah,ax) instructions:
adc,add,and,cmp,or,sbb,sub,test,xor (*)-all
lods,stos,scas
in,out
mov(*)
cmpxchg,xchg(*),xlat,xlatb
div,idiv,imul,mul
cbw,cwde,cwd,cdq
aaa,aad,aam,aas,daa,das (64-bit obs.)
lahf,sahf,salc (64-bit obs.)
cpuid (eax, clobbers eax,ebx,ecx,edx)
(*) have a regular form and short form for accumulator

instructions with hardcoded ,0 or ,1 forms:
enter
rcl,rcr,rol,ror,sal,sar,shl,shr

hardcoded register instructions or prefixes (other than accumulator
only):
xlat,xlatb (bx al)
jcxz,jecxz (cx)
rep,repe/z/ne/nz (cx)
loop,loope/z/nz/ne (cx)
rcl,rcr,rol,ror (cl)
sal,sar,shl,shr (cl)
shld,shrd (cl)
in,out (dx ax,al)
cwd,cdq,div,idiv,mul,imul (dx:ax,al)
lds,les,lfs,lgs,lss
mov (for crx,drx,trx - trx are obs.)
push,pusha/ad/f/fd (ss:sp)
pop,popa/ad/f/fd (ss:sp)
cmps,cmpsb/w/d (ds:si es:di)
movs,movsb/w/d (ds:si es:di)
ins,insb/w/d (es:di dx)
outs,outsb/w/d (ds:si dx)
lods,lodsb/w/d (ds:si ax,al)
scas,scasb/w/d (es:di ax,al)
stos,stosb/w/d (es:di ax,al)
enter,leave (bp ss:sp)

string or memory block instructions, with useable rep prefixes:
A) between memory and memory
cmps,cmpsb/w/d (ds:si es:di) - repe/z/ne/nz (cx)
movs,movsb/w/d (ds:si es:di) - rep (cx)
B) between memory and ax
lods,lodsb/w/d (ds:si ax,al) - rep (cx)
scas,scasb/w/d (es:di ax,al) - repe/z/ne/nz (cx)
stos,stosb/w/d (es:di ax,al) - rep (cx)
C) between memory and ports
ins,insb/w/d (es:di dx) - rep (cx)
outs,outsb/w/d (ds:si dx) - rep (cx)

instructions which load or store segment registers:
lds,les,lfs,lgs,lss
mov
push
pop (except cs - cs allowed on 8086/80186 only)
jmp,call,ret,iret (cs)

instructions that support mixed sizes (all imm8 except for movzx/sx):
add,adc,and,cmp,or,sub,sbb,xor r/m/16/32,imm8 (sign-extended)
rcl,rcr,rol,ror,sal,sar,shl,shr r/m/16/32,imm8
shrd,shld r/m/16/32,r/16/32,imm8
bt,btc/r/s r/16/32,imm8
imul r/16/32,imm8 (sign-extended)
imul r/16/32,r/m/16/32,imm8 (sign-extended)
push imm8 (sign-extended)
movzx/sx r/16/32,r/m/8/16 (zero-extended,sign-extended)

instructions that support imm8:
add,adc,and,cmp,or,sub,sbb,xor
rcl,rcr,rol,ror,sal,sar,shl,shr
push
mov
test
int
in,out
shld,shrd
bt,btc/r/s
enter
imul

instructions with +r form encoding (bytes in paren):
inc,dec (1)
push,pop (1)
xchg (1)
bswap (2)
mov imm (offset)

instructions that support 8-bit instruction relative offsets:
jcc,jcxz,jecxz,jmp,loop,loope/z/nz/ne
[not call instruction]

instructions that support 16/32-bit instruction relative offsets:
call,jcc,jmp

instructions which can perform sign-extension:
add,adc,and,cmp,or,sub,sbb,xor r/m/16/32,imm8
cbw,cwde,cwd,cdq
movsx
imul

instructions which can perform zero-extension:
movzx
movd

instruction which requires a sign-extended argument:
idiv (i.e., requires cwd or cdq)

instructions with r/m/16/32 or r16/32 forms but without r/m/8 or r/8:
push,pop
bsf/r,bt,bts/r/c
jmp,call
bound
lar,lsl
lds,les,lfs,lgs,lss
lea
sldt,smsw
shrd,shld

instrucion with r/m8 only form:
setcc

instruction with r32 only form:
bswap

instructions with r/m16 only forms:
arpl
lldt,lmsw
ltr,str,verr/w
sldt,smsw

instructions with r/m32 only forms:
movd
cvtsi2ss/d

instructions with only single operand forms:
neg,not
pop (no push)
mul (no imul)
div,idiv
inc,dec
bswap
setcc
lldt,lmsw
ltr,str,verr/w
sldt,smsw

instructions that do not accept an operand-size override:
lldt,lmsw
ltr,str
sldt
NOTE: only info from 386 manual

instructions (and prefixes) that have an exact one byte form:
segment overrides (es,cs,ss,ds,fs,gs)
operand size override (0x66)
address size override (0x67)
lock,rep,repe/z/ne/nz
push (es,cs,ss,ds)
pop (es,ss,ds)
aaa,aas,daa,das
inc,dec,push,pop r/16/32
pusha/ad/f/fd
popa/ad/f/fd
cmps,cmpsb/w/d
movs,movsb/w/d
ins,insb/w/d
outs,outsb/w/d
lods,lodsb/w/d
scas,scasb/w/d
stos,stosb/w/d
nop,wait,hlt
xchg,xlat,xlatb
cbw,cwde,cwd,cdq
ret,leave,retf,iret,iretd
int 3,into,int1
salc,sahf,lahf
in,out
clc,cld,cli,cmc,stc,std,sti

instructions that have an exact two byte form:
clts
invd,wbinvd
push (fs,gs)
pop (fs,gs)
bswap r/32
cpuid
rdtsc,rdmsr,rdpmc,wrmsr
rsm
syscall,sysret
sysenter,sysexit
emms,femms
ud0,ud1,ud2

instructions which preserve flags:
bound
bswap,xchg,xlat,xlatb
cbw,cdq,cwd,cwde
clts (doesn't modify EFLAGS, but CR0)
cpuid
enter,leave
esc (all FPU instructions)
dec (only CF), inc (only CF)
in,ins,lods,stos
invd,invlpg,wbinvd
jcc,cmovcc,fcmovcc,setcc
call,jmp,jcxz,ret
lahf,lds/es/ss/fs/gs
lgdt,lidt,lldt,lmsw
hlt,lock,wait
loop,loope/ne,loopz/nz
ltr,str
monitor,mwait
lea,mov
movs,movsx,movzx
not,nop,ud2
out,outs
pop,popa
push,pusha,pushf
rdmsr,rdpmc,rdtsc,wrmsr
rep,repe/ne
sgdt,sidt,sldt,smsw

instructions which modify flags:
aaa,aad,aam,aas,daa,das
adc,add,sbb,sub,xadd,dec,inc
and,neg,or,xor
arpl,bound
bsf/r,bt,bts/r/c
clc,cmc,cld,cli
cmp,test
cmps,scas
cmpxchg,cmpxchg8b
comisd,comiss,ucomisd,ucomiss
div,idiv,imul,mul
dec (except CF), inc (except CF)
int,into
fcomi,fcomip,fucomi,fucomip
iret
lar,lsl
mov crx/drx/trx (trx are obs.)
popf,popfd,pushf,pushfd
rcl,rcr,rol,ror
rsm
sal,sar,shl,shr,shld,shrd
stc,std,sti
sahf
verr/w

instructions which modify the carry flag:
aaa,aas,daa,das
adc,add,sbb,sub,xadd
and(0),or(0),xor(0)
neg (CF=0 for zero)
clc(0),cmc(x),stc(1)
bt,bts/r/c
cmp,test(0)
cmps,scas
cmpxchg,cmpxchg8
comsid,comiss,ucomsid,ucomiss
fcomi,fcomip,fucomi,fucomip
imul,mul
iret,popf,rsm,sahf
rcr,rcr,rol,ror
sal,sar,shr,shl
shld,shrd

instructions which preserve the carry flag:
dec,inc

instructions which modify the zero flag:
aad,aam,daa,das
adc,add,sbb,sub,xadd,dec,inc
and,neg,or,xor
cmp,test
cmps,scas
bsf/r
cmpxchg,cmpxchg8b
comsid,comiss,ucomsid,ucomiss
fcomi,fcomip,fucomi,fucomip
iret,popf,rsm,sahf
lar,lsl,arpl
sal,sar,shr,shl
shld,shrd
verr/w
NOTE: neg sets CF=0 for zero (in carry flag section above)

arithmetic instructions which DO NOT update the carry flag:
inc,dec,div,idiv,not

won't generate partial flag stalls:
and,or,xor,add,adc,sub,sbb,cmp,neg

will generate partial flag stalls (when followed by
lahf,sahf,pushf,pushfd):
inc,dec,test
bt,btc/r/s
bsf/r
clc,cld,cli
cmc
stc,std,sti
mul,imul
rcl,rcr,rol,ror,sal,sar,shl,shr

override restricted instructions:
cmps/cmpsb/w/d no override of ES segment
ins/insb/w/d no override of ES segment
movs/movsb/w/d no override of ES segment
scas/scasb/w/d no override of ES segment
stos/stosb/w/d no override of ES segment
mov crx ignores operand size override
monitor no operand size override, no rep/repne/repnz, no lock
popf/fd #UD if v86 with I/O priv. < 3, and op. size override
jcc #GP if invalid address due to operand size override
loop/loopc #UD in RM if address size override
vmcall address size or segment ignored, #UD for operand size
vmlaunch address size or segment ignored, #UD for operand size
vmresume address size or segment ignored, #UD for operand size
vmclear operand size ignored
vmptrld operand size, equivalent to vmclear
vmptrst #UD for operand size
vmread #UD for operand size
vmwrite #UD for operand size
vmxoff #UD for operand size
vmxon operand size ignored
fxrstor #UD for lock override in RM and PM
fxsave #UD for lock override in RM and PM

atomic operations:
lock (atomic override prefix for multiprocessors - asserts lock
signal)
lss,lds,les,lfs,lgs (loads a segment and offset in single
instruction)
mov ss (following instruction executes without interrupt, e.g., mov
esp)
xchg (automatically locks for memory operands)

critical non-atomic operations:
mov crx (NMI and maskable interrupts must be disabled for atomic
usage)

lock atomic override prefix can be used on (when they have memory
operand):
adc,add,sbb,sub,dec,inc
and,neg,not,or,xor
btc,btr,bts
xadd,xchg
cmpxchg,cmpxchg8b

486 serializing instructions:
(forces completion of all prior instructions on out-of-order-execution
CPU's)
iret (non-privileged)
rsm (non-privileged, only in SMM)
wbinvd (privileged)
mov cr0 (privileged)
mov drx (privileged)
lmsw (won't serialize P5+)
exceptions: int xx,int1,int3,into,bound (won't serialize P5+)
branches: call,ret,retf,jmp,jcc (won't serialize P5+)
segment load: mov sr,pop sr,lds/es/fs/gs/ss (won't serialize P5+)

586+ serializing instructions:
(i.e., non-parallel execution for Pent,P4,P6,Xn)
(forces completion of all prior instructions on out-of-order-execution
CPU's)
iret (non-privileged)
cpuid (non-privileged, won't serialize on 486, modifies registers)
rsm (non-privileged, only in SMM)
wbinvd (privileged)
mov crx (privileged, not with cr8, only cr0 for 486)
mov drx (privileged)
invd (won't serialize on 486)
invlpg (won't serialize on 486)
lgdt (won't serialize on 486)
lidt (won't serialize on 486)
lldt (won't serialize on 486)
ltr (won't serialize on 486)
wsrmsr (not available on 486)
out (?)

protected mode (PM) setup instructions valid in real mode (RM)
clts,lidt,lgdt

protected mode (PM) setup instructions not valid in real mode (RM)
arpl,lldt,lsl,ltr,sldt,str,verr,verw

memory ordering instructions (P4,P6,Xn):
sfence (non-privileged, not available on 486)
mfence (non-privileged, not available on 486)
lfence (non-privileged, not available on 486)
rdtscp (non-privileged, not available on 486)

AMD64 obsolete instructions (64-bit obs.):
jmp far/call far - uses segment registers
inc/dec - single byte versions now REX prefixes
push/pop cs/ds/es/ss
lds/les
pusha,popa
into
bound
aaa,aas,aad,aam,daa,das
icebp
82h alias for 80h
sysenter,sysexit
arpl
salc
lahf (some cpu's)
sahf (some cpu's)
ss,ds,es

basic flags:
OF(11) Overflow flag (usually set for signed borrow)
DF(10) Direction Flag (inc/decrement of SI/DI string instructions)
IF(9) Interrupt enable Flag (set recognizes interrupts)
SF(7) Sign Flag (usually set to sign of signed result)
ZF(6) Zero Flag (usually set when equal)
AF(4) Auxillary carry Flag (set when nybble carry/borrow)
PF(2) Parity Flag (set for even, cleared for odd)
CF(0) Carry Flag (usually set for unsigned borrow)

8088/8086:
preference AX or AL register for shorter or faster instruction forms
preference XCHG reg,reg to MOV reg,reg or PUSH reg; POP reg sequence
use XLAT for tables
use XLAT for for an additional address mode, i.e., DS:[BX+AL]
unroll loops, including REP MOVS, which reduces use of CX to avoid
branches

quick guide to instruction timings
----

* pairable
/ partial pairing
F fxch pairable
-f no flag stall
+f preserves flags
+m mixed size
m memory
X 64-bit obsolete
d directpath 1 macrop
d2 directpath 2 macrop
v vectorpath 3+ macrops

timing: P5 486 386

1 cycle
----
nop * 1 1 3 +f d
mov * 1 1 2 +f d
mov m * 1 1 2/4 +f d
push * 1 1 2 +f +m d
pop * 1 4 4 +f d/d2
pop * 1 1 4 +f d/d2
lea * 1 1 2 +f d/v ;add,mul by 1,2,4,8
add * 1 1 2 -f +m d
sub * 1 1 2 -f +m d
and * 1 1 2 -f +m d
or * 1 1 2 -f +m d
xor * 1 1 2 -f +m d
shr * 1 3 3 +m d
shl * 1 3 3 +m d
sar * 1 3 3 +m d
sal * 1 3 3 +m d
cmp * 1 1 2 -f +m d ;sub - no write to register
test * 1 1 1 d ;and - no write to register
inc * 1 1 2 X d
dec * 1 1 2 X d
adc / 1 1 2 -f +m d
sbb / 1 1 2 -f +m d ;x-(x+CF)
ror 1 / 1 3 3 +m d
rol 1 / 1 3 3 +m d
rcr 1 / 1 3 9 +m d
rcl 1 / 1 3 9 +m d
jmpn / 1 3 7+ +f d
jcc / 1 1,3 3,7 +f d
push sr 1 3 2 X +f d2
neg 1 1 2 -f d ;0-x, CF=0 if x=0
not 1 1 2 +f d ;xor imm8 0xFF - sign extended
setcc 1 3 4 +f d
calln 1 3 7+ +f d2
bswap 1 1 - +f d
fld F
fst(p)
fchs F
fabs F
fcom(p)(pp) F
fucom(p)(pp) F
ftst F
fnop
fxch
wait 1 1-3 6+ +f

2 cycle
----
add m * 2 2,3 6,7 +m d
sub m * 2 2,3 6,7 +m d
and m * 2 1,3 6,7 +m d
or m * 2 2,3 6,7 +m d
xor m * 2 2,3 6,7 +m d
cmp m * 2 2 5,6 +m d
test m * 2 1,2 5,5 d
adc m / 2 2,3 6,7 +m d
sbb m / 2 2,3 6,7 +m d
push m 2 4 5 +f +m d2
setcc m 2 4 5 d
cwd 2 3 2 +f d
cdq 2 3 2 +f d
clc 2 2 2 d
stc 2 2 2 d
cmc 2 2 2 d
cld 2 2 2 d
std 2 2 2 d2
lods 2 5 5 +f v
lahf 2 3 2 X +f v
sahf 2 2 3 X d
xchg ax 2 3 3 +f d2
call r 2 5 10+ +f d2
jmp r 2 5 7+ +f d
retn 2 5 10+ +f d
fld m80
fst(p) m32/64
fldz
fld1
fnstcw m16
fincstp
fdecstp
ffree

3 cycle
----
shr m / 3 4 7 +m d
shl m / 3 4 7 +m d
sar m / 3 4 7 +m d
sal m / 3 4 7 +m d
pop m 3 6 5 +f v
inc m 3 3 6 X d
dec m 3 3 6 X d
neg m 3 3 6 d
not m 3 3 6 +f d
ror m 3 4 7 +m d
rol m 3 4 7 +m d
rcr m / 3 4 10 +m v
rcl m / 3 4 10 +m v
xchg 3 3 3 +f d2/v
cbw 3 3 3 +f d
cwde 3 3 3 +f d
stos 3 5 4 +f v
pop sr 3 3 2 X +f v
pushf 3/9 3/4 4 +f v
movsx 3 3 3 +f +m d
movzx 3 3 3 +f +m d
jmpf 3 13 12+ +f v
retn i 3 5 10+ +f d2
leave 3 5 4 d2
fadd(p) F
fsub(p)(r)(rp) F
fmul(p) F
fst(p) m80
fild m

4 cycle
----
popf 4/6 6/9 5 +f v
xlat 4 4 5 +f v
movs 4 7 7 +f v
scas 4 6 7 v
lds X +f
les X +f
lfs X +f
lgs X +f
lss X +f
shr cl +m d
shl cl +m d
sar cl +m d
sal cl +m d
ror cl +m d
rol cl +m d
shld v
shrd v
bt 4 3 3 +m d/v
retf +f d/d2
ficom

5 cycle
----
cmps 5 8 10 v
pusha 5 11 18/24 X +f v
popa 5 9 24 X +f v
shld m v
shrd m v
call m +f v
jmp m +f d/v
retn m +f d/d2
retf i +f d/d2

misc.
----
fdiv(p)(r)(rp) F (fxch pairable)
enter 11+ 14+ 10+ v
rep 8 7 5 (varies by instruction...)
loop 5 6 11+m v
salc
btc/r/s 7 6 6 +m d2/v
rcr cl 7/26 8/31 9/10 +m v
rcl cl 7/26 8/31 9/10 +m v

instructions by generation
----
186 -
SHR/ROT i immediate >1
BOUND
ENTER/LEAVE
INS/INSB/INSW
IMUL r,r,i
OUTS/OUTSB/OUTSW
PUSH i
POPA/POPAD
PUSHA/PUSHAD
286 -
ARPL
CLTS
LAR
LGDT
LIDT
LLDT
LMSW
LOADALL
LSL
LTR
SGDT
SIDT
SLDT
SMSW
STR
VERR
VERW
386 -
MOVZX
MOVSX
IMUL r,r
SHLD
SHRD
BT
BTR
BTS
BTC
BSF
BSR
SETcc
Jcc long-displacement
CDQ
CWDE
IRETD
LFS
LGS
LSS
MOVSD
OUTSD
POPFD
PUSHFD
MOV CRx
MOV TRx
MOV DRx
486 -
BSWAP
CPUID (some)
CMPXCHG
INVD
INVLPG
RSM
WBINVD
XADD
FSTSW AX
87 -
ST(0)-ST(7) registers
FSTCW mem
FLDCW mem
287 -
FSTSW AX
FSETPM
387 -
FCOS
FLDENVD
FNSAVED
FNSTENVD
FPREM1
FRSTORD
FSAVED
FSIN
FSINCOS
FSTENVD
FUCOM
FUCOMP
FUCOMPP
Pent -
CMPXCHG8B
CPUID
RDMSR
WRMSR
RSM
RDTSC
RDPMC (Pent w/MMX only)
MOVCRx
RSLDT
RVTS
SVDC
Pent2 -
SYSENTER
SYSEXIT
PPro -
CMOVcc
FCMOVcc
FCOMV
FCOMI
FCOMIP
FUCOMI
FUCOMIP
RDPMC
UD2
MMX -
MM0-MM7 registers
MOVD
MOVQ
PACKSSDW
PACKSSWB
PACKUSWB
PADDB
PADDW
PADDD
PADDSB
PADDSW
PADDUSB
PADDUSW
PAND
PANDN
PCMPEQB
PCMPEQW
PCMPEQD
PCMPGTB
PCMPGTW
PCMPGTD
PMADDWD
PMULHW
PMULLW
POR
PSLLD
PSLLW
PSLLQ
PSRAD
PSRAW
PSRLW
PSRLD
PSRLQ
PSUBB
PSUBW
PSUBD
PSUBSB
PSUBSW
PSUBUSB
PSUBUSW
PUNPCKHBW
PUNPCKHDQ
PUNPCKHWD
PUNPCKLBW
PUNPCKLWD
PUNPCKLDQ
PXOR
EMMS
SSE -
XMM0-XMM7 registers 64
PREFETCH
SFENCE
FXSAVE
FXRSTOR
MOVNTQ
MOVNTPS
- CVTSI2SS
- CVTSS2SI
- CVTTSS2SI
PSHUFW
PSADW
PMINUB
PMINSW
PMAXUB
PMAXSW
PMULHUW
PAVGB
PAVGW
PINSRW
PMOVMSKB
SSE2 -
XMM0-XMM7 registers 128
MOVNTI
MOVNTPD
PAUSE
LFENCE
MFENCE
- CVTSD2SI
- CVTSI2SD
- CVTTSD2SI
PADDQ
PSUBQ
PMULUDQ
SSE3 -
FISTTP
LDDQU
MOVDDUP
MOVSHDUP
MOVSLDUP
ADDSUBPS
ADDSUPPD
HADDPS
HADDPD
HSUBPS
HSUBPD
SSE4 -
PSHUFB
PHADDW
PHADDSW
PHADDD
PMADDUBSW
PHSUBW
PHSUBSW
PHSUBD
PSIGNB
PSIGNW
PSIGND
PMULHRSW
PABSB
PABSW
PABSD
PALIGNR
SSE4A -
- EXTRA
- INSERTQ
- MOVNTSD
- MOVNTSS
SSE4.1 -
- DPPD
- DPPS
- INSERTPS
- MOVNTDQA
- MPSADBW
- PACKUSDW
- PBLENDW
- BLENDPD
- BLENDPS
- PBLENDVB
- BLENDVPD
- BLENDVPS
- PCMPEQQ
- PEXTRB
- PEXTRW
- PEXTRD
- PEXTRQ
- PHMINPOSUW
- PINSRB
- PINSRD
- PINSRQ
- PMAXSB
- PMAXSD
- PMAXUW
- PMAXUD
- PMINUW
- PMINUD
- PMOVSXBW
- PMOVSXBD
- PMOVSXBQ
- PMOVSXWD
- PMOVSXWQ
- PMOVSXDQ
- PMOVZXBW
- PMOVZXBD
- PMOVZXBQ
- PMOVZXWD
- PMOVZXWQ
- PMOVZXDQ
- PMULDQ
- PMULLUD
- PTEST
- ROUNDPD
- ROUNDPS
- ROUNDSD
- ROUNDSS
SSE4.2
- CRC32
- PCMPESTRI
- PCMPESTRM
- PCMPGTQ
- PCMPISTRI
- PCMPISTRM
- POPCNT
ABM
LZCNT
POPCNT
Monitor -
MONITOR
MWAIT
3DNow - AMD only
FEMMS
PABGUSB
PF2ID
PFACC
PFADD
PFCMPPEQ/GT/GE
PFMAX
PFMIN
PFRCP/IT1/IT2
PFRSQRT/IT1
PFSUB
PFSUBR
PI2FD
PMULHRW
PREFETCH
PREFETCH/W
3DNowE - AMD only
PF2IW
PFNACC
PFPNACC
PI2FW
PSWAPD
64bit - extensions not available in 32-bit mode
16 64-bit general registers
16 XMM
8 MMX
8 ST

<--end-->

HTH,

Rod Pemberton

George Neuner

unread,

Nov 11, 2016, 7:48:38 AM11/11/16

to

That may be true, but I'm going by external clock.

The original 2MHz Z80 was much slower than 1MHz 6502 on a general mix
and even was slower for mostly 16-bit code.

The 4MHz Z80 was just a little bit faster on general mix and a good
deal faster on mostly 16-bit code. So 4:1 external clock needed.

George

Robert Wessel

unread,

Nov 14, 2016, 1:05:08 AM11/14/16

to

On Fri, 11 Nov 2016 07:43:50 -0500, George Neuner

<gneu...@nospicedham.comcast.net> wrote:

>On Wed, 09 Nov 2016 14:58:18 -0600, Robert Wessel
><robert...@nospicedham.yahoo.com> wrote:
>
>>On Wed, 09 Nov 2016 15:28:25 -0500, George Neuner
>><gneu...@nospicedham.comcast.net> wrote:
>>
>>>The Z80, and the 8080 before it, arguably was better if you needed a
>>>lot of 16-bit operations. The 65c02 had a (IMO) more well rounded
>>>instruction set, and was noticably faster when clocked comparably. The
>>>Z80 needed a 4:1 clock advantage to beat the 65c02 on a general mix of
>>>code. [8080 was even slower and needed 6:1]
>>
>>
>>More like 2:1. The 6502 needed a two phase clock - so a 1MHz 6502
>>basically had 2 million clock pulses each second.
>
>That may be true, but I'm going by external clock.

Eh. Two 1Mhz clocks, timed in such a way that each generates a pulse
halfway between a pair of pulses on the other clock. Sure sounds like
2MHz to me. But it's mainly a question of semantics. Plenty of
systems just generated the phase-0 and phase-1 clocks from a double
(or higher) rate raw signal. For example, the Apple II divided down
the 14.3MHz system oscillator to generate the two 1MHz phases
(specifically, it divided the 14.3MHz signal by two, and generated the
individual phases on one of the seven remaining clocks in each phase).

Robert Wessel

unread,

Nov 14, 2016, 1:05:09 AM11/14/16

to

On Fri, 11 Nov 2016 11:09:27 +0100, "wolfgang kern" <now...@never.at>
wrote:

>
>Edward Brekelbaum said (in part):
>
>| Usually, it is easiest to assume that flags only live to the next
>| instruction.
>
>good advice for those who haven't done their homework yet :)
>
>assembler programmers got a huge advantage over HLL-coders
>by making use of flags and write shorter faster code.

Certainly there are codes that can use those to good advantage (bignum
arithmetic, for example), as well as many where there's little to be
gained.

wolfgang kern

unread,

Nov 14, 2016, 4:05:20 AM11/14/16

to

Robert Wessel wrote:

>>Edward Brekelbaum said (in part):

>>| Usually, it is easiest to assume that flags only live to the next
>>| instruction.

>>good advice for those who haven't done their homework yet :)

>>assembler programmers got a huge advantage over HLL-coders
>>by making use of flags and write shorter faster code.

> Certainly there are codes that can use those to good advantage (bignum
> arithmetic, for example), as well as many where there's little to be
> gained.

I wont start flames on HLL yet, but whenever I compare my low level
code with any C-compiled stuff the advantage is more than obvious.

My routines never start with push ebp and never end with 32 bit in
eax as true/false. They return with flag status and allow 8 apart
results with Cy,S,Z even I mainly use only four with Cy Z. And it's
really easy to set the flags as desired if they arent set already.
__
wolfgang

rug...@nospicedham.gmail.com

unread,

Nov 26, 2016, 8:48:11 PM11/26/16

to

Hi,

On Friday, November 11, 2016 at 5:18:33 AM UTC-6, Rod Pemberton wrote:
> On Fri, 11 Nov 2016 11:09:27 +0100
>

> x86 instruction information
> compiled and authored by Rod Pemberton
>

> CPU instruction length:
> 386+ 15 bytes maximum, GP fault generated if exceeded
> 286 10 bytes maximum
> 86 no maximum - instruction size 1 to 4 bytes

I'm pretty sure that 8086 has max instruction length of six (6) bytes.

Alexei A. Frounze

unread,

Nov 26, 2016, 10:33:21 PM11/26/16

to

At least 7 bytes on 8086/8088:
26 C7 80 34 12 78 56 mov word [es:bx+si+1234H], 5678H

Alex

Rod Pemberton

unread,

Nov 26, 2016, 11:48:28 PM11/26/16

to

The 8086 has a 6-byte instruction queue. The 8088 has a 4-byte
instruction queue. The 8086 fills two bytes at a time. The 8088 fills
one byte at a time.

Now, since the 8088 executes the same instructions as the 8086, the
queue size doesn't limit instruction length. Also, these early
processors didn't have general protection faults for instructions which
were too long.

AISI, our options are that the instruction set encoding limits the
instruction length, or the instruction decoder limits the instruction
length.

So, we just need a quote from a reputable source to get to the truth.
Do we have some? Yes, we do. (I'm not insulting anyone with an RTFM
as I skim the heck out of them and skip most of them ... ;-)

The first quote is from section "14.7 DIFFERENCES FROM 8086 PROCESSOR"
sub-section "6. Redundant prefixes" on page 14-6 of "386 SX
Microprocessor Programmer's Manual" by Intel, 1989.

"The 386 SX microprocessor sets a limit of 15 bytes on instruction
length. The only way to violate this limit is by putting redundant
prefixes before an instruction. A general-protection exception is
generated if the limit on instruction length is violated. The 8086
processor has no instruction length limit."

The second quote is from section "Appendix D iAPX 86/88 SOFTWARE
COMPATIBILITY CONSIDERATIONS" sub-section "10. Do not Duplicate
Prefixes." on page D-2 of "iAPX 286 Programmer's Reference Manual
including the iAPX 286 Numeric Supplement" by Intel, 1985.

"The iAPX 286 sets an instruction length limit of 10 bytes. The only
way to violate this limit is by duplicating a prefix two or more times
before an instruction. Exception 6 occurs if the instruction length
limit is violated. The iAPX 86 or 88 has no instruction length limit."

Well, there we have it. That's straight from Intel. The 8088 or 8086
has no instruction length limit. The 80286 has 10 byte limit and cause
exception 6 if the length is exceeded. The 80386 has 15 byte limit and
causes a GP exception if the length is exceeded.

I'll make a note to change the 80286 line to this:
" 286 10 bytes maximum, exception 6 generated if exceeded"

So, the many, many webpages on the internet which state the 8086 only
has an 6-byte instruction limit are incorrect, clearly.

HTH,

Rod Pemberton

Terje Mathisen

unread,

Nov 27, 2016, 5:18:45 AM11/27/16

to

Rod Pemberton wrote:
> So, the many, many webpages on the internet which state the 8086 only
> has an 6-byte instruction limit are incorrect, clearly.

You are of course completely correct here Rob:

The 8088 was perfectly happy decoding & executing a new/redundant ES:
prefix byte every 4 cycles, i.e. at the maximum rate the memory
subsystem could provide the bytes.

The instruction prefetch buffer length was the canonical way to
determine if you had a 8088 or 8086 cpu, but the size of this buffer
(which incidentally almost never did anything on the 8088!) had nothing
to do with the maximum instruction length.

James Harris

unread,

Dec 2, 2016, 4:56:23 PM12/2/16

to

On 06/11/2016 02:26, hughag...@nospicedham.gmail.com wrote:
> On Thursday, November 3, 2016 at 2:22:37 PM UTC-7, James Harris
> wrote:
>> On 03/11/2016 20:43, Rod Pemberton wrote:

...

>>> And, many other instructions are marked as having the results for
>>> specific flags as being undefined, even if the instruction
>>> doesn't use that flag. This means you can't expect that flag to
>>> be preserved. (I constructed lists of these for my own personal
>>> use.)
>>
>> Isn't there a table showing flag effects at the back of the 386
>> manual, and probably later manuals?
>
> I mentioned this earlier, in asking for a book on the "good parts" of
> x86 assembly-language. The Intel manuals have the information, but it
> is very difficult to look up (it took me several minutes to find NOT
> and NEG to answer your last question about whether they affect the
> flags). We could really use a concise description of the x86 that
> provides basic information such as this that could be used as a handy
> reference.

Here's the table I was talking about:

https://pdos.csail.mit.edu/6.828/2006/readings/i386/appb.htm

The whole online hyperlinked manual is excellent.

--
James Harris