A valid number of prefixes

Tadas

unread,

Nov 12, 2009, 2:27:35 PM11/12/09

to

Hey guys!

I have this question related to instruction prefixes in ia-32/64. I
did some work with software protections and code obfuscations. And I
have seen instruction prefixes in many cases
where their use is reserved, for example jmp instruction and rep
prefix. So my question is, how many prefixes does the instruction can
actually have. I know that the manual says there can be up to 4
prefixes and each of them has to be from four different groups.
However, could there be something, let's say rep rep rep segment
override and address override and then the instruction follows? Does
the hard limit exists? Practically, if you are using some kind of
assembler it will make sure that the same prefix will not repeat more
than once for an instruction. On the other hand, there is always a
possibility to inject a byte manually into the binary. Thanks.

Tadas

wolfgang kern

unread,

Nov 12, 2009, 4:46:18 PM11/12/09

to

"Tadas" asked:

> Hey guys!

Hi,

seems you mean x86-32/64 CPUs and not IA-64 (which is an exotic Intel)
for x86-32/64: the max. instruction length is 15 bytes for all x86's.
so if you code several (redundant) prefix-bytes the behave may depend
on the age of your CPU. While very early CPU's took the last prefix as
the valid one and some even just toggled the meaning for every occurence,
modern CPUs may ignore double or redundant occurance of prefix bytes,
but in any case exceeding the max. allowed instruction-size (15) will
result in an error-exeption.

Even it seems to be not required anymore, I always follow the (logical)
rule of the (recommended for older 486) order of prefix bytes:
1. REP
2. LOCK
3. SEGover
4. REX
[ASM:] SIB (not a prefix, but must be 'after' all prefix stmt.)
__
wolfgang

Frank Kotler

unread,

Nov 12, 2009, 4:47:49 PM11/12/09

to

Try it!

I think you will find that the hard limit is on the total instruction
length, not strictly on the number of prefixes (prefixii?). No idea what
the situation is in 64-bit. Do be aware that multiple address/operand
override prefixes will *not* toggle it back!

Best,
Frank

Alexei A. Frounze

unread,

Nov 13, 2009, 2:16:41 AM11/13/09

to

On Nov 12, 1:47=A0pm, Frank Kotler <fbkot...@MUNGED.microcosmotalk.com>
wrote:

> Tadas wrote:
> > Hey guys!
>
> > I have this question related to instruction prefixes in ia-32/64. I
> > did some work with software protections and code obfuscations. And I
> > have seen instruction prefixes in many cases
> > where their use is reserved, for example jmp instruction and rep
> > prefix. So my question is, how many prefixes does the instruction can
> > actually have. I know that the manual says there can be up to 4
> > prefixes and each of them has to be from four different groups.
> > However, could there be something, let's say rep rep rep segment
> > override and address override and then the instruction follows? Does
> > the hard limit exists? Practically, if you are using some kind of
> > assembler it will make sure that the same prefix will not repeat more
> > than once for an instruction. On the other hand, there is always a
> > possibility to inject a byte manually into the binary. Thanks.
>
> Try it!
>
> I think you will find that the hard limit is on the total instruction
> length, not strictly on the number of prefixes (prefixii?).

Yep, I think the same.

> No idea what
> the situation is in 64-bit. Do be aware that multiple address/operand
> override prefixes will *not* toggle it back!

And if you have several different segment prefixes, normally the last
one is effective.

Alex

Rod Pemberton

unread,

Nov 13, 2009, 4:23:34 AM11/13/09

to

"wolfgang kern" <now...@never.at> wrote in message
news:4afc822a$0$4874$9a6e...@unlimited.newshosting.com...
>
> ... the max. instruction length is 15 bytes for all x86's.

Oh, while of little use to the OP, I thought it was:

15 for 386 or later
10 for 286 (no GPF)
no limit for 86

Rod Pemberton

wolfgang kern

unread,

Nov 13, 2009, 12:30:46 PM11/13/09

to

Rod Pemberton figured:

>> ... the max. instruction length is 15 bytes for all x86's.

> Oh, while of little use to the OP, I thought it was:
>
> 15 for 386 or later
> 10 for 286 (no GPF)
> no limit for 86

Yes Rod, I just forgot to include the museum :)

__
wolfgang

BGB / cr88192

unread,

Nov 14, 2009, 3:32:13 PM11/14/09

to

"Tadas" <vilkel...@MUNGED.microcosmotalk.com> wrote in message
news:4afc61a7$0$5079$9a6e...@unlimited.newshosting.com...

as others have noted, the usual hard limit is 15 or 16 (I thought 16, but
others are saying 15...).

so, yeah, extra prefixes at peril, they may be ignored, or cause other
effects.
also worth noting is that many of these prefixes are actually used to encode
a lot of newer instructions, such as the REP and OperandSize prefixes being
used to encode many SSE opcodes, ... (so, on a CPU without these opcodes, it
will look like a prefix followed by some other instruction).

as for an assembler, multiple prefixes can be added via db:
db 0x66, 0x66, 0x66, 0x66, 0x66
mov [ecx+edx*8+0xf00ba5], 0xb0f0b0f0

FWIW, the CS and DS prefixes may validly be used with jumps to essentially
form 'predicted' jumps.

asside:

hmm, actually prefixes could be a weak point for my interpreter, since it
does not interpret/decode prefixes directly, but instead uses pattern
matching (the prefix is part of a pre-formed pattern), and so code using
prefixes weirdly would likely fail to match (actually, checking my listing,
it would end up parsing the prefixes as individual instructions, and then
#UD due to not having logic in place for, say, a REP or DataSz prefix all by
itself...).

a possible fix here could be to make them simply serve as NOP's, but I am
not sure which is better (AKA: ignore prefixes and do as normal, or #UD at
encountering such an attempt, or maybe log a warning and continue, ...).

I would have to check, but putting prefixes in an unusual order, such as:
66 26 67 ...

would also mess up the decoder.

order is:
[SegOvr] [AddrSz] [REP/REPE] [DataSz/66] [REX] [0F] <Opcode> ...

a different ordering will break my decoder, even though I guess the Intel
docs say they can be in any order.

but then the following text states some opcode encodings in terms of their
ordering, hmm...

actually, I think CPUs are also fussy, as I have an Athlon64, and previously
I had some of the prefixes out of order (in my assembler output), and WinDbg
failed to disassemble my ops correctly, as well as the CPU itself not
behaving correctly, but when the order was fixed all went well...

so, I guess in a practical sense the ordering does matter, but it is
confusing...

it may not matter too much though, since this interpreter is not intended as
a general purpose emulator...

> Tadas

Alexei A. Frounze

unread,

Nov 14, 2009, 9:22:30 PM11/14/09

to

On Nov 14, 12:32=A0pm, "BGB / cr88192"
<cr88...@munged.microcosmotalk.com> wrote:
...
> actually, I think CPUs are also fussy, as I have an Athlon64, and previou=
sly
> I had some of the prefixes out of order (in my assembler output), and Win=
Dbg

> failed to disassemble my ops correctly, as well as the CPU itself not
> behaving correctly, but when the order was fixed all went well...

WinDbg has some bugs w.r.t. prefixes in the disassembler. So far I
haven't seen or heard the CPU depend on the order of prefixes (unless
it's invalid order like REX and then legacy and maybe (don't remember
the details) those multimedia instructions, where the prefix is part
of the opcode).

> so, I guess in a practical sense the ordering does matter, but it is
> confusing...

Can't confirm from what I know.

Alex

H. Peter Anvin

unread,

Nov 15, 2009, 2:20:06 AM11/15/09

to

On 11/14/2009 06:22 PM, Alexei A. Frounze wrote:
>
> WinDbg has some bugs w.r.t. prefixes in the disassembler. So far I
> haven't seen or heard the CPU depend on the order of prefixes (unless
> it's invalid order like REX and then legacy and maybe (don't remember
> the details) those multimedia instructions, where the prefix is part
> of the opcode).
>

The legacy prefixes can be in any order; that is documented. However,
they fall in four classes, and only one prefix from each class is
meaningful in any one instruction.

-hpa

BGB / cr88192

unread,

Nov 15, 2009, 2:21:04 AM11/15/09

to

"Alexei A. Frounze" <alexf...@munged.microcosmotalk.com> wrote in message
news:4aff65e6$0$5107$9a6e...@unlimited.newshosting.com...

>
> On Nov 14, 12:32=A0pm, "BGB / cr88192"
> <cr88...@munged.microcosmotalk.com> wrote:
> ...
>> actually, I think CPUs are also fussy, as I have an Athlon64, and
>> previou=
> sly
>> I had some of the prefixes out of order (in my assembler output), and
>> Win=
> Dbg
>> failed to disassemble my ops correctly, as well as the CPU itself not
>> behaving correctly, but when the order was fixed all went well...
>
> WinDbg has some bugs w.r.t. prefixes in the disassembler. So far I
> haven't seen or heard the CPU depend on the order of prefixes (unless
> it's invalid order like REX and then legacy and maybe (don't remember
> the details) those multimedia instructions, where the prefix is part
> of the opcode).
>

yeah, mostly it was related to SSE instructions, where initially I had
gotten confused about the ordering, and put a few things in the wrong order.
apparently, then, the CPU had been getting confused about which exact
instruction was meant (it didn't crash, it just did the wrong thing).

changing the order made both WinDbg and the processor happy.

so, the ordering is like:
[F2/F3] [66] [REX] [0F] <Opcode>

to do differently (at least for SIMD) seems not to go so well.

it may not matter as much for legacy opcodes, since the prefixes are not
part of the opcode.
however, the DataSz prefix is generally treated as part of the opcode by the
matcher, so having it in a different place would mess up the decoder.

this may be partly because the decoder was based on my disassembler.

as an interesting side effect, a "phantom REX" appears in a few opcodes
which would otherwise be N/E in 32-bit x86 (such as the "rep_movsq"
pseudo-op, which is defined as "F3 48 A5").

>> so, I guess in a practical sense the ordering does matter, but it is
>> confusing...
>
> Can't confirm from what I know.
>

all the power of edge cases...

> Alex

Alexei A. Frounze

unread,

Nov 15, 2009, 4:12:06 AM11/15/09

to

On Nov 14, 11:20=A0pm, "H. Peter Anvin" <h...@MUNGED.microcosmotalk.com>
wrote:

> On 11/14/2009 06:22 PM, Alexei A. Frounze wrote:
>
>
>
> > WinDbg has some bugs w.r.t. prefixes in the disassembler. So far I
> > haven't seen or heard the CPU depend on the order of prefixes (unless
> > it's invalid order like REX and then legacy and maybe (don't remember
> > the details) those multimedia instructions, where the prefix is part
> > of the opcode).
>

> The legacy prefixes can be in any order; that is documented. =A0However,

> they fall in four classes, and only one prefix from each class is
> meaningful in any one instruction.
>

> =A0 =A0 =A0 =A0 -hpa

I meant legacy prefixes between the REX and opcode - that's not good.

Alex

wolfgang kern

unread,

Nov 15, 2009, 6:33:56 AM11/15/09

to

Alexei A. Frounze wrote:

> On Nov 14, 12:32pm, "BGB / cr88192" wrote:
> ...
>> actually, I think CPUs are also fussy, as I have an Athlon64, and

>> previously I had some of the prefixes out of order
>> (in my assembler output), and WinDbg failed to disassemble my ops

>> correctly, as well as the CPU itself not behaving correctly,
>> but when the order was fixed all went well...

> WinDbg has some bugs w.r.t. prefixes in the disassembler. So far I
> haven't seen or heard the CPU depend on the order of prefixes (unless
> it's invalid order like REX and then legacy and maybe (don't remember
> the details) those multimedia instructions, where the prefix is part
> of the opcode).

Yeah, most disassemblers are a bit confused by the new meaning of
66/f2/f3 as SSE extensions or other new stuff like PAUSE (f390).

>> so, I guess in a practical sense the ordering does matter, but it is
>> confusing...

> Can't confirm from what I know.

REX seem to be the only one which 'shall immediate precede opcode',
the others may come in any order.

__
wolfgang

Rod Pemberton

unread,

Nov 15, 2009, 11:48:46 AM11/15/09

to

"H. Peter Anvin" <h...@MUNGED.microcosmotalk.com> wrote in message
news:4affaba5$0$4939$9a6e...@unlimited.newshosting.com...

>
> The legacy prefixes can be in any order; that is documented. However,
> they fall in four classes,

Which are?

Legacy:
two byte (0x0F)
segment overrides (0x26, 0x36, 0x2E, 0x3E, 0x64, 0x65)
rep (0xF2, 0xF3)
lock (0xF0)
sizing (0x66, 0x67)

Non-legacy:
branch hint (0x2E, 0x3E)
sse etc. (0xF2, 0xF3, 0x66)
rex

Are there more?

> The legacy prefixes can be in any order; that is documented. However,
> they fall in four classes, and only one prefix from each class is
> meaningful in any one instruction.

Are you saying 66h and 67h are both in one class but can't be used together?
(Incorrect.)

Or, are you saying 66h and 67h should be considered to be two different
classes? If so, how do you come up with four classes?

???

Rod Pemberton

BGB / cr88192

unread,

Nov 15, 2009, 11:49:21 AM11/15/09

to

"wolfgang kern" <now...@never.at> wrote in message

news:4affe723$0$4871$9a6e...@unlimited.newshosting.com...

>
>
> Alexei A. Frounze wrote:
>> On Nov 14, 12:32pm, "BGB / cr88192" wrote:
>> ...
>>> actually, I think CPUs are also fussy, as I have an Athlon64, and
>>> previously I had some of the prefixes out of order
>>> (in my assembler output), and WinDbg failed to disassemble my ops
>>> correctly, as well as the CPU itself not behaving correctly,
>>> but when the order was fixed all went well...
>
>> WinDbg has some bugs w.r.t. prefixes in the disassembler. So far I
>> haven't seen or heard the CPU depend on the order of prefixes (unless
>> it's invalid order like REX and then legacy and maybe (don't remember
>> the details) those multimedia instructions, where the prefix is part
>> of the opcode).
>
> Yeah, most disassemblers are a bit confused by the new meaning of
> 66/f2/f3 as SSE extensions or other new stuff like PAUSE (f390).
>

yeah...

my disassembler and interpreter had mostly ended up treating them all as a
literal part of the opcode, and so largely does not interpret them as
'prefixes' as such, but instead as separate instructions.

in the cases where REP/REPNE was valid as a prefix, which were limited, I
ended up just decoding the whole thing as a compound instruction.

so, "rep_movsb" as a single opcode, not as "rep; movsb".

the exceptions are:
the AddrSz prefix, which is handled specially by the decoder (since it may
effect ModRM syntax/...);
the segment override prefixes, which are handled themselves as instructions
(they set some internal flags, which are handled the next time the ModRM is
resolved...).

so, note that a segment override followed by a non-ModRM instruction could
delay the result until the next ModRM.

fs
nop
imul ecx
mov eax, [0xf00b45]
behaves-as: mov eax, [fs:0xf00b45]

however, the 'DataSz' prefix on 16-bit instructions is treated the same as
the fixed prefix on SIMD instructions (IOW: as part of the pattern to be
matched), and so if it does not immediately precede the REX or opcode, the
instruction will not be recognized.

so, alas, maybe my interpreter is not exactly conformant with the Intel
docs, but oh well, what assemblers I have seen around seem to produce
instructions in the accepted forms.

>>> so, I guess in a practical sense the ordering does matter, but it is
>>> confusing...
>
>> Can't confirm from what I know.
>
> REX seem to be the only one which 'shall immediate precede opcode',
> the others may come in any order.
>

but, it does precede 0F, which is listed as a prefix which is also a part of
the opcode (escape for more opcodes...).

X0F...

and, I can't seem to find any SIMD ops which use both 66 and F3...

66X0F...

since F2 or F3 would change the opcode in question (and so are not valid),
in this case, the order is effectively fixed.

however, I guess a segment override could be stuffed in there:
66 26 0F 10 ... ("movupd xr*, [es:*]")

where this is the case which would at present break my instruction decoder,
which would parse it as:
"a16; es; movups ..."

...

> __
> wolfgang
>
>
>

H. Peter Anvin

unread,

Nov 15, 2009, 11:28:36 PM11/15/09

to

On 11/15/2009 08:48 AM, Rod Pemberton wrote:
>
>> The legacy prefixes can be in any order; that is documented. However,
>> they fall in four classes, and only one prefix from each class is
>> meaningful in any one instruction.
>
> Are you saying 66h and 67h are both in one class but can't be used together?
> (Incorrect.)
>

No, they are in separate classes.

The classes are:

REP and LOCK prefixes
Segment overrides
OSP (66h)
ASP (67h)

See the Intel Software Development Manual, volume 2A, section 2.1.1.

-hpa

wolfgang kern

unread,

Nov 15, 2009, 11:28:54 PM11/15/09

to

"BGB / cr88192" wrote:
....

>> REX seem to be the only one which 'shall immediate precede opcode',
>> the others may come in any order.

> but, it does precede 0F, which is listed as a prefix which is also a part
> of
> the opcode (escape for more opcodes...).

Yes it were once listed as prefix, but I never saw 0Fh as a prefix byte,
because it's just a mark for a second page of opcodes.
Finally 0F is listed as part of the OPcode now.

> and, I can't seem to find any SIMD ops which use both 66 and F3...
>
> 66X0F...
>
> since F2 or F3 would change the opcode in question (and so are not valid),
> in this case, the order is effectively fixed.
>
> however, I guess a segment override could be stuffed in there:
> 66 26 0F 10 ... ("movupd xr*, [es:*]")
>
> where this is the case which would at present break my instruction
> decoder,
> which would parse it as:
> "a16; es; movups ..."

the byte order is vital on SSE, even most instructions which were preceded
by 66/f3/f2 look similar in behave, this aren't just prefixes here.

__
wolfgang

Rod Pemberton

unread,

Nov 16, 2009, 3:26:33 PM11/16/09

to

"H. Peter Anvin" <h...@MUNGED.microcosmotalk.com> wrote in message

news:4b00d4f4$0$5092$9a6e...@unlimited.newshosting.com...

>
> On 11/15/2009 08:48 AM, Rod Pemberton wrote:
> >
> >> The legacy prefixes can be in any order; that is documented. However,
> >> they fall in four classes, and only one prefix from each class is
> >> meaningful in any one instruction.
> >
> > Are you saying 66h and 67h are both in one class but can't be used
together?
> > (Incorrect.)
> >
>
> No, they are in separate classes.
>
> The classes are:
>
> REP and LOCK prefixes
> Segment overrides
> OSP (66h)
> ASP (67h)
>

AMD manuals say there are 5 (five) classes. Intel manuals say there are 4
(four) classes. This difference could affect a disassembly of instructions.

(Isn't this something you should know? You working on NASM and all...
You've demonstrated knowing the Intel manuals. But, you've also
demonstrated _not_ knowing the AMD manuals - a few times now. Is it your
intent for NASM to be Intel only? That's the impression I'm getting from
you.)

> See the Intel Software Development Manual, volume 2A, section 2.1.1.
>

Intel Software Development Manual, volume 2A, section 2.1.1:

2.1.1 Instruction Prefixes
"Instruction prefixes are divided into four groups, each with a set of
allowable prefix codes. For each instruction, one prefix may be used from
each of four groups (Groups 1, 2, 3,4) and be placed in any order"

The four groups Intel lists are lock and rep, segment overrides and branch
hints, operand size, and address size.

> ...

AMD64 Architecture Programmer's Manual, Volume 1A, sections 3.5 and 3.5.1

3.5 Instruction Prefixes

"Instruction prefixes are of two types: REX, legacy."

3.5.1 Legacy Prefixes
"Table 3-7 on page 72 shows the legacy prefixes. These are organized into
five groups, ..."

"The legacy prefixes can appear in any order in the instruction, but only
one prefix from each of the five groups can be used in a single instruction.
The result of using multiple prefixes from a single group is undefined."

The five groups AMD lists are operand size, address size, segment overrides,
lock, and rep.

HTH,

Rod Pemberton

robert...@munged.microcosmotalk.com

unread,

Nov 16, 2009, 6:06:35 PM11/16/09

to

On Nov 16, 2:26=A0pm, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
> "H. Peter Anvin" <h...@MUNGED.microcosmotalk.com> wrote in messagenews:4b=

00d4f4$0$5092$9a6e...@unlimited.newshosting.com...
>
>
>
>
>
>
>
> > On 11/15/2009 08:48 AM, Rod Pemberton wrote:
>

> > >> The legacy prefixes can be in any order; that is documented. =A0Howe=

ver,
> > >> they fall in four classes, and only one prefix from each class is
> > >> meaningful in any one instruction.
>
> > > Are you saying 66h and 67h are both in one class but can't be used
> together?
> > > (Incorrect.)
>
> > No, they are in separate classes.
>
> > The classes are:
>
> > REP and LOCK prefixes
> > Segment overrides
> > OSP (66h)
> > ASP (67h)
>

> AMD manuals say there are 5 (five) classes. =A0Intel manuals say there ar=
e 4
> (four) classes. =A0This difference could affect a disassembly of instruct=
ions.
>
> (Isn't this something you should know? =A0You working on NASM and all...
> You've demonstrated knowing the Intel manuals. =A0But, you've also
> demonstrated _not_ knowing the AMD manuals - a few times now. =A0Is it yo=
ur
> intent for NASM to be Intel only? =A0That's the impression I'm getting fr=

om
> you.)
>
> > See the Intel Software Development Manual, volume 2A, section 2.1.1.
>
> Intel Software Development Manual, volume 2A, section 2.1.1:
>
> 2.1.1 Instruction Prefixes
> "Instruction prefixes are divided into four groups, each with a set of

> allowable prefix codes. =A0For each instruction, one prefix may be used f=

rom
> each of four groups (Groups 1, 2, 3,4) and be placed in any order"
>

> The four groups Intel lists are lock and rep, segment overrides and branc=

h
> hints, operand size, and address size.
>
> > ...
>
> AMD64 Architecture Programmer's Manual, Volume 1A, sections 3.5 and 3.5.1
>
> 3.5 Instruction Prefixes
>
> "Instruction prefixes are of two types: REX, legacy."
>
> 3.5.1 Legacy Prefixes

> "Table 3-7 on page 72 shows the legacy prefixes. =A0These are organized i=

nto
> five groups, ..."
>
> "The legacy prefixes can appear in any order in the instruction, but only

> one prefix from each of the five groups can be used in a single instructi=

on.
> The result of using multiple prefixes from a single group is undefined."
>

> The five groups AMD lists are operand size, address size, segment overrid=
es,
> lock, and rep.

Of course since there are no instructions for which both REPxx and
LOCK are valid, having them in a separate groups produces a basically
meaningless distinction.

BGB / cr88192

unread,

Nov 17, 2009, 1:34:51 AM11/17/09

to

"wolfgang kern" <now...@never.at> wrote in message

news:4b00d506$0$5100$9a6e...@unlimited.newshosting.com...

>
> "BGB / cr88192" wrote:
> ....
>>> REX seem to be the only one which 'shall immediate precede opcode',
>>> the others may come in any order.
>
>> but, it does precede 0F, which is listed as a prefix which is also a part
>> of
>> the opcode (escape for more opcodes...).
>
> Yes it were once listed as prefix, but I never saw 0Fh as a prefix byte,
> because it's just a mark for a second page of opcodes.
> Finally 0F is listed as part of the OPcode now.
>

yes, granted I have Intel docs which include 64-bit stuff, and they do list
0F as a prefix, but then later define it as escaping another 1 or 2 opcode
bytes (I think maybe 1 or 2 sections later).

in roughly the same place, it says the prefixes may come in any order, and
then in the next section describes some things which do impose some ordering
on the prefiexes.

>> and, I can't seem to find any SIMD ops which use both 66 and F3...
>>
>> 66X0F...
>>
>> since F2 or F3 would change the opcode in question (and so are not
>> valid),
>> in this case, the order is effectively fixed.
>>
>> however, I guess a segment override could be stuffed in there:
>> 66 26 0F 10 ... ("movupd xr*, [es:*]")
>>
>> where this is the case which would at present break my instruction
>> decoder,
>> which would parse it as:
>> "a16; es; movups ..."
>
> the byte order is vital on SSE, even most instructions which were preceded
> by 66/f3/f2 look similar in behave, this aren't just prefixes here.
>

yes, ok...

otherwise I would likely have to do something "funny" with the decoder to
make things work (such as having the decoder have prefixes set flags and
then the matcher checks for flags, or similar...).

but, resolving this, and a few other more subtle issues, would require
alterations to my listings and some amount of changes to the decoder logic.

as is, I have already somewhat altered the logic from what it was in my
disassembler, and I suspect my disassembler has more than a few subtle bugs
(considering what sorts of bugs I fixed). granted, the disassembler is not
exactly a critical feature.

wolfgang kern

unread,

Nov 17, 2009, 9:59:49 AM11/17/09

to

"BGB / cr88192" wrote:
...

>> Yes it were once listed as prefix, but I never saw 0Fh as a prefix byte,
>> because it's just a mark for a second page of opcodes.
>> Finally 0F is listed as part of the OPcode now.

> yes, granted I have Intel docs which include 64-bit stuff, and they do
> list
> 0F as a prefix, but then later define it as escaping another 1 or 2 opcode
> bytes (I think maybe 1 or 2 sections later).

> in roughly the same place, it says the prefixes may come in any order, and
> then in the next section describes some things which do impose some
> ordering
> on the prefiexes.

:) copy paste from older manuals may lead to such things ...

...
>> the byte order is vital on SSE, even most instructions which were
>> preceded
>> by 66/f3/f2 look similar in behave, this aren't just prefixes here.
>>

> yes, ok...

> otherwise I would likely have to do something "funny" with the decoder to
> make things work (such as having the decoder have prefixes set flags and
> then the matcher checks for flags, or similar...).

> but, resolving this, and a few other more subtle issues, would require
> alterations to my listings and some amount of changes to the decoder
> logic.

> as is, I have already somewhat altered the logic from what it was in my
> disassembler, and I suspect my disassembler has more than a few subtle
> bugs
> (considering what sorts of bugs I fixed). granted, the disassembler is not
> exactly a critical feature.

My disassembler set individual flags for the occurance of any prefix
and the meaning of 'em depend on the order and the opcode following.
So it easy detects if it's an SSE-variant(or REP,Opsize) SEG(or Jcc TN/NT)
...

I think somthing similar in reverse could be a solution for compilers too.

__
wolfgang

BGB / cr88192

unread,

Nov 17, 2009, 11:03:35 AM11/17/09

to

"wolfgang kern" <now...@never.at> wrote in message

news:4b02ba65$0$4982$9a6e...@unlimited.newshosting.com...

>
>
> "BGB / cr88192" wrote:
> ...
>>> Yes it were once listed as prefix, but I never saw 0Fh as a prefix byte,
>>> because it's just a mark for a second page of opcodes.
>>> Finally 0F is listed as part of the OPcode now.
>
>> yes, granted I have Intel docs which include 64-bit stuff, and they do
>> list
>> 0F as a prefix, but then later define it as escaping another 1 or 2
>> opcode
>> bytes (I think maybe 1 or 2 sections later).
>
>> in roughly the same place, it says the prefixes may come in any order,
>> and
>> then in the next section describes some things which do impose some
>> ordering
>> on the prefiexes.
>
> :) copy paste from older manuals may lead to such things ...
>

yeah, one would have thought Intel to be more professional than this.
then again, I do remember in the past when writing my assembler originally,
encountering a few errors in the listings (I think, some ModRM opcodes
lacking the '/r', ...).

then I made the observation that some of these same errors also existed in
the same places in the AMD docs, which I thought just curious...

it is almost as if, say, hell, they were copying/pasting from each other in
a few places...

> ...
>>> the byte order is vital on SSE, even most instructions which were
>>> preceded
>>> by 66/f3/f2 look similar in behave, this aren't just prefixes here.
>>>
>
>> yes, ok...
>
>> otherwise I would likely have to do something "funny" with the decoder to
>> make things work (such as having the decoder have prefixes set flags and
>> then the matcher checks for flags, or similar...).
>
>> but, resolving this, and a few other more subtle issues, would require
>> alterations to my listings and some amount of changes to the decoder
>> logic.
>
>> as is, I have already somewhat altered the logic from what it was in my
>> disassembler, and I suspect my disassembler has more than a few subtle
>> bugs
>> (considering what sorts of bugs I fixed). granted, the disassembler is
>> not
>> exactly a critical feature.
>
> My disassembler set individual flags for the occurance of any prefix
> and the meaning of 'em depend on the order and the opcode following.
> So it easy detects if it's an SSE-variant(or REP,Opsize) SEG(or Jcc TN/NT)
> ...
>
> I think somthing similar in reverse could be a solution for compilers too.
>

yeah, I could look into something like this.

as-is, my opcode decoder would probably make a much better disassembler than
my current disassembler.
even then, there are still a few misc unaddressed issues (namely, for a few
edge case instructions which accept 'unusual' arguments).

the decoder has both a weak point and a merit though:
it is likely to work more accurately than the disassembler (as in, not botch
up on so many instructions), but also the logic is a lot more complicated
(it decodes opcodes into structs, ...).

however, as is, the structure employed by the current decoder/interpreter
could also be applied to my assembler, mostly in order to allow cleaner and
more generic assembler logic (and, more esoteric possibilities, such as
disassembling code, doing some task, and re-assembling the same code, more
easily allowing for an ASM-level micro-optimizer, ...).

so, I may consider it eventually...

otherwise, I am off implementing sockets (for the interpreter), and
sendmsg/recvmsg are giving me trouble (mostly due to not being sufficiently
documented), and so I may instead resort to sendto/recvfrom first, and maybe
sendmsg/recvmsg as a hack. grr...

or such...

> __
> wolfgang
>
>

Rod Pemberton

unread,

Nov 17, 2009, 1:45:46 PM11/17/09

to

"robert...@yahoo.com" <robert...@MUNGED.microcosmotalk.com> wrote in
message news:4b01dafb$0$5103$9a6e...@unlimited.newshosting.com...

>
> Of course since there are no instructions for which both REPxx and
> LOCK are valid, having them in a separate groups produces a basically
> meaningless distinction.

As I see it, your statement is false. A valid instruction is one that
complies with the processors' instruction encoding/decoding logic. Many
things can comply. Useless prefix combinations is an example which
complies. Undocumented instructions, like SALC, is another example.
"Bugs," like the F00Fh "bug," is another. An instruction sequence isn't
defined as valid _only_ when it has an _effect_, as you've stated. How do
you explain NOP's? It's both documented and has "no effect."

Rod Pemberton

robert...@munged.microcosmotalk.com

unread,

Nov 18, 2009, 1:02:47 AM11/18/09

to

On Nov 17, 12:45=A0pm, "Rod Pemberton" <do_not_h...@nohavenot.cmm>
wrote:
> "robertwess...@yahoo.com" <robertwess...@MUNGED.microcosmotalk.com> wrote=
in
> messagenews:4b01dafb$0$5103$9a6e...@unlimited.newshosting.com...

>
>
>
> > Of course since there are no instructions for which both REPxx and
> > LOCK are valid, having them in a separate groups produces a basically
> > meaningless distinction.
>

> As I see it, your statement is false. =A0A valid instruction is one that
> complies with the processors' instruction encoding/decoding logic. =A0Man=
y
> things can comply. =A0Useless prefix combinations is an example which
> complies. =A0Undocumented instructions, like SALC, is another example.
> "Bugs," like the F00Fh "bug," is another. =A0An instruction sequence isn'=
t
> defined as valid _only_ when it has an _effect_, as you've stated. =A0How=
do
> you explain NOP's? =A0It's both documented and has "no effect."

Yep, on rechecking the docs, Intel defines the result of REP prefixes
on non-string instructions as "undefined." OTOH, the list of
instructions which permit a LOCK is quite explicit (and the LOCK
prefix is explicitly defined to cause a #UD it you try to use it with
other than an approved form of one of the allowed instructions - note
that F00F bug was the CPU not generating the required #UD). IOW, you
cannot put a lock on a STOSB, but the result of a REP on an ADD is
undefined.

I had misremembered that Intel had tightened up the allowable usage
for both prefixes. They only did that for LOCK (on the 8086 LOCK was
not restricted, on the 386 and later, it was - and the 286 was just
weird). Although the specification for REP has changed over the
years: for the 486 it's defined as being "ignored when it is used with
all other non-string instructions". The current docs just declare REP
to be undefined for all non-string instructions. Both the AMD and
Intel docs say that REP "should" be limited to the string
instructions.

But who said anything about effects?

Tim Roberts

unread,

Nov 18, 2009, 3:18:46 AM11/18/09

to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote:

>
>"robert...@yahoo.com" wrote:
>>
>> Of course since there are no instructions for which both REPxx and
>> LOCK are valid, having them in a separate groups produces a basically
>> meaningless distinction.
>
>As I see it, your statement is false. A valid instruction is one that
>complies with the processors' instruction encoding/decoding logic.

Your logic is faulty. There are many instructions which could be decoded
using the normal decoding rules which are not necessarily executable by the
processor.

>An instruction sequence isn't
>defined as valid _only_ when it has an _effect_, as you've stated. How do
>you explain NOP's? It's both documented and has "no effect."

NOP is a interesting example of a similar principle. Based strictly on the
decoding rules, opcode 0x90 should be
xchg eax,eax

However, the processor handles it specially (it doesn't cause eax to become
busy, as it ordinarily would).
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

wolfgang kern

unread,

Nov 18, 2009, 9:09:29 AM11/18/09

to

"BGB / cr88192" wrote:
...

>> :) copy paste from older manuals may lead to such things ...

> yeah, one would have thought Intel to be more professional than this.
> then again, I do remember in the past when writing my assembler
> originally,
> encountering a few errors in the listings (I think, some ModRM opcodes
> lacking the '/r', ...).

> then I made the observation that some of these same errors also existed in
> the same places in the AMD docs, which I thought just curious...

> it is almost as if, say, hell, they were copying/pasting from each other
> in
> a few places...

Sure, the ident wording show that they work 'together' 'to gather' info.

...

>> My disassembler set individual flags for the occurance of any prefix
>> and the meaning of 'em depend on the order and the opcode following.
>> So it easy detects if it's an SSE-variant(or REP,Opsize) SEG(or Jcc
>> TN/NT)

A few ago I started with a code-analyser, which should find out the meaning
of existing executables and translate it into a flowchart including all
values.

The idea of code conversion between OS's is still in my head.

The disassembler for it is already finished since long (demo in HEXTUTOR),
it got static analyse-features with value-tracking registers and memory,
and is prepared to include known external behaviour from hardware and OS.

I couldn't continue because it needs more memory than I had on my old
machines.

But with my new gadgets I have several TerraByte virtual beside 8GB RAM,
so right after I upgraded my OS and tools to 64-bit I'll continue with it
and hope to make it work at least for smaller code size like a few MB.

__
wolfgang

wolfgang kern

unread,

Nov 18, 2009, 9:09:42 AM11/18/09

to

Robert Wessel wrote:

>> The five groups AMD lists are operand size, address size,

>> segment override, lock, and rep.

> Of course since there are no instructions for which both REPxx and
> LOCK are valid, having them in a separate groups produces a basically
> meaningless distinction.

Not sure if my memory serves me well yet, but I think to remember
a note (286-clones?) where a buggy behaviour on not aligned

LOCK
REP
STOSW

were reported, and the workaround was in the order:

REP
LOCK
STOSW

OTOH, both variants may produce exception#6 on newer CPUs anyway.

__
wolfgang

BGB / cr88192

unread,

Nov 18, 2009, 11:58:30 AM11/18/09

to

"wolfgang kern" <now...@never.at> wrote in message

news:4b040018$0$4865$9a6e...@unlimited.newshosting.com...

>
>
> "BGB / cr88192" wrote:
> ...
>>> :) copy paste from older manuals may lead to such things ...
>
>> yeah, one would have thought Intel to be more professional than this.
>> then again, I do remember in the past when writing my assembler
>> originally,
>> encountering a few errors in the listings (I think, some ModRM opcodes
>> lacking the '/r', ...).
>
>> then I made the observation that some of these same errors also existed
>> in
>> the same places in the AMD docs, which I thought just curious...
>
>> it is almost as if, say, hell, they were copying/pasting from each other
>> in
>> a few places...
>
> Sure, the ident wording show that they work 'together' 'to gather' info.
>

maybe...

this seems like a harder problem than I am trying to address...

> The disassembler for it is already finished since long (demo in HEXTUTOR),
> it got static analyse-features with value-tracking registers and memory,
> and is prepared to include known external behaviour from hardware and OS.
>
> I couldn't continue because it needs more memory than I had on my old
> machines.
>
> But with my new gadgets I have several TerraByte virtual beside 8GB RAM,
> so right after I upgraded my OS and tools to 64-bit I'll continue with it
> and hope to make it work at least for smaller code size like a few MB.
>

can't really comment on this.

my decoder "slightly" raises the level of abstraction, but not
significantly.
it is raised mostly to a similar level as ASM.

a similar set of mechanisms could be used with a codegen though, so I may
look into this.
this could help some with something like "portable ASM".

getting sockets implemented is not going terribly fast though, although
probably because other stuff keeps going on.

> __
> wolfgang
>
>

Rod Pemberton

unread,

Nov 18, 2009, 8:31:27 PM11/18/09

to

"Tim Roberts" <ti...@MUNGED.microcosmotalk.com> wrote in message
news:4b03ade6$0$4875$9a6e...@unlimited.newshosting.com...

> "Rod Pemberton" <do_no...@nohavenot.cmm> wrote:
> >"robert...@yahoo.com" wrote:
> >>
> >> Of course since there are no instructions for which both REPxx and
> >> LOCK are valid, having them in a separate groups produces a basically
> >> meaningless distinction.
> >
> >As I see it, your statement is false. A valid instruction is one that
> >complies with the processors' instruction encoding/decoding logic.
>
> Your logic is faulty.

Unlikely ...

> There are many instructions which could be decoded
> using the normal decoding rules which are not necessarily executable by
the
> processor.
>

True.

You're basically making the same argument as Wessel: executability (or a
useful effect) determines the validity of an instruction encoding. In
either case, that's false as stated above. There is zero correlation
between what a processor can decode and what is desirable, correct, or
documented. Remember the 6502?

Rod Pemberton

wolfgang kern

unread,

Nov 19, 2009, 11:05:34 AM11/19/09

to

"BGB / cr88192" wrote:

...
>> A few years ago I started with a code-analyser, which should find out

>> the meaning of existing executables and translate it into a flowchart
>> including all values.
>> The idea of code conversion between OS's is still in my head.

> this seems like a harder problem than I am trying to address...

>> The disassembler for it is already finished since long (demo in
>> HEXTUTOR),
>> it got static analyse-features with value-tracking registers and memory,
>> and is prepared to include known external behaviour from hardware and OS.

>> I couldn't continue because it needs more memory than I had on my old
>> machines.

>> But with my new gadgets I have several TerraByte virtual beside 8GB RAM,
>> so right after I upgraded my OS and tools to 64-bit I'll continue with it
>> and hope to make it work at least for smaller code size like a few MB.

> can't really comment on this.

:) I'm not sure if I could fully comment this idea, ...yet.

> my decoder "slightly" raises the level of abstraction, but not
> significantly.
> it is raised mostly to a similar level as ASM.

> a similar set of mechanisms could be used with a codegen though, so I may
> look into this.
> this could help some with something like "portable ASM".

> getting sockets implemented is not going terribly fast though, although
> probably because other stuff keeps going on.

Ok, let's see to which extent we can achieve our goals ;)

__
wolfgang

BGB / cr88192

unread,

Nov 19, 2009, 12:31:59 PM11/19/09

to

"wolfgang kern" <now...@never.at> wrote in message

news:4b056cce$0$4972$9a6e...@unlimited.newshosting.com...

>
>
>
> "BGB / cr88192" wrote:
>
> ...
>>> A few years ago I started with a code-analyser, which should find out
>>> the meaning of existing executables and translate it into a flowchart
>>> including all values.
>>> The idea of code conversion between OS's is still in my head.
>
>> this seems like a harder problem than I am trying to address...
>
>>> The disassembler for it is already finished since long (demo in
>>> HEXTUTOR),
>>> it got static analyse-features with value-tracking registers and memory,
>>> and is prepared to include known external behaviour from hardware and
>>> OS.
>
>>> I couldn't continue because it needs more memory than I had on my old
>>> machines.
>
>>> But with my new gadgets I have several TerraByte virtual beside 8GB RAM,
>>> so right after I upgraded my OS and tools to 64-bit I'll continue with
>>> it
>>> and hope to make it work at least for smaller code size like a few MB.
>
>> can't really comment on this.
>
> :) I'm not sure if I could fully comment this idea, ...yet.
>

ok.

>> my decoder "slightly" raises the level of abstraction, but not
>> significantly.
>> it is raised mostly to a similar level as ASM.
>
>> a similar set of mechanisms could be used with a codegen though, so I may
>> look into this.
>> this could help some with something like "portable ASM".
>
>> getting sockets implemented is not going terribly fast though, although
>> probably because other stuff keeps going on.
>
> Ok, let's see to which extent we can achieve our goals ;)
>

I have 'AF_UNIX' (as in, local sockets), and 'SOCK_DGRAM' (AKA: datagram
sockets).
will need to add stream sockets as well.

AF_INET / AF_INET6 (IPv4 and IPv6) exist in a token sense.

so, sockets present an uncertainty:
consider, a sender keeps sending messages (or writing stream data);
the reciever does not recieve any messages or read any data.

the issue is that large amounts of memory would be used, and it is not clear
what the best resolution strategy is...

> __
> wolfgang
>
>

Richard Russell

unread,

Nov 19, 2009, 1:39:14 PM11/19/09

to

On 19 Nov, 01:31, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
> A valid instruction is one that complies with the processors'
> instruction encoding/decoding logic.

I would agree, but only if one more word is added: "A valid
instruction is one that complies with the processor's DOCUMENTED
instruction encoding/decoding logic". If a particular instruction
isn't documented by the CPU vendor, then in no sense can it be said to
be 'valid'.

After all, even if a particular sample of CPU, from a particular
manufacturing batch, did accept the instruction, there's no guarantee
that another sample of nominally the same CPU would do so. There
might have been a mask change that, whilst fully compatible with all
the documented modes of operation, changes the behaviour with that
particular undocumented encoding.

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

Rod Pemberton

unread,

Nov 19, 2009, 8:23:19 PM11/19/09

to

"Richard Russell" <ne...@MUNGED.microcosmotalk.com> wrote in message
news:4b0590d1$0$5122$9a6e...@unlimited.newshosting.com...

> On 19 Nov, 01:31, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
> > A valid instruction is one that complies with the processors'
> > instruction encoding/decoding logic.
>
> I would agree, but only if one more word is added: "A valid
> instruction is one that complies with the processor's DOCUMENTED
> instruction encoding/decoding logic".

False. That one word changes the entire meaning to be equivalent to what
the other two guys said.

> If a particular instruction
> isn't documented by the CPU vendor, then in no sense can it be said to
> be 'valid'.

Sure it can. I believe you, as well as the other two, have the wrong
perspective. The documentation doesn't determine valid encoding/decoding.
The microprocessor circuitry does. Period. So, one should view things from
the microprocessor's perspective. Apparently, it seems I'm the only one who
thinks so...

> After all, even if a particular sample of CPU, from a particular
> manufacturing batch, did accept the instruction, there's no guarantee
> that another sample of nominally the same CPU would do so.

Irrelevant.

> There
> might have been a mask change that, whilst fully compatible with all
> the documented modes of operation, changes the behaviour with that
> particular undocumented encoding.

True.

(I have no further interest in this thread.)

Rod Pemberton

Alexei A. Frounze

unread,

Nov 19, 2009, 10:46:30 PM11/19/09

to

On Nov 19, 5:23=A0pm, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
> "Richard Russell" <n...@MUNGED.microcosmotalk.com> wrote in message

>
> news:4b0590d1$0$5122$9a6e...@unlimited.newshosting.com...
>
> > On 19 Nov, 01:31, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
> > > A valid instruction is one that complies with the processors'
> > > instruction encoding/decoding logic.
>
> > I would agree, but only if one more word is added: "A valid
> > instruction is one that complies with the processor's DOCUMENTED
> > instruction encoding/decoding logic".
>

> False. =A0That one word changes the entire meaning to be equivalent to wh=

at
> the other two guys said.
>
> > If a particular instruction
> > isn't documented by the CPU vendor, then in no sense can it be said to
> > be 'valid'.
>

> Sure it can. =A0I believe you, as well as the other two, have the wrong
> perspective. =A0The documentation doesn't determine valid encoding/decodi=
ng.
> The microprocessor circuitry does. =A0Period. =A0So, one should view thin=
gs from
> the microprocessor's perspective. =A0Apparently, it seems I'm the only on=
e who
> thinks so...

In general it depends...

Documentation may be incorrect, ambiguous or incomplete. I've found
that to be the case many times in the Intel docs (and I believe a few
times in the AMD docs too). In that case the actual CPU behavior fills
in some of the doc gaps.

Also, if you need to support existing (legacy) software the best you
can in something like a CPU emulator, then you want to expand the
range of valid beyond the documented and emulate as close as possible
to the actual CPU behavior that that software may expect.

In all other cases (especially when you want to be sure you don't take
dependency on something that's not guaranteed or properly documented),
you should stick with the doc's definition of valid.

Alex

David

unread,

Nov 19, 2009, 11:54:55 PM11/19/09

to

=46or those who are too young there was an old 'loadall' instruction on
the 286. I think it went away in the 386.

On 20 Nov 2009 03:46:30 GMT, "Alexei A. Frounze"
<alexf...@MUNGED.microcosmotalk.com> wrote:

>
>On Nov 19, 5:23=3DA0pm, "Rod Pemberton" <do_not_h...@nohavenot.cmm> =

wrote:
>> "Richard Russell" <n...@MUNGED.microcosmotalk.com> wrote in message
>>
>> news:4b0590d1$0$5122$9a6e...@unlimited.newshosting.com...
>>
>> > On 19 Nov, 01:31, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
>> > > A valid instruction is one that complies with the processors'
>> > > instruction encoding/decoding logic.
>>
>> > I would agree, but only if one more word is added: "A valid
>> > instruction is one that complies with the processor's DOCUMENTED
>> > instruction encoding/decoding logic".
>>

>> False. =3DA0That one word changes the entire meaning to be equivalent =
to wh=3D

>at
>> the other two guys said.
>>
>> > If a particular instruction

>> > isn't documented by the CPU vendor, then in no sense can it be said =
to
>> > be 'valid'.
>>
>> Sure it can. =3DA0I believe you, as well as the other two, have the =
wrong
>> perspective. =3DA0The documentation doesn't determine valid =
encoding/decodi=3D
>ng.
>> The microprocessor circuitry does. =3DA0Period. =3DA0So, one should =
view thin=3D
>gs from
>> the microprocessor's perspective. =3DA0Apparently, it seems I'm the =
only on=3D

ArarghMai...@not.at.arargh.com

unread,

Nov 20, 2009, 6:13:10 AM11/20/09

to

On 20 Nov 2009 04:54:55 GMT, David <da...@nowhere.net> wrote:

>
>=46or those who are too young there was an old 'loadall' instruction on
>the 286. I think it went away in the 386.

It went away with the 486. Some early 386s or maybe all of then had
it. But, it was implemented differently. Different OP code too, I
think.

<snip>
--
ArarghMail911 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

To reply by email, remove the extra stuff from the reply address.

Alexei A. Frounze

unread,

Nov 20, 2009, 6:13:43 AM11/20/09

to

On Nov 19, 8:54=A0pm, David <d...@nowhere.net> wrote:
> =3D46or those who are too young there was an old 'loadall' instruction on
> the 286. =A0I think it went away in the 386.

Yep. I no longer had a 286 when I could spell loadall.

Alex

Terje Mathisen

unread,

Nov 20, 2009, 1:48:28 PM11/20/09

to

Tim Roberts wrote:

> "Rod Pemberton"<do_no...@nohavenot.cmm> wrote:
>> An instruction sequence isn't
>> defined as valid _only_ when it has an _effect_, as you've stated. How do
>> you explain NOP's? It's both documented and has "no effect."
>
> NOP is a interesting example of a similar principle. Based strictly on the
> decoding rules, opcode 0x90 should be
> xchg eax,eax
>
> However, the processor handles it specially (it doesn't cause eax to become
> busy, as it ordinarily would).

NOP is interesting, in that it did indeed mean (and decode as)

XCHG AX,AX

on the original 808x.

I believe this was maintained at least until the ~486 timeframe, but by
the time of the first OoO cpu (P6/PentiumPro), the difference between
touching and not touching a register became crucial, so Intel started
decoding it and handling it as a true NOP, with no side effects at all.

This was also added to the cpu manuals.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

robert...@munged.microcosmotalk.com

unread,

Nov 20, 2009, 9:34:56 PM11/20/09

to

On Nov 20, 12:48=A0pm, Terje Mathisen
<Terje.Mathi...@MUNGED.microcosmotalk.com> wrote:
> Tim Roberts wrote:
> > However, the processor handles it specially (it doesn't cause eax to be=

come
> > busy, as it ordinarily would).
>
> NOP is interesting, in that it did indeed mean (and decode as)
>

> =A0 =A0 =A0 =A0 XCHG AX,AX

>
> on the original 808x.
>
> I believe this was maintained at least until the ~486 timeframe, but by
> the time of the first OoO cpu (P6/PentiumPro), the difference between
> touching and not touching a register became crucial, so Intel started
> decoding it and handling it as a true NOP, with no side effects at all.

I tested that in response to a question a few years ago, and on the
'486, running a million iterations of 100 instructions in a row, xchg
ax,ax was 2.7 times faster than xchg bx,bx.

Semi-amusingly, several of the same people were in that thread as in
this one. I=92m not sure that=92s a positive comment on our lives...

http://groups.google.com/group/comp.lang.asm.x86/msg/e385b1e074f939ef?hl=3D=
en

Rugxulo

unread,

Nov 20, 2009, 9:35:47 PM11/20/09

to

Hi,

On Nov 20, 12:48=A0pm, Terje Mathisen
<Terje.Mathi...@MUNGED.microcosmotalk.com> wrote:
>

> NOP is interesting, in that it did indeed mean (and decode as)
>

> =A0 =A0 =A0 =A0 XCHG AX,AX

>
> on the original 808x.
>
> I believe this was maintained at least until the ~486 timeframe, but by
> the time of the first OoO cpu (P6/PentiumPro), the difference between
> touching and not touching a register became crucial, so Intel started
> decoding it and handling it as a true NOP, with no side effects at all.
>
> This was also added to the cpu manuals.

To quote Madis731 (from FASM's forum):

"Erm, NOP takes 0.5 clocks from Pentium and later.
It takes 0.333 clocks from Pentium III and later AND
it takes 0.25-0.333 (depending on how you schedule) clocks from Core
arch. and later.
So the maximum needed 15-byte alignment takes 5 clock maximum!!! "

http://board.flatassembler.net/topic.php?t=3D3331

Robert Redelmeier

unread,

Nov 21, 2009, 12:51:36 AM11/21/09

to

robert...@yahoo.com <robert...@munged.microcosmotalk.com> wrote in part:

> I tested that in response to a question a few years ago, and
> on the '486, running a million iterations of 100 instructions
> in a row, xchg ax,ax was 2.7 times faster than xchg bx,bx.

Not surprising -- xchg ax,ax is one byte while xchg bx,bx is two.
Yes, the 486 has cache, but still takes time to fetch. And
keep the instruction sib decoder busy.

-- Robert

Alexei A. Frounze

unread,

Nov 21, 2009, 4:03:39 AM11/21/09

to

On Nov 20, 10:48=A0am, Terje Mathisen

<Terje.Mathi...@MUNGED.microcosmotalk.com> wrote:
> Tim Roberts wrote:

> > "Rod Pemberton"<do_not_h...@nohavenot.cmm> =A0wrote:

> >> An instruction sequence isn't

> >> defined as valid _only_ when it has an _effect_, as you've stated. =A0=
How do
> >> you explain NOP's? =A0It's both documented and has "no effect."
>
> > NOP is a interesting example of a similar principle. =A0Based strictly =

on the
> > decoding rules, opcode 0x90 should be

> > =A0 =A0 =A0xchg =A0eax,eax
>
> > However, the processor handles it specially (it doesn't cause eax to be=

come
> > busy, as it ordinarily would).
>
> NOP is interesting, in that it did indeed mean (and decode as)
>

> =A0 =A0 =A0 =A0 XCHG AX,AX

>
> on the original 808x.
>
> I believe this was maintained at least until the ~486 timeframe, but by
> the time of the first OoO cpu (P6/PentiumPro), the difference between
> touching and not touching a register became crucial, so Intel started
> decoding it and handling it as a true NOP, with no side effects at all.
>
> This was also added to the cpu manuals.

Also in 64-bit mode it's important to treat XCHG EAX, EAX as NOP
because if it was a real XCHG EAX, EAX, it would have to zero out the
top 32 bits of RAX as many ALU instructions do with 32-bit operands in
64-bit mode. The REX prefix further complicates matters since EAX has
to be distinguished from R8D. Hence opcode 0x90 isn't always a NOP --
the REX has to be taken into account.

Alex

Terje Mathisen

unread,

Nov 22, 2009, 4:30:32 PM11/22/09

to

robert...@yahoo.com wrote:
> On Nov 20, 12:48=A0pm, Terje Mathisen
> <Terje.Mathi...@MUNGED.microcosmotalk.com> wrote:
>> I believe this was maintained at least until the ~486 timeframe, but by
>> the time of the first OoO cpu (P6/PentiumPro), the difference between
>> touching and not touching a register became crucial, so Intel started
>> decoding it and handling it as a true NOP, with no side effects at all.
>
> I tested that in response to a question a few years ago, and on the
> '486, running a million iterations of 100 instructions in a row, xchg
> ax,ax was 2.7 times faster than xchg bx,bx.

Did you also try the 2-byte XCHG AX,AX opcode, i.e. the same as XCHG
BX,BX but with register #0 instead of #1?

>
> Semi-amusingly, several of the same people were in that thread as in
> this one. I=92m not sure that=92s a positive comment on our lives...

Life? What life? Anyone here got a life? :-)

Terje
PS. I guess I do: Wife & two kids (20 & 18), scout leader, orienteer,
rock climber, xc and alpine skier, snowboarder, windsurfer, kiter.

I'm also a ham (la8nw), plus I do some white water kayak and canoe.

I used to play volleyball while at university, where I also joined the
gymnastics team and did some springboard diving (1m & 3m).

What I don't have is a lot of spare time...

Terje Mathisen

unread,

Nov 22, 2009, 4:32:21 PM11/22/09

to

Rugxulo wrote:
> To quote Madis731 (from FASM's forum):
>
> "Erm, NOP takes 0.5 clocks from Pentium and later.
> It takes 0.333 clocks from Pentium III and later AND
> it takes 0.25-0.333 (depending on how you schedule) clocks from Core
> arch. and later.
> So the maximum needed 15-byte alignment takes 5 clock maximum!!! "

Except that nobody would use 15 NOPs in a row when there are many longer
instructions that also have zero effects, like

LEA EBX,[EBX+00000000]

and which will execute faster than the same number of NOP bytes.

Terje Mathisen

unread,

Nov 22, 2009, 4:33:44 PM11/22/09

to

wolfgang kern wrote:
> "BGB / cr88192" wrote:
> ....
>>> REX seem to be the only one which 'shall immediate precede opcode',
>>> the others may come in any order.
>
>> but, it does precede 0F, which is listed as a prefix which is also a part
>> of
>> the opcode (escape for more opcodes...).
>

> Yes it were once listed as prefix, but I never saw 0Fh as a prefix byte,

0Fh was indeed a prefix afair, it was used to signal the x87 coprocessor
that what followed was a fp operation.

Later on, 0Fh got used for a lot of things of course, but I believe the
initial idea was to use it to signal an exit from the main cpu into a
secondary chip...

Richard Russell

unread,

Nov 23, 2009, 4:25:36 AM11/23/09

to

On 18 Nov, 06:02, "robertwess...@yahoo.com" wrote:
> The current docs just declare REP to be undefined for all

> non-string instructions. =A0Both the AMD and Intel docs say

> that REP "should" be limited to the string instructions.

Interestingly, the 'AMD Software Optimization Guide' specifically
recommends the use of the sequence F3 C3 (REP RET) as a 'two byte near-
return instruction' (25112.PDF, section 6.2). However, despite that
document still being available from AMD's own web site, the later
'Software Optimization Guide for AMD Family 10h
Processors' (40546.PDF) instead recommends the use of either 90 C3
(NOP RET) or C2 00 00 (RET 0).

I have no idea whether we are to conclude from this that AMD no longer
considers REP RET to be 'safe', but apparently some compilers do
generate it so hopefully it's accepted by all x86-compatible CPUs.

Richard Russell

unread,

Nov 23, 2009, 4:26:15 AM11/23/09

to

On 22 Nov, 21:33, Terje Mathisen wrote:
> 0Fh was indeed a prefix afair, it was used to signal the x87 coprocessor
> that what followed was a fp operation.

I've always assumed that on the original 8086/88 opcode 0F would have
attempted to do POP CS, as its bit pattern would suggest.

> I believe the initial idea was to use it to signal an exit
> from the main cpu into a secondary chip...

I've not heard that before. There was of course the ESC instruction
(opcodes D8-DF) which was originally documented as 'a mechanism by
which other processors may receive their instructions...' but which
has been subsumed into the coprocessor instructions.

ArarghMai...@not.at.arargh.com

unread,

Nov 23, 2009, 4:27:03 AM11/23/09

to

On 22 Nov 2009 21:33:44 GMT, Terje Mathisen
<Terje.M...@MUNGED.microcosmotalk.com> wrote:

>
>wolfgang kern wrote:
>> "BGB / cr88192" wrote:
>> ....
>>>> REX seem to be the only one which 'shall immediate precede opcode',
>>>> the others may come in any order.
>>
>>> but, it does precede 0F, which is listed as a prefix which is also a part
>>> of
>>> the opcode (escape for more opcodes...).
>>
>> Yes it were once listed as prefix, but I never saw 0Fh as a prefix byte,
>
>0Fh was indeed a prefix afair, it was used to signal the x87 coprocessor
>that what followed was a fp operation.

It was? On what processor? Not any x86 that I know of. On the
original 808x 0Fh was a POP CS instruction. By the 80286 it was the
first byte of the extended instructions. Not sure which way the 8018x
used it, but I think the same as the 80286.

>Later on, 0Fh got used for a lot of things of course, but I believe the
>initial idea was to use it to signal an exit from the main cpu into a
>secondary chip...

No, AFAIK, 0Fh was never used that way. Opcodes in the 0D8 to
0DF(IIRC) range were the escape to the math coprocessor until the
80486 when the math coprocessor was moved to the processor chip.

Actually, IIRC, they weren't an escape as such, but the main processor
ignored them, while the co-processor which was supposed to be
monitoring bus i-fetch cycles would see them and process them. That
was why you had to use the FWAIT instruction in the main processor, to
insure the the co-processor had finished the previous instruction.

All of the above AFAIK, and IIRC. It's been years.

Richard Russell

unread,

Nov 23, 2009, 9:12:53 AM11/23/09

to

On 23 Nov, 09:27, ArarghMail911NOS...@NOT.AT.Arargh.com wrote:
> Not sure which way the 8018x used it, but I think the same as the 80286.

I don't believe the 80186/188 differed from the 8086/88 in the
interpretation of opcode 0F. In the 'iAPX 86/88, 186/188 User's
Manual' it's documented under 'POP seg-reg' as an 'undefined
operation' (CS illegal). None of the 186/188 extensions used it (PUSH
imm, PUSHA/POPA, IMUL dst,src,imm, BOUND, ENTER, LEAVE, INS, OUTS).

> Actually, IIRC, they weren't an escape as such, but the main processor
> ignored them

AIUI the main processor had the responsibility for fetching the
operand bytes (if any), hence you could do ESC 29,[BX+DI+5] or ESC
6,ARRAY[SI] and the necessary fetches would take place even if there
wasn't a coprocessor to execute the instructions.

ArarghMai...@not.at.arargh.com

unread,

Nov 23, 2009, 10:32:40 AM11/23/09

to

On 23 Nov 2009 14:12:53 GMT, Richard Russell
<ne...@MUNGED.microcosmotalk.com> wrote:

>
>On 23 Nov, 09:27, ArarghMail911NOS...@NOT.AT.Arargh.com wrote:
>> Not sure which way the 8018x used it, but I think the same as the 80286.
>
>I don't believe the 80186/188 differed from the 8086/88 in the
>interpretation of opcode 0F. In the 'iAPX 86/88, 186/188 User's
>Manual' it's documented under 'POP seg-reg' as an 'undefined
>operation' (CS illegal). None of the 186/188 extensions used it (PUSH
>imm, PUSHA/POPA, IMUL dst,src,imm, BOUND, ENTER, LEAVE, INS, OUTS).

Ok. I hadn't bothered to look it up.

>> Actually, IIRC, they weren't an escape as such, but the main processor
>> ignored them
>
>AIUI the main processor had the responsibility for fetching the
>operand bytes (if any), hence you could do ESC 29,[BX+DI+5] or ESC
>6,ARRAY[SI] and the necessary fetches would take place even if there
>wasn't a coprocessor to execute the instructions.

Yes, but I think the main processor just ignored what it had fetched.
That's why I said the co-processor had to monitor the bus in order to
see the instructions.

Terje Mathisen

unread,

Nov 23, 2009, 12:21:20 PM11/23/09

to

Richard Russell wrote:
> On 22 Nov, 21:33, Terje Mathisen wrote:
>> 0Fh was indeed a prefix afair, it was used to signal the x87 coprocessor
>> that what followed was a fp operation.
>
> I've always assumed that on the original 8086/88 opcode 0F would have
> attempted to do POP CS, as its bit pattern would suggest.
>
>> I believe the initial idea was to use it to signal an exit
>> from the main cpu into a secondary chip...
>
> I've not heard that before. There was of course the ESC instruction
> (opcodes D8-DF) which was originally documented as 'a mechanism by
> which other processors may receive their instructions...' but which
> has been subsumed into the coprocessor instructions.

Mea Culpa!

You're absolutely right, I misremembered this totally.

OTOH, it is more than 20 years since I wrote a disassembler, and a bit
more since I wrote the "perfect" executable text encoder. :-)

Terje
PS. "Perfect" in the meaning that I used the minimum possible amount of
self-modification (a single 2-byte backwards branch instruction), and
that the resulting code was self-relocating enough to survive all the
most likely reformatting operations (changing CRLF to just CR (Mac) or
LF (Unix), or turning an entire paragraph into a single line.

Rod Pemberton

unread,

Nov 23, 2009, 6:51:49 PM11/23/09

to

"Terje Mathisen" <Terje.M...@MUNGED.microcosmotalk.com> wrote in message
news:4b0ac490$0$5098$9a6e...@unlimited.newshosting.com...
>
> ... minimum possible amount of
> self-modification ...

In the rare situations where one would want self-modifying code, e.g., Hugi
size coding competition, what are the good situations or instructions to use
self-modifying code?

Rod Pemberton

Alexei A. Frounze

unread,

Nov 24, 2009, 5:17:25 AM11/24/09

to

On Nov 23, 3:51=A0pm, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
> "Terje Mathisen" <Terje.Mathi...@MUNGED.microcosmotalk.com> wrote in mess=

age
>
> news:4b0ac490$0$5098$9a6e...@unlimited.newshosting.com...
>
>
>
> > ... minimum possible amount of
> > self-modification ...
>

> In the rare situations where one would want self-modifying code, e.g., Hu=
gi
> size coding competition, what are the good situations or instructions to =
use
> self-modifying code?

Well, CPU emulation and testing of CPU emulation could be one such
application of self-mod code.
Also, I remember people generate code on the fly for the inner loops
of their 3d engines. Instead of having a bunch of fast optimized
subroutines that differ only in minor details (different constants,
slightly different instructions) and instead of having one slow
general-purpose routine they pregenerate a few as necessary and call
them afterwards.

Alex

Steve

unread,

Nov 24, 2009, 8:51:39 AM11/24/09

to

Richard Russell <ne...@MUNGED.microcosmotalk.com> writes:
>
>On 23 Nov, 09:27, ArarghMail911NOS...@NOT.AT.Arargh.com wrote:
>> Not sure which way the 8018x used it, but I think the same as the 80286.
>
>I don't believe the 80186/188 differed from the 8086/88 in the
>interpretation of opcode 0F. In the 'iAPX 86/88, 186/188 User's
>Manual' it's documented under 'POP seg-reg' as an 'undefined
>operation' (CS illegal). None of the 186/188 extensions used it (PUSH
>imm, PUSHA/POPA, IMUL dst,src,imm, BOUND, ENTER, LEAVE, INS, OUTS).

Hi,

IntelAP-186,10.2 Instruction Execution Differences between the 8086 and 80186
..

0FH Opcode:

When the opcode 0FH is encountered, the 8086 will
execute a POP CS, while the 80186 will execute an
illegal instruction exception interrupt type6.

Regards,

Steve N.

Robert Redelmeier

unread,

Nov 24, 2009, 10:53:59 AM11/24/09

to

Rod Pemberton <do_no...@nohavenot.cmm> wrote in part:

> "Terje Mathisen" <Terje.M...@MUNGED.microcosmotalk.com> wrote in message

>> ... minimum possible amount of self-modification ...
>
> In the rare situations where one would want self-modifying
> code, e.g., Hugi size coding competition, what are the good
> situations or instructions to use self-modifying code?

I suppose if you are set on doing this evil thing,
it is our duty to help you do it in the least bad way?

IIRC, AMD had some guidelines for SMC. The main thing was to
avoid modifying nearby code (same/next cacheline) because this
forced all sorts of dead-slow cache thrashing. To a lesser
extent on any active code. But the forced cache spill/reload
might not be an excessive penalty for write-once, execute many.
Write once, execute once should be avoided.

-- Robert

Richard Russell

unread,

Nov 24, 2009, 2:36:43 PM11/24/09

to

On 24 Nov, 15:53, Robert Redelmeier <red...@ev1.net.invalid> wrote:
> IIRC, AMD had some guidelines for SMC.

Intel too. See the 'IA-32 Intel Architecture Optimization Reference
Manual'... General Optimization Guidelines... Memory Accesses...
Mixing Code and Data... Self-modifying code (page 2-47 in my copy)
where it says "Software should avoid writing to a code page in the
same 1 KB subpage as that being executed, or fetching code in the same
2 KB subpage as that currently being written".

> I suppose if you are set on doing this evil thing,
> it is our duty to help you do it in the least bad way?

Whether SMC is "evil" is debatable. It's rather hard to avoid in a
JIT compiler!

wolfgang kern

unread,

Nov 25, 2009, 10:57:11 AM11/25/09

to

Steve mentioned:

>>I don't believe the 80186/188 differed from the 8086/88 in the
>>interpretation of opcode 0F. In the 'iAPX 86/88, 186/188 User's
>>Manual' it's documented under 'POP seg-reg' as an 'undefined
>>operation' (CS illegal). None of the 186/188 extensions used it (PUSH
>>imm, PUSHA/POPA, IMUL dst,src,imm, BOUND, ENTER, LEAVE, INS, OUTS).

> Hi,
>
> IntelAP-186,10.2 Instruction Execution Differences between the 8086 and
> 80186
> ..
>
> 0FH Opcode:
>
> When the opcode 0FH is encountered, the 8086 will
> execute a POP CS, while the 80186 will execute an
> illegal instruction exception interrupt type6.

Yeah, IIRC the exception06-handler then could detect 0Fh and treat it as
an ESC, similar to skipping over redundant faulty compiled LOCK (F0h).
Perhaps this was what Terje remembered.

__
wolfgang

wolfgang kern

unread,

Nov 25, 2009, 10:57:24 AM11/25/09

to

Rod Pemberton asked:

>> ... minimum possible amount of self-modification ...

> In the rare situations where one would want self-modifying code, e.g.,
> Hugi
> size coding competition, what are the good situations or instructions to
> use
> self-modifying code?

SMC become more than handy in all case where frequent used code need only
a few changes on a (often much) lesser frequent rate.

I use it in the OS-core:
* on screen resolution changes, so I can keep my only one GUI-packet in RAM
because limits, line-size and pages are part of the code (imm-constants
rather than usually slower variables).
* on user-swaps, like above with user-ID and access-rights...
* on external media changes (FD,CD, planned for USB-devices when possible)

Applications could often gain on speed whenever a change by SMC makes
sense, because loading a single sector from media will take much more
time than a full TLB-flush/cache-penalty.

...but, debug/disassemble SMC may be a PITA.
__
wolfgang

Terje Mathisen

unread,

Nov 26, 2009, 5:36:36 AM11/26/09

to

No general rule, but normally it takes much more code to modify an
instruction than to simply include said opcode from the start.

Where it can save you on size is when you can have a loop of code or
something large which breaks out due to a side-effect which modifies one
of the loop instructions. :-)

Terje

Terje Mathisen

unread,

Nov 26, 2009, 5:37:16 AM11/26/09

to

Alexei A. Frounze wrote:
> Also, I remember people generate code on the fly for the inner loops
> of their 3d engines. Instead of having a bunch of fast optimized
> subroutines that differ only in minor details (different constants,
> slightly different instructions) and instead of having one slow
> general-purpose routine they pregenerate a few as necessary and call
> them afterwards.

This is the approach Mike Abrash/Tom Forsyth (of RadGameTools.com) used
for their DX7 sw engine, i.e. they built a mini-assembler/optimizer that
would take as input code fragments corresponding to each possible shader
operation and generate pretty close to perfectly optimized straight-line
code for the exact operation wanted.

Yes, this was a _lot_ of work, but for several games the resulting sw
fallback ran faster than on a slow 3d hw. :-)

Terje

Martin Str|mberg

unread,

Nov 27, 2009, 6:54:25 AM11/27/09

to

Terje Mathisen <Terje.M...@munged.microcosmotalk.com> wrote:

> PS. "Perfect" in the meaning that I used the minimum possible amount of
> self-modification (a single 2-byte backwards branch instruction), and
> that the resulting code was self-relocating enough to survive all the
> most likely reformatting operations (changing CRLF to just CR (Mac) or
> LF (Unix), or turning an entire paragraph into a single line.

Why is minumim amount of SMC better than SMC? If you use it once, you
could just as well use it several times. No?

--
MartinS

Terje Mathisen

unread,

Nov 27, 2009, 11:02:35 AM11/27/09

to

If you have to ask, you probably don't need (or care) about elegant
code. :-)

In this particular case I wanted to make the initial bootstrap code as
short as possible, simply because that would leave the minimum number of
potential trouble spots.

Terje

BGB / cr88192

unread,

Dec 1, 2009, 10:02:48 AM12/1/09

to

"Terje Mathisen" <Terje.M...@MUNGED.microcosmotalk.com> wrote in message

news:4b09ade5$0$4938$9a6e...@unlimited.newshosting.com...

>
> Rugxulo wrote:
>> To quote Madis731 (from FASM's forum):
>>
>> "Erm, NOP takes 0.5 clocks from Pentium and later.
>> It takes 0.333 clocks from Pentium III and later AND
>> it takes 0.25-0.333 (depending on how you schedule) clocks from Core
>> arch. and later.
>> So the maximum needed 15-byte alignment takes 5 clock maximum!!! "
>
> Except that nobody would use 15 NOPs in a row when there are many longer
> instructions that also have zero effects, like
>
> LEA EBX,[EBX+00000000]
>
> and which will execute faster than the same number of NOP bytes.
>

or one of the multi-byte NOP forms...

they can be used both as padding, and as a means of code annotation (where
the nop can point to a structure). granted, for many types of annotation, it
would be better to use tables, but then again, such a nop can point to such
a table, so things are good enough IMO...