OT: first assembler instruction...

Rod Pemberton

unread,

Aug 24, 2009, 7:09:10 PM8/24/09

to

My "assembler" - if you can call it that at this point - just compiled it's
first x86 instruction:

.eax _push

It's in an reverse order or RPN-ish like syntax due to current parsing
restrictions. In NASM, that's "push eax"... It's written in C, of
course... ;-)

Rod Pemberton

James Harris

unread,

Aug 24, 2009, 7:28:55 PM8/24/09

to

On 25 Aug, 00:09, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:

> My "assembler" - if you can call it that at this point - just compiled it's
> first x86 instruction:
>
> .eax _push
>
> It's in an reverse order or RPN-ish like syntax due to current parsing
> restrictions.

A novel approach - and it may turn out to be a better one in some
circumstances. It will at least challenge the normal approach. :-)

James

Dave -Turner

unread,

Aug 27, 2009, 2:18:42 AM8/27/09

to

"push eax" is easier to type, has been used more historically (so many
developers are used to that style of syntax), and is probably more logical
than ".eax _push" ... and on top of that you mention you HAVE to process it
that way at the moment because of - in your own words - "current parsing
restrictions", as opposed to any advantage offered to the developer.

Rod Pemberton

unread,

Aug 27, 2009, 3:59:28 AM8/27/09

to

"Dave -Turner" <ad...@127.0.0.1> wrote in message
news:lNGdnRVjzaWouAvX...@westnet.com.au...

> "push eax" is easier to type, has been used more historically (so many
> developers are used to that style of syntax),
>

Yeah, I don't like the reverse order in terms of readability. I prefer
NASM's x86 syntax. But, I need my own assembler, of smaller size, in ANSI
C.

> and is probably more logical
> than ".eax _push"

In terms of syntax suitable for humans, yes, I wholeheartedly agree "push
eax" is "more logical". But, this all depends on how one got to the
logic...

Generating and processing assembly isn't limited to humans. And, smaller
application size is still important, which is where the parsing constraints
come from. Unless I can find a low cost way to parse "_push .eax" as easily
as I can parse ".eax _push", it's likely to remain. If you (or James or
BGB) didn't notice, it's a syntax-directed parser, directed by escape
characters. I'm not fully familiar with the CS terms for parsers, but I
think it's LL(0), basically... ? There is no lookahead or lookbehind.
However, the control stream of escape characters and data stream are
interwoven, so I'm not sure it that complies with LL(0)'s description or
not. The object follows an escape so the parser doesn't need code to
determine what it is looking at. It "knows" from the escape what type of
object follows each escape. The '.' indicates a register follows, and the
'_' indicates an instruction follows. I have other escape characters to
indicate most of the major x86 assembly components, like constants, labels,
label-references, instruction overrides, etc. However, I've just gotten
started and I haven't coded the instructions which use these features yet.

> ... and on top of that you mention you HAVE to process it
> that way at the moment because of - in your own words - "current parsing
> restrictions", as opposed to any advantage offered to the developer.

The "developer" is another program and, to a far lesser extent, me. Believe
it or not, parsing "eax push" is much simpler than parsing "push eax" in
terms of coding needed. When you parse "eax push", you have everything
needed to encode the instruction by the time you parse the instruction
"push". You have saved the register and can take action on the "push" token
or string immediately. With this reverse order, only a *single* call to
parse the register is needed for the *entire* program. This reduces coding
quite a bit.

With "push eax", you've got the instruction, but you can't attempt to emit
an instruction until after you've parsed the register, but you need to take
action on the push. If you do so, the routine for the push instruction must
"say": "I need a register before I can emit the correct push instruction.".
This means a function call to parse the register must be within the routine
for the push instruction. With many different instructions to assemble, you
end up with numerous functions calls to parse the register throughout your
application. Your other option here is to delay taking action until an EOL
occurs whereby the entire line will have been processed and/or buffered.
Except for portable EOL detection which is usually hidden from you in C's
library, there is nothing wrong with that. That's the way, i.e.,
line-oriented, I write many programs in C. However, when you need a smaller
size application, line I/O is not as well suited as is simple on-the-fly
single-pass character-by-character I/O.

Rod Pemberton

Rui Maciel

unread,

Aug 27, 2009, 7:51:39 AM8/27/09

to

Rod Pemberton wrote:

<snip/>

>> and is probably more logical
>> than ".eax _push"
>
> In terms of syntax suitable for humans, yes, I wholeheartedly agree "push
> eax" is "more logical". But, this all depends on how one got to the
> logic...
>
> Generating and processing assembly isn't limited to humans. And, smaller
> application size is still important, which is where the parsing
> constraints
> come from. Unless I can find a low cost way to parse "_push .eax" as
> easily
> as I can parse ".eax _push", it's likely to remain.

Is this any relevant? A parser is usually divided into a lexer/tokenizer, a routine which generates tokens
from a given stream, and a parser, a routine that extracts meaning from a set of tokens. If your parser is
divided that way then I don't see how the order the tokens are generated can have any impact on the size of
the code generated for the lexer and parser.

Adding to that, I believe a {instruction} {parameter}* {end token} grammar will end up making your parser
easier to write and also read, which means it's easier to expand.

<snip/>

>> ... and on top of that you mention you HAVE to process it
>> that way at the moment because of - in your own words - "current parsing
>> restrictions", as opposed to any advantage offered to the developer.
>
> The "developer" is another program and, to a far lesser extent, me.
> Believe it or not, parsing "eax push" is much simpler than parsing "push
> eax" in
> terms of coding needed. When you parse "eax push", you have everything
> needed to encode the instruction by the time you parse the instruction
> "push". You have saved the register and can take action on the "push"
> token
> or string immediately. With this reverse order, only a *single* call to
> parse the register is needed for the *entire* program. This reduces
> coding quite a bit.

As I see it, the only thing that this may accomplish is avoiding the need for an "end of instruction"
token, which, due to readability concerns, probably will already be a part of the grammar. Does this have
any meaningful impact?

> With "push eax", you've got the instruction, but you can't attempt to emit
> an instruction until after you've parsed the register, but you need to
> take
> action on the push. If you do so, the routine for the push instruction
> must "say": "I need a register before I can emit the correct push
> instruction.". This means a function call to parse the register must be
> within the routine
> for the push instruction. With many different instructions to assemble,
> you end up with numerous functions calls to parse the register throughout
> your
> application.

> Your other option here is to delay taking action until an
> EOL occurs whereby the entire line will have been processed and/or
> buffered.

I believe that either way you always end up having to parse the entire line before being able to generate
the instruction.

> Except for portable EOL detection which is usually hidden from
> you in C's
> library

I don't believe that is true. There may be differences in the way some OSs interpret the EOL token but
that's about it. For example, Unix-like OSs generally interpret the LF character as EOL while Microsoft
chose, somehow, for the Windows family of OSs the CR+LF character sequence. That's about it. There are no C
libraries in play.

On the other hand, you don't really need to interpret those characters as your {end token} token. You may
ignore them as you please and then set some other character as that token. For example, you may adopt C's
approach and interpret the ';' character as your own {end token}, which also enables you to write multiple
instructions in a single line. That may be useful for those writing code.

> there is nothing wrong with that. That's the way, i.e.,
> line-oriented, I write many programs in C. However, when you need a
> smaller size application, line I/O is not as well suited as is simple
> on-the-fly single-pass character-by-character I/O.

Why? I don't see how relying on an {end token} forces the parser to be multipass or be any more convoluted.
In fact, I do believe it makes the parser easier to write.

Rui Maciel

BGB / cr88192

unread,

Aug 28, 2009, 11:09:42 AM8/28/09

to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message
news:h75eck$7p2$1...@aioe.org...

> "Dave -Turner" <ad...@127.0.0.1> wrote in message
> news:lNGdnRVjzaWouAvX...@westnet.com.au...
>> > My "assembler" - if you can call it that at this point - just compiled
> it's
>> > first x86 instruction:
>> >
>> > .eax _push
>> >
>> > It's in an reverse order or RPN-ish like syntax due to current parsing
>> > restrictions. In NASM, that's "push eax"... It's written in C, of
>> > course... ;-)
>> >
>>
>> "push eax" is easier to type, has been used more historically (so many
>> developers are used to that style of syntax),
>>
>
> Yeah, I don't like the reverse order in terms of readability. I prefer
> NASM's x86 syntax. But, I need my own assembler, of smaller size, in ANSI
> C.
>

yep.

I still stuck with NASM-style syntax though...
parsing is not "that" difficult, since it is mostly:
parse token, peek next token;
dispatch if token is a special character (used for special commands);
dispatch again if second token is special (':', 'equ', 'db', ...);
assume token is an opcode;
read arguments (this part ignores opcode);
call function to emit the opcode (figures out which opcode form to use,
...).

read an argument is like:
read token;
is the token a name?
check if it is a register, otherwise assume symbol;
return the register for reg, and the symbol for a name.
is the token '['?
read token;
is token numeric?
add to offset;
read token;
if token is '+', read token.
is token a non-register symbol?
use as the label;
read token;
if token is '+', read token.
is token a register?
use register as 'base';
read token;
if token is '+', read token.
is token register?
use register as index;
read token;
if token is '+', read token;
if token is '*'?
read token;
use numeric value as 'scale';
read token;
if token is '+', read token;
is token numeric?
add to offset;
read token;
verify token is ']'.

now, granted, a "generic" syntax could also be used here (such as a full
expression parser), but this would require more complicated handling than
simply filling in a few fields.

read-arguments:
read an argument;
peek token;
if token is ','?
read token;
continue.

>> and is probably more logical
>> than ".eax _push"
>
> In terms of syntax suitable for humans, yes, I wholeheartedly agree "push
> eax" is "more logical". But, this all depends on how one got to the
> logic...
>
> Generating and processing assembly isn't limited to humans. And, smaller
> application size is still important, which is where the parsing
> constraints
> come from. Unless I can find a low cost way to parse "_push .eax" as
> easily
> as I can parse ".eax _push", it's likely to remain. If you (or James or
> BGB) didn't notice, it's a syntax-directed parser, directed by escape
> characters. I'm not fully familiar with the CS terms for parsers, but I
> think it's LL(0), basically... ? There is no lookahead or lookbehind.
> However, the control stream of escape characters and data stream are
> interwoven, so I'm not sure it that complies with LL(0)'s description or
> not. The object follows an escape so the parser doesn't need code to
> determine what it is looking at. It "knows" from the escape what type of
> object follows each escape. The '.' indicates a register follows, and the
> '_' indicates an instruction follows. I have other escape characters to
> indicate most of the major x86 assembly components, like constants,
> labels,
> label-references, instruction overrides, etc. However, I've just gotten
> started and I haven't coded the instructions which use these features yet.
>

ok.

I would recommend against the single-function-per-opcode route, as this can
be a horrid amount of work, and is not well scalable.

in my case, I had more-or-less transcribed much of the Intel doc's opcode
listings into a text file, and then wrote a tool to process said file
(generates some parts of the assembler machinery, produces internal tables,
...).

>> ... and on top of that you mention you HAVE to process it
>> that way at the moment because of - in your own words - "current parsing
>> restrictions", as opposed to any advantage offered to the developer.
>
> The "developer" is another program and, to a far lesser extent, me.
> Believe
> it or not, parsing "eax push" is much simpler than parsing "push eax" in
> terms of coding needed. When you parse "eax push", you have everything
> needed to encode the instruction by the time you parse the instruction
> "push". You have saved the register and can take action on the "push"
> token
> or string immediately. With this reverse order, only a *single* call to
> parse the register is needed for the *entire* program. This reduces
> coding
> quite a bit.
>

yep, RPN allows simpler handlers, since one does not actually need a full
parser.

read token;
handle token...

> With "push eax", you've got the instruction, but you can't attempt to emit
> an instruction until after you've parsed the register, but you need to
> take
> action on the push. If you do so, the routine for the push instruction
> must
> "say": "I need a register before I can emit the correct push
> instruction.".
> This means a function call to parse the register must be within the
> routine
> for the push instruction. With many different instructions to assemble,
> you
> end up with numerous functions calls to parse the register throughout your
> application. Your other option here is to delay taking action until an
> EOL
> occurs whereby the entire line will have been processed and/or buffered.
> Except for portable EOL detection which is usually hidden from you in C's
> library, there is nothing wrong with that. That's the way, i.e.,
> line-oriented, I write many programs in C. However, when you need a
> smaller
> size application, line I/O is not as well suited as is simple on-the-fly
> single-pass character-by-character I/O.
>

EOL is almost always one of:
CR
CR+LF
LF

the strategy is to convert all these into LF (AKA: '\n'), which is what is
done when reading a file in text mode.

as noted, I also allow ';' to be used to lump multiple opcodes per line
(where it is not allowed to follow whitespace, since if there is any
preceding whitespace, it is assumed to be a comment).

;comment
mov eax, ecx; xor eax, edx ;2 ops, then comment
mov eax, ecx ; xor eax, edx ;1 op, then comment

(granted, this initially caused a slight issue when later adding a
preprocessor, but oh well...).

OTOH, I also allow C style comments ('/*...*/' and '//...').

or such...

>
> Rod Pemberton
>
>

Rod Pemberton

unread,

Aug 29, 2009, 1:45:17 AM8/29/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message
news:h78rvo$ghk$1...@news.albasani.net...
>

Good to hear from you.

I really just didn't know where to begin to respond to Rui's post. I'd
swear he read my post in reverse order but seeded his comments in forward
order (Sorry, Rui.). Maybe..., I'll give it another read in a week because
me posting circular explanations will cause confusion.

> EOL is almost always one of:
> CR
> CR+LF
> LF
>

I'm not sure how common CR only is. CR only is rumored to be used for the
older Motorola (pre-Intel) based Mac's. But, I've never noticed it being
used anywhere. The only platform I've used where I do recall a different
EOL was DEC VAX/VMS. But, I'm no longer sure if that was for a specific
text editor, or for the entire platform. If I knew what Stratus VOS used, I
don't recall anymore. It seems EBCDIC also has NL (for CR+LF) in addition
to CR, and LF. EBCDIC uses a different value for LF character than ASCII,
but the same for CR.

> the strategy is to convert all these into LF (AKA: '\n'), which is what is
> done when reading a file in text mode.

Well, the strategy for CR+LF and LF, is to ignore (or delete) the CR... I
mean, I wouldn't convert CR to LF. While that'd be fine for CR only or LF
only, you'd end up with LF+LF for CR+LF, i.e., double-spacing. If you
decided to remove the double-spacing after the fact, how would you
distinguish two valid LF's from an LF+LF conversion?

You really want a single LF for all three cases. When you add the CR case
to CR+LF and LF, you need a slightly different strategy.

Anyway, the strategy I use for all three is to ignore CR, use LF as the
native EOL, but add in one character of look-behind to detect the CR only
case so it can be converted to LF. You ignore CR just as before for the
other two cases, but since you're saving the current character (for
look-behind), when you read the next character from the stream you check 1)
this character for non-LF, and 2) the saved look-behind char for CR. If so,
you convert the CR to the LF _prior_ to processing current char. You can
get the actual EOL byte-sequence, e.g., for C's '\n' escape, by writing it
out as text and reading it in as binary to an "array".

Rod Pemberton

BGB / cr88192

unread,

Aug 29, 2009, 3:51:37 AM8/29/09

to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message

news:h7af8v$rbd$1...@aioe.org...

> "BGB / cr88192" <cr8...@hotmail.com> wrote in message
> news:h78rvo$ghk$1...@news.albasani.net...
>>
>
> Good to hear from you.
>

lot going on recently...

had taken a trip to Utah to see relatives, then some amount of chaos here,
then classes have started, ...

not done a whole lot of coding recently either as a result...

> I really just didn't know where to begin to respond to Rui's post. I'd
> swear he read my post in reverse order but seeded his comments in forward
> order (Sorry, Rui.). Maybe..., I'll give it another read in a week
> because
> me posting circular explanations will cause confusion.
>

ok.

>> EOL is almost always one of:
>> CR
>> CR+LF
>> LF
>>
>
> I'm not sure how common CR only is. CR only is rumored to be used for the
> older Motorola (pre-Intel) based Mac's. But, I've never noticed it being
> used anywhere. The only platform I've used where I do recall a different
> EOL was DEC VAX/VMS. But, I'm no longer sure if that was for a specific
> text editor, or for the entire platform. If I knew what Stratus VOS used,
> I
> don't recall anymore. It seems EBCDIC also has NL (for CR+LF) in addition
> to CR, and LF. EBCDIC uses a different value for LF character than ASCII,
> but the same for CR.
>

EBCDIC can probably be safely ignored I think...

yep, CR only was for Mac's, I am not sure if/when this changed, or what they
are using now (still CR only, or CR+LF, or LF).

however, if they changed, given what other stuff they are using, it probably
would have been to LF-only I think...

>> the strategy is to convert all these into LF (AKA: '\n'), which is what
>> is
>> done when reading a file in text mode.
>
> Well, the strategy for CR+LF and LF, is to ignore (or delete) the CR... I
> mean, I wouldn't convert CR to LF. While that'd be fine for CR only or LF
> only, you'd end up with LF+LF for CR+LF, i.e., double-spacing. If you
> decided to remove the double-spacing after the fact, how would you
> distinguish two valid LF's from an LF+LF conversion?
>
> You really want a single LF for all three cases. When you add the CR case
> to CR+LF and LF, you need a slightly different strategy.
>
> Anyway, the strategy I use for all three is to ignore CR, use LF as the
> native EOL, but add in one character of look-behind to detect the CR only
> case so it can be converted to LF. You ignore CR just as before for the
> other two cases, but since you're saving the current character (for
> look-behind), when you read the next character from the stream you check
> 1)
> this character for non-LF, and 2) the saved look-behind char for CR. If
> so,
> you convert the CR to the LF _prior_ to processing current char. You can
> get the actual EOL byte-sequence, e.g., for C's '\n' escape, by writing
> it
> out as text and reading it in as binary to an "array".
>

what I mean, is that code recognizes the CR-only and CR+LF cases, and
converts both to a single LF.

another commonly used strategy is to not bother, and instead regard all of
this as "generic whitespace" (finding the end of the line is then limited to
specific operations, such as handling '//', ...).

this is because I tend to use more-or-less line-independent parsers in many
cases (only that in some cases, I implement partial line-sensitivity, such
as using a line-break as a terminator for ASM opcodes, ...).

if I did not care about NASM compatibility, possibly I would go to using ';'
as the primary opcode deliminator, and drop using ';' for comments.

however, as is, I keep things compatible, only that ';' may also be used for
lumping, which would not typically work in NASM and friends (as well as
breaking a case which shouldn't really occur in practice).

in general, some things (such as my compiler), have been gradually
eliminating lumping due to the possibility that it "may" make sense to
support other assemblers (NASM and YASM).

but, alas, there are special cases which are likely to cause problems...

for example:
movzx and movsx have been partially split into 'movsxb/movsxw' and
'movzxb/movzxw', with 'movsx' and 'movzx' being equivalent to 'movsxb' and
'movzxb', mostly due to some internal representational issues within my
assembler (can't generally deal with opcodes with mixed argument sizes).

other minor issues are likely to pop up "here and there"...

for example, my assembler did not originally support "sized generic memory
operands" (of the type common in many x87 ops), and so resorted to size
qualifying the operations ('fld32', 'fstp64', ...).

which although no longer particularly an issue (and I have added the more
generic case), is still an issue that my compiler emits opcodes like this
(though likely only in cases where x87 is used...).

well, that and my compiler emitting lots of internal debugging comments
using C-style block-comments (not a big issue, could be disabled or filtered
easily enough...).

well, if interested, I could supply my opcode-listings table, which is in a
form which "should" be relatively to write a tool to translate it into
whatever form is needed.

the syntax is "fairly close" that used in the intel docs (only that it
includes placeholders for the various prefixes, such as
REX/AddrSize/OpSize/...), and is slightly less verbose in some cases.

the notation for VEX/XOP differs notably from the Intel and AMD forms (it is
a hassle, but is as I see it worthwhile, as the Intel/AMD notation is not
very "ideal" for listings, primarily due to its verbosity...).

granted, the notation I used here is not particularly "readable" but oh
well...

or such...

rand...@earthlink.net

unread,

Aug 29, 2009, 2:10:23 PM8/29/09

to

On Aug 24, 4:09 pm, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
> My "assembler" - if you can call it that at this point - just compiled it's
> first x86 instruction:
>
> .eax _push
>
> It's in an reverse order or RPN-ish like syntax due to current parsing
> restrictions. In NASM, that's "push eax"...

This is the way most FORTH assemblers worked, IIRC.

> It's written in C, of
> course... ;-)

You might want to take a look at the HLA "Back Engine" if you pursue
this development. HLABE (written in HLA, but obviously callable from C
as HLAPARSE.EXE uses it) is an object code formatter and branch
displacement optimizer. It takes raw binary object code plus
conditional jumps, unconditional jumps, and call instructions, and
creates an output object file formatted in COFF and ELF formats (Mach-
O is coming soon). Using HLABE to produce your object files will
automatically make your "assembler" work with Windows, Linux, Mac OS
X, and FreeBSD (with more coming soon).

HLABE does have one current limitation -- it currently supports an
"a.out" memory model (text, data, and bss sections). This is easy
enough to change, though.

As soon as I get Mach-O support added to HLABE, I'm going to put it up
as a separate project on SourceForge. In the meantime, it's part of
the HLA source code. http://webster.cs.ucr.edu
Cheers,
Randy Hyde

rand...@earthlink.net

unread,

Aug 29, 2009, 2:17:29 PM8/29/09

to

On Aug 27, 12:59 am, "Rod Pemberton" <do_not_h...@nohavenot.cmm>
wrote:

>
> Yeah, I don't like the reverse order in terms of readability. I prefer
> NASM's x86 syntax. But, I need my own assembler, of smaller size, in ANSI
> C.

Why is size an issue? Given the memory found on modern machines, I
can't imagine that saving a few kilobytes would be a big issue these
days.

Use FLEX to generate a C-based lexical analyzer. Use Bison to parse
the code (though Bison is probably overkill for such a project). The
code will be much larger, but the source will be much smaller. And you
don't have to play around with limiting yourself to funny characters
to lead off your lexemes.

>
> With "push eax", you've got the instruction, but you can't attempt to emit
> an instruction until after you've parsed the register, but you need to take
> action on the push. If you do so, the routine for the push instruction must
> "say": "I need a register before I can emit the correct push instruction.".

I'm not sure how this issue is solve using "postfix" notation. All
you've done is change it to "I've got a register, now I need an
instruction before I can emit the correct instruction."

BTW, generating the code for "push eax" is ridiculously simple using
FLEX and Bison. Indeed, writing a bare-bones assembler (no macros,
conditional assembly, or other compile-time language features) would
be nearly trivial except for the fact that there are hundreds and
hundreds of machine instructions to encode. Any individual
instruction, however, is really simple to process using FLEX and
Bison.

Cheers,
Randy Hyde

BGB / cr88192

unread,

Aug 29, 2009, 3:41:10 PM8/29/09

to

<rand...@earthlink.net> wrote in message
news:3e08caf8-1d6a-4962...@q40g2000prh.googlegroups.com...

On Aug 27, 12:59 am, "Rod Pemberton" <do_not_h...@nohavenot.cmm>
wrote:

>
> Yeah, I don't like the reverse order in terms of readability. I prefer
> NASM's x86 syntax. But, I need my own assembler, of smaller size, in ANSI
> C.

<--

Why is size an issue? Given the memory found on modern machines, I
can't imagine that saving a few kilobytes would be a big issue these
days.

Use FLEX to generate a C-based lexical analyzer. Use Bison to parse
the code (though Bison is probably overkill for such a project). The
code will be much larger, but the source will be much smaller. And you
don't have to play around with limiting yourself to funny characters
to lead off your lexemes.

-->

or, just write a plain tokenizer...

using funny characters, well, personally I don't see why it is not that
necessary, since fixed-form logic and a tokenizer are sufficient to parse
ASM.

>
> With "push eax", you've got the instruction, but you can't attempt to emit
> an instruction until after you've parsed the register, but you need to
> take
> action on the push. If you do so, the routine for the push instruction
> must
> "say": "I need a register before I can emit the correct push
> instruction.".

<--

I'm not sure how this issue is solve using "postfix" notation. All
you've done is change it to "I've got a register, now I need an
instruction before I can emit the correct instruction."

BTW, generating the code for "push eax" is ridiculously simple using
FLEX and Bison. Indeed, writing a bare-bones assembler (no macros,
conditional assembly, or other compile-time language features) would
be nearly trivial except for the fact that there are hundreds and
hundreds of machine instructions to encode. Any individual
instruction, however, is really simple to process using FLEX and
Bison.

-->

judging from my listings (every basic instruction form, including SIMD,
...), it is more like thousands.

granted, many of these could be regarded as redundant, since they would be
the same base opcode with different prefixes, or a different Mod/RM, ...

but, as noted, it is still a fairly close to 1:1 mapping with the Intel and
AMD docs, and the rough count would still seem to be in the thousands...

granted, it would likely be much smaller if one were using only a
486-friendly subset or similar, where I currently count around 773 for basic
opcodes (900 lines, but 773 after I removed any otherwise blank lines), and
about 986 if one includes x87 (1200, but then eliminated blank lines).

it is a good deal larger if SIMD is included, and larger still if one
includes AVX/...

in my case, I make the opcode listing primary, and generate relevant
portions of the assembler based on the listing.

BGB / cr88192

unread,

Aug 29, 2009, 3:59:23 PM8/29/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:h7c08p$40m$1...@news.albasani.net...

>
> <rand...@earthlink.net> wrote in message
> news:3e08caf8-1d6a-4962...@q40g2000prh.googlegroups.com...
> On Aug 27, 12:59 am, "Rod Pemberton" <do_not_h...@nohavenot.cmm>
> wrote:
>

<snip>

>
> granted, it would likely be much smaller if one were using only a
> 486-friendly subset or similar, where I currently count around 773 for
> basic opcodes (900 lines, but 773 after I removed any otherwise blank
> lines), and about 986 if one includes x87 (1200, but then eliminated blank
> lines).
>
> it is a good deal larger if SIMD is included, and larger still if one
> includes AVX/...
>
>
> in my case, I make the opcode listing primary, and generate relevant
> portions of the assembler based on the listing.
>

below is the partial listing, but with all tabs replaced with "\t".
this is because the syntax is line-orientated and indentation based (and
because OE messes up tabs...).

notes:

basic listing format is:
opname
forms...
or:
opname form
form...

where form is:
encoding [args|"-" [flags]]

encoding is like this:
0-9 | A-F hex values (literal byte values)
X REX prefix
V address prefix (0x67, 32/64 bit mode)
W word-size prefix (0x66, 32/64 bit mode)
S address prefix (16 bit mode / real-mode)
T word-side prefix (16 bit mode)

i-x and H..L are used, but these are related to my AVX notation.

|r fixed register (low 3 bits of opcode)
/r Mod/RM with register
/# (# is a digit 0-7), Mod/RM with # as the fixed reg field.

ib,iw,id,iq immediates (byte, word, double, qword)
rb,rw,rd displacements (relative offsets)
mw,md,mq fixed memory addresses

is (AVX) register-as-immediate

(grabbed from listing):
;args will be a comma seperated list
;r8/16/32/64 will name a register
;rm8/16/32/64 will identify the value in the modrm fields
;i8/16/32/64 will define an immediate
;u8/16 unsigned 8/16 bit immediate
;al/ax/eax/rax oprand is simply a form of ax
;ar8/ar16/ar32/ar64 relative address (jump, call, ...)
;sr segment register
;cr control register
;dr debug register
;mo16/32/64 fixed memory offset (constant address)
;m generic untyped memory
;fr float/MMX reg
;frm float/MMX reg/memory
;frm16/32/64/80 float reg/memory
;xr SSE XMM reg (untyped)
;xrm SSE XMM reg/memory (untyped)
;xr2 SSE XMM reg (secondary, rm)
;xrv SSE XMM reg (VEX/XOP vvvv)
;xri SSE XMM reg (VEX/XOP immed)
;yr SSE YMM reg (untyped)
;yrm SSE YMM reg/memory (untyped)
;yr2 SSE YMM reg (secondary, rm)
;yrv SSE YMM reg (VEX/XOP vvvv)
;yri SSE YMM reg (VEX/XOP immed)
;
;flag
;newly added, will be a short string indicating the modes for some ops
;will be a comma seperated list
;long: long-mode only (unneeded if the op' wouldn't be encoded anyways)
;leg: legacy mode only (invalid in long mode)
;vex: opcode could be confused with a VEX or XOP prefix

---
aaa\t\t37
aad\t\tD50A
\t\tD5,ib
aam\t\tD40A
\t\tD4,ib
aas\t\t3F
adc\t\t14,ib\t\t\tal,i8
\t\tX80/2,ib\t\trm8,i8
\t\tWX83/2,ib\t\trm16,i8
\t\tTX83/2,ib\t\trm32,i8
\t\tX83/2,ib\t\trm64,i8
\t\tX12/r\t\t\tr8,rm8
\t\tX10/r\t\t\trm8,r8
\t\tW15,iw\t\tax,i16
\t\tWX81/2,iw\t\trm16,i16
\t\tWX13/r\t\tr16,rm16
\t\tWX11/r\t\trm16,r16
\t\tT15,id\t\teax,i32
\t\tTX13/r\t\tr32,rm32
\t\tTX81/2,id\t\trm32,i32
\t\tTX11/r\t\trm32,r32
\t\tX15,id\t\trax,i32
\t\tX13/r\t\t\tr64,rm64
\t\tX81/2,id\t\trm64,i32
\t\tX11/r\t\t\trm64,r64
add\t\t04,ib\t\t\tal,i8
\t\tX80/0,ib\t\trm8,i8
\t\tWX83/0,ib\t\trm16,i8
\t\tTX83/0,ib\t\trm32,i8
\t\tX83/0,ib\t\trm64,i8
\t\tX02/r\t\t\tr8,rm8
\t\tX00/r\t\t\trm8,r8
\t\tW05,iw\t\tax,i16
\t\tWX81/0,iw\t\trm16,i16
\t\tWX03/r\t\tr16,rm16
\t\tWX01/r\t\trm16,r16
\t\tT05,id\t\teax,i32
\t\tTX81/0,id\t\trm32,i32
\t\tTX01/r\t\trm32,r32
\t\tTX03/r\t\tr32,rm32
\t\tX05,id\t\trax,i32
\t\tX81/0,id\t\trm64,i32
\t\tX03/r\t\t\tr64,rm64
\t\tX01/r\t\t\trm64,r64
and\t\t24,ib\t\t\tal,i8
\t\tX80/4,ib\t\trm8,i8
\t\tWX83/4,ib\t\trm16,i8
\t\tTX83/4,ib\t\trm32,i8
\t\tX83/4,ib\t\trm64,i8
\t\tX20/r\t\t\trm8,r8
\t\tX22/r\t\t\tr8,rm8
\t\tWX25,iw\t\tax,i16
\t\tWX81/4,iw\t\trm16,i16
\t\tWX23/r\t\tr16,rm16
\t\tWX21/r\t\trm16,r16
\t\tT25,id\t\teax,i32
\t\tTX81/4,id\t\trm32,i32
\t\tTX23/r\t\tr32,rm32
\t\tTX21/r\t\trm32,r32
\t\tX25,id\t\trax,i32
\t\tX81/4,id\t\trm64,i32
\t\tX23/r\t\t\tr64,rm64
\t\tX21/r\t\t\trm64,r64
bsf\t\tWX0FBC/r\t\tr16,rm16
\t\tTX0FBC/r\t\tr32,rm32
\t\tX0FBC/r\t\tr64,rm64
bsr\t\tWX0FBD/r\t\tr16,rm16
\t\tTX0FBD/r\t\tr32,rm32
\t\tX0FBD/r\t\tr64,rm64
bswap\t\tX0FC8|r\t\tr32
\t\tX0FC8|r\t\tr64
bt\t\tWX0FA3/r\t\trm16,r16
\t\tTX0FA3/r\t\trm32,r32
\t\tX0FA3/r\t\trm64,r32
\t\tWX0FBA/4,ib\t\trm16,i8
\t\tTX0FBA/4,ib\t\trm32,i8
\t\tX0FBA/4,ib\t\trm64,i8
btc\t\tWX0FBB/r\t\trm16,r16
\t\tTX0FBB/r\t\trm32,r32
\t\tX0FBB/r\t\trm64,r64
\t\tWX0FBA/7,ib\t\trm16,i8
\t\tTX0FBA/7,ib\t\trm32,i8
\t\tX0FBA/7,ib\t\trm64,i8
btr\t\tWX0FB3/r\t\trm16,r16
\t\tTX0FB3/r\t\trm32,r32
\t\tX0FB3/r\t\trm64,r64
\t\tWX0FBA/6,ib\t\trm16,i8
\t\tTX0FBA/6,ib\t\trm32,i8
\t\tX0FBA/6,ib\t\trm64,i8
bts\t\tWX0FAB/r\t\trm16,r16
\t\tTX0FAB/r\t\trm32,r32
\t\tX0FAB/r\t\trm64,r64
\t\tWX0FBA/5,ib\t\trm16,i8
\t\tTX0FBA/5,ib\t\trm32,i8
\t\tX0FBA/5,ib\t\trm64,i8
call\t\tTE8,rd\t\tar32
\t\tWXFF/2\t\trm16
\t\tTXFF/2\t\trm32
\t\tXFF/2\t\t\trm64
call_w\t\tWE8,rw\t\tar16
cbw\t\tW98
cwde\t\tT98
cdqe\t\t4898
clc\t\tF8
cld\t\tFC
cli\t\tFA
clts\t\t0F06
cmc\t\tF5
cmp\t\t3C,ib\t\t\tal,i8
\t\tX80/7,ib\t\trm8,i8
\t\tWX83/7,ib\t\trm16,i8
\t\tTX83/7,ib\t\trm32,i8
\t\tX83/7,ib\t\trm64,i8
\t\tX3A/r\t\t\tr8,rm8
\t\tX38/r\t\t\trm8,r8
\t\tW3D,iw\t\tax,i16
\t\tWX81/7,iw\t\trm16,i16
\t\tWX3B/r\t\tr16,rm16
\t\tWX39/r\t\trm16,r16
\t\tT3D,id\t\teax,i32
\t\tTX81/7,id\t\trm32,i32
\t\tTX3B/r\t\tr32,rm32
\t\tTX39/r\t\trm32,r32
\t\tX3D,id\t\trax,i32
\t\tX81/7,id\t\trm64,i32
\t\tX3B/r\t\t\tr64,rm64
\t\tX39/r\t\t\trm64,r64
cmpsb\t\tA6
cmpsw\t\tWA7
cmpsd\t\tTA7
\t\tF2X0FC2/r,ib\txr,xrm,u8\t;theoretical form
cmpxchg\t\tX0FB0/r\t\trm8,r8
\t\tWX0FB1/r\t\trm16,r16
\t\tTX0FB1/r\t\trm32,r32
\t\tX0FB1/r\t\trm64,r64
cpuid\t\t0FA2
cwd\t\tW99
cdq\t\tT99
cqo\t\t4899
daa\t\t27
das\t\t2F
dec\t\tW48|r\t\t\tr16\t\tleg
\t\tT48|r\t\t\tr32\t\tleg
\t\tXFE/1\t\t\trm8
\t\tWXFF/1\t\trm16
\t\tTXFF/1\t\trm32
\t\tXFF/1\t\t\trm64
div\t\tXF6/6\t\t\tr8
\t\tWXF7/6\t\tr16
\t\tTXF7/6\t\tr32
\t\tXF7/6\t\t\tr64
emms\t\t0F77
enter\t\tC8,iw,ib\t\ti16,i8
hlt\t\tF4
idiv\t\tXF6/7\t\t\trm8
\t\tWXF7/7\t\trm16
\t\tTXF7/7\t\trm32
\t\tXF7/7\t\t\trm64
imul\t\tXF6/5\t\t\trm8
\t\tWXF7/5\t\trm16
\t\tTXF7/5\t\trm32
\t\tXF7/5\t\t\trm64
\t\tWX0FAF/r\t\tr16,rm16
\t\tTX0FAF/r\t\tr32,rm32
\t\tX0FAF/r\t\tr64,rm64
\t\tWX6B/r,ib\t\tr16,i8
\t\tTX6B/r,ib\t\tr32,i8
\t\tX6B/r,ib\t\tr64,i8
\t\tWX69/r,iw\t\tr16,i16
\t\tTX69/r,id\t\tr32,i32
\t\tX69/r,id\t\tr64,i32
\t\tWX6B/r,ib\t\tr16,rm16,i8
\t\tTX6B/r,ib\t\tr32,rm32,i8
\t\tX6B/r,ib\t\tr64,rm64,i8
\t\tWX69/r,iw\t\tr16,rm16,i16
\t\tTX69/r,id\t\tr32,rm32,i32
\t\tX69/r,id\t\tr64,rm64,i32
in\t\tE4,ib\t\t\tal,u8
\t\tWE5,ib\t\tax,u8
\t\tTE5,ib\t\teax,u8
\t\tXE5,ib\t\trax,u8
\t\tEC\t\t\tal,dx
\t\tWED\t\t\tax,dx
\t\tTED\t\t\teax,dx
\t\t48ED\t\t\trax,dx
inc\t\tW40|r\t\t\tr16\t\tleg
\t\tT40|r\t\t\tr32\t\tleg
\t\tXFE/0\t\t\trm8
\t\tWXFF/0\t\trm16
\t\tTXFF/0\t\trm32
\t\tXFF/0\t\t\trm64
insb\t\t6C
insw\t\tW6D
insd\t\tT6D
insq\t\t486D
int\t\tCC\t\t\t3
\t\tCD,ib\t\t\tu8
into\t\tCE
invld\t\t0F08
invplg\t\t0F01/7\t\tm
iret\t\tCF
iretd\t\tCF
ja\t\t77,rb\t\t\tar8
\t\tW0F87,rw\t\tar16
\t\tT0F87,rd\t\tar32
jae\t\t73,rb\t\t\tar8
\t\tW0F83,rw\t\tar16
\t\tT0F83,rd\t\tar32
jb\t\t72,rb\t\t\tar8
\t\tW0F82,rw\t\tar16
\t\tT0F82,rd\t\tar32
jbe\t\t76,rb\t\t\tar8
\t\tW0F86,rw\t\tar16
\t\tT0F86,rd\t\tar32
jc\t\t72,rb\t\t\tar8
\t\tW0F82,rw\t\tar16
\t\tT0F82,rd\t\tar32
je\t\t74,rb\t\t\tar8
\t\tW0F84,rw\t\tar16
\t\tT0F84,rd\t\tar32
jg\t\t7F,rb\t\t\tar8
\t\tW0F8F,rw\t\tar16
\t\tT0F8F,rd\t\tar32
jge\t\t7D,rb\t\t\tar8
\t\tW0F8D,rw\t\tar16
\t\tT0F8D,rd\t\tar32
jl\t\t7C,rb\t\t\tar8
\t\tW0F8C,rw\t\tar16
\t\tT0F8C,rd\t\tar32
jle\t\t7E,rb\t\t\tar8
\t\tW0F8E,rw\t\tar16
\t\tT0F8E,rd\t\tar32
jna\t\t76,rb\t\t\tar8
\t\tW0F86,rw\t\tar16
\t\tT0F86,rd\t\tar32
jnae\t\t72,rb\t\t\tar8
\t\tW0F82,rw\t\tar16
\t\tT0F82,rd\t\tar32
jnb\t\t73,rb\t\t\tar8
\t\tW0F83,rw\t\tar16
\t\tT0F83,rd\t\tar32
jnbe\t\t77,rb\t\t\tar8
\t\tW0F87,rw\t\tar16
\t\tT0F87,rd\t\tar32
jnc\t\t73,rb\t\t\tar8
\t\tW0F83,rw\t\tar16
\t\tT0F83,rd\t\tar32
jne\t\t75,rb\t\t\tar8
\t\tW0F85,rw\t\tar16
\t\tT0F85,rd\t\tar32
jng\t\t7E,rb\t\t\tar8
\t\tW0F8E,rw\t\tar16
\t\tT0F8E,rd\t\tar32
jnge\t\t7C,rb\t\t\tar8
\t\tW0F8C,rw\t\tar16
\t\tT0F8C,rd\t\tar32
jnl\t\t7D,rb\t\t\tar8
\t\tW0F8D,rw\t\tar16
\t\tT0F8D,rd\t\tar32
jnle\t\t7F,rb\t\t\tar8
\t\tW0F8F,rw\t\tar16
\t\tT0F8F,rd\t\tar32
jno\t\t71,rb\t\t\tar8
\t\tW0F81,rw\t\tar16
\t\tT0F81,rd\t\tar32
jnp\t\t7B,rb\t\t\tar8
\t\tW0F8B,rw\t\tar16
\t\tT0F8B,rd\t\tar32
jns\t\t79,rb\t\t\tar8
\t\tW0F89,rw\t\tar16
\t\tT0F89,rd\t\tar32
jnz\t\t75,rb\t\t\tar8
\t\tW0F85,rw\t\tar16
\t\tT0F85,rd\t\tar32
jo\t\t70,rb\t\t\tar8
\t\tW0F80,rw\t\tar16
\t\tT0F80,rd\t\tar32
jp\t\t7A,rb\t\t\tar8
\t\tW0F8A,rw\t\tar16
\t\tT0F8A,rd\t\tar32
jpe\t\t7A,rb\t\t\tar8
\t\tW0F8A,rw\t\tar16
\t\tT0F8A,rd\t\tar32
jpo\t\t7B,rb\t\t\tar8
\t\tW0F8B,rw\t\tar16
\t\tT0F8B,rd\t\tar32
js\t\t78,rb\t\t\tar8
\t\tW0F88,rw\t\tar16
\t\tT0F88,rd\t\tar32
jz\t\t74,rb\t\t\tar8
\t\tW0F84,rw\t\tar16
\t\tT0F84,rd\t\tar32
jmp\t\tEB,rb\t\t\tar8
\t\tWE9,rw\t\tar16
\t\tTE9,rd\t\tar32
\t\tWXFF/4\t\trm16
\t\tTXFF/4\t\trm32
\t\tXFF/4\t\t\trm64
lahf\t\t9F
lar\t\tWX0F02/r\t\tr16,rm16
\t\tTX0F02/r\t\tr32,rm32
\t\tX0F02/r\t\tr64,rm64
lea\t\tWX8D/r\t\tr16,rm16
\t\tTX8D/r\t\tr32,rm32
\t\tX8D/r\t\t\tr64,rm64
leave\t\tC9
lgdt\t\t0F01/2\t\tm
lidt\t\t0F01/3\t\tm
lldt\t\tX0F00/2\t\trm16
lmsw\t\tX0F01/6\t\trm16
lmtr\t\tX0F00/3\t\trm16
lodsb\t\tAC
lodsw\t\tWAD
lodsd\t\tTAD
lodsq\t\t48AD
loop\t\tE2,rb\t\t\tar8
loope\t\tE1,rb\t\t\tar8
loopz\t\tE1,rb\t\t\tar8
loopne\t\tE0,rb\t\t\tar8
loopnz\t\tE0,rb\t\t\tar8
lsl\t\tWX0F03/r\t\tr16,rm16
\t\tTX0F03/r\t\tr32,rm32
\t\tX0F03/r\t\tr64,rm64
mov\t\tVA0,mw\t\tal,mo16
\t\tVWA1,mw\t\tax,mo16
\t\tVTA1,mw\t\teax,mo16
\t\tVTXA1,mw\t\trax,mo16
\t\tVA2,mw\t\tmo16,al
\t\tVWA3,mw\t\tmo16,ax
\t\tVTA3,mw\t\tmo16,eax
\t\tVTXA3,mw\t\tmo16,rax
\t\tSA0,md\t\tal,mo32\tleg
\t\tSWA1,md\t\tax,mo32\tleg
\t\tSTA1,md\t\teax,mo32\tleg
\t\tSTXA1,md\t\trax,mo32\tleg
\t\tSA2,md\t\tmo32,al\tleg
\t\tSWA3,md\t\tmo32,ax\tleg
\t\tSTA3,md\t\tmo32,eax\tleg
\t\tSTXA3,md\t\tmo32,rax\tleg
\t\tX8A/r\t\t\tr8,rm8
\t\tWX8B/r\t\tr16,rm16
\t\tTX8B/r\t\tr32,rm32
\t\tX8B/r\t\t\tr64,rm64
\t\tX88/r\t\t\trm8,r8
\t\tWX89/r\t\trm16,r16
\t\tTX89/r\t\trm32,r32
\t\tX89/r\t\t\trm64,r64
\t\tXB0|r,ib\t\tr8,i8
\t\tWXB8|r,iw\t\tr16,i16
\t\tTXB8|r,id\t\tr32,i32
\t\tXC6/0,ib\t\trm8,i8
\t\tWXC7/0,iw\t\trm16,i16
\t\tTXC7/0,id\t\trm32,i32
\t\tXC7/0,id\t\trm64,i32
\t\tXB8|r,iq\t\tr64,i64
\t\tX8C/r\t\t\trm16,sr
\t\tX8E/r\t\t\tsr,rm16
\t\tX0F22/r\t\tcr,r32
\t\tX0F22/r\t\tcr,r64
\t\tX0F20/r\t\tr32,cr
\t\tX0F20/r\t\tr64,cr
\t\tX0F23/r\t\tdr,r32
\t\tX0F23/r\t\tdr,r64
\t\tX0F21/r\t\tr32,dr
\t\tX0F21/r\t\tr64,dr
movsb\t\tA4
movsw\t\tWA5
movsq\t\t48A5
movsx\t\tWX0FBE/r\t\tr16,rm8
\t\tTX0FBE/r\t\tr32,rm8
\t\tX0FBE/r\t\tr64,rm8
\t\tX0FBF/r\t\tr32,rm16
\t\tX0FBF/r\t\tr64,rm16
\t\tX63/r\t\t\tr64,rm32
movsxb\t\tWX0FBE/r\t\tr16,rm8
\t\tTX0FBE/r\t\tr32,rm8
\t\tX0FBE/r\t\tr64,rm8
movsxw\t\tX0FBF/r\t\tr32,rm16
\t\tX0FBF/r\t\tr64,rm32
movsxd\t\tX63/r\t\t\tr64,rm32
movzx\t\tWX0FB6/r\t\tr16,rm8
\t\tTX0FB6/r\t\tr32,rm8
\t\tX0FB6/r\t\tr64,rm8
\t\tX0FB7/r\t\tr32,rm16
\t\tX0FB7/r\t\tr64,rm32
movzxb\t\tWX0FB6/r\t\tr16,rm8
\t\tTX0FB6/r\t\tr32,rm8
\t\tX0FB6/r\t\tr64,rm8
movzxw\t\tX0FB7/r\t\tr32,rm16
\t\tX0FB7/r\t\tr64,rm32
mul\t\tXF6/4\t\t\trm8
\t\tWXF7/4\t\trm16
\t\tTXF7/4\t\trm32
\t\tXF7/4\t\t\trm64
neg\t\tXF6/3\t\t\trm8
\t\tWXF7/3\t\trm16
\t\tTXF7/3\t\trm32
\t\tXF7/3\t\t\trm64
nop\t\t90
not\t\tXF6/2\t\t\trm8
\t\tWXF7/2\t\trm16
\t\tTXF7/2\t\trm32
\t\tXF7/2\t\t\trm64
or\t\t0C,ib\t\t\tal,i8
\t\tX80/1,ib\t\trm8,i8
\t\tWX83/1,ib\t\trm16,i8
\t\tTX83/1,ib\t\trm32,i8
\t\tX83/1,ib\t\trm64,i8
\t\tX0A/r\t\t\tr8,rm8
\t\tX08/r\t\t\trm8,r8
\t\tWX0D,iw\t\tax,i16
\t\tWX81/1,iw\t\trm16,i16
\t\tWX0B/r\t\tr16,rm16
\t\tWX09/r\t\trm16,r16
\t\tT0D,id\t\teax,i32
\t\tTX81/1,id\t\trm32,i32
\t\tTX0B/r\t\tr32,rm32
\t\tTX09/r\t\trm32,r32
\t\t480D,id\t\trax,i32
\t\tX81/1,id\t\trm64,i32
\t\tX0B/r\t\t\tr64,rm64
\t\tX09/r\t\t\trm64,r64
out\t\tE6,ib\t\t\tu8,al
\t\tWE7,ib\t\tu8,ax
\t\tTE7,ib\t\tu8,eax
\t\tEE\t\t\tdx,al
\t\tWEF\t\t\tdx,ax
\t\tTEF\t\t\tdx,eax
pop\t\tWX58|r\t\tr16
\t\tTX58|r\t\tr32
\t\tX58|r\t\t\tr64
\t\tWX8F/0\t\trm16\t\tvex
\t\tTX8F/0\t\trm32\t\tvex
\t\tX8F/0\t\t\trm64\t\tvex
popa\t\t61
popad\t\t61
popf\t\t9D
popfd\t\t9D
popfq\t\t489D
push\t\tWX50|r\t\tr16
\t\tTX50|r\t\tr32
\t\tX50|r\t\t\tr64
\t\t6A,ib\t\t\ti8
\t\tW68,iw\t\ti16
\t\tT68,id\t\ti32
\t\tWXFF/6\t\trm16
\t\tTXFF/6\t\trm32
\t\tXFF/6\t\t\trm64
\t\t0E\t\t\tcs
\t\t16\t\t\tss
\t\t1E\t\t\tds
\t\t06\t\t\tes
\t\t0FA0\t\t\tfs
\t\t0FA8\t\t\tgs
push_cs\t0E
push_ss\t16
push_ds\t1E
push_es\t06
push_fs\t0FA0
push_gs\t0FA8
pusha\t\t60
pushaw\tW60
pushad\tT60
pushf\t\t9C
pushfw\tW9C
pushfd\tT9C
pushfq\t9C
rcl\t\tXD0/2\t\t\trm8
\t\tXD0/2\t\t\trm8,1
\t\tXD2/2\t\t\trm8,r8
\t\tXC0/2,ib\t\trm8,i8
\t\tWXD1/2\t\trm16
\t\tWXD1/2\t\trm16,1
\t\tWXD3/2\t\trm16,r8
\t\tWXC1/2,ib\t\trm16,i8
\t\tTXD1/2\t\trm32
\t\tTXD1/2\t\trm32,1
\t\tTXD3/2\t\trm32,r8
\t\tTXC1/2,ib\t\trm32,i8
\t\tXD1/2\t\t\trm64
\t\tXD1/2\t\t\trm64,1
\t\tXD3/2\t\t\trm64,r8
\t\tXC1/2,ib\t\trm64,i8
rcr\t\tXD0/3\t\t\trm8
\t\tXD0/3\t\t\trm8,1
\t\tXD2/3\t\t\trm8,r8
\t\tXC0/3,ib\t\trm8,i8
\t\tWXD1/3\t\trm16
\t\tWXD1/3\t\trm16,1
\t\tWXD3/3\t\trm16,r8
\t\tWXC1/3,ib\t\trm16,i8
\t\tTXD1/3\t\trm32
\t\tTXD1/3\t\trm32,1
\t\tTXD3/3\t\trm32,r8
\t\tTXC1/3,ib\t\trm32,i8
\t\tXD1/3\t\t\trm64
\t\tXD1/3\t\t\trm64,1
\t\tXD3/3\t\t\trm64,r8
\t\tXC1/3,ib\t\trm64,i8
rol\t\tXD0/0\t\t\trm8
\t\tXD0/0\t\t\trm8,1
\t\tXD2/0\t\t\trm8,r8
\t\tXC0/0,ib\t\trm8,i8
\t\tWXD1/0\t\trm16
\t\tWXD1/0\t\trm16,1
\t\tWXD3/0\t\trm16,r8
\t\tWXC1/0,ib\t\trm16,i8
\t\tTXD1/0\t\trm32
\t\tTXD1/0\t\trm32,1
\t\tTXD3/0\t\trm32,r8
\t\tTXC1/0,ib\t\trm32,i8
\t\tXD1/0\t\t\trm64
\t\tXD1/0\t\t\trm64,1
\t\tXD3/0\t\t\trm64,r8
\t\tXC1/0,ib\t\trm64,i8
ror\t\tXD0/1\t\t\trm8
\t\tXD0/1\t\t\trm8,1
\t\tXD2/1\t\t\trm8,r8
\t\tXC0/1,ib\t\trm8,i8
\t\tWXD1/1\t\trm16
\t\tWXD1/1\t\trm16,1
\t\tWXD3/1\t\trm16,r8
\t\tWXC1/1,ib\t\trm16,i8
\t\tTXD1/1\t\trm32
\t\tTXD1/1\t\trm32,1
\t\tTXD3/1\t\trm32,r8
\t\tTXC1/1,ib\t\trm32,i8
\t\tXD1/1\t\t\trm64
\t\tXD1/1\t\t\trm64,1
\t\tXD3/1\t\t\trm64,r8
\t\tXC1/1,ib\t\trm64,i8
rdtsc\t\t0F31
ret\t\tC3
\t\tC2,iw\t\t\ti16
retf\t\tCB
\t\tCA,iw\t\t\ti16
sahf\t\t9E
sal\t\tXD0/4\t\t\trm8
\t\tXD0/4\t\t\trm8,1
\t\tXD2/4\t\t\trm8,r8
\t\tXC0/4,ib\t\trm8,i8
\t\tWXD1/4\t\trm16
\t\tWXD1/4\t\trm16,1
\t\tWXD3/4\t\trm16,r8
\t\tWXC1/4,ib\t\trm16,i8
\t\tTXD1/4\t\trm32
\t\tTXD1/4\t\trm32,1
\t\tTXD3/4\t\trm32,r8
\t\tTXC1/4,ib\t\trm32,i8
\t\tXD1/4\t\t\trm64
\t\tXD1/4\t\t\trm64,1
\t\tXD3/4\t\t\trm64,r8
\t\tXC1/4,ib\t\trm64,i8
sar\t\tXD0/7\t\t\trm8
\t\tXD0/7\t\t\trm8,1
\t\tXD2/7\t\t\trm8,r8
\t\tXC0/7,ib\t\trm8,i8
\t\tWXD1/7\t\trm16
\t\tWXD1/7\t\trm16,1
\t\tWXD3/7\t\trm16,r8
\t\tWXC1/7,ib\t\trm16,i8
\t\tTXD1/7\t\trm32
\t\tTXD1/7\t\trm32,1
\t\tTXD3/7\t\trm32,r8
\t\tTXC1/7,ib\t\trm32,i8
\t\tXD1/7\t\t\trm64
\t\tXD1/7\t\t\trm64,1
\t\tXD3/7\t\t\trm64,r8
\t\tXC1/7,ib\t\trm64,i8
sbb\t\t1C,ib\t\t\tal,i8
\t\tX80/3,ib\t\trm8,i8
\t\tWX83/3,ib\t\trm16,i8
\t\tTX83/3,ib\t\trm32,i8
\t\tX83/3,ib\t\trm64,i8
\t\tX1A/r\t\t\tr8,rm8
\t\tX18/r\t\t\trm8,r8
\t\tWX1D,iw\t\tax,i16
\t\tWX81/3,iw\t\trm16,i16
\t\tWX1B/r\t\tr16,rm16
\t\tWX19/r\t\trm16,r16
\t\tT1D,id\t\teax,i32
\t\tTX81/3,id\t\trm32,i32
\t\tTX1B/r\t\tr32,rm32
\t\tTX19/r\t\trm32,r32
\t\t481D,id\t\trax,i32
\t\tX81/3,id\t\trm64,i32
\t\tX1B/r\t\t\tr64,rm64
\t\tX19/r\t\t\trm64,r64
scas\t\tAE\t\t\trm8
\t\tWAF\t\t\trm16
\t\tTAF\t\t\trm32
\t\t48AF\t\t\trm64
scasb\t\tAE
scasw\t\tWAF
scasd\t\tTAF
scasq\t\t48AF
seta\t\tX0F97/0\t\trm8
setae\t\tX0F93/0\t\trm8
setb\t\tX0F92/0\t\trm8
setbe\t\tX0F96/0\t\trm8
setc\t\tX0F92/0\t\trm8
sete\t\tX0F94/0\t\trm8
setg\t\tX0F9F/0\t\trm8
setge\t\tX0F9D/0\t\trm8
setl\t\tX0F9C/0\t\trm8
setle\t\tX0F9E/0\t\trm8
setna\t\tX0F96/0\t\trm8
setnae\tX0F92/0\t\trm8
setnb\t\tX0F93/0\t\trm8
setnbe\tX0F97/0\t\trm8
setnc\t\tX0F93/0\t\trm8
setne\t\tX0F95/0\t\trm8
setng\t\tX0F9E/0\t\trm8
setnge\tX0F9C/0\t\trm8
setnl\t\tX0F9D/0\t\trm8
setnle\tX0F9F/0\t\trm8
setno\t\tX0F91/0\t\trm8
setnp\t\tX0F9B/0\t\trm8
setns\t\tX0F99/0\t\trm8
setnz\t\tX0F95/0\t\trm8
seto\t\tX0F90/0\t\trm8
setp\t\tX0F9A/0\t\trm8
setpe\t\tX0F9A/0\t\trm8
setpo\t\tX0F9B/0\t\trm8
sets\t\tX0F98/0\t\trm8
setz\t\tX0F94/0\t\trm8
sfence\t0FAE/7
sgdt\t\t0F01/0\t\tm
shl\t\tXD0/4\t\t\trm8
\t\tXD0/4\t\t\trm8,1
\t\tXD2/4\t\t\trm8,r8
\t\tXC0/4,ib\t\trm8,i8
\t\tWXD1/4\t\trm16
\t\tWXD1/4\t\trm16,1
\t\tWXD3/4\t\trm16,r8
\t\tWXC1/4,ib\t\trm16,i8
\t\tTXD1/4\t\trm32
\t\tTXD1/4\t\trm32,1
\t\tTXD3/4\t\trm32,r8
\t\tTXC1/4,ib\t\trm32,i8
\t\tXD1/4\t\t\trm64
\t\tXD1/4\t\t\trm64,1
\t\tXD3/4\t\t\trm64,r8
\t\tXC1/4,ib\t\trm64,i8
shld\t\tWX0FA5/r,ib\t\trm16,r16
\t\tTX0FA5/r,ib\t\trm32,r32
\t\tX0FA5/r,ib\t\trm64,r64
\t\tWX0FA5/r,ib\t\trm16,r16,1
\t\tTX0FA5/r,ib\t\trm32,r32,1
\t\tX0FA5/r,ib\t\trm64,r64,1
\t\tWX0FA4/r,ib\t\trm16,r16,i8
\t\tWX0FA5/r,ib\t\trm16,r16,cl
\t\tTX0FA4/r,ib\t\trm32,r32,i8
\t\tTX0FA5/r,ib\t\trm32,r32,cl
\t\tX0FA4/r,ib\t\trm64,r64,i8
\t\tX0FA5/r,ib\t\trm64,r64,cl
shr\t\tXD0/5\t\t\trm8
\t\tXD0/5\t\t\trm8,1
\t\tXD2/5\t\t\trm8,r8
\t\tXC0/5,ib\t\trm8,i8
\t\tWXD1/5\t\trm16
\t\tWXD1/5\t\trm16,1
\t\tWXD3/5\t\trm16,r8
\t\tWXC1/5,ib\t\trm16,i8
\t\tTXD1/5\t\trm32
\t\tTXD1/5\t\trm32,1
\t\tTXD3/5\t\trm32,r8
\t\tTXC1/5,ib\t\trm32,i8
\t\tXD1/5\t\t\trm64
\t\tXD1/5\t\t\trm64,1
\t\tXD3/5\t\t\trm64,r8
\t\tXC1/5,ib\t\trm64,i8
shrd\t\tWX0FAD/r,ib\t\trm16,r16
\t\tTX0FAD/r,ib\t\trm32,r32
\t\tX0FAD/r,ib\t\trm64,r64
\t\tWX0FAD/r,ib\t\trm16,r16,1
\t\tTX0FAD/r,ib\t\trm32,r32,1
\t\tX0FAD/r,ib\t\trm64,r64,1
\t\tWX0FAC/r,ib\t\trm16,r16,i8
\t\tWX0FAD/r,ib\t\trm16,r16,cl
\t\tTX0FAC/r,ib\t\trm32,r32,i8
\t\tTX0FAD/r,ib\t\trm32,r32,cl
\t\tX0FAC/r,ib\t\trm64,r64,i8
\t\tX0FAD/r,ib\t\trm64,r64,cl
sidt\t\t0F01/1\t\tm
sldt\t\t0F00/0\t\trm16
\t\t480F00/0\t\trm64
smsw\t\t0F01/4\t\trm16
\t\t0F01/4\t\trm32
\t\t480F01/4\t\trm64
stc\t\tF9
std\t\tFD
sti\t\tFB
stosb\t\tAA
stosw\t\tWAB
stosd\t\tAB
str\t\t0F00/1\t\trm16
sub\t\t2C,ib\t\t\tal,i8
\t\tX80/5,ib\t\trm8,i8
\t\tWX83/5,ib\t\trm16,i8
\t\tTX83/5,ib\t\trm32,i8
\t\tX83/5,ib\t\trm64,i8
\t\tX2A/r\t\t\tr8,rm8
\t\tX28/r\t\t\trm8,r8
\t\tW2D,iw\t\tax,i16
\t\tWX81/5,iw\t\trm16,i16
\t\tWX2B/r\t\tr16,rm16
\t\tWX29/r\t\trm16,r16
\t\tT2D,id\t\teax,i32
\t\tTX81/5,id\t\trm32,i32
\t\tTX2B/r\t\tr32,rm32
\t\tTX29/r\t\trm32,r32
\t\t482D,id\t\trax,i32
\t\tX81/5,id\t\trm64,i32
\t\tX2B/r\t\t\tr64,rm64
\t\tX29/r\t\t\trm64,r64
test\t\tA8,ib\t\t\tal,i8
\t\tWA9,iw\t\tax,i16
\t\tTA9,id\t\teax,i32
\t\t48A9,id\t\trax,i32
\t\tXF6/0,ib\t\trm8,i8
\t\tWXF7/0,iw\t\trm16,i16
\t\tTXF7/0,id\t\trm32,i32
\t\tXF7/0,id\t\trm64,i32
\t\tX84/r\t\t\trm8,r8
\t\tWX85/r\t\trm16,r16
\t\tTX85/r\t\trm32,r32
\t\tX85/r\t\t\trm64,r64
ud2\t\t0F0B
wait\t\t9B
fwait\t\t9B
wbinvd\t\t0F09
wrmsr\t\t0F30
xadd\t\tX0FC0/r\t\trm8,r8
\t\tWX0FC1/r\t\trm16,r16
\t\tTX0FC1/r\t\trm32,r32
\t\tX0FC1/r\t\trm64,r64
xchg\t\tWX87C0\t\tax,ax\t\tlong
\t\tTX87C0\t\teax,eax\tlong
\t\tWX90|r\t\tax,r16
\t\tWX90|r\t\tr16,ax
\t\tTX90|r\t\teax,r32
\t\tTX90|r\t\tr32,eax
\t\tX90|r\t\t\trax,r64
\t\tX90|r\t\t\tr64,rax
\t\tX86/r\t\t\trm8,r8
\t\tX86/r\t\t\tr8,rm8
\t\tWX87/r\t\trm16,r16
\t\tWX87/r\t\tr16,rm16
\t\tTX87/r\t\trm32,r32
\t\tTX87/r\t\tr32,rm32
\t\tX87/r\t\t\trm64,r64
\t\tX87/r\t\t\tr64,rm64
xor\t\t34,ib\t\t\tal,i8
\t\tX80/6,ib\t\trm8,i8
\t\tWX83/6,ib\t\trm16,i8
\t\tTX83/6,ib\t\trm32,i8
\t\tX83/6,ib\t\trm64,i8
\t\tX32/r\t\t\tr8,rm8
\t\tX30/r\t\t\trm8,r8
\t\tW35,iw\t\tax,i16
\t\tWX81/6,iw\t\trm16,i16
\t\tWX33/r\t\tr16,rm16
\t\tWX31/r\t\trm16,r16
\t\tT35,id\t\teax,i32
\t\tTX81/6,id\t\trm32,i32
\t\tTX33/r\t\tr32,rm32
\t\tTX31/r\t\trm32,r32
\t\t4835,id\t\trax,i32
\t\tX81/6,id\t\trm64,i32
\t\tX33/r\t\t\tr64,rm64
\t\tX31/r\t\t\trm64,r64

f2xm1\t\tD9F0
fabs\t\tD9E1
fadd\t\tD8C0|i\t\tst0,fr
\t\tDCC0|i\t\tfr,st0
\t\tD8/0\t\t\tfrm32
\t\tDC/0\t\t\tfrm64
fadd32\tD8/0\t\t\tm
fadd64\tDC/0\t\t\tm
faddp\t\tDEC1
\t\tDEC0|i\t\tfr,st0
fbld\t\tDF/4\t\t\tm
fbstp\t\tDF/6\t\t\tm
fchs\t\tD9E0
fclex\t\t9BDBE2
fnclex\tDBE2
fcmovb\tDAC0|i\t\tst0,fr
fcmove\tDAC8|i\t\tst0,fr
fcmovbe\tDAD8|i\t\tst0,fr
fcmovu\tDAD8|i\t\tst0,fr
fcmovnb\tDBC0|i\t\tst0,fr
fcmovne\tDBC8|i\t\tst0,fr
fcmovnbe\tDBD8|i\t\tst0,fr
fcmovnu\tDBD8|i\t\tst0,fr
fcom\t\tD8D0|i\t\tst0,fr
\t\tD8/2\t\t\tfrm32
\t\tDC/2\t\t\tfrm64
fcom32\tD8/2\t\t\tm
fcom64\tDC/2\t\t\tm
fcomp\t\tD8D9
\t\tD8D8|i\t\tst0,fr
\t\tD8/3\t\t\tfrm32
\t\tDC/3\t\t\tfrm64
\tD8D1
fcomp32\tD8/3\t\t\tm
fcomp64\tDC/3\t\t\tm
fcompp\tDED9
fcomi\t\tDBF1
\t\tDBF0|i\t\tst0,fr
fcomip\tDFF1
\t\tDFF0|i\t\tst0,fr
fucomi\tDBE9
\t\tDBE8|i\t\tst0,fr
fucomip\tDFE9
\t\tDFE8|i\t\tst0,fr
fcos\t\tD9FF
fdecstp\tD9F6
fdiv\t\tD8F0|i\t\tst0,fr
\t\tDCF8|i\t\tfr,st0
\t\tD8/6\t\t\tfrm32
\t\tDC/6\t\t\tfrm64
fdiv32\tD8/6\t\t\tm
fdiv64\tDC/6\t\t\tm
fdivp\t\tDEF9
\t\tDEF8|i\t\tfr,st0
fdivr\t\tD8F8|i\t\tst0,fr
\t\tDCF0|i\t\tfr,st0
\t\tD8/7\t\t\tfrm32
\t\tDC/7\t\t\tfrm64
fdivr32\tD8/7\t\t\tm
fdivr64\tDC/7\t\t\tm
fdivrp\tDEF1
\t\tDEF0|i\t\tfr,st0
fdup\t\tD9C0
ffree\t\tDDC0|i\t\tfr
fiadd\t\tDA/0\t\t\trm32
\t\tDE/0\t\t\trm16
fiadd32\tDA/0\t\t\tm
fiadd16\tDE/0\t\t\tm
fidiv\t\tDA/6\t\t\trm32
\t\tDE/6\t\t\trm16
fidiv32\tDA/6\t\t\tm
fidiv16\tDE/6\t\t\tm
fidivr\tDE/7\t\t\trm32
\t\tDA/7\t\t\trm16
fidivr16\tDA/7\t\t\tm
fidivr32\tDE/7\t\t\tm
ficom\t\tDE/2\t\t\tfrm32
\t\tDA/2\t\t\tfrm16
ficom16\tDA/2\t\t\tm
ficom32\tDE/2\t\t\tm
ficomp\tDE/3\t\t\tfrm32
\t\tDA/3\t\t\tfrm16
ficomp16\tDA/3\t\t\tm
ficomp32\tDE/3\t\t\tm
fild\t\tDF/0\t\t\trm16
\t\tDB/0\t\t\trm32
\t\tDF/5\t\t\trm64
fild16\tDF/0\t\t\tm
fild32\tDB/0\t\t\tm
fild64\tDF/5\t\t\tm
fimul\t\tDA/1\t\t\tfrm32
\t\tDE/1\t\t\tfrm16
fimul32\tDA/1\t\t\tm
fimul16\tDE/1\t\t\tm
fincstp\tD9F7
finit\t\t9BDBE3
fninit\tDBE3
fist\t\tDF/2\t\t\trm16
\t\tDB/2\t\t\trm32
fist16\tDF/2\t\t\tm
fist32\tDB/2\t\t\tm
fistp\t\tDF/3\t\t\trm16
\t\tDB/3\t\t\trm32
\t\tDF/7\t\t\trm64
fistp16\tDF/3\t\t\tm
fistp32\tDB/3\t\t\tm
fistp64\tDF/7\t\t\tm
fisttp\tDF/1\t\t\trm16
\t\tDB/1\t\t\trm32
\t\tDD/1\t\t\trm64
fisttp16\tDF/1\t\t\tm
fisttp32\tDB/1\t\t\tm
fisttp64\tDD/1\t\t\tm
fisub\t\tDA/4\t\t\trm32
\t\tDE/4\t\t\trm16
fisub32\tDA/4\t\t\tm
fisub16\tDE/4\t\t\tm
fisubr\tDA/5\t\t\trm32
\t\tDE/5\t\t\trm16
fisubr32\tDA/5\t\t\tm
fisubr16\tDE/5\t\t\tm
fld\t\tD9/0\t\t\tfrm32
\t\tDD/0\t\t\tfrm64
\t\tDB/5\t\t\tfrm80
\t\tD9C0|i\t\tfr
fld32\t\tD9/0\t\t\tm
fld64\t\tDD/0\t\t\tm
fld80\t\tDB/5\t\t\tm
fld1\t\tD9E8
fldl2t\tD9E9
fldl2e\tD9EA
fldpi\t\tD9EB
fldlg2\tD9EC
fldln2\tD9ED
fldz\t\tD9EE
fldcw\t\tD9/5\t\t\tm
fldenv\tD9/4\t\t\tm
fmul\t\tD8C8|i\t\tst0,fr
\t\tDCC8|i\t\tfr,st0
\t\tD8/1\t\t\tfrm32
\t\tDC/1\t\t\tfrm64
fmul32\tD8/1\t\t\tm
fmul64\tDC/1\t\t\tm
fmulp\t\tDEC9
\t\tDEC8|i\t\tfr,st0
fnop\t\tD9D0
fpatan\tD9F3
fprem\t\tD9F8
fprem1\tD9F5
fptan\t\tD9F2
frndint\tD9FC
frstor\tDD/4\t\t\tm
fsave\t\t9BDD/6\t\tm
fnsave\tDD/6\t\t\tm
fscale\tD9FD
fsin\t\tD9FE
fsincos\tD9FB
fsqrt\t\tD9FA
fst\t\tDDD0|i\t\tfr
\t\tD9/2\t\t\tfrm32
\t\tDD/2\t\t\tfrm64
fst32\t\tD9/2\t\t\tm
fst64\t\tDD/2\t\t\tm
fstp\t\tDDD8|i\t\tfr
\t\tD9/3\t\t\tfrm32
\t\tDD/3\t\t\tfrm64
\t\tDB/7\t\t\tfrm80
fstp32\tD9/3\t\t\tm
fstp64\tDD/3\t\t\tm
fstp80\tDB/7\t\t\tm
fstcw\t\t9BD9/7\t\tm
fnstcw\tD9/7\t\t\tm
fstenv\t9BD9/6\t\tm
fnstenv\tD9/6\t\t\tm
fstsw\t\t9BDFE0\t\tax
\t\t9BDD/7\t\tm
fnstsw\tDFE0\t\t\tax
\t\tDD/7\t\t\tm
fsub\t\tD8E0|i\t\tst0,fr
\t\tDCE8|i\t\tfr,st0
\t\tD8/4\t\t\tfrm32
\t\tDC/4\t\t\tfrm64
fsub32\tD8/4\t\t\tm
fsub64\tDC/4\t\t\tm
fsubp\t\tDEE9
\t\tDEE8|i\t\tfr,st0
fsubr\t\tD8E8|i\t\tst0,fr
\t\tDCE0|i\t\tfr,st0
\t\tD8/5\t\t\tfrm32
\t\tDC/5\t\t\tfrm64
fsubr32\tD8/5\t\t\tm
fsubr64\tDC/5\t\t\tm
fsubrp\tDEE1
\t\tDEE0|i\t\tfr,st0
ftst\t\tD9E4
fucom\t\tDDE1
\t\tDDE0|i\t\tfr
fucomp\tDDE9
\t\tDDE8|i\t\tfr
fucompp\tDAE9
fxam\t\tD9E5
fxch\t\tD9C9
\t\tD9C8|i\t\tfr
fxrstor\t0FAE/1\t\tm
fxsave\t0FAE/0\t\tm
fxtract\tD9F4
fyl2x\t\tD9F1
fyl2xp1\tD9F9

Maxim S. Shatskih

unread,

Aug 29, 2009, 4:36:39 PM8/29/09

to

>Use FLEX to generate a C-based lexical analyzer. Use Bison to parse
>the code (though Bison is probably overkill for such a project).

Good advice.

Or, for a novice which wants to play with parsers - code your own state-machine-based lexer and then LL(1) parser. Not this complex.

It is always good to start with _proper_ approaches, instead of quickly-made heuristics. Especially it is a good idea support-wise.

The serious parser is unlimited, and, with it, "push eax" and "eax push" are the same :-) with primitive improperly made parser, future extensibility is a trouble.

--
Maxim S. Shatskih
Windows DDK MVP
ma...@storagecraft.com
http://www.storagecraft.com

Rod Pemberton

unread,

Aug 29, 2009, 6:39:20 PM8/29/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:h7ammc$57q$1...@news.albasani.net...

>
> well, if interested, I could supply my opcode-listings table, which is in
a
> form which "should" be relatively to write a tool to translate it into
> whatever form is needed.
>

Thanks for the offer. But, I've made a few myself from manuals, NASM, nntp
posts to various newsgroups, etc., and have a few more as tables etc. from
other people's projects. However, I don't usually use the list natively. I
usually use it as a basis for my code. While some of the lists are in
mod/rm form, most have been brute-force expanded as much as possible. I
prefer those. While this increases the number of byte sequences for
instructions, it eliminates much encoding and/or decoding of instructions.

RP

Rod Pemberton

unread,

Aug 29, 2009, 6:39:34 PM8/29/09

to

<rand...@earthlink.net> wrote in message
news:3e08caf8-1d6a-4962...@q40g2000prh.googlegroups.com...

> On Aug 27, 12:59 am, "Rod Pemberton" <do_not_h...@nohavenot.cmm>
> wrote:
> >
> > But, I need my own assembler, of smaller size, in ANSI
> > C.
>
> Why is size an issue? Given the memory found on modern machines, I
> can't imagine that saving a few kilobytes would be a big issue these
> days.

Bootstrapping. It will need to be able to compile on a really simple, and
likely incomplete, C compiler, e.g. SmallC. Or, it will need to be
converted to another simple language.

> Use FLEX to generate a C-based lexical analyzer. Use Bison to parse
the code (though Bison is probably overkill for such a project).

Did that. Well... a large chunk of it. I started with some incomplete and
out-of-date grammars from the 'net. I'm not sure what I did is perfectly
correct or optimal, since it was my first in-depth experience with
flex/bison. I attempted to handle some of the preprocessor functionality as
well. I'm aware now that this was a mistake.

Anyway, flex/bison and lex/yacc have problems with ambiguous grammars like
C. (So, uh, why'd you recommend C?) Two of the ambiguity problems with C
are the implicit int's used in pre-ANSI C, and typdef's. C's typedef's
actually need another keyword added to the language to parse without
ambiguity. The usual flex/bison solution is a "hack" to pass information
between the lexer and parser. The problem I ran into with these is that
they separate lexing from parsing, when in many instances, things are much
simpler if you can do both when needed.

> The
> code will be much larger, but the source will be much smaller. And you
> don't have to play around with limiting yourself to funny characters
> to lead off your lexemes.

The flex grammar for ANSI and GCC C is 919 lines. The bison/yacc grammar is
785 lines. *AND*, the bison/yacc grammar doesn't have the code to create an
AST... But, the counts do include the preprocessor mistake code. I
guesstimate: down 50 for pp-mistake, up 350 for AST, ~2000 lines. Bison's
output is 2472 + 153 lines of C. Flex's output is 5516 lines of C. ~8141
lines. From what I've seen, most of the freely available C grammars don't
actually lex or parse in compliance with the C spec.'s. They take all sorts
of shortcuts, e.g., lexing 0-9 for all numbers, including octal. One of my
inprogress C parsers is a few hundred lines smaller than those two,
~1500lines. Admittedly, that doesn't come close to the SmallC limited
implementations of C, but I'm doing some things my way, including some large
switches(), which add "bloat"...

> > With "push eax", you've got the instruction, but you can't attempt to
emit
> > an instruction until after you've parsed the register, but you need to
take
> > action on the push. If you do so, the routine for the push instruction
must
> > "say": "I need a register before I can emit the correct push
instruction.".
>
> I'm not sure how this issue is solve using "postfix" notation. All
> you've done is change it to "I've got a register, now I need an
> instruction before I can emit the correct instruction."

Ok, I think Rui said something similar, but I haven't gotten back to his
post. As said previously, once you've got the instruction from ".eax
_push", you've got everything needed to emit an instruction. You had one
call to detect "eax", one call to detect "push", and now you're ready to go.
In normal order, you don't. You've got one call to detect "push". Once
you've detected "push", it's the optimal time to emit an instruction, e.g.,
"trigger". But, you can't. You must get the remainder of the information
needed. So, you can either delay and use some other "trigger" to emit the
instruction, or you can call the instruction emit routine and then get the
remaining information. The latter needs a function call to the get register
routine for every instruction implemented, i.e., bloat. The former requires
another trigger independent of the instruction, and extra processing, e.g.,
upto EOL or upto a comment character. In the former, you can't just discard
everything after you've got the instruction. You don't know whats there.
If the instruction terminates the information, it's the trigger. Anything
after that is the next instruction or something else. Doing either the
former or latter really nullifies the use of the parser being directed by
syntax, e.g., escape control stream of "._" for ".eax _push". Don't you
think so?

> BTW, generating the code for "push eax" is ridiculously simple using
> FLEX and Bison.

It's even simpler for ".eax _push" in C. The parser only sees "._", the
"control" stream. It's syntax directed by escape characters. All you need
is a switch().

> Indeed, writing a bare-bones assembler (no macros,
> conditional assembly, or other compile-time language features) would
> be nearly trivial except for the fact that there are hundreds and
> hundreds of machine instructions to encode.

I'm limiting that. I have yet to decide if I'm only going to implement what
I need, or just the basic 386 instructions. I'm not sure about macros or
conditional assembly yet. I use a little bit with NASM. Most of the input
will be the output of another program. So, the functionality of macros or
conditional assembly will likely be shifted "upstream". I'm working on a
number of related projects in an ad hock fashion: assembler, C compiler,
interpreter. So, things go as they go.

Rod Pemberton

unread,

Aug 29, 2009, 6:40:00 PM8/29/09

to

"Maxim S. Shatskih" <ma...@storagecraft.com.no.spam> wrote in message
news:h7c3gn$vpa$1...@news.mtu.ru...

> >Use FLEX to generate a C-based lexical analyzer. Use Bison to parse
> >the code (though Bison is probably overkill for such a project).
>
> Good advice.

Depends on the project. C is not good advice. The language has
ambiguities. It's not LALR(1) due to implicit int's or typedef's. Properly
implementing constants and numbers in C requires many rules. There are
other problems, like needing to distinquish "long long" from "long", etc.
You can read my other post to RH for line counts of flex/bison grammars for
C.

> Or, for a novice which wants to play with parsers - code your own
state-machine-based lexer and then LL(1) parser. Not this complex.

I did a few sample FSM lexers too: one removes C comments, and another
removes C++ comments. The most powerful cleans up C code. It does a number
of things, like correcting indentation, removing extra whitespace, inserting
missing space or brackets. But, the table is hand encoded. So, I suspect
they are larger than needed. I'm not sure how to optimize these or reduce
their size, although I see papers saying that it can be done. Anyway, they
are awkward and cryptic, although they are compact. But, they aren't
powerful enough to implement control flow. They operate on input streams.
So, to implement flow control such as a loop, you'd need a loop to generate
a stream, then the FSM can check the validity of the input, but since you've
got flow control to implement the stream in the first place and a structured
language, there is no point in implementing an FSM. It's just
duplication... I.e., they work best as input filters, such as keypads,
lexers, etc. but not as a compact compilation model.

All are unreleased, but they are based on my port of JV Noble's FSM in FORTH
to C:
http://groups.google.com/group/comp.lang.c/msg/2fa92cf626980f5c?hl=en

Rod Pemberton

BGB / cr88192

unread,

Aug 29, 2009, 6:58:32 PM8/29/09

to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message

news:h7cama$u46$1...@aioe.org...

ok.

note that my assembler does not directly use the listing either, rather, it
is typically processed by a tool and used to automatically build parts of
the assembler...

this approach is actually used for many of my bytecode formats as well,
mostly since it makes working with bytecode easier
(encoding/decoding/dumping/...). variants are used for JBC and others as
well...

granted, I do tend to leave opcodes in a mostly command-based form, rather
than fully expanding them.

or such...

>
> RP
>
>

Rod Pemberton

unread,

Aug 29, 2009, 8:11:31 PM8/29/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:h7cbqp$khq$1...@news.albasani.net...

>
> "Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message
> news:h7cama$u46$1...@aioe.org...
> > "BGB / cr88192" <cr8...@hotmail.com> wrote in message
> > news:h7ammc$57q$1...@news.albasani.net...
> >>
> >> well, if interested, I could supply my opcode-listings table, which is
in
> > a
> >> form which "should" be relatively to write a tool to translate it into
> >> whatever form is needed.
> >>
>

> note that my assembler does not directly use the listing either, rather,
it
> is typically processed by a tool and used to automatically build parts of
> the assembler...

Yeah, I probably should do that sometime... I was just thinking about a
tool to recode strings into a slighly more compact form. It would take many
strings, say error messages in your OS, and convert them into 1) a list of
unique words, and 2) recode the messages into integer sequences - each
integer representing a text word. However, I'm not likely to use this more
than a few times, and the list is likely to be small, so I might just do it
the hard way. Edit, VI, sort, uniq, edit to "array"-ize words for C.
Encode strings by manual lookup. This has to have been done by somebody
somewhere...

RP

Nathan Baker

unread,

Aug 30, 2009, 2:31:50 PM8/30/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:h7c08p$40m$1...@news.albasani.net...

>
>
> or, just write a plain tokenizer...
>
> using funny characters, well, personally I don't see why it is not that
> necessary, since fixed-form logic and a tokenizer are sufficient to parse
> ASM.
>

This is much easier when the 'token' can be the opcode. Yes, most ASM code
is trivially parsed via tables and simple logic -- there is no reason to
resort to escape sequences or fancy grammar parsers.

Nathan.

BGB / cr88192

unread,

Aug 30, 2009, 4:20:43 PM8/30/09

to

"Nathan Baker" <nathan...@gmail.com> wrote in message
news:h7egi1$41a$1...@aioe.org...

hmm... I didn't write this so well...

basically, I meant, I didn't see why it was necessary to use fancy chars...

so, yeah, I use a tokenizer, and a simplistic parser which generally parses
ASM syntax (and does not need to resort to either funny syntax, or a
full-featured parser...).

so, my parser does not use ASTs or anything like this, just using tokens to
drive the logic...

jonas rudloff

unread,

Aug 31, 2009, 3:04:37 AM8/31/09

to

whwn you talks about, human read able stuff. why do'nt you make a
assembler, to work whit the langue like:

this is the beginning.
pleas increment ecx
put ecx in to eax.
and then, compare eax, with the double adress of ebx.
if they are not equal, you may jump to the beginning.

then it will be much funny, then
beginning:
inc ecx
mov eax, ecx
cmp eax, [ebx*2]
jne beginning

Alexei A. Frounze

unread,

Sep 4, 2009, 6:47:45 PM9/4/09

to

On Aug 24, 4:09 pm, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:

I've once written a simple x86 assembler (with normal NASM-like
syntax) in Perl. It was quite easy with the regexps. And once the
prototype works, it's easy to convert it to C or anything.
Unfortunately, I've lost the file (~50K, a few KLOCs) and so if I ever
want to do it, I'll have to start from the scratch.

Alex

Alexei A. Frounze

unread,

Sep 4, 2009, 6:57:22 PM9/4/09

to

On Aug 27, 12:59 am, "Rod Pemberton" <do_not_h...@nohavenot.cmm>
wrote:

...

> With "push eax", you've got the instruction, but you can't attempt to emit
> an instruction until after you've parsed the register, but you need to take
> action on the push. If you do so, the routine for the push instruction must
> "say": "I need a register before I can emit the correct push instruction.".
> This means a function call to parse the register must be within the routine
> for the push instruction. With many different instructions to assemble, you
> end up with numerous functions calls to parse the register throughout your
> application.

I compiled a list of operand combinations like "Eb, Gb", "Eb, Ib", etc
and I parsed instructions in terms of the regexps describing these
patterns. Of course, some substitution had to be done prior to the
parsing (EQUs, defines had to be expanded). And there were a few
ambiguities which were solved by properly ordering the combinations
and adding a few requirements here and there (mostly it was a
requirement to specify the size of a memory or immediate operand --
don't remember the details -- would need to look up the instruction
set). Other than that, it's all quite parseable (with regexps
especially) and obviously feasible.

Alex