How To Write An Assembler

Nick Mudge

unread,

Sep 7, 2008, 2:56:24 PM9/7/08

to

Hello,
I just wanted to know if anyone knew of any good tutorials that taught
how to write an assembler for x86 machine code. Also if there was any
good tutorial that taught the 86 machine instructions (how the hex
machine code works/is put together etc.). The best I know of is the
Intel Software Developer Manuals, but I was wondering if there might
be any good references/tutorials etc. online.

Alexei A. Frounze

unread,

Sep 7, 2008, 8:03:51 PM9/7/08

to

If you don't know bits, bytes and different base representations (as
might be inferred from "how the hex machine code is put together"),
you shouldn't be writing an assembler but instead focus on learning
the basics.

Now, the mentioned intel manuals will give you the most info on how
instructions are encoded. The AMD ones will work too. In fact, you
should be using both when there's some uncertainty in either. And then
you may want or need to use tools (disassemblers and other assemblers)
to clear all confusion.

A basic assembler is pretty easy to do. All you need to do is:
1. parse the source file, get an instruction from a line (if any)
2. encode the instruction and emit the bytes of encoded instructions
into a binary file

However, to do that you will need to take several passes over the code
being compiled/assemblied. The reason for that is the forward
references. That is, you don't know the address of an instruction
ahead before you reached it by generating code for all previous
instructions. Example:

CMP EAX, EBX
JNE L1
MOV EAX, 0
JMP L2
L1:
MOV EAX, 1
L2:
RET

Here you can't fully encode JNE L1 until you know how many bytes need
the following instructions before L1. The same applies to JMP L2 and
instructions between it and L2.

Your assembler should also support very basic arithmetic expressions.
Very often there's a need to calculate the distance between data
objects or some instructions. This can be used to determine object's
size. So, your assembler should be able to assembly an instruction
like this:
MOV EAX, L2 - L1
or a data variable declaration like this:
V DD L2 - L1

In fact, it must be able to assembly any instruction with a purely
additive/subtractive expression involving addresses, e.g.:
LEA EAX, L1 + 5
or
LEA EAX, V + (L2 - L1)
and any expression involving a difference of addresses (of course, if
it supports more complex expressions):
MOV EAX, ((L2 - L1) * 5) SHR 1

You will need to support some form of the ORG operator/macro to tell
the assembler at what rIP value the following instruction is supposed
to be executed since the compiled x86 code generally isn't position
independent. I think some special name for the entry point should also
be supported if the ouput binary/object file is supposed to support an
arbitrary entry point and it needs to be indicated somewhere in the
file.

A rich assembler has the following features:
1. support for expressions
2. support for all CPU modes
3. macros and special symbols
4. support for data structures
5. support for segments
6. code/data alighment
7. support of intermediate object files as the output format
8. symbolic/debug information generation
9. support for listing files
10. support for referencing of external objects
11. inclusion of text source files and binary data files
12. etc

For a good C programmer familiar with the x86 assembly language, a
basic assembler (not including most of the above list) would probably
take no more than a month to implement and test.

Alex

Nick Mudge

unread,

Sep 8, 2008, 11:23:54 AM9/8/08

to

Please don't insult me. By hex numbers, what I meant was I want to
learn how the machine instructions are put together and how they are
used together etc. I mentioned hex numbers to differentiate assembly
code, meaning that I didn't want people to tell me to learn an
assembly language, I want to understand it at a lower level, meaning
the numeric values that represent the instructions. Of course you'd
need to know this in order to write an assembler. I haven't found much
that describes the anatomy of the numeric machine instructions.

I didn't think of reading AMD's manuals. I checked it out. Looks
pretty good.
Thanks for the other info.

Alexei A. Frounze

unread,

Sep 8, 2008, 2:19:49 PM9/8/08

to

I'm sorry. I must have misinterpreted your words.

> By hex numbers, what I meant was I want to
> learn how the machine instructions are put together

They're encoded numerically as described in the manual. And then are
usually placed one after another and executed sequentially in that
order unless there're jumps, calls, rets or exceptions.

> and how they are used together etc.

Just like C operators. The instructions are tiny building blocks,
every one of them doing very little work. You use many of them to do
something more significant and useful than just moving data around
memory and doing arithmetic and bit operations on it.

> I mentioned hex numbers to differentiate assembly
> code, meaning that I didn't want people to tell me to learn an
> assembly language, I want to understand it at a lower level, meaning
> the numeric values that represent the instructions. Of course you'd
> need to know this in order to write an assembler. I haven't found much
> that describes the anatomy of the numeric machine instructions.

I'm not sure what you're looking for. If it's instruction encoding,
it's described in the manuals. Despite needing a lot of text and
tables to describe it, it's actually pretty straightforward. Every
distinct instruction has a unique numerical code that's encoded with a
number of whole or fractional bytes. Many instructions have a common
structure in the encoding, for example, the ModR/M bytes, which say
what operand this instruction operates on.

Example: ADD Eb, Gb.

Eb stands for a byte that can be (depending on the encoding) a memory
location or a general purpose register (AL, AH, etc). Gb is a general
purpose register.
Encoding: "Opcode" byte (that aforementioned unique number), followed
by the ModR/M byte, possibly followed by SIB byte and/or displacement
bytes. Opcode byte value is 000h. This instruction may have optional
prefix bytes before the opcode byte.

The complete encoding for ADD AL, AL would be: 000h (opcode byte),
0C0h (ModR/M byte).
2 top bits ("mod") of 0C0h define whether Eb is a memory or register
location (11B is register, 00B, 01B, and 10B is memory). Bits 2
through 0 ("r/m") further define Eb's location (in case of a register,
it's just the register's index, 0 for AL/AX/EAX; in case of a memory
it's more complicated than that, but the idea is the same, mod is used
together with r/m to encode the memory location). Bits 5 through 3
("reg") define Gb location (again, 0 for AL/AX/EAX).

Now, for XOR Gv, Ev, where Gv=eAX and Ev=eDI you'd have this encoding:
033h, 0F8h (mod=11B, reg=000B, r/m=111B).

If for some reason you're trying to understand why it's 000h for ADD,
033h for XOR and 000B for AL/AX/EAX, it's so because many years ago
intel decided it to be so. These numbers are obviously related to the
CPU's internal implementation, but you don't need to worry about them
much. At least, you shouldn't care about why 000B encodes AL/AX/EAX
and not CL/CX/ECX.

See the rest in the manuals. Ask questions if something's unclear.

> I didn't think of reading AMD's manuals. I checked it out. Looks
> pretty good.
> Thanks for the other info.

You're welcome. I'm still unsure if I got your questions and estimated
your knowledge correctly.

Alex

Rugxulo

unread,

Sep 8, 2008, 4:57:58 PM9/8/08

to

Hi,

On Sep 8, 10:23 am, Nick Mudge <spamt...@crayne.org> wrote:
> On Sep 7, 5:03 pm, "Alexei A. Frounze" <spamt...@crayne.org> wrote:
> > On Sep 7, 11:56 am, Nick Mudge <spamt...@crayne.org> wrote:
>
> > > I just wanted to know if anyone knew of any good tutorials that taught
> > > how to write an assembler for x86 machine code. Also if there was any
> > > good tutorial that taught the 86 machine instructions (how the hex
> > > machine code works/is put together etc.).
>

> Please don't insult me. By hex numbers, what I meant was I want to
> learn how the machine instructions are put together and how they are
> used together etc. I mentioned hex numbers to differentiate assembly
> code, meaning that I didn't want people to tell me to learn an
> assembly language, I want to understand it at a lower level, meaning
> the numeric values that represent the instructions. Of course you'd
> need to know this in order to write an assembler. I haven't found much
> that describes the anatomy of the numeric machine instructions.

Well, there are two files in particular that I think might help you.
Granted, they are somewhat outdated, but that's good right? Less to
worry about (no x86-64, etc). ;-)

http://www.eunet.bg/simtel.net/msdos/asmutl.html

In particular, TA980705.ZIP (an assembler in its own right) has
OPCODE.TXT (talking about octal codes). And the other good one is
DISASM.ZIP (complete 8086/186 disassembly tables).

You won't get more useful info, IMO, unless you look at the sources to
other popular assemblers (NASM, GAS, JWasm, FASM, Octasm, Wolfware,
etc). It depends on how simple or complex you want to make it. I'm
sure you'd get more specific help if you stated what programming
language, OS hosts, output formats, and bits (32, I assume) that you
intend to support.

Rod Pemberton

unread,

Sep 8, 2008, 6:28:07 PM9/8/08

to

"Nick Mudge" <spam...@crayne.org> wrote in message
news:b4055ea7-157a-4e96...@n38g2000prl.googlegroups.com...

> I just wanted to know if anyone knew of any good tutorials that taught
> how to write an assembler for x86 machine code.

No.

The tables are a bit out of date. It's written in C. But, I've found it
easy to modify:

486dis_c.zip
http://www.eunet.bg/simtel.net/msdos/disasm.html

> Also if there was any
> good tutorial that taught the 86 machine instructions (how the hex
> machine code works/is put together etc.).

For a perspective a bit different from the manuals:
http://groups.google.com/group/alt.lang.asm/msg/2ab864fb18cd8f63

I found a couple of trivial errors in one of his (Mark Hopkins) earlier
posts. I don't know about this one. In general, his earlier post was
extremely accurate once expanded and converted into a form I could check.

Rod Pemberton

Wolfgang Kern

unread,

Sep 9, 2008, 2:48:46 AM9/9/08

to

I haven't seen any tutorials for creating complete tools.
If you want to know about all CPU-instructions AMD64 VOL1..5
(pdf: 24592..24594 ,24568,24589 may be what you search for.
VOL 3 contains shortcut lists and opcode maps, VOL 4/5 covers SSE.

HEXTUTOR is just a quick lookup referrence and were mainly
written to check the output of my disassambler.

A DEMO-version (windoze-PE with a few known bugs) is available:

http://web.utanet.at/schw1285/KESYS/index.htm [codesnips][HEXTUTOR]

__
wolfgang

tin.cans.and.string

unread,

Sep 9, 2008, 1:33:01 AM9/9/08

to

On Sep 7, 12:56 pm, Nick Mudge <spamt...@crayne.org> wrote:
> Also if there was any good tutorial that taught the 86 machine
> instructions (how the hex machine code works/is put together
> etc.)

The NASM manual has a fairly comprehensive listing of instructions
with details on how they're built in machine code in Appendix B, as
well as a very short introduction to some of the formatting used.

http://home.comcast.net/~fbkotler/nasmdocb.html