Writing a disassembler ?

So and so

unread,

Oct 10, 2008, 10:57:31 AM10/10/08

to

Hey,

I've set myself a goal to write a disassembler, so far I've managed to
understand most of Intel's documentation (for now it's only going to
disassemble x86 code) and I'm about to start writing the basic
skeleton.

The algorithm I had in mind was :
Read N bytes (or the whole files, I'm not sure about it yet)
if this byte is part of a prefix instruction, parse it else continue
to opcode, and so on and so on .

Though, I ran into several "problems" in mind:

1. Which data structure should store the values I read ? A hash table
or a Tree ? Or a combination of both ? (trie) Should the tree be
balanced ? If not, will it cost in efficiency or whether balancing it
will cost in efficiency ?

2. What about invalid instructions ? Should I strip them the moment I
detect they're invalid or should they be stored FFU ?

3. Which data structure should hold the final result of the
disassembled instruction ?

5. Should the disassembler itself be multi threaded or one program
which does everything step-by-step and if it will be multi threaded -
how can I handle or parse different instructions ? or handle
synchronization ?

best regards,
lightfault

Vimal

unread,

Oct 11, 2008, 8:23:13 AM10/11/08

to

Hi,

2008/10/10 So and so <light...@gmail.com>:

> I've set myself a goal to write a disassembler, so far I've managed to
> understand most of Intel's documentation (for now it's only going to
> disassemble x86 code) and I'm about to start writing the basic
> skeleton.

Nice. I have never written any disassembler before, my views are just
from a programmer's point of view.

> The algorithm I had in mind was :
> Read N bytes (or the whole files, I'm not sure about it yet)

You could have a single interface to read a character (read_char())
and let that function take care of buffering up IO and such. It could
be something like:

read_char():
if buffer is empty:
read bufsize amount into the buffer from the file
else
return the character at the current position, position++

> if this byte is part of a prefix instruction, parse it else continue
> to opcode, and so on and so on .

Instruction prefixes can be longer than 1 byte. They can be a max of 4
bytes (I think). So, you would have something like a state machine
here.

> Though, I ran into several "problems" in mind:
>
> 1. Which data structure should store the values I read ? A hash table
> or a Tree ? Or a combination of both ? (trie) Should the tree be
> balanced ? If not, will it cost in efficiency or whether balancing it
> will cost in efficiency ?

A multi-level hash combining different techniques would be the best
bet. For this, it is important to see the distribution of instruction
length across all instructions. Common instructions tend to be smaller
in length, and you would want very fast decoding for this.
Conceptually it's a tree, but the way you implement the tree matters.

For example, you could hash into the first two bytes of a prefix (if
at all it's a prefix):

prefix[1 << 16] = array of some datastructures containing pointers to
further processing.

Try to keep expensive operations like trie-searching at the lowest level.

A good example of multi-level hashing is Linux Kernel Timers, where
timers are hashed on their time of execution in a nice way.

> 5. Should the disassembler itself be multi threaded or one program
> which does everything step-by-step and if it will be multi threaded -
> how can I handle or parse different instructions ? or handle
> synchronization ?

Executables have sections built into them. You could parallelize
disassembling each section! In fact, this is a good example of a data
parallelism where you're exploiting the structure of your data allows
it to be parallelized easily.

In fact, as soon as you an instruction boundary, (not just sections,
but also procedures, or in fixed-length intstruction archs, which
unfortunately Intel isn't) you can independently parse from that
location.

Synchronization wouldn't naturally arise. But, you would want your
other disassembling threads to finish before you resolve names, unless
you're content with:
CALL 0xdeadbeef

instead of:
CALL __function_name

You can do this intelligently, but a 2-pass disassembling would be the
easiest solution. First pass to disassemble code, second to patch up
references.

Hope that helps,
--
Vimal

Jeff Kenton

unread,

Oct 11, 2008, 9:51:43 AM10/11/08

to

I wrote a disassembler for a Motorola 88000 once. It's a RISC
architecture, so it only took me 4 hours. Writing one for an x86
architecture will take you considerably longer.

Read the whole file into an array at once -- memory is cheap.

For my disassembler, I had a table consisting of a mask, an opcode
value, instruction class and a string for the opcode name. For each
instruction I would mask it and compare the value. When I found a
match, I printed the opcode and used the instruction class (and a BIG
switch statement) to decide how to print the operands. Its purpose was
just disassembly to text -- no fancy interface nor understanding of the
code nor attempt to track variable names nor jump targets. Its speed
was only limited by how fast you could printf to the screen.

For Intel x86 disassembly, you will need to handle multibyte opcodes
(mine had fixed size 32 bit instructions). You could probably still do
it with a single table using appropriate multibyte masks and values, but
you might also choose to break it down by instruction length. In that
case, I still suggest using the same scheme -- first prefixes, then
single byte instructions, etc. The instruction class field can be your
guide to whether you need another byte and what to do with it.

For invalid instructions you have the problem of getting out of sync
with the intended instruction stream. It will eventually sync back up,
or else you can try something fancy to figure out what's going on.

The way you represent the disassembled instruction depends on what
you're going to use it for. When you decide that you'll also know what
to do with your invalid instructions.

Don't multi-thread. This a disassembler, not a computer science project.

Good luck, and have fun.

jeff

So and so wrote:
> 1. Which data structure should store the values I read ? A hash table
> or a Tree ? Or a combination of both ? (trie) Should the tree be
> balanced ? If not, will it cost in efficiency or whether balancing it
> will cost in efficiency ?
>
> 2. What about invalid instructions ? Should I strip them the moment I
> detect they're invalid or should they be stored FFU ?
>
> 3. Which data structure should hold the final result of the
> disassembled instruction ?
>
> 5. Should the disassembler itself be multi threaded or one program
> which does everything step-by-step and if it will be multi threaded -
> how can I handle or parse different instructions ? or handle
> synchronization ?

---------------------------------------------------------------------
= Jeff Kenton http://home.comcast.net/~jeffrey.kenton =
---------------------------------------------------------------------

Hans-Peter Diettrich

unread,

Oct 11, 2008, 11:28:36 AM10/11/08

to

So and so schrieb:

> The algorithm I had in mind was :
> Read N bytes (or the whole files, I'm not sure about it yet)
> if this byte is part of a prefix instruction, parse it else continue
> to opcode, and so on and so on .

Imagine how a physical CPU will process instructions. There is no room
for guessing, the decoding of an instruction is a straight forward process.

> Though, I ran into several "problems" in mind:
>
> 1. Which data structure should store the values I read ? A hash table
> or a Tree ? Or a combination of both ? (trie) Should the tree be
> balanced ? If not, will it cost in efficiency or whether balancing it
> will cost in efficiency ?

Whenever you start decoding an instruction, create a data structure for
it. Put instruction prefixes into that structure, for later use.

> 2. What about invalid instructions ? Should I strip them the moment I
> detect they're invalid or should they be stored FFU ?

The separation of instructions from data and garbage is one of the most
annoying tasks in an disassembler. In real life x86 code you also should
be prepared for overlaid instructions, where a jump into another
instruction will result in the execution of a different instruction.
There may exist different interpretations of the same bytes, depending
on the entry points taken.

Even if an instruction decodes fine, for itself, it may be invalidated
later by contextual constraints, like for unreachable instructions or
invalid instructions in the following bytes.

> 3. Which data structure should hold the final result of the
> disassembled instruction ?

For x86 code I used a fixed structure, containing only the essential
information, useful to bypass the decoding of the instruction in later
passes.

More important is the storage of the (possibly) valid instruction
sequences (address ranges), which form the Basic Blocks (BB) in control
flow analysis. You'll have to start from known entry points to code,
adding further entry points from references in immediate or indirect
instruction arguments. Every jump, call or interrupt terminates an BB.
You cannot be sure whether a call or interrupt returns at all, and that
it will always return to the point where the call was taken. Sometimes
the bytes after an call contain subroutine arguments, which are
processed and skipped by the callee, or are used to modify the return
address in some other way. The Microsoft coders have tons of dirty
tricks in their pockets, which can (or shall?) confuse any intruder into
their code. I'd suggest that you start with 32 (or 64?) bit protected
mode code (flat model), which may contain not so many tricky constructs,
as I found in 16 bit and real mode code. Otherwise the Intel segmentitis
will bite you...

After an first pass over all initial and added entry points, you'll
typically end up with many white spots in the segments, which will have
to be explored in subsequent passes, by heuristics or guided by the
user. I strongly suggest that you have a look at IDA, the best
Interactive DisAssembler I've ever seen :-)

Later on you'll have to deal with the data inside an executable. Much
time will be spent in reasoning about the data type of non-code bytes,
which may be [arrays of data structures containing] pointers, and
tracking the data flow between registers, stack and other writeable data
areas. But that's too much for an first approach to the universe of
disassemblers ;-)

> 5. Should the disassembler itself be multi threaded or one program
> which does everything step-by-step and if it will be multi threaded -
> how can I handle or parse different instructions ? or handle
> synchronization ?

I cannot see much use for multiple threads in an disassembler, except
for a separation of a GUI from the disassembler itself. Everything else
will lead to killing dead end threads from other threads, over and over
again, and you'll loose control over the analyzed areas of definitely
known content (code, data, junk). Not to mention self modifying code...

You better prepare to stream the results of your analysis into an file,
allowing to resume the analysis later.

DoDi

Stephen Horne

unread,

Oct 11, 2008, 3:29:55 PM10/11/08

to

On Fri, 10 Oct 2008 16:57:31 +0200, "So and so" <light...@gmail.com>
wrote:

>1. Which data structure should store the values I read ? A hash table
>or a Tree ? Or a combination of both ? (trie) Should the tree be
>balanced ? If not, will it cost in efficiency or whether balancing it
>will cost in efficiency ?

These are associative data structures, associating keys with data. You
might need something like this for referenced addresses, but the set
of _potentially_ referenced addresses will be very densely packed -
I'd probably use a bitvector (or else some kind of mapping that leads
to single-integer mini-bitvectors for 32 or 64-byte address ranges).
Each flag indicates that a valid instruction starts here.

You don't need to associate label-names with addresses unless you
allow the user to rename them - just use a prefix followed by the hex
address, so you don't need a lookup at all.

For parsing, I don't think you need a data structure at all. Just use
a switch/case instruction to interpret each component, selecting by
offset-from-start and masking as needed. Nest the switch statements to
form a simple decision tree.

In fact, even that's probably the hard way. The easy way is to
download a good regular-grammar-parsing code generator such as Ragel
(http://www.complang.org/ragel/). This tool is easily capable of
interpreting binary files. For variable parts of each opcode (register
identifiers etc) just make sure all cases are covered, and pick out
that information in your rule-recognised actions.

Actually covering all those variable-part cases could be fiddly since
there is no support for deriving sets of recognised bytes using
bitwise or arithmetic manipulation of codes that I recall, but you
could always use a scripting language to generate parts of the Ragel
spec.

You might do well to read up on regular grammar handling techniques.
The standard reference is "Introduction to Automata Theory, Languages,
and Computation" by Hopcroft, Motwani and Ullman - which I can't
afford ATM and have never read. But there's plenty to read elsewhere
if you look. Try the "Algorithmic Forays" section from the gamedev.net
site (http://www.gamedev.net/reference/list.asp?categoryid=25).

However, full regular grammar handling would be overkill. Since you
need to identify the instructions, your state machines will have a
tree structure - ie this is another way to design and implement the
decision trees. Ragel may even generate very similar code to that you
would have written, depending on the options you specify.

It's also worth looking at the pattern matching available in
functional languages such as Haskell and Objective Caml.

As far as the data structure is concerned, keep it simple, use a
standard library if possible - and consider that you may not need a
data structure at all since directly interpreting the binary on demand
is very fast.

>5. Should the disassembler itself be multi threaded or one program
>which does everything step-by-step and if it will be multi threaded -
>how can I handle or parse different instructions ? or handle
>synchronization ?

There was once a free version of IDA Pro. It's pretty out of date
though - more recent versions are probably available as time-limited
demos.

I strongly recommend you play a little bit.

IDA Pro does interactive disassembling. It loads a file and displays
it immediately, but also starts disassembling in the background. It
identifies instruction start points automatically by following the
flow of execution, including both following and skipping over
conditional branches for instance. But it also allows the user to
override its decisions, rename labels and so on.

This is the *only* kind of disassembler that I can imagine using
explicit multi-threading. A simple one-pass disassembler is exactly
that - it dumps out the disassembly of each instruction as soon as it
recognises it. A multi-pass disassembler could do better at
recognising code-blocks from backward jumps, spotting how data blocks
are referenced and therefore deciding whether the data is best
represented as byte, word, float, string or whatever, and other
tweaks. Either way, I doubt there's any benefit from multi-threading.
After all, even with multi-pass approaches, each pass needs all the
data from the previous pass, so later passes cannot start until
earlier passes have been completed.

However, a simpler disassembler may be able to exploit implicit
multi-threading of the kind that some compilers can organise for you,
and which I tend to think of as a bit like what processors do to run
instructions out-of-order, but much more so.

If you really need the best possible optimisation for this, I'd
suggest using Glasgow Haskell. It may take some getting used to for
imperitive programmers, but the deep optimisation and so on can pay
off big.

That said, the real bottleneck will almost certainly be the output of
disassembled code anyway. The input binaries are far more compact, and
the parsing should be very fast no matter how naively you implement
it. About the worst thing you could do is have overcomplex data
structures that thrash the virtual memory when disassembling
multi-megabyte binaries - other than that you should have no problem.

I probably have some old disassembler source codes on reference CD
ROMS from the days of DOS and 16 bit Windows. If you want, I can dig
out one or two for you to look at.

Arargh...@arargh.com

unread,

Oct 11, 2008, 6:32:45 PM10/11/08

to

On Fri, 10 Oct 2008 16:57:31 +0200, "So and so" <light...@gmail.com>
wrote:

>I've set myself a goal to write a disassembler, so far I've managed to

>understand most of Intel's documentation (for now it's only going to
>disassemble x86 code) and I'm about to start writing the basic
>skeleton.

That can be a very non-trivial program. If you intend to limit it to
32-bit flat file types like COFF or ELF, it wouldn't be too bad, as in
those one pretty much knows where the code & where the data is, and
the relocations aren't too bad. If, on the other hand, you intend to
handle the full range fo x86 executable file types, it gets real messy
very quickly.

Just decoding a single instruction is no biggy. I wrote a routine for
that years ago.

> ...

All the rest of your questions are good ones. :-) I don't have the
answers.
--
ArarghMail810 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

glen herrmannsfeldt

unread,

Oct 12, 2008, 8:19:10 PM10/12/08

to

So and so wrote:

> I've set myself a goal to write a disassembler, so far I've managed to
> understand most of Intel's documentation (for now it's only going to
> disassemble x86 code) and I'm about to start writing the basic
> skeleton.

> The algorithm I had in mind was :
> Read N bytes (or the whole files, I'm not sure about it yet)
> if this byte is part of a prefix instruction, parse it else continue
> to opcode, and so on and so on .

Disassemblers I have written, and seen written by others, work a
little differently.

First, read in the whole program to memory. That was usually possible
20 or 30 years ago, and should be even more possible today.

In addition to the memory for storing the program, use another array
to store what you know about that part of the program, initialize to zero.
Start at the entry point if known, record in the second array that is
the beginning of an instruction (maybe set to 1). Determine the opcode
and perform operation specific functions as described below. Indicate
in the second array bytes which are part of the instruction, but not
the beginning (maybe with 2). If the instruction was not an unconditional
branch (check the flags in the opcode specific table) continue with
the next instruction, checking to see that it hasn't been processed yet.
(Second array value is still zero).

Otherwise, if there are any addresses on the saved address stack,
remove one and continue from there. If there aren't, print out
the results and stop.

Depending on the flags in the opcode specific table, do some of the
following.

Branch: Add the destination address to the saved address stack.
(Requires knowing the addressing mode for some machines.)

Indirect branch: (Branch address can't be determined, such as
using the contents of a register or indexed by a register.)
Make a note to the user for manual determination.

At startup, you should read a list of addresses to initialize the
branch address stack. (Easiest is to also include the entry point.)

If there are indirect branches, manual searching for the appropriate
addresses (such as a table of addresses) and adding them to the initial
branch address stack is required. Usually only a few iterations of
running the disassembler and adding addresses will find all the actual
code.

The opcode specific table should have the nmemonic, address mode,
instruction length (possibly modified by address mode), enough to
print out the instruction in the appropriate assembler format.

On processors (such as the Z80) that have mostly one byte opcodes
but some two byte opcodes, there should be a separate table for
two byte opcodes, indicated by a flag in the one byte opcode table.

Bytes not identified as instructions should be printed in the
assembler specific form for hex constants, maybe eight or 16
on a line, or up to the next such boundary.

That should work for a large fraction of available byte addressable
machines. It should also work for word addressable machines with
word size a multiple of eight. With a little more work, for other
word sizes.

It should be mostly table driven, with new flags and addressing
modes added as needed.

-- glen

So and so

unread,

Oct 15, 2008, 10:15:25 PM10/15/08

to

Hey all, thanks for all for your replies.
I will likely consider all of them once I'll get to those stages, as
for now - I'm working on the decoder itself, which seems to be the
earliest stage of the disassembler itself, and I'm having a bit
trouble.

I've started building the structure of the decoder, starting from
reading raw opcodes stream(say, from the constructor
it has been yet decided)
I've created five classes, each for a specific "section" within an
instruction as follows:

1. A prefix class, with a byte prefix which identifies which prefix is
it (if at all)
2. An opcode class with byte[2] opcode, and boolean flags for
is_twobyte, is_modrm,has_disp,has_imm
3. A modrm class with a byte modrm and sib, along with boolean flags
for one, two, or four displacements
4. A displacement class with a byte displacement and boolean whether
it's two byte or four byte displacement (if all false it's obviously a
one byte)
5. An immediate class with the same variables as the previous one

and one class (instruction) which gathers them all together and
performs the checks with a result string and a long address
variable(for BB). I'm unsure whether where I should create the opcode
table (one byte opcodes for modrm/imm/disp fields, two byte etc ) and
how I'll represent it, either inside the opcode class or the
instruction. I got really lost in this side of the stream.
It seems that this is the main part of the decoder, afterwards the
representation (to the user screen) or something like it along with a
big while (1) loop goes into, if anyone could reference me or guide me
about this huge table I'd be fond

Bartc

unread,

Oct 16, 2008, 6:15:03 AM10/16/08

to

"So and so" <light...@gmail.com> wrote in message

>
> I've set myself a goal to write a disassembler, so far I've managed to
> understand most of Intel's documentation (for now it's only going to
> disassemble x86 code)

Only x86? That's probably the most complex.

The first docs you need will be a listing of the 256 possible values of the
initial instruction byte. Once you've decoded this, you will know the
instruction or instruction group, and can go from there. Intel docs tend to
be complex, but there are plenty of these lists around from other sources.

> and I'm about to start writing the basic
> skeleton.
>
> The algorithm I had in mind was :
> Read N bytes (or the whole files, I'm not sure about it yet)

This is your first problem. My idea of a disassembler would have a start
address as input not a file (and so the code would reside in memory).

If you have a file, then what sort of file is this: executable, object code,
etc? Then this will have it's own structure you need to decode before you
get to the code.

You also need a way of recognising the end (otherwise disassembling will
continue until you run off the end of memory).

How big are the files you're processing? Is the output going to be
human-readable syntax, or are you doing some analysis of the code? I've
assumed here the former, in which case there is no point in disassembling a
1GB input file as it might take a decade or two for someone to read it...

> if this byte is part of a prefix instruction, parse it else continue
> to opcode, and so on and so on .
>
> Though, I ran into several "problems" in mind:
>
> 1. Which data structure should store the values I read ? A hash table
> or a Tree ? Or a combination of both ? (trie) Should the tree be
> balanced ? If not, will it cost in efficiency or whether balancing it
> will cost in efficiency ?

I don't understand; why do you want to store a linear set of disassembled
instructions in a tree? Usually the output of a disassembler is a textual
representation of the code (and you need to choose which kind of syntax to
display this).

If not converting to text straightaway, you might as well keep as
undisassembled code! That's pretty compact. Perhaps just a series of
pointers to the start and end of code blocks. When you're ready to output,
disassemble each block.

>
> 2. What about invalid instructions ? Should I strip them the moment I
> detect they're invalid or should they be stored FFU ?

The main problem is not invalid instructions, but bytes that are not
instructions: either data or garbage. You will have to devise a way of
dealing with these and finding the start of the next valid instruction.

On x86, a zero byte is the start of a valid (but rare) instruction, but you
will soon recognise which one! And there also a lot of different different
Intel cpu types, I suppose some codes will be valid instructions for some,
but not for others (the last time I did a disassembler was for 186)

What's an FFU?

>
> 3. Which data structure should hold the final result of the
> disassembled instruction ?

See my answer above.

> 5. Should the disassembler itself be multi threaded or one program
> which does everything step-by-step and if it will be multi threaded -
> how can I handle or parse different instructions ? or handle
> synchronization ?

I won't ask why on earth you want to complicate your project this way. I
guess you're an expert in multi-threading and now want to use it for
everything. Perhaps get any disassembler working first then worry about
multi-threading later; if you must...

--
Bartc

rlunger

unread,

Oct 18, 2008, 4:37:06 AM10/18/08

to

FFU is For Future Use (or Far From Us). Yes, x86 opcodes can be
complex, as has been pointed out, but I do believe that it is also the
most widely used set of processors we have ever seen, so writing a
disassembler for the architecture has it's uses. I've found that the
simplest ways of making something work is the best way (I wrote a
disassembler in QuickBASIC a long time ago). Since the hardest part
of your problem is recognizing the opcodes, start there and then worry
about what to do with your data, how to store it or represent it. Do
what feels right with it, and if that doesn't work, try something
else.

Regards,

R. Lunger