If you're just looking to use a disassembler, then objdump is one choice. The disassembler that comes with the nasm assembler is ndisasm. You can also run "debug.exe" in DOS Box on Linux, provided you get a hold of a copy of the program. It also does disassembly, as well as controlled execution; i.e. simulation of the CPU, itself - which is also important, even when doing disassembly, for reasons I'm about to describe.
Fake86 has a cpu emulator. You may be able to hack it into doing disassembly by (a) having it show the instruction instead of simulating it, (b) having it not take conditional jumps or invoke calls, but (instead) stacking the address as a new entry point to do disassembly from (i.e., in effect, taking both branches and encapsulating subroutines), (c) having it stop the current disassembly at an unconditional jump or return, (d) having it accept one, two or more entry points to start with and ideally (e) having it also accept base addresses for data segments, and (f) getting it to do a hex dump of all the areas unprocessed as data or code segments (as these are usually where indirect jumps or calls or indirectly-accessed data segments land into.)
This gets to the other sense of your query: "I want to make a disassembler". The source for ndisasm is available, and it handles many of the descendants of 8086, not just 8086, itself (which seriously clutters it, if all you want is an 8086 or even 80386 disassembler), but it is not self-contained and has a heavy dependency on the rest of the distribution.
Its main talking point is that it uses octal digits for the opcodes - which better fits the 80x86 - as I pointed out on the USENET in 1995 in comp.lang.asm ... and (in fact) nasm's creation was a direct response to that. So, it's potentially more transparent and you may want to keep the source handy as a check and comparison, if you're making your own disassembler.
You could also try to run ndisasm on debug.exe; after stripping out the 0x200-byte .EXE file header, to make it a raw binary, after extracting out the entry point address CS:IP and stack pointer address SS:SP from it (80x86 stacks grow down, so the stack segment is nominally SS:0 to SS:(SP-1)). The EXE for debug.exe has no relocations, so you're okay with that treating the code as raw binary.
But you won't get anything that's clearly recognizable, since the program is self-modifying - more precisely: self-extracting. You'll get a (barely) compressed code image (about 5/6 compression ratio) followed by a loader routine.
You have to run emulation on it, e.g. by running debug.exe on debug.exe to emulate its unpacking routine, to get it to extract itself, and then you dump the unpacked program image and disassemble that. There is a "relocation table" at the end of the loader routine, so it does actually have relocations in it - it's just that they're applied when the program unpacks itself, rather than by the OS when the EXE file is loaded.
And then you've just disassembled a disassembler that also happens to do CPU emulation, like Fake86 does - but only for the 8086. You'll have to make the absolute addresses relative (using the original relocation table as a guide), to make is re-assemblable. Once you do that, you can work on the source. The opcode table is in clear view (if you display it as text) - both when seen in the packed and unpacked versions of debug.exe.
There's also DosDebug up on GitHub. It handles everything up to "80586" (or Pentium") and "80686": it flags a generation "6" for some instructions.; e.g. the conditional "cmov" operations are handled by it, as well as their "fcmov" floating point versions. DosDebug is in 8086 assembly and is best-suited to compile with jwasm. You might be able to run nasm on it, I don't know. I never tried.
I might port the DAS disassembler to the x86, since items (a)-(f) are already incorporated into DAS's design. I've only ever ported it to the 8051, 6800, 6809 and 8080/8085 (and Z80) up to now; but the transition from 8085 to 8086 is relatively small. To that end, I might hack something out of Fake86. That's mostly abandonware, now, since the author replaced it by XTulator, as Fake86 was written when the programmer was relatively new to C. You might also be able to hack something directly out of DosDebug's opcode tables (their "instr.*" files).
In essence, a disassembler is the exact opposite of an assembler. Where an assembler converts code written in an assembly language into binary machine code, a disassembler reverses the process and attempts to recreate the assembly code from the binary machine code.
Since most assembly languages have a one-to-one correspondence with underlying machine instructions, the process of disassembly is relatively straight-forward, and a basic disassembler can often be implemented simply by reading in bytes, and performing a table lookup. Of course, disassembly has its own problems and pitfalls, and they are covered later in this chapter.
Many disassemblers have the option to output assembly language instructions in Intel, AT&T, or (occasionally) HLA syntax. Examples in this book will use Intel and AT&T syntax interchangeably. We will typically not use HLA syntax for code examples, but that may change in the future.
Here we are going to list some commonly available disassembler tools. Notice that there are professional disassemblers (which cost money for a license) and there are freeware/shareware disassemblers. Each disassembler will have different features, so it is up to you as the reader to determine which tools you prefer to use.
Many of the Unix disassemblers, especially the open source ones, have been ported to other platforms, like Windows (mostly using MinGW or Cygwin). Some Disassemblers like otool ([OS X) are distro-specific.
As we have alluded to before, there are a number of issues and difficulties associated with the disassembly process. The two most important difficulties are the division between code and data, and the loss of text information.
Since data and instructions are all stored in an executable as binary data, the obvious question arises: how can a disassembler tell code from data? Is any given byte a variable, or part of an instruction?
The problem wouldn't be as difficult if data were limited to the .data section (segment) of an executable (explained in a later chapter) and if executable code were limited to the .code section of an executable, but this is often not the case. Data may be inserted directly into the code section (e.g. jump address tables, constant strings), and executable code may be stored in the data section (although new systems are working to prevent this for security reasons). AI programs, LISP or Forth compilers may not contain .text and .data sections to help decide, and have code and data interspersed in a single section that is readable, writable and executable, Boot code may even require substantial effort to identify sections. A technique that is often used is to identify the entry point of an executable, and find all code reachable from there, recursively. This is known as "code crawling".
Many interactive disassemblers will give the user the option to render segments of code as either code or data, but non-interactive disassemblers will make the separation automatically. Disassemblers often will provide the instruction AND the corresponding hex data on the same line, shifting the burden for decisions about the nature of the code to the user. Some disassemblers (e.g. ciasdis) will allow you to specify rules about whether to disassemble as data or code and invent label names, based on the content of the object under scrutiny. Scripting your own "crawler" in this way is more efficient; for large programs interactive disassembling may be impractical to the point of being unfeasible.
The general problem of separating code from data in arbitrary executable programs is equivalent to the halting problem. As a consequence, it is not possible to write a disassembler that will correctly separate code and data for all possible input programs. Reverse engineering is full of such theoretical limitations, although by Rice's theorem all interesting questions about program properties are undecidable (so compilers and many other tools that deal with programs in any form run into such limits as well). In practice a combination of interactive and automatic analysis and perseverance can handle all but programs specifically designed to thwart reverse engineering, like using encryption and decrypting code just prior to use, and moving code around in memory.
User defined textual identifiers, such as variable names, label names, and macros are removed by the assembly process. They may still be present in generated object files, for use by tools like debuggers and relocating linkers, but the direct connection is lost and re-establishing that connection requires more than a mere disassembler. Especially small constants may have more than one possible name. Operating system calls (like DLLs in MS-Windows, or syscalls in Unices) may be reconstructed, as their names appear in a separate segment or are known beforehand. Many disassemblers allow the user to attach a name to a label or constant based on his understanding of the code. These identifiers, in addition to comments in the source file, help to make the code more readable to a human, and can also shed some clues on the purpose of the code. Without these comments and identifiers, it is harder to understand the purpose of the source code, and it can be difficult to determine the algorithm being used by that code. When you combine this problem with the possibility that the code you are trying to read may, in reality, be data (as outlined above), then it can be even harder to determine what is going on. Another challenge is posed by modern optimising compilers; they inline small subroutines, then combine instructions over call and return boundaries. This loses valuable information about the way the program is structured.
7fc3f7cf58