Recently I realised that, as part of his 8086 reverse-engineering series, Ken Shirriff had posted online a high resolution photograph of the 8086 die with the metal layer removed. This was something I have been looking for for some time, in order to extract and disassemble the 8086 microcode. I had previously found very high resolution photos of the die with the metal layer intact, but only half of the bits of the microcode ROM were readable. Ken also posted a high resolution photograph of the microcode ROM of the 8088, which is very similar but not identical. I was very curious to know what the differences were.
The microcode is partially documented in US patent 4363091. In particular, that patent has source listings for several microcode routines. Within these, there are certain patterns of parts of instructions which I was able to find in the ROM dump. This allowed me to figure out how the bit patterns in the ROM correspond to the operands and opcodes of the microcode instruction set, in a manner similar to cracking a monoalphabetic substitution cipher. My resulting disassembly of the microcode ROM can be found here and the code for my disassembler is on github.
The differences are in the interrupt handling code. I think it comes down to fact that the 8086 does two special bus accesses to acknowledge an interrupt (one to tell the PIC that it is ready to service the interrupt, the second to fetch the interrupt number for the IRQ that needs to be serviced). These are word-sized accesses for some reason, so the 8088 would break them into four accesses instead of two. This would confuse the PIC, so the 8088 does a single access instead and relies on the BIU to split the access into two. The other changes seem to be fallout related to that.
Mostly. There are differences, however (which added some complexity to the deciphering process). The differences are in the string instructions. For example, the "STS" (STOSB/STOSW) instruction in the patent is:
The arrow isn't a difference - I just put that in my disassembly to emphasize the direction of data movement in the "move" part of the microcode instructions. Likewise, the "F1 1" in the patent listing is the same as the "F1 RPTS" in my disassembly - I have replaced subroutine numbers with names to make it easier to read.
The version in the patent does a check for pending interrupts in the "RPTS" routine, before it processes any iterations of the string. This means that if there is a continuous "storm" of interrupts, the string instruction will make no progress. The version in the CPU corrects this, and checks for interrupts on line 3, after it has done the store, allowing it to progress. This was probably not a situation that was expected to occur in normal operation (in fact, I seem to recall crashing my 8088 and 8086 machines by having interrupts happen too rapidly to be serviced). The change was most likely done to accommodate debugging with the trap flag (which essentially means that there is always an interrupt pending when the trap flag is set). Without this change, code that used the repeated string instructions would not have progressed under the debugger.
The discontinuous instructions were most likely broken up because they had bug fixes making them too long for their original slots. Similarly "POP rmw" appears to have been shortened by at least 3 instructions as there is a gap after it. Moving code around after it's been written (and updating all the far jump/call locations) would probably have been tricky.
There is no microcode for the segment override prefixes (CS:, SS:, DS: and ES:). Nor for the other prefixes (REP, REPNE and LOCK), nor the instructions CLC, STC, CLI, STI, CLD, STD, CMC, and HLT. The "group" opcodes 0xf6, 0xf7, 0xfe and 0xff do not have top level microcode instructions. So none of the instructions with 0xf in the high nybble of the opcode are initially handled by the microcode. Most of these instruction are very simple and probably better done by random logic. HLT is a little surprising - I really thought I'd find a microcode loop for that one since it only seems to check for interrupts every other cycle.
There doesn't appear to be any way for execution to reach these instructions. This code saves AL to tmpa (which doesn't appear to then be used at all) and then does either an interrupt or (if an interrupt is pending) a far call. In the interrupt case it also does a move between a source and a destination that aren't used anywhere else (and hence I have no idea what they are). This makes me wonder if there was at one point a plan for something like an "INT AL" instruction. With the x86 instruction set we ended up with, such a thing has to be done using self-modifying code, a table of INT instructions, or faking the operation of INT in software).
When the WAIT instruction finishes in the non-interrupt case (i.e. by the -TEST pin going active to signal that the 8087 has completed an instruction) the microcode sequence finishes using this sequence:
There is also a bit (shown as "Q" in the listings) which does not have an obvious function for "type 6" (bus IO) operations. This Q bit is only set for "W" (write) operations, and is differentiated in the listing by write operations without it being shown in lower case ("w"). There seems to be no pattern as to which writes use this bit. The string move instructions use it, as does the stack push for the flags when an interrupt occurs, and the push of the segment for a far call or interrupt (but not the offset). It would make sense if this bit was used to distinguish between memory and port IO bus accesses, but the CPU seems to have another mechanism for this (most likely the group decode ROM, which I have not decoded as there are too many unknowns about what its inputs and outputs are).
Despite many of the instructions seeming to execute quite ponderously by the standards of later CPUs, the microcode appears to be very tightly written and I didn't find many opportunities for improvement. If the MOVS/LODS opcode was split up into separate microcode routines for LODS and MOVS, the LODS routine could avoid a conditional jump and execute 1 cycle faster. But there is only room for that because of the "POP rmw" shortening, which may have happened quite late in the development cycle (especially if it was a functional bug fix rather than an optimisation - optimisations might not have met the bar at that point).
There may be places where prefetching could be suspended earlier before a jump, but it's not quite so obvious that that would be an optimisation. Especially if the "suspend" operation is synchronous, and waits for the BIU to complete the current prefetch cycle before continuing the microcode program. And especially if that would make the microcode routine longer.
It would of course be possible to make improvements if the random logic is changed as well. The NEC V20 and V30 implement the same instructions at a generally lower number of cycles per instruction, but they have 63,000 transistors instead of 29,000 so probably have a much larger proportion of random logic to microcode.
It does! Using the REP or REPNE prefix with a MUL or IMUL instruction negates the product. Using the REP or REPNE prefix with an IDIV instruction negates the quotient. As far as I know, nobody has discovered these before (or at least documented them).
Signed multiplication and division works by negating negative inputs and then negating the output if exactly one of the inputs was negative. That means that the CPU needs to remember one bit of state (whether or not to negate the output) across the multiplication and division algorithms. But these algorithms use all three temporary registers, and the internal counter, and the ALU (so the bit can't be put in the internal carry flag for example). I was scratching my head about where that bit might be kept. I was also scratching my head about why the multiplication and division algorithms check the F1 ("do we have a REP prefix?") flag. Then I realised that these puzzles cancel each other out - the CPU flips the F1 flag for each negative sign in the multiply/divide inputs! There's already an microcode instruction to check for that, so the 8086's designers just needed to add an instruction to flip it.
I was thinking the microcode instruction might set the F1 flag instead of flipping it - that would mean that you could get a (probably negated) "absolute value" operation (almost) for free with a multiply. But an almost-free negation is pretty good too - REP is a byte cheaper than "NEG AX", and with 16-bit multiplies the savings are even greater (eliminates a NEG AX / ADC DX, 0 / NEG DX) sequence. Still small compared to the multiply, but a savings nonetheless.
I contemplated using this in a demoscene production as another "we break all your emulators" moment, but multiplication and division on the 8086 and 8088 CPUs is sufficiently slow to be of limited use for demos.
The F1ZZ microcode instruction (which controls whether the REPE/REPNE SCAS/CMPS sequences terminate early) is also used in the LOOPE and LOOPNE instructions. Which made me wonder if one of the REP prefixes would also reverse the sense of the test. However, neither prefix seems to have any effect on these instructions.
I've made a new version of the disassembly here incorporating some changes from the comments below. I have transcribed the group ROM, got rid of "NWB", added the RNI flag to W microinstructions, and changed XZC to ADC.
Hi, I know this is quite a bit later, but I'm writing an emulator for the 8088 that interprets the microcode itself. However, I seem to have hit a wall for the shift rm8,cl instructions. How does the CPU determine where in the microcode to jump for these instructions. I vaguely understand the bottom three bits of the address are set to modrm.reg, but that would jump to the middle of a microcode routine, wouldn't it?
Thank you so much! I'm not sure how to submit corrections to the ZIP documentation, but I have two suggestions: the unknown difference between w and W micro-instructions in could be that the lower case w ops do not terminate the instruction and W does terminate. The unknown bit seems to indicate a hidden RNI. I'm not 100% sure, but reading through the microcode this seems to hold up. The other suggestion is key.txt line 50 col 78, p could be renamed i to match the bitfield description.
7fc3f7cf58