Last year our team had to analyze V8 bytecode. Back then, there were no tools in place to decompile such code and facilitate convenient navigation over it. We decided to try writing a processor module for the Ghidra framework. Thanks to the features of the language used to describe the output instructions, we obtained not only a readable set of instructions, but also a C-like decompiler. This article is a continuation of the series (1, 2) on our Ghidra plugin.
You can see in the documentation that the processor modules for Ghidra are written in SLEIGH, a language derived from the Specification Language for Encoding and Decoding (SLED), which was developed specifically for Ghidra. It translates machine code into p-code (the intermediate language that Ghidra uses to build decompiled code). As a language for describing processor instructions, it has a lot of limitations, although they can be reduced with the p-code injection mechanism implemented as Java code.
The source code of the new processor module is presented on github. This article will review the key concepts that are used in the development of the processor module using pure SLEIGH, with certain instructions as examples. Working with the constant pool, p-code injections, analyzer, and loader will be or have already been reviewed in other articles. Also you can read more about analyzers and loaders in The Ghidra Book: The Definitive Guide.
The work will require an installed Eclipse IDE, to which the plugins that come supplied with Ghidra (GhidraDev and GhidraSleighEditor) need to be added. Then the Ghidra Module Project with the name v8_bytecode is created. The new project contains template files that are important for the processor module; we will modify them to suit our needs.
For a general overview of the files we will need to work with, we can refer to the official documentation or the recently released The Ghidra Book: The Definitive Guide by Chris Eagle and Kara Nance. Here is a description of these files.
After the first launch of your project, a file with the .sla extension will also appear. It is generated based on the SLASPEC file.
Before getting down to developing the processor module, you need to work out what registers there are, how to work with them, with the stack, etc., for the chosen processor or interpreter.
The JSC file that interests us was built using the JavaScript Node.js 8.16.0 runtime environment via bytenode (this module will either come supplied with Node.js or it will need to be installed via npm). In essence, bytenode uses the documented functionality of Node.js to create the compiled file. This is the source code of the function that compiles JSC files from JS:
It is worth noting that SLEIGH is not totally suited for interpretable bytecodes of this kind, so the written processor module has certain limitations. For example, work is defined with no more than 124 rN registers and 125 aX registers. We attempted to resolve this problem using a stack-based communication model with registers, as it was more in line with the overall concept. However, in this instance, the disassembled bytecode was harder to read:
As mentioned above, arguments are passed to the function through the aX registers. In a module, registers must be defined as a contiguous sequence of bytes at offset in some space. As a rule, in such cases, a specially designated space named register is used. However, in theory, there is nothing to stop you from using any other space. When there are a large number of registers performing more or less the same functions, it is simplest not to denote each of them separately, rather you can simply specify the offset with a set of registers in a space and use that to define them. Therefore, we mark the memory region in the register space (space="register") in the tag for registers through which the arguments are passed into functions, at offset 0x14000 (0x14000 is not set in stone; this is just an offset at which the aX registers will be defined in *.slaspec below).
By default, the result of the function call is saved in the accumulator (acc), which needs to be denoted in the tag. For alternative register variations, where the values returned by the functions are stored, we can define the logic when describing instructions. In the tag, we note that the function calls have no impact on the register that stores the stack pointer.
Moving on to the language definition: this is a file with the .ldefs extension. It requires a little information for describing: the byte order (we have le), names of key files (* .sla, * .pspec, and *.cspec), id, and bytecode name, which will be displayed in the list of supported processor modules when importing a file into Ghidra. If we ever need to add a processor module for a file compiled by a version of Node.js that is significantly different from the current one, then we describe it there by creating another tag, as it is done to describe processor families in * .ldefs modules supplied as part of Ghidra.
The situation with the processor specification (file with the .pspec extension) is more complicated in terms of documentation. In this case, you can turn to ready-made solutions within the framework itself or to the processor_spec.rxg file (we have not looked at an option with a full-scale parsing of Ghidra source codes). At the time when the module was written, there was nothing more detailed available. In time, the developers will probably publish official documentation.
The address spaces that we will need to describe the bytecode (we created spaces with the names register and ram) shall be defined by way of define space, while the registers, by way of define register. The offset value in the definition of the registers is not of fundamental importance; what is most important is that they are located on different offsets. The number of bytes that is taken up by the registers is defined by the size parameter. Remember that the information defined here must correspond to calls to similar abstractions and values within *.cspec and the analyzer, if you refer to these registers.
Initially, there is a root instruction table (the identifier instruction is attached to it), all the described constructors with an empty table header become a part of it, and the first identifier of these constructors in the display section is recognized as a mnemonic of the instructions.
Bit fields need to be defined to describe the instruction constructors. Through them, the bits of the program become tied to certain abstractions of the language into which the translation will take place. For example, mnemonics or operands can be such abstractions. The fields are defined during token definition; the syntax of their definition is as follows:
The size of the token (tokenMaxSize) must be a multiple of 8. This might be inconvenient if the operands or nuances for the instruction are coded using a smaller number of bits. On the other hand, this is compensated by the ability to create fields of different sizes, positionally decoding any bits within the size specified by the token. The following conditions must be observed for such fields: startBitNumX and endBitNumX are in the range from 0 to tokenMaxSize-1 inclusive and startBitNumX
Note: even if you use a field or a combination of fields in the bit pattern section that do not cover the full size of the token to which they belong with their bitmasks, the number of bits determined by the token size will still be taken from the program bytes for the operand.
For the AddSmi instruction, in the bit pattern section, in addition to the op field that sets the limit on the opcode, the fields appear from the operand token, divided by a semicolon (;). They will be placed in the display section as operands. Real bits are mapped in the order in which the fields are specified in the bit pattern section. We implement the instruction logic in the semantics actions section, using the documented operations; this is what an interpreter would do when fulfilling these instructions.
This is what the listing window looks like with the described instructions with the PCode field enabled in the Edit the Listing fields menu of the Ghidra listing. The decompiler window will not yet be indicative due to code optimization, so, at this stage, it is worth focusing only on the intermediate p-code operations.
In V8 bytecode, all instructions whose mnemonics start with lda imply loading a value into the accumulator. However, if you want to clearly see the acc register in the bytecode instructions, you can define it for display in the constructor. In this case, it makes no difference in which position in the bit pattern section the acc is located, since it is not a field and does not occupy instruction bits, although it does need to be written in.
The instructions for returning values from the function are implemented using the return keyword and, as already mentioned, the value is returned most often through the accumulator, when calling the function:
The instructions we have described above do not use registers as operands. Now, though, we implement the constructor of the Mul instruction, under which the accumulator is multiplied by the value that is stored in the register.
In the Basic Definitions and Macros of the Preprocessor code section, the registers were already declared, but in order for the necessary registers to be selected depending upon the bits presented in the bytecode, we need to bind their list to the corresponding bitmasks. The kReg field is 8 bits in size. Using the attach variables construction, we sequentially define to which bitmasks, from 0b to 11111111b, the aforementioned registers will correspond as part of the subsequent use of fields from the set list (only kReg in our case) in the constructors. For example, it is evident in this description that the operand coded as 0xfb (11111011b) is interpreted as the r0 register when it is described through kReg.
We encounter conditional and unconditional transitions in the bytecode via relative offsets from the current address. To jump to an address or label in SLEIGH, we use the goto keyword. It is noteworthy for the definition that the bit pattern section uses the kUImm field, although it is not used in its pure form. The value calculated in the disassembly actions section is exported to the display section through the rel identifier. inst_start is a predefined symbol for SLEIGH and contains the address of the current instruction.
7fc3f7cf58