[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

161 views
Skip to first unread message

Min-Yih Hsu via llvm-dev

unread,
Dec 5, 2021, 10:34:28 PM12/5/21
to llvm-dev
(This is a long proposal. If you prefer, here is the web version: https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b)

## Background
CodeEmitterGen is a TableGen backend that generates instruction encoder functions for `MCCodeEmitter` from a concise TableGen syntax. It is, however, almost exclusively designed for targets that use fixed-length instructions. It's nearly impossible to use this infrastructure to describe instruction encoding scheme for ISAs with variable-length instructions, like X86 and M68k.

To have a better understanding on this problem, let's look at an example. For a fixed-length instruction ISA, developers write the following TableGen syntax to describe an instruction encoding:
```
class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction {
    bits<32> Inst;

    bits<4> dst;
    bits<16> imm;
    let Inst{31-28} = 0b1011;
    ...
    let Inst{19-16} = dst;
    let Inst{15-0} = imm;
}
```
The `Inst` field tells us the length of an instruction -- 32 bits in this case. Each bit in this field describes the encoded value, which is either a concrete value or symbolic one like `dst` and `imm` in the example above. The `dst` and `imm` variables correspond to the output operand (`$dst`) and the second input operand (`$imm`), respectively. Meaning, the encoder function (generated by CodeEmitterGen) will eventually insert the encoding for these two operands into the right bit ranges (bit 19\~16 for `dst` and 15\~0 for `imm`).

Though this TableGen syntax fits well for fixed-length instructions, it imposes some difficulties to instructions with variable length and memory poerands with complex addressing modes:
  1. The bit width of the `Inst` field is fixed. Though we can declare the field with maximum instruction size in the ISA, it requires extra code to adjust the final instruction size.
  2. Operand encoding can only be placed at fixed bit positions. However, the size of an operand in a variable-length instruction might vary.
  3. In the situation where a single logical operand is consisting of multiple MachineOperand-s / MCOperand-s, the current syntax cannot reference a sub-operand. Which means we can only reference the entire logical operand at places where we actually should put sub-operands. Making the TG code less readable and bring more burden to the operand encoding functions (because they don't know which sub-operand to encode).

In short, we need a more flexible CodeEmitterGen infrastructure for variable-length instructions: describe the instruction encoding in a "position independent" fashion and be able to reference sub-operands with ease.

## Proposal
We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the aforementioned shortcomings. It is consisting of new TableGen syntax and some modifications to the existing CodeEmitterGen TableGen backend.

Suppose we are dealing with an instruction `MyVarInst`:
```
class MyMemOperand<dag sub_ops> : Operand<iPTR> {
    let MIOperandInfo = sub_ops;
}

class MyVarInst<MyMemOperand memory_op> : Instruction {
    let OutOperandList = (outs GR64:$dst);
    let InOperandList  = (ins memory_operand:$src);
}
```
It has the following encoding format:
```
15 8 0
----------------------------------------------------
| Fixed bits | Sub-operand 0 in source operand |
----------------------------------------------------
X 16
----------------------------------------------------
| Sub-operand 1 in source operand |
----------------------------------------------------
X + 4 X + 1
------------------------------------
| Destination register |
------------------------------------
```
We have two different kinds of memory operands:
```
def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;
def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;

def FOO16 : MyVarInst<MemOp16>;
def FOO32 : MyVarInst<MemOp32>;
```
So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.

To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:
```
class MyMemOperand<dag sub_ops> : Operand<iPTR> {
    let MIOperandInfo = sub_ops;
    dag Base;
    dag Extension;
}

class MyVarInst<MyMemOperand memory_op> : Instruction {
    dag Inst;

    let OutOperandList = (outs GR64:$dst);
    let InOperandList  = (ins memory_op:$src);

    let Inst = (seq
        (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),
        memory_op.Extension,
        // Destination register
        (operand "$dst", 4)
    );
}
```
Then, we use a slightly different representation for `MemOp16` and `MemOp32`:
```
class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg, i16imm:$offset)> {
    let Base = (operand "$"#op_name#".reg", 8);
    let Extension = (operand "$"#op_name#".offset", 16);
}

class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg, i32imm:$offset)> {
    let Base = (operand "$"#op_name#".reg", 8);
    let Extension = (operand "$"#op_name#".offset", 32);
}

def FOO16 : MyVarInst<MemOp16<"src">>;
def FOO32 : MyVarInst<MemOp32<"src">>;
```

This new TableGen syntax uses `dag` rather than `bits<N>` for the `Inst` field. Allowing instructions to place their operand (and sub-operand) encodings without worrying about the actual bit positions. The new syntax is underpinned by two new DAG operators: `seq` and `operand`.

The `seq` operator sequentially places its arguments -- fragments of encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it goes from MSB to LSB instead. The `operand` operator references the encoding of an operand. Its first DAG argument is a string referencing the name of an operand in either `InOperandList` or `OutOperandList` of an instruction. We can also reference an sub-operand using syntax like `$<operand name>.<sub-operand name>`. The second DAG argument for `operand` is the bit width of the encoded operand. The other variant of `operand` is having two arguments instead of one that follow the operand referencing string. More specifically:
```
(operand "$src.reg", 8, 4)
```
In this case, 8 and 4 represents a bit range -- high bit and low bit, respectively -- to the encoded `$src.reg` operand.

Finally, a new sub-component added to the existing CodeEmitterGen TableGen backend, VarLenCodeEmitterGen, will turn the above syntax into a C++ encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the same mechanism as the fixed-length instruction version (except few details, like it always uses APInt to store the result).

We think the proposed solution has the following advantages:
  - Flexible and versatile in terms of expressing instruction encodings.
  - The TableGen syntax is easy to read, write and understand.
  - Only adds a few new TableGen syntax.
  - Tightly integrated with the existing CodeEmitterGen.

### Previous approaches
Both X86 and M68k -- the only two LLVM targets with variable-length instructions -- are using custom instruction encoders. X86 leverages TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates and numbers every possible combinations of operands and stores the corresponding index into a segment of TSFlags for an instruction. This approach, of course, requires none trivial amount of workforce to maintain.

M68k, on the other hand, uses an obscured infrastructure called code beads. It is conceptually similar to the VarLenCodeEmitterGen we're proposing here -- concatenating encoding fragments. Except that the syntax is bulky and it uses too many specialized TableGen infrastructures, including a separate TableGen backend, that make the maintainence really really hard.

## Patches
TableGen modifications: https://reviews.llvm.org/D115128

## FAQ
  - Do I need to toggle some flags -- either a command line flag or a TableGen bit field -- to use the new code emitter scheme?
    - No, having a `dag` type `Inst` field will automatically opt-in this new code emitter scheme.
  - Can I adopt this for fixed-length instructions?
    - Absolutely yes. But it's not recommended because CodeEmitterGen can generate more optimal encoder functions for fixed-length instructions. The TableGen syntax of CodeEmitterGen makes more sense for fixed-length instructions, too.
  - Can X86 adopt this infrastructure?
    - Theoritically, yes (In practice? I dunno).
  - What about the disassembler? Can we TableGen-enerate the corresponding disassembling functions?
    - Since we have a structural description of the encoded instruction, it's probably easier to create a disassembler from the new TableGen syntax. But I haven't worked on that yet.

Ricky Taylor via llvm-dev

unread,
Dec 7, 2021, 5:07:56 AM12/7/21
to Min-Yih Hsu, llvm-dev
From an M68k point-of-view, I like this. I'm a little uneasy about the `operand` node though. Is it possible that there will ever be an operand type which needs multiple encodings?

I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like `(seq (flip 0b0101010) (slice my_operand.Base, 4, 8))`, be possible?

As you say, I suspect that implementing an interpreter-style disassembler generator like the fixed length one would be fairly straight-forward (and much better than the M68k disassembler implementation I provided).

Ricky,

Min-Yih Hsu via llvm-dev

unread,
Dec 7, 2021, 6:55:05 AM12/7/21
to Ricky Taylor, llvm-dev

Thanks for the feedbacks!

On Dec 7, 2021, at 6:07 PM, Ricky Taylor <rickyt...@gmail.com> wrote:

From an M68k point-of-view, I like this. I'm a little uneasy about the `operand` node though. Is it possible that there will ever be an operand type which needs multiple encodings?

Do you mean an operand or sub-operand might have different encodings depend on its context / bit positions? I don’t know about other (future) architectures, M68k doesn’t have this issue: Each memory operand is consisting of (multiple) primitive construction(s) like registers and immediate, that is, sub-operands. While a sub-operand might be placed in arbitrary positions, each sub-operand has a fixed encoding.

I can think of a solution for operands / sub-operands whose encodings depend on its context: passing custom encoder function name into `operand`. So if you write:
```
(operand “$foo”, 4, “encodeFoo”)
```
The foo operand at this particular bit position will be encoded by `encodeFoo`.
(The current CodeEmitterGen has a similar stuff — via the `EncoderMethod` field in each operand. But that one is bit-position-invariant. So an operand can only use a single custom encoder regardless of its context / bit-position)
BTW, this again shows the advantage of using DAG for carrying encoding info: It’s easy to augment with new parameters / features.


I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like `(seq (flip 0b0101010) (slice my_operand.Base, 4, 8))`, be possible?

IIUC, you’re proposing to replace`(seq (seq:$dec 0b0101010), (operand “…”, 4, 8))` with the new syntax.

I think it’s a good idea because then we’re not overloading a single DAG operator. It’s trivial to change, too.

Nevertheless, I haven’t implemented `seq:$dec` for fixed bits value so actually you can’t write `(seq:$dec 0b0101010)` (This is not hard to implement though).


As you say, I suspect that implementing an interpreter-style disassembler generator like the fixed length one would be fairly straight-forward (and much better than the M68k disassembler implementation I provided).

Good to hear!

Cheers,
-Min

Ricky Taylor via llvm-dev

unread,
Dec 7, 2021, 9:15:01 AM12/7/21
to Min-Yih Hsu, llvm-dev
On Tue, 7 Dec 2021 at 11:54, Min-Yih Hsu <min...@uci.edu> wrote:

Thanks for the feedbacks!

On Dec 7, 2021, at 6:07 PM, Ricky Taylor <rickyt...@gmail.com> wrote:

From an M68k point-of-view, I like this. I'm a little uneasy about the `operand` node though. Is it possible that there will ever be an operand type which needs multiple encodings?

Do you mean an operand or sub-operand might have different encodings depend on its context / bit positions? I don’t know about other (future) architectures, M68k doesn’t have this issue: Each memory operand is consisting of (multiple) primitive construction(s) like registers and immediate, that is, sub-operands. While a sub-operand might be placed in arbitrary positions, each sub-operand has a fixed encoding.
Yes, I was thinking about the solution in a general sense. For M68k we're good either way. :)
 

I can think of a solution for operands / sub-operands whose encodings depend on its context: passing custom encoder function name into `operand`. So if you write:
```
(operand “$foo”, 4, “encodeFoo”)
```
The foo operand at this particular bit position will be encoded by `encodeFoo`.
(The current CodeEmitterGen has a similar stuff — via the `EncoderMethod` field in each operand. But that one is bit-position-invariant. So an operand can only use a single custom encoder regardless of its context / bit-position)
BTW, this again shows the advantage of using DAG for carrying encoding info: It’s easy to augment with new parameters / features.
This sounds good to me. (And I very much agree, using a DAG for this stuff is very convenient.)
I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like `(seq (flip 0b0101010) (slice my_operand.Base, 4, 8))`, be possible?

IIUC, you’re proposing to replace`(seq (seq:$dec 0b0101010), (operand “…”, 4, 8))` with the new syntax.

I think it’s a good idea because then we’re not overloading a single DAG operator. It’s trivial to change, too.
I guess you could allow custom encoding with something like `(encode "SpecialMode", "$foo")` (which could call `encodeSpecialMode`, or similar).

Either way, I'm pretty satisfied. :)

Min-Yih Hsu via llvm-dev

unread,
Dec 9, 2021, 9:04:42 AM12/9/21
to llvm-dev
FYI: As a preview for this new CodeEmitterGen component, I’ve refactored some of the M68k instructions using VarLenCodeEmitterGen: https://reviews.llvm.org/D115234 

-Min
Reply all
Reply to author
Forward
0 new messages