Looking for examples of a machine instruction decoder implementation in ANTLR

116 views
Skip to first unread message

Grigory Rechistov

unread,
Mar 16, 2015, 4:03:19 PM3/16/15
to antlr-di...@googlegroups.com
Dear ANTLR experts,

My question is about a rather unusual usage of a parser generators. While the process of decoding machine instructions in software has a lot in common with "regular" languages parsing, I have found no examples or attempts to implement such decoder in LEX/YACC, ANTLR or another parser generator. At the same time, I am aware about several specialized DSL translators tailored for creation of decoders, and none of them, to my knowledge, is based on a parser generator.

The second step of a machine code interpreter operation, after fetching a raw word from memory, is to decode it: run a routine that accepts a machine word of given width and parses it into an intermediate representation that facilitates consequent simulation of the instruction's semantics. Usually such IR contains aggregated information about instruction’s function (opcode) and arguments (operands). A similar task is disassembling - translating a machine word into a string of human readable text.

At decoding, a typical machine instruction of RISC system (MIPS, ARM, SPARC, Alpha etc) is split into several bit fields. Values of bitfields are tested against a set of checks to determine whether instruction is known and what arguments are present. Position of some fields may depend on values found in other fields, so the decoding is usually done sequentially from the first bit to the last. Machine encodings are guaranteed to be prefix code, and no ambiguous situation may arise.
Decoding of CISC instructions (such as IA32 a.k.a. x86) is somewhat (to my experience, significantly) harder, as they present variable length of instructions, but essentially it is all the same.

No surprise that writing a decoder by hand is a tedious and ungrateful task. Several translators have been written [1-4] in order to simplify this and adjacent tasks. All of them accept a machine language description in certain DSLs and create code capable, besides other things, to decode that language.

To my knowledge, none of these tools are based on parser generators. But what they do is mainly parsing. I find this strange. I am looking for projects that can disprove my observation. Alternatively, I am trying to formulate a logical explanation to support it. So far I was able to imagine two possibilities:

1. Machine language parsing is really an unsuitable task for general purpose parser generators, such as ANTLR, due to "low-levelness" of language. Indeed, I cannot imagine a lexer for machine code producing other tokens than '0' and '1'. A scanner using just these two tokens might turn out to be clumsy, gigantic, or inefficient.

2. Folks working in system simulation field are not aware about parser generators or are too conservative to use it, sticking to their tools. Indeed, the very idea of trying to create a machine decoder in ANTLR or other tool only recently visited my mind. Until then I've always resorted to manually written decoders or ad-hoc translators.

Before trying to implement a decoder for some simple architecture (like MIPS) using ANTLR, I want to ask more experienced community members whether it is worth trying. Any opinion is welcome! Thank you!

[1] Oliver Schliebusch, Andreas Hoffmann, Achim Nohl, Gunnar Braun,
and Heinrich Meyr. Architecture implementation using the machine
description language LISA. In: ASPDAC 02 Proceedings of the 2002
conference on Asia South Pacific design automationVLSI Design
(2002), pp. 239–244.
[2] G Hadjiyiannis, S Hanono, and S Devadas. ISDL: An Instruction
Set Description Language For Retargetability. In: Proceedings of
the 34th Design Automation Conference (1997), pp. 299–302.
[3] Fredrik Larsson, Peter Magnusson, and Bengt Werner. SimGen:
Development of Efficient Instruction Set Simulators. SICS
Technical Report R97:03.
[4] Norman Ramsey and Mary F. Fernandez. The New Jersey
Machine-Code Toolkit. In Proceedings of USENIX 1995
 Technical Conference

Loring Craymer

unread,
Mar 16, 2015, 6:39:21 PM3/16/15
to antlr-di...@googlegroups.com
Instruction decode breaks down into two steps:
1.) Extract the bitfield (bits may not be contiguous) that indicates which decoding to use (index into a switch statement)
2.) Decode the remaining bits into fields, one of which is an instruction, the rest are data (well, there may be a data size field or fields, but they can logically be folded into the intruction). Some of the data may have to be fetched from the instruction stream.

Instruction sets are designed for rapid decoding, and lack the complexity of structure that would make them an interesting language problem. Thus the DSL approach: you need only a simple specification language to specify a decoder and labels for instructions and registers. This is an easy problem for parser generators: a grammar for the specification language need be no more than two or three pages.

The reason that parser generators are not typically used for the DSL implementation is that the decoding problem is simple enough not to need parser generation and the extra learning step to understand how to use a parser generator.

--Loring

Grigory Rechistov

unread,
Mar 17, 2015, 11:02:46 AM3/17/15
to antlr-di...@googlegroups.com
Hello Loring,

Thank you for your explanation. Still, I tend to disagree with "This is an easy problem for parser generators". While the most simple encodings like in MIPS are no-brainer and don't even require a DSL to describe them, the more sophisticated an architecture gets, the more effort is required to decode it. I will give just two examples. They also contradict your statement that "a grammar for the specification language need be no more than two or three pages"; in fact, in the DSLs definitions for them that I have seen, they occupy hundreds of kilobytes of hand-written code (and those are inputs for DSL translators, not results of their operation).

1. Intel Itanium instructions are stored in "bundles" of 3x41 bits, plus 5 bits to define a bundle "template", which affects meaning of all three instructions in a bundle. Each of these 41-bit instruction should be recognized into one of instruction formats, each of which has its own layout of bit-fields to be extracted, checked against limitations and then transformed to instruction operands and/or instruction suffixes.
There are about 120 instruction formats in current documentation; to even more complicate things, five of them are actually stating that the length of an instruction should be increased to 82 bits, twice the size of regular ones.

2. Intel IA-32 instructions have varying length, from 8 to 120 bits. An instruction can have zero or more single-byte prefixes, individual bits in which will affect recognition of further bits. They also can have multi-byte prefixes, also with individual bits giving new meaning to further fields. All of these prefixes should obey a complex set of rules on whether they are allowed to comes after each other or not, and in what order. After this prefix madness comes opcode bytes, one to three of them - they only mandatory part of any instruction. After that, again optional fields may follow - ModRM, SIB, Offset, Immediate. Actual meaning of certain bytes changes with processor mode; for example, 0x40 is an opcode in 32-bit mode, but it is a prefix in Long64 bit mode.

Despite high entanglement of instruction sets, they are still prefix codes and do not require unbounded lookahead. For a single instruction set of such complexity (and I have to note that others, like modern ARM, IBM s/390 etc, are only marginally simpler than those two) there should be a way to describe a grammar in ANTLR capable to tell correct instructions from undefined ones, as there already are specialized DSLs for that.

As in my original letter, I am not sure if such grammar 1) will generate an efficient parser 2) be easy to write and maintain over time. The lack of publications on that matter means that either lexer-scanners are indeed unsuitable for parsing low-level codes, or that nobody ever tried to do it.



--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/THyWEnb0bzM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Best regards,
Grigory Rechistov

Terence Parr

unread,
Mar 17, 2015, 1:06:56 PM3/17/15
to antlr-di...@googlegroups.com
Hi Grigory,I have not attempted to you and manually decode real instruction sets; well at least not since the 6502 ;)   You would have to use a lot of semantic predicates in the grammar to determine what is valid next based upon bit fields. In the end, there may not be a very good grammar about how the bits are laid out. Or at least it might be highly context-sensitive, in which case it might be better just to build your own state machine that knew how to decode these instructions.

 I believe that there are open source emulators for ARM and such that you might be able to look at.
Ter
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Loring Craymer

unread,
Mar 18, 2015, 4:57:04 AM3/18/15
to antlr-di...@googlegroups.com, grigory....@phystech.edu
Grigory--

Yes, there are processors which have cascading decodes--that adds hierarchy, but little complexity to a DSL grammar.  Similarly, VLIW processors add parallelism to the decode process--another form of hierarchy.  The hardware decode process is still fundamentally table-driven and avoids lookahead beyond fetching the "token" for the next decode step.  As long as the DSL focuses on specifying the decode engine and labels for the decoded fields, this is not a hard problem although individual specifications can get messy--the more so if the specification writer does not think through the decode process.  The specification for a particular CPU may still be large, but that is a function of the decode design for that particular processor--or the specification writer's failure to do sufficient analysis of the ISA.

The fact that a grammar for a specification language is not large does not mean that this is an inappropriate problem for ANTLR, but it does explain why you do not find DSLs implemented in ANTLR or another parser generator:  the DSL writers had a problem to solve that appeared straightforward (only one token of lookahead, even if tokens are of variable size).  Recursive descent parsers are easy to write for such problems, and most programmers don't use parser generators because they haven't had occasion to learn how to use one.

--Loring

Eric Vergnaud

unread,
Mar 18, 2015, 11:45:23 AM3/18/15
to antlr-di...@googlegroups.com
Hi,

I would add to this that disassembling machine code does not require any kind of syntax checking or validation
(this is based on the assumption that you are parsing 'real' machine code, as opposed to human produced bytes).
Using a tool designed primarily to check syntax and validate input does not sound like an effective approach.
While antlr is a great tool for parsing asm, and with some help, generate machine code, I would not use it to do the reverse.

Just my 2P.

Eric

Grigory Rechistov

unread,
Mar 19, 2015, 10:37:36 AM3/19/15
to antlr-di...@googlegroups.com
Dear all,

I thank everyone for your opinions, they make things more clear for me. I guess that, in order to empirically support my observations and your arguments, I'll take an undergraduate student to implement a medium-complexity decoder with ANTLR or similar tool and then report obtained experience. Ideally I want to work towards a generalized solution for producing machine instruction decoders.


Reply all
Reply to author
Forward
0 new messages