Dear ANTLR experts,
My question is about a rather unusual usage of a parser generators. While the process of decoding machine instructions in software has a lot in common with "regular" languages parsing, I have found no examples or attempts to implement such decoder in LEX/YACC, ANTLR or another parser generator. At the same time, I am aware about several specialized DSL translators tailored for creation of decoders, and none of them, to my knowledge, is based on a parser generator.
The second step of a machine code interpreter operation, after fetching a raw word from memory, is to decode it: run a routine that accepts a machine word of given width and parses it into an intermediate representation that facilitates consequent simulation of the instruction's semantics. Usually such IR contains aggregated information about instruction’s function (opcode) and arguments (operands). A similar task is disassembling - translating a machine word into a string of human readable text.
At decoding, a typical machine instruction of RISC system (MIPS, ARM, SPARC, Alpha etc) is split into several bit fields. Values of bitfields are tested against a set of checks to determine whether instruction is known and what arguments are present. Position of some fields may depend on values found in other fields, so the decoding is usually done sequentially from the first bit to the last. Machine encodings are guaranteed to be prefix code, and no ambiguous situation may arise.
Decoding of CISC instructions (such as IA32 a.k.a. x86) is somewhat (to my experience, significantly) harder, as they present variable length of instructions, but essentially it is all the same.
No surprise that writing a decoder by hand is a tedious and ungrateful task. Several translators have been written [1-4] in order to simplify this and adjacent tasks. All of them accept a machine language description in certain DSLs and create code capable, besides other things, to decode that language.
To my knowledge, none of these tools are based on parser generators. But what they do is mainly parsing. I find this strange. I am looking for projects that can disprove my observation. Alternatively, I am trying to formulate a logical explanation to support it. So far I was able to imagine two possibilities:
1. Machine language parsing is really an unsuitable task for general purpose parser generators, such as ANTLR, due to "low-levelness" of language. Indeed, I cannot imagine a lexer for machine code producing other tokens than '0' and '1'. A scanner using just these two tokens might turn out to be clumsy, gigantic, or inefficient.
2. Folks working in system simulation field are not aware about parser generators or are too conservative to use it, sticking to their tools. Indeed, the very idea of trying to create a machine decoder in ANTLR or other tool only recently visited my mind. Until then I've always resorted to manually written decoders or ad-hoc translators.
Before trying to implement a decoder for some simple architecture (like MIPS) using ANTLR, I want to ask more experienced community members whether it is worth trying. Any opinion is welcome! Thank you!
[1] Oliver Schliebusch, Andreas Hoffmann, Achim Nohl, Gunnar Braun,
and Heinrich Meyr. Architecture implementation using the machine
description language LISA. In: ASPDAC 02 Proceedings of the 2002
conference on Asia South Pacific design automationVLSI Design
(2002), pp. 239–244.
[2] G Hadjiyiannis, S Hanono, and S Devadas. ISDL: An Instruction
Set Description Language For Retargetability. In: Proceedings of
the 34th Design Automation Conference (1997), pp. 299–302.
[3] Fredrik Larsson, Peter Magnusson, and Bengt Werner. SimGen:
Development of Efficient Instruction Set Simulators. SICS
Technical Report R97:03.
[4] Norman Ramsey and Mary F. Fernandez. The New Jersey
Machine-Code Toolkit. In Proceedings of USENIX 1995
Technical Conference