[Haskell-cafe] Idiomatic way to parse bytecodes

19 views
Skip to first unread message

Gautier DI FOLCO

unread,
Aug 29, 2015, 7:12:11 PM8/29/15
to haskell-cafe
Hello all,

I have some bytecodes to parse and I look for the most idiomatic way to do it.
It's mostly JVM Bytecode, long story short, I have some embedded JVM Bytecodes:
 * one part is specification-defined, from 0x00 to 0xB8, differ from the regular JVM Bytecode
 * and the other part is proprietary and changes often

I have two main goals:
 * print the signification of each bytecodes (a more explicit one than the hexadecimal value of the bytecode)
 * Do some analysis on it, spot patterns, apply markov model on it and so on.

The issue with JVM Bytecode in general is that each bytecode hasn't been made equal. Some of them requires parameters, some of them modify the length of the next bytecode, etc.
I'm not looking for a way to parse bytecode quickly, or for a library doing it for me.
I look for a simple/composable/idiomatic way to do it.
So if you have any thoughts, papers, articles or even code snippets (part of a library of not) that is done in the Haskell-way, please, share it :)

Thanks in advance for your help.

William Yager

unread,
Aug 29, 2015, 8:22:07 PM8/29/15
to Gautier DI FOLCO, haskell-cafe
Based on what you've said, I would define a Serial or Binary instance for single bytecodes, and then use this to write a function :: ByteString -> [Bytecode].

Then, you can convert raw JVM bytecode into some convenient Haskell data, and then do whatever processing you want to this.

--Will


_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe


Kim-Ee Yeoh

unread,
Aug 29, 2015, 11:47:27 PM8/29/15
to Gautier DI FOLCO, haskell-cafe

On Sun, Aug 30, 2015 at 6:12 AM, Gautier DI FOLCO <gautier...@gmail.com> wrote:
 
I have two main goals:
 * print the signification of each bytecodes (a more explicit one than the hexadecimal value of the bytecode)
 * Do some analysis on it, spot patterns, apply markov model on it and so on.

Work backwards, a piece at a time.

The output of parsing is a piece of structured data. So you need to design your data type. Start with the bare minimum. Write the pretty printing code. What do the type signatures of your analysis functions look like?

E.g. "Apply markov model" is way too vague.

The key thing is to iterate and improve on the design of your data type.

Once the data type looks good, the implementation of parsing becomes clear as day.

-- Kim-Ee

Gautier DI FOLCO

unread,
Aug 30, 2015, 6:35:34 PM8/30/15
to Kim-Ee Yeoh, haskell-cafe
Hello,

Thanks for your answer.
I think I have misled you in the expression of my needs.
I don't have any representation issues, I'm just looking for a good design for the parsing phase.
Here's some naive iterative tries:
0. We only have one bytecode per instruction a simple HashMap is enough:
import Data.Word
import Data.Maybe
import qualified Data.Map as H
type Byte = Word8

data ByteCodeSimpleInstruction =
      Astore_0 -- 0x4B
    | Astore_1 -- 0x4C
    | Astore_2 -- 0x4D
    | Astore_3 -- 0x4E
    deriving (Eq, Show)

sampleSI :: [Byte]
sampleSI = [0x4B, 0x4C, 0x4D, 0x4E]

convertTableSI :: H.Map Byte ByteCodeSimpleInstruction
convertTableSI = H.fromList $ [
      (0x4B, Astore_0),
      (0x4C, Astore_1),
      (0x4D, Astore_2),
      (0x4E, Astore_3)
    ]

parserSI :: [Byte] -> [ByteCodeSimpleInstruction]
parserSI = map $ fromJust . flip H.lookup convertTableSI

1. We add instruction with different lengths, I'm "forced" to introduce pattern matching and it makes me sad:
data ByteCodeVariableLengthInstruction =
      Astore' Byte -- 0x19
    | Astore_0' -- 0x4B
    | Astore_1' -- 0x4C
    | Astore_2' -- 0x4D
    | Astore_3' -- 0x4E
    deriving (Eq, Show)

sampleVLI :: [Byte]
sampleVLI = [0x4B, 0x4C, 0x19, 0x2A, 0x4D, 0x4E]

parserVLI :: [Byte] -> [ByteCodeVariableLengthInstruction]
parserVLI b = case b of
    (0x19:p:n) -> (Astore' p):parserVLI n
    (0x4B:n)   -> Astore_0':parserVLI n
    (0x4C:n)   -> Astore_1':parserVLI n
    (0x4D:n)   -> Astore_2':parserVLI n
    (0x4E:n)   -> Astore_3':parserVLI n
    []         -> []

2. We add instructions that change the next instruction's length, we are "forced" to add different parsing strategies:
data ByteCodeVariableLengthParameterInstruction =
      Astore'' ByteParameter -- 0x19
    | Astore_0'' -- 0x4B
    | Astore_1'' -- 0x4C
    | Astore_2'' -- 0x4D
    | Astore_3'' -- 0x4E
    | Wide'' -- 0xDD
    deriving (Eq, Show)

data ByteParameter =
      Simple Byte
    | Double Byte Byte
    deriving (Eq, Show)

sampleVLPI :: [Byte]
sampleVLPI = [0x4B, 0x4C, 0xDD, 0x19, 0xA2, 0x2A, 0x4D, 0x4E]

parserVLPI :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPI = parserVLPISimple

parserVLPISimple :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPISimple b = case b of
    (0x19:p:n) -> (Astore'' (Simple p)):parserVLPISimple n
    (0x4B:n)   -> Astore_0'':parserVLPISimple n
    (0x4C:n)   -> Astore_1'':parserVLPISimple n
    (0x4D:n)   -> Astore_2'':parserVLPISimple n
    (0x4E:n)   -> Astore_3'':parserVLPISimple n
    (0xDD:n)   -> Wide'':parserVLPIDouble n
    []         -> []

parserVLPIDouble :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPIDouble b = case b of
    (0x19:p:q:n) -> (Astore'' (Double p q)):parserVLPISimple n
    _            -> parserVLPISimple b


I feel that are too "ad-hoc" tactics and I'm looking for an higher-order way to do it.
I don't know if I'm clearer now.

Thanks in advance for you help.
Regards.

PS: For the Markov part, Markov models will help me to spot the most common sequences of bytecodes. Then I'll be able to create proprietary byte that will be the most likely to decrease my code's size.

Richard A. O'Keefe

unread,
Aug 30, 2015, 11:44:01 PM8/30/15
to Gautier DI FOLCO, haskell-cafe

On 31/08/2015, at 10:35 am, Gautier DI FOLCO <gautier...@gmail.com> wrote:
> Thanks for your answer.
> I think I have misled you in the expression of my needs.
> I don't have any representation issues, I'm just looking for a good design for the parsing phase.

Depending on what you mean by "parsing",
a very simple answer may suffice.

Use a TABLE-DRIVEN approach.

Let me take a Smalltalk byte code as an example.
Instructions fall into a number of groups.
For example, we have
push_local <number>
where <number> might be encoded in 2 bytes,
in 1 byte, or it might be implied in the opcode.
So we'd have something like

data Operator
= Push_Local
| Store_Local
| ...
data Operand
= None
| Implied Int
| One_Byte
| Two_Byte
| ...
decode :: Array Int (Operator, Operand)
decode = array (0,255) [
(0, (Push_Local, Implied 0)),
(1, (Push_Local, Implied 1)),
...
(8, (Push_Local, One_Byte)),
(9, (Push_Local, Two_Byte)),
...

Or even make it a function:

decode :: Int -> (Operator,Operand)
decode 0 = ...
...
decode 255 = ...

then for example

extract :: Operand -> Bytes -> Int -> (Simple_Operand, Int)
extract None b i = (No_Operand, i)
extract (Implied n) b i = (Int_Operand n, i)
extract (One_Byte) b i = (Int_Operand (byte b i), i+1)
extract (Two_Byte) b i = (Int_Operand (byte b i * 256 + byte b (i+1)),i+2)
...

Ideally, you'd switch to a different variant of the instruction set
by simply changing the table.

When you have "portmanteau instructions", it gets a bit trickier.
For example, Smalltalk systems typically have

Store_Local <index> -- store the TOS in that variable
Pop -- remove the TOS

combined in

Pop_Into_Local <index> -- store TOS into variable and remove it

> PS: For the Markov part, Markov models will help me to spot the most common sequences of bytecodes. Then I'll be able to create proprietary byte that will be the most likely to decrease my code's size.

There's a Xerox blue-and-white report. Ah, here we go.
http://www.textfiles.com/bitsavers/pdf/xerox/parc/techReports/CSL-82-2_An_Analysis_of_a_Mesa_Instruction_Set.pdf
Mesa was an "industrial strength Pascal" running on the Alto and
the D-machines. The author collected dynamic frequencies of
byte pairs. "It is hard to learn much of general interest by
examining individual instruction frequencies of a small set of
programs, since those statistics may highlight idiosyncratic
behavior of the programs. For example, only three of the top eight
most frequently executed instructions in both VlsiCheck and the
compiler are the same."

There's also the companion paper from ASPLOS'82,
"Static Analysis of the Mesa Instruction Set". R.Sweet, J.Sandman.

The obvious question is WHY you want to compress your code size
in this way. That report (if only I could remember title or
author, sigh, pointed out that STATIC instruction (and operand)
frequencies and DYNAMIC instruction (and operand) frequencies
can be quite different.

A fair bit of fiddling has gone on in the Smalltalk world, so that
although practically everyone started with something like the Blue
Book model (which is still probably *the* reference for how byte
code systems work). Those people are trying to balance *compact*
encoding against *efficient* decoding.

If you just want to reduce size you might think in terms of
just compressing everything, with decompression of code that's
actually executed as "Level 0 JIT".
Reply all
Reply to author
Forward
0 new messages