MLIR for binary code analysis

Ajay Kumar

unread,

Jul 29, 2020, 3:32:49 PM7/29/20

to MLIR

Hi, I am new to MLIR currently learning, most of the work with MLIR going towards compiler infrastructure. I wanted to use the same MLIR concept for binary code analysis.

In this regard, I have a few questions and expecting your suggestion.

Sorry if I am wrong, but what I understood from the tutorials is the Toy language converted into AST, and from the AST, the MLIR is emitting. When you say multi-level IR, is there any level within the MLIR like level-1, level-2, etc…?
Once MLIR is emitted what is the further use from the MLIR..? is that current status of MLIR work is converting from MLIR to LLVM-IR…?. if so, can we use that LLVM-IR for recompilation …? (if I use this MLIR for binary code analysis).
I have a disassembled assembly instruction for the ELF binary (obtained from RE tool like IDA pro). In place of toy language if I insert this disassembly assembly instruction. Can I continue to use the remaining steps like Toy language/Assembly instructions --> AST–> MLIR–> LLVM-IR…?

Thanks in advance.

Mehdi AMINI

unread,

Jul 29, 2020, 7:33:12 PM7/29/20

to Ajay Kumar, MLIR

Hi,

This mailing list is mostly intended for MLIR inside TensorFlow, the general MLIR discussion forum is https://llvm.discourse.group/c/mlir/31 ; that said see answers to your questions inline:

On Wed, Jul 29, 2020 at 12:32 PM Ajay Kumar <akma...@gmail.com> wrote:

Hi, I am new to MLIR currently learning, most of the work with MLIR going towards compiler infrastructure. I wanted to use the same MLIR concept for binary code analysis.
In this regard, I have a few questions and expecting your suggestion.
Sorry if I am wrong, but what I understood from the tutorials is the Toy language converted into AST, and from the AST, the MLIR is emitting. When you say multi-level IR, is there any level within the MLIR like level-1, level-2, etc…?

There aren't any "numbering" for the levels: MLIR mainly allows you to create your own abstraction level. As such there can't really be an absolute numbering.

Each compilation flow would have its own concept of levels, which are in general implicit and associated to the dialects defined and involved.

In the context of the tutorial, the levels would be (from the highest to the lowest):

- the Toy dialect, which models the semantics of the programming language as much as possible.

- the linalg dialect which describes in a structured way some dense computations.

- the affine dialect which is slightly lower than linalg in how it describe more generally any kind of affine relationship.

- std dialect for general scalar/memory manipulation

- the llvm dialect which is the last level before we hand over to LLVM for the CPU codegen.

Note that other uses of MLIR can involve level much lower than LLVM IR, like for example https://github.com/llvm/circt

Once MLIR is emitted what is the further use from the MLIR..? is that current status of MLIR work is converting from MLIR to LLVM-IR…?. if so, can we use that LLVM-IR for recompilation …? (if I use this MLIR for binary code analysis).

I have a disassembled assembly instruction for the ELF binary (obtained from RE tool like IDA pro). In place of toy language if I insert this disassembly assembly instruction. Can I continue to use the remaining steps like Toy language/Assembly instructions --> AST–> MLIR–> LLVM-IR…?

I'm lacking some context on what you're trying to model here: in particular if you decompile binaries you'd have to wonder if there is some structure you need to model where LLVM IR can't be directly emitted. MLIR just allows you to create such modelling and progressively analyze/transform your representation to converge towards your target IR.

--

Mehdi

Ajay Kumar

unread,

Jul 30, 2020, 12:02:21 AM7/30/20

to Mehdi AMINI, MLIR

Hi Mehdi,

Thanks so much for your response. I am very much convinced with your answer to my first question which I asked related to MLIR.

- the llvm dialect which is the last level before we hand over to LLVM for the CPU codegen.

You mean to say, llvm dialect is the last level and it is repressed as LLVM-IR but it represented within the MLIR stage .. is my understanding is correct..?.

--

In the third question, I am trying to model the disassembled assembly instruction of X86-64 ELF static binaries (something like https://docs.binary.ninja/dev/bnil-llil.html), that I had obtained from the disassembler tools such as objdump, capstone, IDA pro, etc. the disassembled assembly instruction has below data structure semantics:

-registers

-instruction

- functions

- data and segment section etc. of the static elf binary.

I was trying to model the above information while first generating to AST and then emitting the to MLIR. Traditionally, a single IR representation for binary code analysis and recompilation (from LLVM-IR) widely studied in the area of reverse engineering. Additionally, the LLVM-IR is also widely used by many approaches (https://github.com/lifting-bits/mcsema#comparison-with-other-machine-code-to-llvm-bitcode-lifters). But my interest I wanted to use Multi-level IR (MLIR) in place of single IR generation.

In this stage I wanted to know from your end is my approach is feasible to generate Multi-level IR and then recompile to target binary code (which can be runnable) while following the existing MLIR concept. What is your suggestion on this..?

At last, can we able to read the content inside the MLIR (something like pseudo assembly instruction form)..?