Using MLIR for symbolic analysis

363 views
Skip to first unread message

Peteris Erins

unread,
Apr 28, 2019, 6:33:04 PM4/28/19
to MLIR
Hi There,

I'm excited about the prospect of potentially using MLIR for symbolic execution on blockchain, but my understanding of MLIR is not yet good enough to fully evaluate this use case. The practical problem I'm facing is the following.

Problem description

We want to build a tool that takes a smart contract (i.e., a program) and can:

1) interpret it
2) convert it into a set of formulae for symbolic execution (the formulas would be expressions for an SMT solver like Z3). For example a program (a) -> { x = a; x += 1; return x } could be turned into these formulas { x1 = a; x2 = x1 + 1; return = x2; }

This is how I would be interested in using MLIR for this problem:

Diagram

AST -> (a) Smart contract dialect -> (b) IR for ops that are easy to model with SMT solver -> LLVM dialect
                                                                                                                                               -> SMT formulas

- Create an MLIR dialect (call it (a)) for the smart contract programming language/bytecode in SSA form
- Create another dialect (call it (b)) that is closer to what SMT solvers can understand
- Implement conversion (a) -> (b)
- Get (1) by writing a (b) -> LLVM dialect converter
- Get (2) by implementing a rewriter from (b) to the SMT solver outside of MLIR or even build the SMT solver as a dialect (c) using MLIR

Why use MLIR in this case?

- Get debugging, analysis, verification tools and a type system for free / almost using MLIR as a general framework for building an IR
- LLVM already implemented so easier to write an interpreter for (1)
- By virtue of leveraging the common language (b), it's easier to achieve correctness for (1) and (2)
- If any optimizations are needed, they are easy to implement as rewrites

Where I see challenges/questions

- Can MLIR deal with a re-definition of "state" for a dialect? For example, the K framework for formal semantics supports the introduction of new state objects called "cells". Smart contract platforms like Ethereum have a specific architecture with their own memory, storage, that doesn't map directly to traditional computers. Is MLIR relying on low-level state model of machine RAM, etc. as implicit in LLVM, or does it support broader concepts of "state"?
- If not is there a way to "hack" this?
- Another way of asking this is: could MLIR be a good way to describe the canonical semantics of a programming language?
- Is MLIR powerful enough to express concepts like mappings and sets?
- What is the status/coverage of Python bindings right now?
- What sorts of debugging, analysis, tools are already working/planned at this point/

Appreciate your insight.

Best
Peteris

Mehdi AMINI

unread,
Apr 28, 2019, 7:11:13 PM4/28/19
to Peteris Erins, MLIR
On Sun, Apr 28, 2019 at 3:33 PM Peteris Erins <peteri...@gmail.com> wrote:
Hi There,

I'm excited about the prospect of potentially using MLIR for symbolic execution on blockchain, but my understanding of MLIR is not yet good enough to fully evaluate this use case. The practical problem I'm facing is the following.

Problem description

We want to build a tool that takes a smart contract (i.e., a program) and can:

1) interpret it
2) convert it into a set of formulae for symbolic execution (the formulas would be expressions for an SMT solver like Z3). For example a program (a) -> { x = a; x += 1; return x } could be turned into these formulas { x1 = a; x2 = x1 + 1; return = x2; }

This is how I would be interested in using MLIR for this problem:

Diagram

AST -> (a) Smart contract dialect -> (b) IR for ops that are easy to model with SMT solver -> LLVM dialect
                                                                                                                                               -> SMT formulas

- Create an MLIR dialect (call it (a)) for the smart contract programming language/bytecode in SSA form
- Create another dialect (call it (b)) that is closer to what SMT solvers can understand
- Implement conversion (a) -> (b)
- Get (1) by writing a (b) -> LLVM dialect converter
- Get (2) by implementing a rewriter from (b) to the SMT solver outside of MLIR or even build the SMT solver as a dialect (c) using MLIR

That seems like a good approach to me (without knowing enough about the domain specific problem).
 

Why use MLIR in this case?

- Get debugging, analysis, verification tools and a type system for free / almost using MLIR as a general framework for building an IR
- LLVM already implemented so easier to write an interpreter for (1)
- By virtue of leveraging the common language (b), it's easier to achieve correctness for (1) and (2)
- If any optimizations are needed, they are easy to implement as rewrites 

Yes that seems like a good summary: we hope to have MLIR mature into a complete framework to build your own IR and set of analyses/transformations, but also create a set of dialects that can be re-used as much as possible.
 

Where I see challenges/questions

- Can MLIR deal with a re-definition of "state" for a dialect? For example, the K framework for formal semantics supports the introduction of new state objects called "cells". Smart contract platforms like Ethereum have a specific architecture with their own memory, storage, that doesn't map directly to traditional computers. Is MLIR relying on low-level state model of machine RAM, etc. as implicit in LLVM, or does it support broader concepts of "state"?

Each dialect its own expectations: you can define your own type system with your domain specific semantic. Some concept in MLIR like memref I believe are making some assumption about the memory, but you don't have to use these types or any of the standard operations in your dialect. 

 
- If not is there a way to "hack" this?
- Another way of asking this is: could MLIR be a good way to describe the canonical semantics of a programming language?
- Is MLIR powerful enough to express concepts like mappings and sets?

I'm not sure what kind of mapping and sets you have in mind here, do you have any example to illustrate?

-- 
Mehdi
 
- What is the status/coverage of Python bindings right now?
- What sorts of debugging, analysis, tools are already working/planned at this point/

Appreciate your insight.

Best
Peteris

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/a768c2b4-5b53-454f-bd14-5860c77fe4c5%40tensorflow.org.

Peteris Erins

unread,
Apr 29, 2019, 12:59:52 AM4/29/19
to MLIR
Hi Mehdi,

Thank you for your response.

In terms of mappings, one thing to model in the Ethereum Virtual Machine is its storage. It's "hard drive" is a hash map from 256-bit accounts to hash maps of 256-bit keys and 256-bit values, i.e., map<uint256, map<uint256, uint256>>. There are some opcodes in the Ethereum Virtual Machine that modify this storage map.

Here are the semantics (expressed in K framework) of an operation that modifies storage called SSTORE:

    syntax BinStackOp ::= "SSTORE"
 // ------------------------------
    rule <k> SSTORE INDEX NEW => . ... </k>
         <id> ACCT </id>
         <account>
           <acctID> ACCT </acctID>
           <storage> ... (INDEX |-> (CURR => NEW)) ... </storage>
           <origStorage> ORIGSTORAGE </origStorage>
           ...
         </account>
         <refund> R => R +Int Rsstore(SCHED, NEW, CURR, #lookup(ORIGSTORAGE, INDEX)) </refund>
         <schedule> SCHED </schedule>
The way K framework works is that there is a general, custom definition of global state and each operation is a rewrite of that state. As you see, there are several modifications to the global state that are specialized to this virtual machine: the stack is modified, the underlying storage is modified, and a gas refund is potentially allocated.

I'd like to understand:

- How does MLIR support expressing novel types of virtual machine state that are non-trivial to map to existing memory architecture. In particular, given MLIR is SSA by default, I wonder if the global state would therefore have to be implicit in every single function/block as a custom type/argument?

func simple_f(ethereum_global_state, funct_arguments):
 ethereum_global_state2 = ...
 return ethereum_global_state2, func_return_values

- In this case, would it be fair to say that MLIR is like a "general framework for implementing IRs that will eventually be executed on a standard architecture". So any representation of "semantics" in MLIR would not be direct/mathematically clean, it would represent how a given virtual machine/language could be translated for execution on standard computers. So it would be defining a language indirectly by detailing how it should eventually be lowered to machine level code.

- What types of data structures are encodable? EVM is a stack-based VM. Can we express a stack? Can we express a hash table like above?

Hope this clarifies further how I'm essentially attempting to hijack MLIR to apply to a different virtual machine architecture and how that could work.

Best
Peteris

Mehdi AMINI

unread,
Apr 29, 2019, 1:27:34 PM4/29/19
to Peteris Erins, MLIR
On Sun, Apr 28, 2019 at 9:59 PM Peteris Erins <peteri...@gmail.com> wrote:
Hi Mehdi,

Thank you for your response.

In terms of mappings, one thing to model in the Ethereum Virtual Machine is its storage. It's "hard drive" is a hash map from 256-bit accounts to hash maps of 256-bit keys and 256-bit values, i.e., map<uint256, map<uint256, uint256>>. There are some opcodes in the Ethereum Virtual Machine that modify this storage map.

Here are the semantics (expressed in K framework) of an operation that modifies storage called SSTORE:

    syntax BinStackOp ::= "SSTORE"
 // ------------------------------
    rule <k> SSTORE INDEX NEW => . ... </k>
         <id> ACCT </id>
         <account>
           <acctID> ACCT </acctID>
           <storage> ... (INDEX |-> (CURR => NEW)) ... </storage>
           <origStorage> ORIGSTORAGE </origStorage>
           ...
         </account>
         <refund> R => R +Int Rsstore(SCHED, NEW, CURR, #lookup(ORIGSTORAGE, INDEX)) </refund>
         <schedule> SCHED </schedule>
The way K framework works is that there is a general, custom definition of global state and each operation is a rewrite of that state. As you see, there are several modifications to the global state that are specialized to this virtual machine: the stack is modified, the underlying storage is modified, and a gas refund is potentially allocated.

I'd like to understand:

- How does MLIR support expressing novel types of virtual machine state that are non-trivial to map to existing memory architecture. In particular, given MLIR is SSA by default, I wonder if the global state would therefore have to be implicit in every single function/block as a custom type/argument?

func simple_f(ethereum_global_state, funct_arguments):
 ethereum_global_state2 = ...
 return ethereum_global_state2, func_return_values

That is if you want your IR to be "functional", but nothing prevents you to have side-effecting operations.



- In this case, would it be fair to say that MLIR is like a "general framework for implementing IRs that will eventually be executed on a standard architecture".

I don't think there is a notion of "standard architecture". Contrary to LLVM, MLIR does not have any assumption baked-in I think. The goal is to make is as pluggable/customizable as possible.
It is true that some dialects are more closer to the machine than other and will necessarily make assumptions about a virtual machine that matches "standard architectures", the lowest level of abstraction likely being LLVM. But that does not preclude from having dialects with higher level abstractions.
 
So any representation of "semantics" in MLIR would not be direct/mathematically clean, it would represent how a given virtual machine/language could be translated for execution on standard computers. So it would be defining a language indirectly by detailing how it should eventually be lowered to machine level code.

Again, what you describe here seems to match the approach what one would have to take with LLVM. But the main use-case we are leveraging MLIR for internally is TensorFlow (which is modelling a dataflow graph) and we're mapping it on custom accelerators for deep learning (which aren't exactly "standard computers").
 

- What types of data structures are encodable? EVM is a stack-based VM. Can we express a stack? Can we express a hash table like above?

The type system being extensible, I hope you can express a stack or hash-table as "first-class" type in your dialect. 

The main thing is that your dialect semantic would be opaque to MLIR, so you have to build the analyses for your dialect that will model the semantics you want.

Have you watched the tutorial from EuroLLVM? https://llvm.org/devmtg/2019-04/talks.html#Tutorial_1
We tried to show how to model various level of abstractions there. If you feel that it is still unclear let me know as we're likely gonna evolve the tutorial in the coming months and maybe we can do a better job to showcase how MLIR can help the kind of use-cases you have.

Best,

-- 
Mehdi
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Peteris Erins

unread,
Apr 29, 2019, 4:41:43 PM4/29/19
to MLIR
Hi Mehdi,

Thanks for the explanation - this is very encouraging.

I did go through the tutorial, which was great. It was clear how MLIR could help me build an optimized compiler. I've not been exposed to LLVM so that's what made it difficult to play out this use case. If helpful at all, I can separately send some feedback on what could have made it clearer for me!

Best
Peteris

On Sunday, April 28, 2019 at 11:33:04 PM UTC+1, Peteris Erins wrote:

ma...@numforge.co

unread,
Apr 30, 2019, 4:04:43 AM4/30/19
to MLIR
Hi Peteris,

I'm part of the Status' Nimbus team, an Eth1 and Eth2 client, and I've implemented the EVM and I'm also following the discussion on which VM to use for Ethereum 2.
I'm assuming you want to do symbolic execution for Ethereum specifically (and not any blockchain VM).

In the current state, I expect eWASM, to be used which is a 64-bit VM based on WebAssembly with Ethereum primitives on top.
The design document is available here and a C++ implementation is available here. Discussion happens on Gitter in the eWASM.

eWASM would probably be a much nicer to implement symbolic execution and as it's a 64-bit VM it will be much closer to the hardware than the EVM which is 256-bit.
Furthermore, WASM -> MLIR toolings woud probably be useful beyond the Ethereum community.

Also one of the main priorities of eWASM is the ease of formal verification and the KEVM team who formally specified the EVM is providing valuable inputs.

Kind regards,
Mamy

Peteris Erins

unread,
Apr 30, 2019, 6:40:37 AM4/30/19
to MLIR
Hi Mamy,

Great to hear. I'm very much looking forward to eWASM. It'd be nice to have MLIR dialects for WASM and eWASM and I wouldn't be surprised if the former popped up rather quickly.

For MLIR, I think there are some interesting questions we need to answer before being able to express EVM/eWASM: what does a canonical SSA form look like (e.g., we probably need some phantom operations just to capture gas costs), how to deal with CALL family of opcodes (which means you don't have all program source code up-front). More generally the byte code -> IR approach may not be what MLIR was designed for and will require some cleverness at the least :).

Best regards
Peteris Erins
Reply all
Reply to author
Forward
0 new messages