[LLVMdev] how to transform elf binary to llvm IR?

1,824 views
Skip to first unread message

慕冬亮

unread,
Jul 17, 2015, 3:11:53 AM7/17/15
to llv...@cs.uiuc.edu
I want to transform elf binary to llvm IR, and do some instrumentation based on llvm.
Is there any tool which can do the transformation?
Thanks in advance.
    
    - mudongliang

Suprateeka R Hegde

unread,
Jul 17, 2015, 3:48:43 AM7/17/15
to 慕冬亮, LLVM Dev

Its not that easy. Check out projects like MCSEMA.

--
Supra

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Mayur Pandey

unread,
Jul 17, 2015, 5:47:59 AM7/17/15
to Suprateeka R Hegde, LLVM Dev
mcsema is one such tool(open source). It supports the translation of x86 and x86_64 machine code to LLVM IR as of now. You can check more details about what all instructions it supports on https://github.com/trailofbits/mcsema
--
Thanx & Regards
Mayur Pandey


 
 

慕冬亮

unread,
Jul 17, 2015, 10:20:41 AM7/17/15
to Mayur Pandey, LLVM Dev
I have seen this tool, but I did not get the way to translate x86_64 machine code to LLVM IR in linux. 
I see the demos in x86_64, but I did not see any relative demos.

John Criswell

unread,
Jul 17, 2015, 11:45:07 AM7/17/15
to 慕冬亮, llv...@cs.uiuc.edu
On 7/17/15 2:09 AM, 慕冬亮 wrote:
I want to transform elf binary to llvm IR, and do some instrumentation based on llvm.
Is there any tool which can do the transformation?

There is a tool called Revgen which might do what you need, though I don't know if it meets your needs.  Revgen can translate native code to LLVM IR, but I'm not sure if it can translate the LLVM IR back to native code for execution.

There is also s2e which does dynamic translation from binary code to LLVM IR; it should be able to run the code after instrumentation.

IIRC, both come from George Candea's group at EPFL.  A quick Google search should help you find the code.

Regards,

John Criswell


Thanks in advance.
    
    - mudongliang


_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


-- 
John Criswell
Assistant Professor
Department of Computer Science, University of Rochester
http://www.cs.rochester.edu/u/criswell

Shuai Wang

unread,
Jul 17, 2015, 12:23:49 PM7/17/15
to 慕冬亮, llv...@cs.uiuc.edu
This is not a easy task. And I believe there is NO (open-source) tool can fully solve this problem (statically). Correct me if I was wrong.

It would be more helpful if you can provide details about what you want to do, say, static or dynamic ? stripped binary or binary with symbolic information? 
What compiler do you work on? 

Check out  papers below if you are interested.





Shuai



mats petersson

unread,
Jul 17, 2015, 12:49:29 PM7/17/15
to Shuai Wang, LLVM Developers Mailing List
For every level of translation [in terms of "human readable -> machine code translation", not someone translating a literary work from one language to another - although often some subtle details are lost here too], a little bit of the semantic meaning is lost. This means that you can almost never completely reconstruct the code in original form from the machine-code, or the C-code from the LLVM IR, or the C++ code from the output of something like cfront (the original C++ -> C translator), or the original Pascal code from a Pascal to C compiler, etc.

It is, at least sometimes, possible to reconstruct something that can then be "compiled" [in quotes as it's a loose term in this discussion] again from the binary file, but it's often lacking some of the original subtlety. And there are certainly cases where the original code is very hard to derive from the machine-code. I played with a "symbolic disassembler" many years back, and on "well-behaved code" it would reconstruct assembly code that could be recompiled, but it struggled with for example switch-statements that became a PC-relative jump-table, because when you modify the code, it couldn't figure out what the jumps were - just as one example.


I'm pretty sure it's possible to, at least as a human, write code that is nearly impossible to translate back to a higher level language. And modern compilers may not use the same types of obfuscation, but they will certainly produce code that is complex, hard to follow and not using obvious instructions for some particular purpose.

--
Mats

Shuai Wang

unread,
Jul 17, 2015, 1:35:17 PM7/17/15
to mats petersson, LLVM Developers Mailing List
Hello Mats,

I am sorry but I didn't fully get your point. Actually things have moving forward and recently research have (marginally) solved some obstacles proposed before.

Actually I am working on related reverse engineering topics for a while and according to my review this is no open-source tool can fully solve this challenge, 
even for binaries generated from well-written C program by widely-used compiler (32-bit gcc, with no optimization). We can discuss more in the email if you would like to. 
 
You might want to check papers I listed in the previous email, which discussed several issues in translating binary into LLVM IR, 
and also some recent research paper on disassembling itself.


Sincerely,
Shuai

mats petersson

unread,
Jul 17, 2015, 2:55:54 PM7/17/15
to Shuai Wang, LLVM Developers Mailing List
Shuai: I think we are agreeing - I was just saying that it's very difficult, but in a different way than how you were saying it. A large part of the difficulty is that there is "more information" in the higher level description of code, than there is in the lower level, and the compiler/translator "removes" (some of) that information when compiling. A very simple example is:

     struct ab
     {
        int a;
        float b;
    };

     struct ab a;

     foo(a);

will look (in most compilers) the same as

     int a;
     float b;

     foo(a, b);

Debug information and symbols can of course help here, but if the code doesn't have that, then there isn't any way to tell `foo(int, float)` from `foo(struct ab)` as a signature. So whilst it MAY be possible to recreate the code at a higher level that is functionally equivalent, a lot of the "helping" features in the high-level language will go missing because the information was "removed" by the compiler.

--
Mats

慕冬亮

unread,
Jul 18, 2015, 12:54:13 AM7/18/15
to Shuai Wang, llv...@cs.uiuc.edu
What we want to do is to transform binary(binary with symbolic information) to llvm IR in static way.
I will instrument code in the llvm IR.
The compiler may be llvm clang. 

Mayur Pandey

unread,
Jul 18, 2015, 4:00:47 PM7/18/15
to 慕冬亮, llv...@cs.uiuc.edu
Hi mudongliang,

I have tried using mcsema tool on x86_64 binaries. It works well for small programs. One large project I am trying now and was getting an error. Trying to resolve the same.

To use mcsema, you have two main tools:
bin_descend : to get the cfg (in google protocol buffer format) from binary
cfg_to_bc : to convert the cfg created in the earlier step to LLVM IR.

You can tell me what is the problem that you were facing while using this tool. I might be able to help.

Joshua Cranmer 🐧

unread,
Jul 18, 2015, 10:26:52 PM7/18/15
to llv...@cs.uiuc.edu
On 7/17/2015 2:09 AM, 慕冬亮 wrote:
> I want to transform elf binary to llvm IR, and do some instrumentation
> based on llvm.
> Is there any tool which can do the transformation?

It sounds like what you want to do is some form of binary translation,
and, quite frankly, LLVM is going to be a poor choice. LLVM is designed
to be a compiler IR, and its optimizations rely on source-level hinting
information that is irrevocably lost when converted to machine code.
While there do exist several projects that can do some conversion from
machine code to IR (Dagger, Fracture, MCSema), none of them are
sufficiently robust (to my knowledge). In comparison to projects whose
raison d'être is binary translation (e.g., Valgrind, Pin), you're not
going to see sufficient value-add in using LLVM to outweigh the fact
that you're using a very non-robust solution.

If you really want to use LLVM, I'd advise using clang to compile the
C/C++ code and do instrumentation passes within the clang compilation
process. I would not advise trying to do instrumentation via decompiling
binaries to LLVM IR.

--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

Reply all
Reply to author
Forward
0 new messages