Incorporate static analysis to symbolic execution in S2E

25 views
Skip to first unread message

Yingtong Liu

unread,
Nov 22, 2021, 2:11:24 PM11/22/21
to S2E Developer Forum

Hi all,

I recently encountered some problems with doing static analysis in S2E. Part of my project requires me to use some external whole-program static analysis to direct the symbolic execution in S2E. 

As far as I know, the LLVM instructions lifted in S2E are quite different from the LLVM instructions compiled from the source code. Usually, those LLVM instructions compiled from source code are used by some static analysis tools to do whole-program static analysis like pointer analysis.

I wonder if there is already any static analysis directly implemented in S2E on the LLVM instructions S2E generated? If not, I guess the only way to combine external static analysis and symbolic execution in S2E is through some LLVM instruction/basic block matching. But I don't actually know how hard it can be and how to use S2E to do this. Could you please point me to some existing relevant resources/plugins in S2E that are related to this if there are any? Any possible suggestions are also welcomed. 

Thank you! 

Vitaly Chipounov

unread,
Nov 22, 2021, 2:32:03 PM11/22/21
to s2e...@googlegroups.com, Yingtong Liu

Hi,

In order to do static analysis, you need to translate your entire binaries to LLVM first. You have two main options for this:

1. Use Revgen [1]. It's a static translator built from S2E's LLVM translator. Since the translator is shared with the symbolic execution engine, you will get similar LLVM code to what S2E generates dynamically. Unfortunately, Revgen is not being maintained anymore and the quality of its output is lagging. Moreover, static analysis passes will have a hard time looking at its output. Finally, we plan to remove the LLVM translator altogether, because it's very slow [3], so avoid depending on it too much.

2. Use Trailofbits McSema [2]. I would suggest you use this tool, as this is a state-of-the-art, actively supported, translator. The quality of its lifting is of higher quality than that of Revgen and your static analysis passes will work better on it.

Your static analysis passes could then feed information to S2E, e.g., which paths to explore first. You will have to write your own plugins for that. There aren't any generic ones because it's highly specific to the type of analysis you want to do. Keep in mind that S2E is a dynamic analysis tool by design, so use it for applications that benefit of dynamic analysis.

Vitaly

[1] http://s2e.systems/docs/Tutorials/Revgen/Revgen.html
[2] https://blog.trailofbits.com/2017/03/14/mcsema-im-liftin-it/
[3] https://github.com/S2E/s2e-env/issues/178

--
--
You received this message because you are a member of the S2E Developer Forum.
To post to this group, send email to s2e...@googlegroups.com
To unsubscribe from this group, send email to s2e-dev+u...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/s2e-dev

---
You received this message because you are subscribed to the Google Groups "S2E Developer Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s2e-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/s2e-dev/9e2c1850-51d7-49dc-b465-a48aab19df14n%40googlegroups.com.

Yingtong Liu

unread,
Nov 22, 2021, 3:28:40 PM11/22/21
to Vitaly Chipounov, s2e...@googlegroups.com
Hi Vitaly,

Thanks for your reply and all the useful resources!

It seems like using McSema to do the lifting will output LLVM instructions that are more friendly for some static analysis tools. My only concern about using it instead of Revgen is the incompatibility between McSema generated LLVM instructions and the LLVM instructions executed in S2E. I wonder how different you would expect McSema generated LLVM Instructions and S2E translator generated LLVM instructions? Is it possible to do a one-on-one matching for them? 

For my purpose, I need to do an LLVM instruction-level matching, i.e, when the S2E executes LLVM ins1 symbolically, match it with the same instruction in a static analysis graph, then the symbolic executor would decide to switch to another state or not based some criteria. 

Yeah, I guess writing my own plugins is necessary. I also agree that dynamic analysis is what S2E is mainly good at. What I am trying to do is use static analysis which has a whole-program view to benefit the state scheduling of s2e's state scheduler, which I think maybe interesting to experiment on. 

Vitaly Chipounov

unread,
Nov 23, 2021, 12:11:56 PM11/23/21
to Yingtong Liu, s2e...@googlegroups.com

Hi,

On 11/22/21 9:28 PM, Yingtong Liu wrote:

It seems like using McSema to do the lifting will output LLVM instructions that are more friendly for some static analysis tools. My only concern about using it instead of Revgen is the incompatibility between McSema generated LLVM instructions and the LLVM instructions executed in S2E. I wonder how different you would expect McSema generated LLVM Instructions and S2E translator generated LLVM instructions? Is it possible to do a one-on-one matching for them?

Revgen's output is quite different from S2E's output because Revgen tries to eliminate useless instructions as much as possible during lifting. So the concern is the same as with McSema: you won't have obvious one-to-one matching in either case.


For my purpose, I need to do an LLVM instruction-level matching, i.e, when the S2E executes LLVM ins1 symbolically, match it with the same instruction in a static analysis graph, then the symbolic executor would decide to switch to another state or not based some criteria.

Instead of matching LLVM instructions, it would be better to match x86 binary instructions. The tool that you use to generate LLVM code should embed information about the original x86 instructions into the translated bitcode. This way, your static analysis can map any LLVM instruction back to the binary's program counter. It can then communicate these program counters to S2E plugins. This would allow the highest flexibility for state switching, i.e., the S2E plugin will be able to switch states on any instruction rather than just on those that touch symbolic data. If you need to manipulate symbolic data, do it at instruction boundaries, i.e., either before or after an instruction is fully executed. S2E provides a stable API to instrument code at guest instruction granularity, but does not have anything to instrument individual LLVM instructions.

Vitaly

Yingtong Liu

unread,
Nov 23, 2021, 11:22:14 PM11/23/21
to Vitaly Chipounov, s2e...@googlegroups.com
Hi Vitaly,

Yeah, I agree with matching x86 binary instructions may be the right way to go. Thank you for your suggestion. This is really helpful!
Reply all
Reply to author
Forward
0 new messages