Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Making C compiler generate obfuscated code

561 views
Skip to first unread message

Dennis Yurichev

unread,
Dec 7, 2010, 10:07:12 AM12/7/10
to
Hi.

About my little attempt to hack Tiny C compiler's codegenerator so it
produces obfuscated code:
http://blogs.conus.info/node/58

Paul Biggar

unread,
Dec 9, 2010, 12:53:37 PM12/9/10
to

I'd be interested to know how much of this added code could be removed
by a binary- or assembly-level optimizer, and it's affect on the
reversibility of result?
--
Paul Biggar
paul....@gmail.com

Joshua Cranmer

unread,
Dec 9, 2010, 4:53:59 PM12/9/10
to

From a decompilation aspect, it's not that difficult of obfuscated
code. Just by removing redundant assignments, knowing nothing else, I
produce this assembly:

a proc near

var_CD500B = byte ptr -0CD500Bh
arg_0 = dword ptr 8
arg_4 = dword ptr 0Ch
arg_1D364BDE = byte ptr 1D364BE6h

nop
push ebp
mov ebp, esp
sub esp, 0
nop

loc_800001F:
mov ebx, 9EF81F3Eh
mov eax, 0FD6D5D47h
sub ebx, edx
mov eax, [ebp+arg_4] ; *
shl eax, 2 ; *
mov ecx, [ebp+arg_0]
adc ecx, ecx
mov ecx, eax
or ecx, eax
mov ebx, 52A74759h
xor edx, edx
jnz short loc_800001F
mov ecx, [ebp+arg_0] ; *
add ecx, eax ; *
mov ebx, 0DAB5E429h
lea edx, [ebx+75A1EF29h]
or edx, edx
mov eax, ecx ; *
jmp $+5
leave
pop ebx
jmp ebx
a endp


Half the code is gone just by noticing that these values aren't being
used, with extremely liberal values of "used". If i notice that the jnz
branch is never taken, I get even shorter:
mov eax, [ebp+arg_4] ; *
shl eax, 2 ; *
mov ecx, [ebp+arg_0] ; *
add ecx, eax ; *
mov ebx, 0DAB5E429h
lea edx, [ebx+75A1EF29h]
or edx, edx
mov eax, ecx ; *

(I could combine the three obfuscated instructions into one, but I don't
want to do that by hand.)

If I wanted to deobfuscate this code, all I have to do is first run it
through an optimizer, and one simple enough to be written as a course
project at that. If you really want to obfuscate the code, it's better
to modify the control flow graph as opposed to inserting random
do-nothing code.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

Torben Ægidius Mogensen

unread,
Dec 15, 2010, 5:23:49 AM12/15/10
to
Joshua Cranmer <Pidg...@gmail.com> writes:
> If I wanted to deobfuscate this code, all I have to do is first run it
> through an optimizer, and one simple enough to be written as a course
> project at that. If you really want to obfuscate the code, it's better
> to modify the control flow graph as opposed to inserting random
> do-nothing code.

Indeed. I wsa not impressed by the approach: It made the code a lot
larger and a lot slower and the obfuscation was easy to remove.
Ideally, obfuscated code should only be marginally larger and slower
than "normal" code, but extremely difficult to reverse-engineer.

Some of the least readable code I have seen has been code where every
trick in the book was used to make it as short as possible, so using
extreme optimisaton tricks is probably a much better obfuscator than
inserting random code -- even random control-structure code.

Some common optimisations do, indeed, make the code less readable:
Strength reduction, loop scheduling, array blocking, common
subexpression elimination and so on. So using these aggressively
would work well. But there will be code where such optimsations do
not apply. Here, you could use some of the following tricks:

- Split a variable x into two variables p and q, such that x=p+q.
Modifications to x are made into modifications of p _or_ q. Uses of
x (except those for self-modification such as x=x+1) are replaced by
uses of p+q. This will make the code somewhat larger and slower, but
not by much. Common subexpression elimination can remove some of the
p+q additions without increasing readability.

- Replace two otherwise unrelated variables x and y by p and q such
that p=x+y and q=x-y (so x==(p+q)/2 and y==(p-q)/2).

- A variable x holding small non-negative integers is replaced by a
variable p = 2^x. Additions and subtractions of constants to x are
made into shifts. Using x is more tricky, though, unlesss there is a
"count leading zeroes" instruction.


Torben

glen herrmannsfeldt

unread,
Dec 16, 2010, 3:52:13 AM12/16/10
to
Torben Fgidius Mogensen <tor...@diku.dk> wrote:
(snip on object code obfuscating compilers)

> Indeed. I wsa not impressed by the approach: It made the code a lot
> larger and a lot slower and the obfuscation was easy to remove.
> Ideally, obfuscated code should only be marginally larger and slower
> than "normal" code, but extremely difficult to reverse-engineer.

It would seem that one could divide the code into blocks at
unconditional branches, and then move the blocks around randomly.
Though again, it might not be hard to unobfuscate. I do remember
stories about compilers with optimization algorithms based on
random numbers and the natural difficulty with reverse engineering.

> Some of the least readable code I have seen has been code where every
> trick in the book was used to make it as short as possible, so using
> extreme optimisaton tricks is probably a much better obfuscator than
> inserting random code -- even random control-structure code.

That is, assuming such tricks are allowed. As I understand it
(possibly incorrectly) the exception model in Java makes it
difficult to do many optimizations that traditionally would
have been done. Moving statements outside of loops, for one.

> Some common optimisations do, indeed, make the code less readable:
> Strength reduction, loop scheduling, array blocking, common
> subexpression elimination and so on. So using these aggressively
> would work well. But there will be code where such optimsations do
> not apply. Here, you could use some of the following tricks:

If they are predictable, though, it should be possible to write
a program to undo them. Easier if the obfuscator is open source.

> - Split a variable x into two variables p and q, such that x=p+q.
> Modifications to x are made into modifications of p _or_ q. Uses of
> x (except those for self-modification such as x=x+1) are replaced by
> uses of p+q. This will make the code somewhat larger and slower, but
> not by much. Common subexpression elimination can remove some of the
> p+q additions without increasing readability.

Even better if you can find natural rearrangements of variables
that satisfy this goal.

> - Replace two otherwise unrelated variables x and y by p and q such
> that p=x+y and q=x-y (so x==(p+q)/2 and y==(p-q)/2).

Probably not so bad for integers, but not good for floating point.

-- glen

Joshua Cranmer

unread,
Dec 16, 2010, 10:23:45 AM12/16/10
to
On 12/15/2010 05:23 AM, Torben Fgidius Mogensen wrote:
> Some of the least readable code I have seen has been code where every
> trick in the book was used to make it as short as possible, so using
> extreme optimisaton tricks is probably a much better obfuscator than
> inserting random code -- even random control-structure code.

I'm not sure what work has been done on deoptimization (perhaps anyone
with the Hex-Rays decompiler could tell us?), but some of the
optimization techniques seem relatively easy to reverse.

From a purely theoretical standpoint, obfuscation that adds
non-executing code is going to be less difficult to reverse engineer
than obfuscation that does the same thing, just... less obviously.


--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

[A good optimizer should make code much more obscure, since it throws
away everything not necessary to produce a correct result. Really good
optimizers like the ones in embedded linkers do gross tricks like combining
constants that happen to have the same bit pattern even though they mean
different things. -John]

Joshua Cranmer

unread,
Dec 16, 2010, 7:45:40 PM12/16/10
to
On 12/16/2010 03:52 AM, glen herrmannsfeldt wrote:
> Torben Fgidius Mogensen<tor...@diku.dk> wrote:
> (snip on object code obfuscating compilers)
>
>> Indeed. I wsa not impressed by the approach: It made the code a lot
>> larger and a lot slower and the obfuscation was easy to remove.
>> Ideally, obfuscated code should only be marginally larger and slower
>> than "normal" code, but extremely difficult to reverse-engineer.
>
> It would seem that one could divide the code into blocks at
> unconditional branches, and then move the blocks around randomly.
> Though again, it might not be hard to unobfuscate. I do remember
> stories about compilers with optimization algorithms based on
> random numbers and the natural difficulty with reverse engineering.

My decompiler worked by turning the disassembled code into a graph
(also removing unconditional jumps in the process), so the actual
location of hunks did not make when whit of difference.

> That is, assuming such tricks are allowed. As I understand it
> (possibly incorrectly) the exception model in Java makes it
> difficult to do many optimizations that traditionally would have
> been done. Moving statements outside of loops, for one.

Java does make guarantees about apparent execution order, but to the
degree that you know that statements are not going to cause "side
effects", you can reorder stuff.

Martin Ward

unread,
Dec 17, 2010, 5:00:42 AM12/17/10
to
On Thursday 16 Dec 2010 at 15:23, Joshua Cranmer <Pidg...@gmail.com> wrote:
> I'm not sure what work has been done on deoptimization (perhaps anyone
> with the Hex-Rays decompiler could tell us?), but some of the
> optimization techniques seem relatively easy to reverse.
>
> From a purely theoretical standpoint, obfuscation that adds
> non-executing code is going to be less difficult to reverse engineer
> than obfuscation that does the same thing, just... less obviously.

A major theoretical result in this area is the paper "On the
(Im)possibility of Obfuscating Programs" by Boaz Barak et al,
published in: CRYPTO '01 Proceedings of the 21st Annual International
Cryptology Conference on Advances in Cryptology. Boaz Barak gives a
non-technical description of the results and their meaning here:

http://www.math.ias.edu/~boaz/Papers/obf_informal.html

The FermaT program transformation system can interactively transform
code into semantically equivalent forms: including restructuring,
simplification, dataflow analysis , SSA construction, slicing and, in
simple cases, abstraction to a specification. The FermaT engine and
graphical front end runs on Windows and Linux and can be downloaded
from here:

http://www.gkc.org.uk/fermat.html

or:

http://www.cse.dmu.ac.uk/~mward/fermat.html

FermaT's primary commercial application is migrating assembler to
structured and maintainable C and COBOL, so the "deobfuscation"
transformations are geared towards removing the "clever tricks"
introduced by assembler programmers to save a byte here and a cycle
there: or just because they were

--
Martin

STRL Reader in Software Engineering and Royal Society Industry Fellow
mar...@gkc.org.uk http://www.cse.dmu.ac.uk/~mward/ Erdos number: 4
G.K.Chesterton web site: http://www.cse.dmu.ac.uk/~mward/gkc/
Mirrors: http://www.gkc.org.uk and http://www.gkc.org.uk/gkc

glen herrmannsfeldt

unread,
Dec 18, 2010, 1:23:36 AM12/18/10
to
Martin Ward <mar...@gkc.org.uk> wrote:
(snip)

> FermaT's primary commercial application is migrating assembler to
> structured and maintainable C and COBOL, so the "deobfuscation"
> transformations are geared towards removing the "clever tricks"
> introduced by assembler programmers to save a byte here and a cycle
> there: or just because they were

One of the tricks I remember from the 8 bit microprocessor days was to
put an instruction in the two bytes of a load instruction instead of a
branch around the instruction.

I used to have a table driven disassembler for 8080, Z80, and then
later 6809 code that would follow along from the entry point, keeping
track of conditional branch destinations on a stack. When it reached
an unconditional branch it would take an entry off the stack. It
keeps flags for each byte as the first byte (opcode), absolute
address, relative address, or, if not part of any instruction, data.
I suppose one would eventually learn enough tricks though.

Along the line of such tricks, there are stories of Apple II ROM
cloners claiming that so many such tricks were used that the Apple
implementation was the only one possible in the available number of
bytes. Then, since it was the only one possible, it should not
protected against cloning.

-- glen

Rob Warnock

unread,
Dec 18, 2010, 5:30:50 AM12/18/10
to
Martin Ward <mar...@gkc.org.uk> wrote:
+---------------

| A major theoretical result in this area is the paper "On the
| (Im)possibility of Obfuscating Programs" by Boaz Barak et al,
| published in: CRYPTO '01 Proceedings of the 21st Annual International
| Cryptology Conference on Advances in Cryptology. Boaz Barak gives a
| non-technical description of the results and their meaning here:
|
| http://www.math.ias.edu/~boaz/Papers/obf_informal.html
+---------------

See also more recent results mentioned here:

http://www.wisdom.weizmann.ac.il/~oded/p_obfuscate.html

In particular, there's a 2010 revised version of the CRYPTO'01
full paper here (54 pages, 0.5 MB PDF):

http://www.wisdom.weizmann.ac.il/~oded/PS/obf4.pdf

Some snippets from the abstract:

Our main result is that, even under very weak formalizations
of the above intuition, obfuscation is impossible.
...
We extend our impossibility result in a number of ways,
including even obfuscators that (a) are not necessarily
computable in polynomial time, (b) only approximately
preserve the functionality, and (c) only need to work
for very restricted models of computation (TC0). We also
rule out several potential applications of obfuscators,
by constructing "unobfuscatable" signature schemes,
encryption schemes, and pseudorandom function families.

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Hans-Peter Diettrich

unread,
Dec 16, 2010, 9:07:36 AM12/16/10
to
Torben Fgidius Mogensen schrieb:

> Some of the least readable code I have seen has been code where every
> trick in the book was used to make it as short as possible, so using
> extreme optimisaton tricks is probably a much better obfuscator than
> inserting random code -- even random control-structure code.

Best obfuscation can be found in (older) Windows system code. Short
jumps have been replaced by dummy instructions (CMP), with other jumps
going to the argument of that instruction. What assembly code should a
disassembler generate from such ambiguous constructs?

Also nice are vararg-type subroutine parameters inlined after the call
statement, so that the location of the next statement after the call
is unknown. Even worse when multiple possible return adresses (jump
tables) are part of these parameters. Such obfuscation requires to
decode and emulate the called subroutine, before the location of the
next statement is known.

In practice such interruptions of the control flow make automatic
disassembling almost impossible. Instead a good *interactive*
disassembler is required (as I was writing when I came across above
tricks), and time consuming manual intervention and analysis is
required with almost every break in the control flow. The mix of data
and instructions not only makes it impossible to generate an assembler
listing, but also hides the use of memory locations (variables or
constants), with pointers embedded in the inlined parameter
blocks. Now tell me how a decompiler or other analysis tool should
deal with such constructs, when already the automatic separation of
code and data is impossible.

DoDi

Torben Ægidius Mogensen

unread,
Dec 20, 2010, 6:53:51 AM12/20/10
to
Hans-Peter Diettrich <DrDiet...@aol.com> writes:


> In practice such interruptions of the control flow make automatic
> disassembling almost impossible. Instead a good *interactive*
> disassembler is required (as I was writing when I came across above
> tricks), and time consuming manual intervention and analysis is
> required with almost every break in the control flow. The mix of data
> and instructions not only makes it impossible to generate an assembler
> listing, but also hides the use of memory locations (variables or
> constants), with pointers embedded in the inlined parameter
> blocks. Now tell me how a decompiler or other analysis tool should
> deal with such constructs, when already the automatic separation of
> code and data is impossible.

Using jump tables and the like is, indeed, going to make unobfuscation
hard. Especially if the tables change dynamically.

You might be able to get around this by symbolic execution: You start
with a state description which allows arbitrary values of variables.
You then symbolically execute and update the state description as you go
along. At any test, you split the state description into two new state
descriptions and continue symbolic execution on two individual paths.
Whenever symbolic execution reaches a previously-visited program point
with a state description equivalent to what was seen before, you make a
loop. To keep the set of state descriptions finite, you apply
generalisation when a state description gets too complicated.

This process is similar to online partial evaluation, which also uses
state descriptions and generalisation.

That said, unobfuscation can never be perfect: Equivalence of programs
is undecidable, so it is in theory possible to make a program so
obfuscated that no automatic process can recover the original.

But what if you know the obfuscation method? Assuming that the
obfuscation method is polynomic, deobfuscation is at worst NP-hard, so
it is decidable. But it can be so intractable that it doesn't matter.

Torben

George Neuner

unread,
Dec 21, 2010, 1:30:32 AM12/21/10
to
On Sat, 18 Dec 2010 06:23:36 +0000 (UTC), glen herrmannsfeldt
<g...@ugcs.caltech.edu> wrote:

>Along the line of such tricks, there are stories of Apple II ROM
>cloners claiming that so many such tricks were used that the Apple
>implementation was the only one possible in the available number of
>bytes. Then, since it was the only one possible, it should not
>protected against cloning.

I've heard these claims about the Apple II ROM for thirty years and I
think they are bogus. I learned assembly programming on a IIe and it
was impossible to get much done without a good understanding of the
ROM routines. Many subroutines had multiple entry points and many
were routinely reused from different contexts, but I don't believe the
Apple's ROM was any more complex than any other computer ROM of that
era (or indeed any 8-bit device today).

The major stumbling block to cloning the Apple IIe without licensing
the ROM code was Applesoft FP BASIC, which was implemented using a
bytecoded 16-bit virtual machine interpreter called Sweet16.
Approximately 5KB of Sweet16 code was included in the 16KB Apple IIe
ROM. The cloners couldn't copy the Sweet16 code, nor could they
figure out how to fit a non-infringing work-alike BASIC into the same
address space.

There were a few other issues with the ROM code for the custom disk
controller and video (in particular the hi-res graphics IIRC). Here
the problem was not having Wozniak's hardware ... work-alike code for
OTS video display and disk controllers couldn't be fit into the same
address space.

There eventually were a number of legal Apple IIe and //c clones, but,
AFAIK, all of them had much larger ROMs than the machine they copied
and had to use bank-switching and address shadowing to present a
compatible memory map for existing Apple programs.


YMMV but I don't consider implementing a virtual machine to be a
"trick". Wozniak created Sweet16 to make programming easier ... it
ultimately saved some ROM space, but it was never intended to deceive
anyone ... the Sweet16 interpreter and the fact that it was used was
documented in the Apple ][ Integer BASIC manual. I don't recall if
Sweet16 ever was mentioned in the FP BASIC manual, but FP BASIC was an
extension of Integer BASIC and so, ISTM, anyone interested should have
inferred that Sweet16 had been used in FP BASIC's implementation as
well.

George

George Neuner

unread,
Dec 21, 2010, 12:25:16 PM12/21/10
to
On Mon, 20 Dec 2010 12:53:51 +0100, tor...@diku.dk (Torben Fgidius
Mogensen) wrote:

>Equivalence of programs is undecidable, ...

I'm not sure that's true ... at least in theory. Turing equivalence
guarantees that any program can be expressed in the lambda calculus,
and the Church-Rosser theorem proves that two lambda expressions which
reduce to the same expression are equivalent.

In practice, though, I agree that deciding equivalence is intractable.

>... so it is in theory possible to make a program so


>obfuscated that no automatic process can recover the original.

Without meta information, such as exists in Java and .NET binaries, it
is already difficult to accurately reconstruct source for any but the
simplest of programs.

>But what if you know the obfuscation method? Assuming that the
>obfuscation method is polynomic, deobfuscation is at worst NP-hard, so
>it is decidable. But it can be so intractable that it doesn't matter.

I don't think knowing the method will help because any decent
implementation likely will have multiple ways to obfuscate any
particular construct and the selection will be psuedo-random.

George

Walter Banks

unread,
Dec 21, 2010, 1:40:01 PM12/21/10
to
George Neuner wrote:

> There were a few other issues with the ROM code for the custom disk
> controller and video (in particular the hi-res graphics IIRC). Here
> the problem was not having Wozniak's hardware ... work-alike code for
> OTS video display and disk controllers couldn't be fit into the same
> address space.

Woz's disk controller was very clever. It used a simple state machine
to encode the data on and off the disk. At one point I wired up a
3 1/2 drive for my own use that read and write apple ][ disks (and
higher density that could not be read on an apple). At the time I
was impressed how well thought out Woz's design was. In my "one of"
I did something similar and used a 2708 eprom to store the jump
tables for a state machine.

> YMMV but I don't consider implementing a virtual machine to be a
> "trick". Wozniak created Sweet16 to make programming easier ... it
> ultimately saved some ROM space, but it was never intended to deceive
> anyone ... the Sweet16 interpreter and the fact that it was used was
> documented in the Apple ][ Integer BASIC manual. I don't recall if
> Sweet16 ever was mentioned in the FP BASIC manual, but FP BASIC was an
> extension of Integer BASIC and so, ISTM, anyone interested should have
> inferred that Sweet16 had been used in FP BASIC's implementation as
> well.

Woz wrote a Byte article documenting the Sweet16 code and use.
I don't think the FP basic used the Sweet 16 library. Sweet 16 was
used in the integer basic.

He also wrote a floating point package for the apple. At the time not all
6502's had an unconditional branch which meant that Woz used conditional
branches even when one was not needed. This single "feature" obfuscated
the sources almost to the point of almost being unreadable.

http://www.6502.org/source/floats/wozfp3.txt

All the best of the season,

Walter Banks
--
Byte Craft Limited
http://www.bytecraft.com
[Fascinating though this is, it doesn't have much to do with compilers
any more. -John]

glen herrmannsfeldt

unread,
Dec 21, 2010, 2:20:28 PM12/21/10
to
Torben Fgidius Mogensen <tor...@diku.dk> wrote:
(snip)

> Using jump tables and the like is, indeed, going to make
> unobfuscation hard. Especially if the tables change dynamically.

Yes for automated systems, but maybe not with a human in the loop.

If, for example, the target code is an interpreter then there is
likely a jump table for processing different statements. Finding that
table, then, tells you more than finding a bunch of conditional
branches.

> You might be able to get around this by symbolic execution: You start
> with a state description which allows arbitrary values of variables.

Well, you do have to find the end of the table, which often isn't hard
done by hand. (You can see when the addresses stop looking like
addresses. Also, there might be a conditional test on the table
offset.)

(snip)


> This process is similar to online partial evaluation, which also uses
> state descriptions and generalisation.

> That said, unobfuscation can never be perfect: Equivalence of programs
> is undecidable, so it is in theory possible to make a program so
> obfuscated that no automatic process can recover the original.

Again, it helps to have a human in the loop, especially one who
know the art of the design of similar programs.

One of my first, larger, disassemblies (for personal use) was the
BASIC interpreter for the TRS-80 Color Computer, MC6809 code.
I was working on it for a while without finding the main loop
that reads statement codes and branches. It turned out that such
loop is loaded into RAM at startup, and executes there. When
disassembling ROM, one doesn't always expect execution in RAM.

Anyway, I believe that mostly automated but with a human in the
loop is the best way to do disassembly and deobfuscation.

-- glen

Martin Ward

unread,
Dec 22, 2010, 6:30:48 AM12/22/10
to
On Tuesday 21 Dec 2010 at 17:25, George Neuner <gneu...@comcast.net> wrote:
> >Equivalence of programs is undecidable, ...
>
> I'm not sure that's true ... at least in theory. Turing equivalence
> guarantees that any program can be expressed in the lambda calculus,
> and the Church-Rosser theorem proves that two lambda expressions which
> reduce to the same expression are equivalent.

Any program which takes no input and produces no output can only
do one of two things:

(a) Terminate.
(b) Not terminate.

So such a program is equivalent to either SKIP or ABORT.
If this equivalence were decidable, then termination would also be decidable.
But termination is undecidable (http://en.wikipedia.org/wiki/Halting_problem).

--
Martin

STRL Reader in Software Engineering and Royal Society Industry Fellow
mar...@gkc.org.uk http://www.cse.dmu.ac.uk/~mward/ Erdos number: 4

[Termination is undecidable in general given an arbitrary program and
arbitrary input, but frequently decidable for a specific program and
specific input. I'd think a program that takes no input would be
pretty easy to analyze. -John]

Hans-Peter Diettrich

unread,
Dec 22, 2010, 11:12:06 AM12/22/10
to
Torben Fgidius Mogensen schrieb:

>> In practice such interruptions of the control flow make automatic
>> disassembling almost impossible. Instead a good *interactive*
>> disassembler is required (as I was writing when I came across above
>> tricks), and time consuming manual intervention and analysis is
>> required with almost every break in the control flow. The mix of data
>> and instructions not only makes it impossible to generate an assembler
>> listing, but also hides the use of memory locations (variables or
>> constants), with pointers embedded in the inlined parameter
>> blocks. Now tell me how a decompiler or other analysis tool should
>> deal with such constructs, when already the automatic separation of
>> code and data is impossible.
>
> Using jump tables and the like is, indeed, going to make unobfuscation
> hard. Especially if the tables change dynamically.

In the observed cases the presence of jump tables was unknown, and also
the structure and size of the data block, that follows the call
instruction :-(

> You might be able to get around this by symbolic execution: You start
> with a state description which allows arbitrary values of variables.

Then you'll end up with a tree of states, in the best case, and a graph
(with loops and knots) in the worst case.


> But what if you know the obfuscation method? Assuming that the
> obfuscation method is polynomic, deobfuscation is at worst NP-hard, so
> it is decidable. But it can be so intractable that it doesn't matter.

It may be possible to crack algorithmic obfuscation, but odds are bad
with the encountered "handmade" obfuscation. The intended (and achieved)
effect was optimization (almost for smaller size), and the resulting
obfuscation only was a side effect.

Even if one can produce equivalent assembler code, with some tricks
(macros...) for data structures with multiple meanings (instructions in
instruction arguments...), that code will remain hard to understand -
and that's the primary goal of every obfuscation. Who will be able to
tell the *purpose* of a state machine or other automaton, given only its
implementation?

More unobfuscation problems come into mind, like the use of modified
external code, maybe only different versions of (shared) standard
libraries. When some code relies on the values returned from such
external subroutines, and the precise implementation in a specific
library version, the entire environment (version of the OS and all
installed libraries) has to be taken into account.

DoDi

George Neuner

unread,
Dec 23, 2010, 4:08:06 PM12/23/10
to
On Wed, 22 Dec 2010 11:30:48 +0000, Martin Ward <mar...@gkc.org.uk>
wrote:

>On Tuesday 21 Dec 2010 at 17:25, George Neuner <gneu...@comcast.net> wrote:
>> >Equivalence of programs is undecidable, ...
>>
>> I'm not sure that's true ... at least in theory. Turing equivalence
>> guarantees that any program can be expressed in the lambda calculus,
>> and the Church-Rosser theorem proves that two lambda expressions which
>> reduce to the same expression are equivalent.
>
>Any program which takes no input and produces no output can only
>do one of two things:
>
>(a) Terminate.
>(b) Not terminate.
>
>So such a program is equivalent to either SKIP or ABORT.
>If this equivalence were decidable, then termination would also be decidable.
>But termination is undecidable (http://en.wikipedia.org/wiki/Halting_problem).

It is true that termination, in general, is undecidable ... but that
doesn't invalidate Church-Rosser equivalence.

The lambda calculus is based on term rewriting. Church-Rosser
equivalence is based on pattern matching of lambda terms. By applying
the rewrite rules for term expansion, substitution and reduction, if
two different terms can be rewritten to be in the same pattern form -
modulo variable names - then they are considered equivalent. It
doesn't matter what the terms "mean" or what they might do if
"executed".

"Execution" in LC is a process of continually applying the rewrite
rules to the "program" term until no more reductions can be found (a
halt) or until a fixed point (a cycle) is observed.

George

Torben Ægidius Mogensen

unread,
Jan 4, 2011, 6:19:18 AM1/4/11
to
George Neuner <gneu...@comcast.net> writes:


> It is true that termination, in general, is undecidable ... but that
> doesn't invalidate Church-Rosser equivalence.
>
> The lambda calculus is based on term rewriting. Church-Rosser
> equivalence is based on pattern matching of lambda terms. By applying
> the rewrite rules for term expansion, substitution and reduction, if
> two different terms can be rewritten to be in the same pattern form -
> modulo variable names - then they are considered equivalent. It
> doesn't matter what the terms "mean" or what they might do if
> "executed".

There is no conflict here: Equivalence of lambda terms is undecidable.

Equivalence (more specifically beta equivalence), as you say, is defined
by saying that two terms t1 and t2 are equivalent if there is a term t3
such that t1 => t3 and t2 => t3, where => is the (reflective and
transitive) reduction relation. But if t1 or t2 both are without normal
forms, you can not decide their equivalence by comparing a finite number
of terms. Hence, equivalence is only semi-decidable.

There are various other ways to define equivalence of lambda terms
that attempt to equivalate more terms, such as all terms that have no
head normal form or weak head normal form. But these are also
undecidable. You can define equivalences that equate fewer terms and
which are decidable, but these are not useful as program equivalence.
Alpha equivalence and eta equivalence springs to mind.

Torben

George Neuner

unread,
Jan 6, 2011, 3:45:23 PM1/6/11
to
On Tue, 04 Jan 2011 12:19:18 +0100, tor...@diku.dk (Torben Fgidius
Mogensen) wrote:

>George Neuner <gneu...@comcast.net> writes:
>
>
>> It is true that termination, in general, is undecidable ... but that
>> doesn't invalidate Church-Rosser equivalence.
>>
>> The lambda calculus is based on term rewriting. Church-Rosser
>> equivalence is based on pattern matching of lambda terms. By applying
>> the rewrite rules for term expansion, substitution and reduction, if
>> two different terms can be rewritten to be in the same pattern form -
>> modulo variable names - then they are considered equivalent. It
>> doesn't matter what the terms "mean" or what they might do if
>> "executed".
>
>There is no conflict here: Equivalence of lambda terms is undecidable.

I think it depends on which calculus you're working in.


>Equivalence (more specifically beta equivalence), as you say, is defined
>by saying that two terms t1 and t2 are equivalent if there is a term t3
>such that t1 => t3 and t2 => t3, where => is the (reflective and
>transitive) reduction relation. But if t1 or t2 both are without normal
>forms, you can not decide their equivalence by comparing a finite number
>of terms. Hence, equivalence is only semi-decidable.

Actually I'm talking about alpha equivalence: e.g., \x.x <==> \y.y
Identical form modulo variable names.

>There are various other ways to define equivalence of lambda terms
>that attempt to equivalate more terms, such as all terms that have no
>head normal form or weak head normal form. But these are also
>undecidable. You can define equivalences that equate fewer terms and
>which are decidable, but these are not useful as program equivalence.
>Alpha equivalence and eta equivalence springs to mind.

AFAIK, for the _untyped_ lambda calculus, head normal forms are not an
issue and you can prove or disprove alpha equivalence of forms by
recursive substitution and applicative order reduction until no more
substitutions or reductions are possible and then examining the
resultant forms.


Happy to be proven wrong though. References?
George

0 new messages