Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Obfuscation in the .NET universe

2 views
Skip to first unread message

Paul

unread,
Jun 13, 2002, 2:36:20 PM6/13/02
to
Obfuscation in the .NET Universe
Paul Tyma
www.preemptive.com
Dotfuscator(tm)

This paper is a experience-report into 5 years of one developer
creating obfuscation products for the Java and now .NET environments.
It tells you what we did (what we learned) and what we're doing now.
Over that time I've talked to hundreds of obfuscating users and
written code to meet their needs. My apologies if things sound a bit
like an advertisement (I'm not a marketing guy, I'm a developer), its
only because like most programmers I'm passionate about the software I
create and I've applied my experience to meet the needs of our
customers.

Some Background: My name is Paul Tyma, I was the lead architect on
PreEmptive Solution's DashO line of obfuscation/optimization java
products and am an engine developer on the new Dotfuscator line of
.NET obfuscators. DashO (named after the gcc compiler's command line
option for optimization) spurred from my Ph.D research and started
life as a bytecode optimizer, not an obfuscator. (For purposes of this
article, remember DashO is Java, Dotfuscator is .NET). Back in the day
when Java was interpreted, a bytecode optimizer could make significant
gains. Unfortunately (for us), JIT technology quickly advanced and
broke when trying to run optimized code. The worst part was that not
all JITs broke, only naively written ones (actually Microsoft's JVM
ran anything we threw at it). But, if we released code that didn't run
on some JITs, we would look like the bad guys, not the JIT company.

So, DashO removed the bytecode-optimization features and retained its
ability to obfuscate identifier names and prune unused code. It also
kept design-level optimization that didn't interfere with JITs. DashO
has enjoyed wild successs. In the beginning, we had several important
competitors. DashO has prevailed and remains the only real enterprise
obfuscation solution in the Java space. Sun runs their jdk SSL
libraries through DashO, RSA runs their encryption libs, Pointbase,
InstallAnywhere, IBM, the list goes on.

Dotfuscator is taking a reverse path, that is, its starting with
obfuscation as a goal, not optimization. The hot topic in the .NET
space is protecting code. We know how to do that – we were doing it
long before .NET was public. Unlike Java, .NET will have only a few
important VMs we must appease allowing us to do better code protection
and optimization.

Our .NET obfuscator named Dotfuscator includes identifier renaming,
control-flow obfuscation, string encryption, and unused
type/method/field removal. Learned from our mistakes with DashO, we
know the issues customers come across and we created Dotfuscator as a
complete obfuscation package from the get go.

As an aside, we also learned something important about product naming.
Although "DashO" is a darn fine name for an optimizer that only geeks
would understand, it wasnt a great name for the public. A lot of
people didnt get it right off and being clever but misunderstood isnt
a grand thing. We spent too much time saying "DashO: application size
reducer, optimizer, obfuscator". You might like the name "Dotfuscator"
or you may not, either way, its pretty plain though what the
application does - a "dot net obfuscator". We don't have to scramble
to fit in alot of words to tell people what it does (and programmers
generally know what obfuscation is now, they didnt when Java came
out).

+ Obfuscation

"When obfuscation is outlawed, only outlaws will sifjdifdm wofiefiemf
eifm."

Obfuscation is the technology of shrouding the facts. Its not
encryption, but in the context of .NET (or java) code, it might be
better. Early in Java's life, several companies produced encrypting
class loaders to fully encrypt java classes. Decryption was done
Just-in-time prior to execution. Although this made classes completely
unreadable, this methodology suffered from a classic encryption flaw,
it needed to keep the decryption-key with the encrypted data.
Therefore, an automated utility could be created to decrypt the code
and put it out to disk. Once that happens the fully unencrypted,
unobfuscated code is in plain view.

As another comparison, we could compare encryption to locking a six
item meal into a lockbox. Only the intended diner (i.e. the VM) has
the key and we don't want anyone else to know what their going to eat.
Unfortunately, if someone can pick the lock (or find the key hidden on
the bottom of the box), the food is in plain view. Obfuscation works
more like putting the six item meal into a blender and sending it to
the diner in a baggie. Sure everyone can see the food in transit, but
besides a lucky pea or some beef-colored goop, they don't know what
the original meal is. The diner still gets the intended delivery and
still provides the same nutrional value as it did before (luckily, VMs
aren't picky on taste). The trick of an obfuscator is to confuse
observers, while still giving VMs the same delivery.

Without argument, obfuscation (or even encryption) is not 100 percent
protection. Even compiled C++ is disassembleable. If a hacker is
perseverant enough, they can find the meaning of your code. The goal
of obfuscation is certainly to knock out the 90 percent of hackers who
aren't willing to go the extra step. From there, the returns on
investment dissipate. It now takes exponentially more effort to thwart
progressively fewer decompilers. Although we were told decompilers
would rename unprintable identifier names (to something readable), in
truth, not too many did. Far fewer ever even attempted to untangle
control flow. The goal became to stop all casual hackers and as many
serious hackers as possible up until some level of
return-on-investment. That level varied between customer requirements
and new decompiler releases.

+ Identifier Renaming

As expected of any good obfuscator, Dotfuscator renames all program
identifiers to small meaningless names. With DashO, we played around
with creating clever renames (unprintables, etc) but ended up just
renaming to small, alphabetic letters. Instead of clever names, we
invented and patented an algorithm called "overload induction" that
has been in use in DashO since its inception. Overload induction works
by identifying colliding sets of methods across inheritance
hierarchies and renaming such sets according to some enumeration (i.e.
the alphabet or something). Because separate colliding sets are
identified and the enumeration starts at the beginning each time,
method overloading is induced on a grand scale. (Note that Dotfuscator
can overload induct fields too based solely upon return type but the
algorithm is far less interesting since fields cannot be overridden)

This effect is far stronger than normal one-to-one renaming for
several reasons. I'm amused by claims I've read that overload
induction provides no obfuscation value. Firstly, overload induction
raises the 90% casual hacker number listed above. It takes more work
to undo overload induction than not so fewer people will go to the
trouble. It does protect more programs. Secondly, in order to undo
Overload induction effectively, a decompiler needs to implement
overload induction themselves (ironically, violating our patent in the
process) to undo it. That's a lot of work and "college-kid" free
decompilers rarely stay motivated to get that far. Alternatively, a
simpler renaming scheme can be taken to undo it to a lesser degree.
Regardless of the tact taken, overload induction is provably
irreversible. The best overload induction undo-er will come out with a
different number of unique methods than the original source code
contained. It cannot be undone all the way because overload induction
destroys original overloading relationships. In undone overload
induction, there will be no overloaded methods. If we assume the grand
designers of OO technology implemented overloaded methods as a way of
creating "more readable code", then by virtue of removing that
ability, the code has less information in it than before.

Apart from obfuscation, overload induction also reduces the final
program size of obfuscated code. Because of its heavy reuse of
identifier names, it saves significant space. Up to 10% of the size
savings in DashO'd programs was directly attributable to renaming (5%
because it was OI renaming).

We have released Dotfuscator-Community Edition free for
non-commercial. This is not crippleware, it's a full-fledged renaming
obfuscator that incorporates our overload induction renaming system.
It's the strongest renamer in the business, and its free. We offer
this because although renaming is effective, it really doesn't stand
up in the enterprise or if you're releasing a significant product.
We've seen this difference in the java space. Not only will serious
customers need stronger protection, they can't live without the
features surrounding renaming that we offer in our professional
edition.

Our Professional version adds incremental obfuscation, unused-code
pruning, control flow obfuscation, and more. Before you buy an
obfuscator see what you're getting, plenty of our competitors are
releasing just a renamer and charging plenty for it – heck we're
giving that away for free.

+ Incremental Obfuscation

Our customers in the Java space quickly told us they needed this
feature. The problem was that they distributed their obfuscated code
and their customers found a bug in their product. They wanted to issue
just a patch to fix their customer's problems, but because of
obfuscation this always wasn't possible. Fixing bugs in their code
would often create or delete classes, methods or fields. This action
caused subsequent obfuscation runs to rename things slightly
differently. Unfortunately, how and what was renamed different was a
mystery.

Dotfuscator (and DashO) includes incremental obfuscation to combat
that problem. Dotfuscator creates a map file to tell you how it did
renaming. So if you get a stack trace from your customer, you can
match it to the mapfile and find out what unobfuscated class your bug
is in (obviously mapfiles are to be treated as confidential by you).
However, that same mapfile can be used as input to Dotfuscator on
subsequent runs to dictate that renames used previously should be used
again wherever possible.

So, if you release your product, then patch a few classes. Dotfuscator
can be run in such a way to mimic its previous renaming scheme. That
way, you can issue just the patched classes to your customers.


+ String Encryption

Dotfuscator implements runtime decrypted String encryption. As I
mentioned before, any encryption (or specifically decryption) done at
runtime is inherently insecure. That is, a smart hacker can eventually
break it, but for Strings present in customer code, we found it
worthwhile. Effectively, we apply a simple encryption algorithm to any
strings in your application you desire.

Let's face it, if a hacker wants to get into your code, he doesn't
blindly start searching renamed types. He probably does a nice grep on
"Invalid License Key" which points him right to the type where license
handling is performed. Searching on strings is incredibly easy. String
Encryption raises the bar for the casual hacker and deters that many
more non-serious hackers. This algorithm incurs a tiny performance
penalty at runtime but as with pretty much everything else in
Dotfuscator, this option is fully configurable.

+ Control Flow Obfuscation

Control flow obfuscation is a strong form of protection, but comes at
a cost. Whereas renaming and pruning can actually speed up execution
speed, control flow obfuscation can degrade it.

The most popular academic form of control flow obfuscation comes in
the form of Opaque Predicates. (See Dr. Christian Collberg's website
for many good references including
http://www.cs.arizona.edu/~collberg/Research/Publications/CollbergThomborsonLow98a/).
Collberg introduced the 5 terms which evaluate obfuscating transforms
including potency, resilience, deobfuscation, cost, and stealth
(foundational stuff). The term control-flow-obfuscation is a bit
unfortunate because its so broad. Any form of introduction or
reduction of control flow is technically obfuscation.

Dotfuscator 1.1 (professional edition) is released with a strong
control flow obfuscation system in place. As an example of what I
mean, here is a quick example:

BEFORE:
// Code Snippet copyright 2000, Microsoft Corp, from WordCount.cs
sample app
public int CompareTo(Object o) {
int n = occurrences - ((WordOccurrence)o).occurrences;
if (n == 0) {
n = String.Compare(word, ((WordOccurrence)o).word;
}
return (n);
}

This is a pretty straightforward code snippet. Now after obfuscation
it decompiles to (using the exemplar or "akarino" decompiler btw - all
others I tried crashed):

public virtual int a(object A_0) {
int local0;
int local1;

local0 = this.a - (c) A_0.a;
if (local0 != 0)
goto i0;
goto i1;
while (true) {
return local1;
i0: local1 = local0;
}
i1: local0 = System.String.Compare(this.b, (c) A_0.b);
goto i0;
}

I love that. (especially the while (true) return..).
The version after control flow obfuscation is a complete mess and it
came from a simple code example (one of Microsoft's distributed code
examples).

The interesting part is that I grabbed several decompilers and in
order to make that example. When I fed them our
control-flow-obfuscated code, all of the decompilers crashed like mad.
In fact, it took me 5 tries of different methods until finally one of
the decompilers actually made output and didn't crash. Our control
flow algorithms do a pretty good job of messing up code while still
maintaining complete execution consistency and code verification.

Keep in mind that the above code looks majorly messed up, but thats
only to the decompiler. We're running thousands of classes through
control-flow that won't decompile anymore but run and verify 100%.

+ Breakage

In the early days of java obfuscation, we started an arms race with
the decompiler companies. We set out to crash their decompilers, not
just screw up decompilation. That turned out to be an impossible task
for 2 reasons. One, decompiler guys were darn smart and fixed the junk
we gave them pretty fast. Two, there were more Java VMs than there
were decompilers, it became very difficult to find obfuscations that
broke all decompilers that didn't break some primitive VM (remember,
even one broken VM, regardless of who's fault it is – became our
problem).

In the .NET space, things look a bit different. First off, our control
flow obfuscator is already breaking decompilers out of the box. This
is a strong testament to the primitiveness of the existing
decompilers. However, eventually, someone is going to need a Ph.D.
dissertation and some better ones will come along.

We literally have "algorithms in wait" that we'll be expanding
Dotfuscator in future releases. I was pretty sure we'd never be able
to break all .NET decompilers, but with the strength our control-flow
already has, now I'm not so sure. We'll be implementing algorithms as
needed.

+ Optimization

Dotfuscator (unlike DashO) currently includes nothing specifically for
code optimization. However, the opportunity is far greater since we
have a very limited number of VMs and can target performance
transforms that enhance those VMs.

This is inevitably a positive direction for Dotfuscator. Optimization
algorithms can inherently obfuscate too, so we'll see.


+ Pruning

I've never seen customer reaction to any feature like I've
consistently seen with pruning. When a customer hears it, there's
often a "big deal" attitude. After they try it, their disposition does
a 180 and they claim they absolutely can't live without it.

It seems odd that unused-code removal can actually do anything – I
mean, who writes code they don't use? Well, the answer is all of us.
What's more, we all use libraries and types written by other people
that were written to be reusable. Reusable code implies there is
contingent code that handles many cases – however, in any given
application, you typically only use one or two of those many cases.
Pruning figures that out and rips out all the unused code (from
compiled IL, we never touch source).

Pruning's most visible result is the reduction in size of the
executable. For many applications that are distributed on CD-ROM, the
size of the application isn't often a serious worry. However, more and
more applications are involving a networked/distributed component or
written for embedded systems. In those cases, every byte counts.

The size reduction caused by pruning is literally staggering. Some
customers have reported to us they received a 70% size reduction of
their executable. I imagine those customers use large 3rd party
libraries that were heavily trimmed. In our tests, we see a solid 40%
reduction using DashO (this is pruning, renaming, and general metadata
reduction). Our sample size for Dotfuscator isn't as mature as
DashO's, but the results are looking similar (there's no reason they
shouldn't, very much the same organization and theory). This size
reduction in the executable is of course related to size reduction of
indivdual types and subsequently, individual objects. Therefore,
pruned programs tend to run in less memory too.

The report generated after pruning is also seriously telling of what
code you have laying around that you aren't using. As with the rest
of Dotfuscator, pruning is fully configurable. You can turn it on or
off and configure it down to the method level. A classic question
arises in pruning (and renaming) of dynamically loaded types and use
of reflection. Given that we do static analysis, there simply isn't
anyway to detect how dynamic language features will be used at runtime
(heck, you could have a user type in a method name from a prompt). The
exact same issues were in the java space and turned out to be a
minimal problem. We allow significant customization features to allow
the programmer to specify which types or methods are used dynamically.
You can tell pruning or renaming (or control flow) to leave given
methods/types/fields alone. The level of customization is as deep as
you want to get. In other words, we haven't met an application in the
java or .NET space that couldn't be run through our products.

+ Details (ILASM)

I find it odd to address this, but it seems one of our competitors
(the one who was charging $2500 per developer on your project for his
renamer before he realized that was ridiculous – heh) is spreading
misinformation. Below any obfuscation system is the logic that breaks
apart the assemblies and constructs objects out of the types, methods,
fields, etc. From there, the obfuscator can operate on those objects
and do its work.

The initial reaction (and how we did it in DashO) was to parse the
files ourselves. This is a well-documented in .NET (as it was in Java)
and is pretty simple parsing grunt-work. Now any parsing of binary
files, is by definition, a bits-and-bytes coding job assured to give
you hours of trivial debugging. In our discussions with Microsoft, we
discussed the option to allow their already-built infrastructure to
take care of this layer for us.

This sounded pretty great. With Java, it seems that every nontrivial
release of the JDK involves them changing the class file format in
some small way. This sends us into some quick coding to find their
changes and implement them and get patches to our customers (the java
file format spec is published, but not in real time).

In contrast, Microsoft publishes the grammar to their ILDASM output.
In addition, we've got direct lines to the guys at MS doing ILASM and
more. We're planning to have updated grammars for ILDASM output before
its released. By relying on ILDASM, we're relying on Microsoft for
that abstraction (saves us a ton of bugs) and completely removes our
need to "dig through" assemblies generated by new releases to figure
out which byte changed. This option has worked great for us and sped
up our development as expected (almost like relying on an API
library). Any changes to the assembly formats will be transparent to
us since the ILDASM program will be updated to the new format
implicitly.

In addition, I've seen claims that using ILDASM somehow induces
obfuscators to induce bugs into your code. I have no idea where this
falsehood came from, but bugs are created by bugs, not by using
existing levels of abstraction to get the job done.

Our goal is always to output absolutely, positively 100% verifiable
(peverify) and execution equivalent code. If we fall short of that
goal, we have a bug and we fix it. We've been doing this 6 years, we
know about bugs in obfuscators and have a zero tolerance for them. We
have run and have customers who are running extremely large
applications through Dotfuscator with no problems.

I hope this clears up some issues with how obfuscation works for us in
the Java and now .NET worlds.

Paul Tyma
ptyma at preemptive dot com
www.preemptive.com
Dotfuscator(tm)

0 new messages