Obfuscation (software)

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 83.30.35.33 (talk) at 19:53, 28 September 2006 (haakon you idiot, if you don't know what it is don't remove it). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Jump to navigation Jump to search

Obfuscated code is source code that is (usually intentionally) very hard to read and understand. Some languages are more prone to obfuscation than others. C, C++ and Perl are most often cited as easily obfuscatable languages. Macro preprocessors are often used to create hard to read code by masking the standard language syntax and grammar from the main body of code. The term shrouded code has also been used.

There are also programs known as obfuscators that may operate on source code, object code, or both, for the purpose of deterring reverse engineering.

Recreational obfuscation

Code is sometimes obfuscated deliberately for recreational purposes. There are programming contests which reward the most creatively obfuscated code: the International Obfuscated C Code Contest, Obfuscated Perl Contest, International Obfuscated Ruby Code Contest and Obfuscated PostScript Contest.

There are many varieties of interesting obfuscations ranging from simple keyword substitution, use/non-use of whitespace to create artistic effects, to clever self-generating or heavily compressed programs.

Short obfuscated Perl programs printing "Just another Perl hacker" or something like that are often found in signatures of Perl programmers.

Examples

Take this infamous example from Internet lore:


#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1)
  :0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);}

Although unintelligible at first glance, it is a legal C program which when compiled and run will generate the 12 verses of The 12 Days of Christmas. It actually contains all the strings required for the poem in an encoded form inlined in the code. The code then iterates through the 12 days displaying what it needs to.

Another example is a program's source listing that was formatted to resemble an empty tic-tac-toe board. Each pass through the program modified the sourcecode to show a turn in the game, to be executed for the next move.

Yet another example is this short program that generates mazes of arbitrary length:


char*M,A,Z,E=40,J[40],T[40];main(C){for(*J=A=scanf(M="%d",&C);
--            E;             J[              E]             =T
[E   ]=  E)   printf("._");  for(;(A-=Z=!Z)  ||  (printf("\n|"
)    ,   A    =              39              ,C             --
)    ;   Z    ||    printf   (M   ))M[Z]=Z[A-(E   =A[J-Z])&&!C
&    A   ==             T[                                  A]
|6<<27<rand()||!C&!Z?J[T[E]=T[A]]=E,J[T[A]=A-Z]=A,"_.":" |"];}

Note the shape of the corridors in the program. Modern C compilers don't allow constant strings to be overwritten, which can be avoided by changing the first line to


char M[2],A,Z,E=40,J[40],T[40];main(C){for(*J=A=scanf("%d",&C);

or using the flag -fwritable-strings in gcc (the GNU Compiler for C).

Obfuscation by code morphing

Its main difference from other obfuscation types is its code transformation called "Code Morphing". This technology protects the code on the CPU-command level. It is known the x86 processors command system is redundant and allows the execution of the same ‘code’ using system commands. It breaks up the protected code into several processor commands or small command snippets and replace them by others, while maintaining the same end result. Thus the protector obfuscates the code not on the source level but on the level of the CPU commands.

The Code Morphing is multilevel technology containing hundreds of unique code transformation patterns. In addition this technology includes the special layer that transforms some commands into Virtual Machine commands (like P-Code). Code Morphing turns binary code into an undecipherable mess that is not similar to normal compiled code, and completely hides execution logic of the protected code.

There is no concept of code decryption with this approach. Protected code blocks are always in the executable state, and they are executed as a transformed code. The original code is completely lost and code restoration is an NP-hard problem.

The weak point of such scheme is that it significantly increases the size and affects the speed of a program. But protecting an application author usually doesn't need to transform its entire code. It is enough to protect only critical parts of your code, responsible for serial number verification, trial expiration date, and other evaluation restrictions. The rest of application code remains intact and software execution speed remains the same.

Below is a code sample generated by Delphi and a partial (the full listing contains over 500 instructions) listing of the transformed code.

Source code :


writeln('Test OK');

After compilation

mov eax, [$ 004092ec]
mov edx, $00408db4
call  @WriteOLString
call  @WriteLn
call  @_IOTest

After the code transformation (partial):

db 3
add al, $30
xlat
call +$000025b2
jmp +$00000eec
call +$00000941
or al, $4a
scasd
call -$304ffbe9
rol eax, $14
mov edi, [ebx]
jmp +$00001738
mov ebx, eax
shr ebx, $03
push ebx
jmp +$0001b5e
call -$000001eb
jmp +$00003203
jmp +$00005df8
call +$00000910
adc dh, ah
fmul st(7)
adc [eax], al
les eax, [ecx+$0118bfc0]
stosb

Obfuscation Tools

A vast variety of tools exists to perform or assist with code obfuscation. These include experimental research tools created by academics, hobbyist tools, commercial products written by professionals, and Open-source software.

Software obfuscation tools include specialized obfuscators to demonstrate a relatively limited technique, more general obfuscators which attempt a more thorough obfuscation, and combined-function tools which obfuscate code as part of a larger goal such as software licensing enforcement.

Obfuscation and information-hiding

One definition of "code obfuscation" is a set of transformations on a program, that preserve the same black box specification while making the internals difficult to reverse-engineer. There turns out to be many such transformations.

For example, dynamic languages such as Java, C#, and Lisp store a program's symbol table within the compiled output. One common obfuscation is to rename every class from something descriptive like "Encryption_Index", to a meaningless sequence such as "rb". The class methods can be renamed to a(), b(), etc.

When writing source code, programmers generally create a great deal of structure, according to rules from Structured Programming, OOP, and other methodologies. Compilers tend to propagate this structure into compiled code. The job of a good obfuscator is to destroy as much as possible of this structure that lends a program to being human-readable.

Uses for obfuscation

Makes reverse engineering more difficult

Even when a language is compiled to an executable or bytecode file, someone may choose to run a decompiler which converts these files back into human-readable form (generally without comments). This could help them understand whatever lies hidden within the source code, against the wishes of the code's creator. Obfuscation serves to increase the difficulty of decompilation, usually forcing someone who wants that information to use more costly forms of reverse engineering.

However, some parts of language obfuscation can be easily defeated (reverse engineered). For example, some websites obscure their JavaScripts so as to prevent code copying and/or modification. This can be defeated quickly by viewing the DOM of the page. This can enable one to see the JavaScript code, removing some of the confusion, but scrambled variable names still can make the code extremely hard to understand.

Minimizes code size

Obfuscation usually breaks down structures which make programs modular and maintainable. This has the pleasant side-effect of reducing code size in many cases. For example, in dynamic languages that incorporate a symbol table with the executable code, simple variable renaming can save a great deal of space in the resulting code footprint. This is a crucial consideration if code size must be kept to a minimum, as with code that must be sent over a network or embedded into a small device.

Concealment of evidence

Spammers frequently use obfuscated JavaScript or HTML code in spam messages. The obfuscated message, when displayed by an HTML-capable e-mail client, appears as a reasonably normal message -- albeit with obnoxious JavaScript behaviors such as spawning pop-up windows. However, when the source is viewed, the obfuscations make it far more difficult for investigators to discern where the links go, or what the JavaScript code does.

Dealers in spamming software have sold JavaScript obfuscators for the purpose of confounding investigators. Some of the techniques use JavaScript's dynamic nature -- a piece of code is stored as an encrypted string, which is decrypted and evaluated. This may be done several times. Other techniques include insertion of dummy code, as well as dummy HTML links to legitimate pages.

Disadvantages of obfuscation

One Layer of Security

No obfuscator known today provides any guarantees on the difficulty of reverse engineering, and this seems to be an inherent issue (see for example, this paper). Thus, obfuscators do not provide security of a level similar to modern encryption schemes, and should be used with other measures in tandem, in cases where security is of high importance.

Debugging

Obfuscated code is extremely difficult to debug. Variable names will no longer make sense, and the structure of the code itself will likely be modified into unrecognizability. This fact generally forces developers to maintain two builds: One with the original, unobfuscated source code that can be easily debugged, and another for release. While both builds should be tested to make sure they act identically, the second build is generally easily and reliably constructed from the first by an obfuscation tool.

Obviously this limitation does not apply to intermediate language (Java, C#, etc.) obfuscators, which generally work on compiled assemblies rather than on source code.

Portability

Obfuscated code often depends on the particular characteristics of the platform and compiler, making it difficult to manage if either change.

Defective obfuscators

Occasionally an obfuscator may be buggy, in a difficult to reproduce way. For binary obfuscators, there is little one can do except find or create a newer version or fiddle with any inputs to the obfuscator until it magically works. Source code obfuscators are often buggy because most are built using simple-string munging tools that fail to account for all the complexities of the source language syntax. Reliable source code obfuscators tend to use true language parsers to ensure that all the syntax is properly handled.

Conflicts with Reflection APIs

Reflection is a set of APIs in various languages that allow an object to be examined or created just by knowing its classname at run-time. Many obfuscators allow specified classes to be exempt from renaming; and it is also possible to let a class be renamed and call it by its new name. However, the former option places limits on the dynamism of code, while the latter adds a great deal of complexity and inconvenience to the system.

See also