[ Pobierz całość w formacie PDF ]
J Comput Virol (2010) 6:277–287
DOI 10.1007/s11416-009-0136-2
ORIGINAL PAPER
From the design of a generic metamorphic engine to a black-box
classification of antivirus detection techniques
Jean-Marie Borello
·
Éric Filiol
·
Ludovic Mé
Received: 22 December 2008 / Accepted: 7 September 2009 / Published online: 25 September 2009
© Springer-Verlag France 2009
Abstract
In this paper, we propose an original black-box
approach concerning antivirus products evaluation. Contrary
to classical tests focusing on detection rates concerning a spe-
cific malware sample, we use a generic metamorphic engine
to observe the detection products behaviors. We believe that
this point of view presents a double interest: First, it offers an
original way of evaluating current antivirus products focus-
ing on the observed detection technique. More precisely, the
use of metamorphic malware guarantees the difficulty of
static signature based detection techniques to focus only on
heuristic and behavioral detection approaches. Second, by
pointing out current detection capabilities, we practically
evaluate the danger that complex metamorphic malware
could represent. To achieve this goal, we start with the
description of a generic metamorphic engine acting in two
steps: obfuscation and modeling. Then, we apply this engine
to a real mass-mailing worm and propose the resulting meta-
morphic malware samples to current antivirus products. The
observed results lead to a classification of detection tech-
niques in two main categories: the first one, relying on static
detection techniques, presents low detection rates obtained
by heuristic analysis. The second one, composed of behav-
ioral detection programs, mainly focuses on elementary
suspicious actions. In all cases, no product was able to
detect a global malware behavior. Consequently, we consider
that metamorphic malware detection still represents a real
challenge for antivirus products. Through this study, we hope
to help defenders understand and defend against the threat
represented by this class of malware.
1 Introduction
Malware is a generic term used to describe all kinds of soft-
ware presenting malicious behavior. In terms of security of
computer users, malicious software is considered as major
threat. Many detection programs are based on form detec-
tion relying on byte signature to identify a specific malware.
To circumvent these detection tools, attackers have devel-
oped specific counter-measures, giving birth to more and
more advanced code mutation techniques. Among all these
techniques such as encryption and polymorphism, metamor-
phism is certainly the most advanced one. Metamorphism
consists in canceling as much as possible any fixed element
that would represent a potential detection pattern according
to byte signaturematching. Here, we considermetamorphism
as a special class of self-replicating programs.
From a theoretical point of view, few results exist concern-
ing the detection complexity of code mutation techniques,
even if these notions have already been evoked in F. Cohen’s
seminal works [
10
]. Recently, D. Spinellis has proved [
23
]
that the detection of bounded-length polymorphic viruses
is an NP-complete problem. Then, E. Filiol has formalized
metamorphism by the means of formal grammars and lan-
guages [
16
] to extract three classes of detection complexity
according to the corresponding grammar class identified by
N. Chomsky [
7
,
8
]: polynomial complexity for class 3 gram-
mars, NP-complexity for class 1 and 2 grammars, and unde-
cidability for class 0 grammars.
B
J.-M. Borello (
)
CELAR, BP 7419, 35 174 Bruz Cedex, France
e-mail: jean-marie.borello@dga.defense.gouv.fr;
borello.jeanmarie@gmail.com
É. Filiol
ESIEA, Operational Virology and Cryptology Lab.,
53 000 Laval, France
L. Mé
SUPELEC, SSIR (EA 4039), Rennes, France
123
 278
J.-M. Borello et al.
From a practical point of view, very few metamorphic
malware are known to exist thanks to the difficulty of writ-
ing such complex programs. The most advanced metamor-
phic virus known is MetaPHOR [
13
], for which 14 000 lines
of assembly instructions (90% of the code) are dedicated to
the metamorphic engine. To change its form, this virus uses
simple instructions rewriting rules which allows its detec-
tion [
16
].
In this paper, we focus on metamorphic malware detec-
tion capabilities. More precisely, it was suggested in [
16
]
that a metamorphic malware presents few limitations from an
execution context point of view, whereas any antivirus tool
is bounded by severe time constraints. To take advantage
of this time constraint, a new class of obfuscator denoted
τ
producing a new program
with the same functionality
as
P
but being unintelligible [
2
]. Starting with this infor-
mal definition, they proposed a formal one based on ora-
cle access to program. Then, they proved that no obfuscator
exists according to this definition. Recently, another formal-
ization of obfuscation, based on the notion of oracle pro-
grams led to the same impossibility result [
3
]. Despite these
theoretical results, practical obfuscation has been intensively
investigated to protect intellectual property and especially
concerning high level languages such as JAVA and the .NET
framework [
11
,
12
]. Indeed, with such languages, the result-
ing binary contains all the information allowing us to easily
retrieve the original program structure such as names, struc-
tures, data types, etc.
Concerning malware, obfuscation schemes were used to
change the syntactic structure of the code to escape simple
form detection techniques such as pattern matching.
Metamorphic malware traditionally used basic obfuscation
transformations modifying either data flow (rewriting rules,
registers exchange) or control flow (branch insertions) to
avoid pattern detection [
5
]. The choice of such basic obfusca-
tion transformations was clearly evoked in [
20
] as follows:
“... a metamorphic virus must be able to disassemble and
reverse itself. Thus a metamorphic virus cannot utilize [...]
techniques that make it harder or impossible for its code to be
disassembled or reverse engineered by itself.”
In agreement
with this point of view, many static detection approaches
based on de-obfuscation techniques (such as data flow analy-
sis [
1
] and slicing [
25
]) were developed [
6
,
22
,
28
]. However,
more complex obfuscation schemes based on control flow
modifications such as [
5
], could thwart these static detection
techniques. Being aware of static detection limitations, an
increasing number of antivirus products consider behavioral
detection, which can be divided in two classes [
17
]. The first
one is represented by dynamic detectors relying on sequences
of observable events such as system call traces. The sec-
ond one is composed of static verifiers relying on instruction
meta-structures (graphs, temporal logic formula). In [
18
],
the coverage of behavioral detection engines was assessed
with the introduction of functional polymorphic engines.
Briefly, a functional polymorphic engine was defined as a
malware embedding a non deterministic compiler to dynam-
ically produce functional variants from a high level mal-
ware description. In this paper, we focus on the temporal
constraint aspect by investigating the new threat that
O(
P
)
-obfuscator was introduced in [
3
] to delay code analysis
for a predefined time
. Our work aims at evaluating cur-
rent antivirus products confronted by possible future threats
that metamorphic malware could represent by taking advan-
tage of the time-answer constraint from a detection point of
view. This approach allows an original black-box evaluation
of current malware detectors through their response to new
mutated forms of known malware.
The contributions of this paper are the following:
– We propose a design of a metamorphic engine corre-
sponding to a future possible threat that antivirus tools
must deal with.
– We present an original way of classifying antivirus detec-
tion techniques based on their observable behavior
towards mutated form of known malware.
– We give a precise example of our approach through the
application of ourmetamorphic engine on thewell-known
email-worm MyDoom [
15
].
This paper is organized as follows. In Sect.
2
we introduce
metamorphic malware detection and its link with code obfus-
cation. In Sect.
3
we present the description of our metamor-
phic engine. In Sect.
4
we classify detection tools according
to their response to metamorphic malware obtained by the
application of the previous metamorphic engine on a real
email-worm.
τ
2 Metamorphism and obfuscation
Metamorphic malware and code obfuscation techniques are
narrow linked subjects. Indeed, as mentioned in [
9
], a meta-
morphic code can be viewed as an obfuscated program.
So, detecting such a program leads to the ability to de-obfus-
cate it. Before describing our metamorphic engine approach,
we briefly introduce these two fundamental notions of obfus-
cation and metamorphism.
Barak et al. informally defined an obfuscator
-obfus-
cation-based metamorphic malware could represent on
antivirus products.
τ
3 Metamorphic engine description
as a
probabilistic compiler taking in input a program
P
, and
O
From a high level point of view, a metamorphic engine
offers a self-replicating property which has to produce a
123
Black-box approach concerning antivirus products evaluation
279
syntactically different but semantically equivalent mutated
form. A generic description of metamorphic binary-transfor-
mation is given in [
27
]. Here, we present our metamorphic
engine self-replication process, which acts in two steps:
1.
In the first step, known as the obfuscation step, the engine
changes its form to escape detection algorithms. The
main purpose of this step is to avoid static detection
approaches such as [
4
,
6
,
9
]
2.
In the second step, the already obfuscated engine reverses
its own obfuscation transformations to come back to its
original form. This step, known as the modeling step,
allows the engine to re-obfuscate itself. It is worth men-
tioning here that the reverse engine in charge of the
engine modeling is itself obfuscated otherwise it could
be easily detected by pattern matching.
This section presents the design of our metamorphic
engine. More precisely, in Sect.
3.1
, we focus on the obfus-
cation step. In Sect.
3.2
, we describe the engine information
needed to ensure its modeling. In Sect.
3.3
, we describe the
whole replication process. Finally, in Sect.
3.4
, we explain
how to produce a metamorphic binary starting from the
sources of an original program.
Fig. 1
Illustration of the obfuscation scheme. Original program
P
on
the left and the obfuscated program
P
on the right
3.1 Obfuscation step
This section presents the obfuscation step in the self-
replication process of our metamorphic engine. The obfus-
cation process has to work on both the code and the data in
a program at the same time.
As the splitting is randomly generated, no syntactic pattern
can be directly extracted, according to this approach. More-
over, it was proved in [
5
] that the static detection of meta-
morphic malware employing such a technique in a multi path
assumption, is an NP-complete problem. In static analysis,
the multi path assumption translates the difficulty of branch
target evaluation. Indeed, considering a branch instruction,
represented as follows, “
if
(
condition
){
branch1
}
else
{
branch2
}”, the
condition
evaluation can be highly compli-
cated by the use of opaque predicates as detailed in [
11
].
Informally, a predicate is said to be opaque if it has a prop-
erty which is known to the obfuscator, but which is difficult
for the deobfuscator to deduce. Thus, if a program cannot
determine the
condition
value, then it has to consider the two
branches as possibly executable paths. However, the creation
of opaque predicates which are difficult to resolve is a hard
task [
11
]. It is also the case from a metamorphic malware
point of view. Instead of focusing on opaque predicate crea-
tion, we deliberately choose to take advantage of the malware
time detection constraint evoked in [
3
,
16
].
In other words, each block
P
transition is
Code obfuscation:
The general code obfuscation scheme
detailed hereafter is inspired from [
5
]. Let
P
be a program
composed of
n
consecutive instructions (
I
1
,...,
I
n
). This
program
P
is split in
k
consecutive blocks
P
.
Each of these blocks contains a random number of instruc-
tions. Let
=
(
P
1
,...,
P
k
)
used
to randomize
P
i
blocks. For each
P
i
block, we define a new
block
P
σ(
i
)
σ
be a random permutation over the set
[
1
,
n
]
with its transition. This approach is illustrated
in Fig.
1
. On the left hand, we have an original program
P
composed of ten instructions whose control flow is repre-
sented with arrows. The boxes illustrate the random splitting
of
P
in five blocks. On the right hand, the new program
P
is obtained by permutating
P
i
blocks according to
.
It is easy to see that whatever the
Control Flow Graph
(CFG) of program
P
looks like, the execution remains the
same if after executing the last instruction of block
P
σ(
i
)
σ
,the
first instruction of
P
σ(
i
+
1
)
is reached. These transitions, rep-
resented with dashed arrows in Fig.
1
, are the key points of
the obfuscation scheme. For instance, considering the block
containing instructions
I
1
,
I
2
and
I
3
, the execution of instruc-
tion
I
3
must lead to
I
4
as illustrated in
P
and
P
.
-obfuscated
by dynamically computing the target destination. Several
approaches were detailed in [
3
]to
τ
-obfuscate programs.
In order to facilitate time measurement, we decided to use
an obfuscated loop which computes the destination address.
τ
123
 280
J.-M. Borello et al.
So, for the rest of the article, the
delaying time is mea-
sured thanks to the number of iterations in the transitions
loops. Here, we only present a sketch of our
τ
3.2 Modeling step: the necessity of extra information
-obfuscation
design for two reasons. Firstly, from an ethical point of view,
giving a complete description of the implementation could
lead an attacker to directly write such a metamorphic engine,
which is a non affordable risk. Secondly, according to the
experiment result,
τ
This section presents the modeling step in the self-replication
process of the metamorphic engine. From now on, we con-
sider that the metamorphic engine
M
is already obfuscated,
as presented in Fig.
1
. The obfuscated metamorphic engine
is denoted
M
in the rest of the section.
From its entry point,
M
must be able to extract its structure
in memory in order to re-obfuscate itself. Without any partic-
ular information on the way it was produced,
M
would have
to disassemble itself as any other external program would
have to. In this case, the engine would be confronted with
the difficulty of reversing its own obfuscation scheme. So,
to easily reverse its own code,
M
must embed extra infor-
mation allowing its de-obfuscation without simplifying the
detection.
According to the obfuscation scheme presented in
Sect.
3.1
, coming back to
M
means recovering the original
sequence of code blocks (
M
1
,...,
-obfuscation seems to have no impact on
current detection tools. So,
τ
-obfuscation does not appear to
be the key component of the metamorphic engine. To achieve
τ
τ
-obfuscation, the idea consists in choosing a random func-
tion
f
for each transition. Then the target address is deter-
mined by the number of compositions of this function
f
.Of
course, this simple loop is obfuscated using classical tech-
niques such as rewriting rules to avoid any pattern.
Data obfuscation:
A simple way to protect data is encryp-
tion as used in polymorphic malware, for example. In this
case, the malware execution begins with a (polymorphic)
decryption routine acting on the rest of the code and data.
After this decryption, all the code and data represent a
possible detection pattern. Thus, a practical detection [
24
,
pp. 451–458] consists in emulating the decryption routine to
come back to classical pattern matching detection techniques
on the decrypted program.
To avoid such a detection, a better approach consists in
decrypting data just before they are used and re-encrypting
them just after. By data we mean a block of data which cannot
be divided without a loss of semantics (for instance, a string,
a switch table, a structure, etc). This technique known as
on-the-fly encryption is commonly used in malware protec-
tion (DarkParanoid [
19
] and W32/Elkern [
14
]). More pre-
cisely, let
f
be a function taking as parameter a data block,
denoted
D
. Our original program
P
computes the function
f
M
n
) and the original data
blocks. More precisely, the extra information to be embedded
is composed of:
1.
the description of the original sequence of code blocks
(
P
1
,...,
P
n
)
;
2.
the description of data blocks with their corresponding
encryption key;
3.
the description of memory references.
With these three elements, themetamorphic engine
M
is able
to come back to its exact original (de-obfuscated) form
M
.
We shall explain the necessity of references. At binary level,
each logical element in a program (a block of data, an instruc-
tion, an import table entry, etc.) is represented by its address.
As these addresses change during each mutation according
to the previous obfuscation scheme, the metamorphic engine
must be able to find and update these references according
to the new position of the corresponding element. Unfor-
tunately, the exact determination of references in a binary
program is difficult.
To illustrate this problem, let us consider the following
assembly instruction:
cmp eax, 0402000h
. T is
instruction compares the value contained in the
eax
register with the hexadecimal value
402000.
Considering
the metamorphic engine (or any disassembler), the problem
consists in determining the semantics of this value. In other
words, is it an address or not? Now, let us consider the two
following programs described in C language: both of them
declared a constant value
MY_FLAG
in line 1 and a global
string
Global1
in line 2. The
main
function only declares
a variable in line 6, whose value is supposed to be defined later
in the
main
function. The key point is the
if
statement line
8 which compares
var1
with
MY_FLAG
in the first source
and with
Global1
in the second source.
.Let
E
be a symmetric encryption scheme. We modify
the original program
P
to get the program
P
defined as fol-
lows:
P
contains (in its binary representation) an encryption
key
K
and the encrypted data
C
(
D
)
=
E
K
(
)
. Then, during its
execution,
P
starts with the decryption of the encrypted data
C
. After that,
P
computes
f
with the previous decrypted data
D
. Finally,
P
re-encrypts this data
D
with the same key
K
.
So, without the knowledge of the key
K
, the protection of
D
is guaranteed by the robustness of the encryption scheme.
The data obfuscation process consists in randomizing the
key value and its position in such a way that only the piece of
code which previously had access to this key has access to the
new one. The new program contains the new encryption
K
and the new encrypted data
C
=
D
. In this case, the
decryption key can only be discovered by disassembling the
code. Thus, the robustness of data obfuscation directly relies
on the robustness of the previous code obfuscation guaran-
teed by
E
K
(
D
)
τ
-obfuscation.
123
Black-box approach concerning antivirus products evaluation
281
Fig. 2
Illustration of the
difficulty of precise references
evaluation
disassemble
M
1
, then
M
2
until the last block
M
7
.From
now,
M
has its own instructions sequence
Considering the particular case where the compilation
process places the
Global1
string at address
402000
in
the two resulting binaries, line 8 corresponds to the previous
assembly instruction. It is worthmentioning that this extreme
academic case is not very probable, but clearly illustrates
the problem of the references. Concerning our metamorphic
approach, code and data are randomly dispersed throughout
the programduring the replication. So, considering the previ-
ous example, the address of
Global1
will be different after
replication. And then, to be correct, in the statement
cmp
eax, 402000h
, the hexadecimal value must be updated
by the new address of
Global1
only in the second pro-
gram’s binary (Fig.
2
).
(
I
1
,...,
I
20
)
in
memory as illustrated in (3). The re-obfuscation starts here,
as described in Sect.
3.1
whose results is illustrated in (4):
newcode blocks are randomly generated
M
1
,...,
M
6
)
(
with
τ
1
,...,τ
6
).
their corresponding
τ
-obfuscated transitions (
M
1
,...,
M
6
)
The original code block sequence
is inserted
in a new data block represented in (5). The key point con-
sists in updating the reference to this rebuilding information
in instruction
I
4
, to be sure that this instruction will use the
new code blocks description. Finally, the entry point of
M
is
defined in its header to point to the position of
I
1
instruction
in
M
.
From a detection point of view, rebuilding information
presents no constant part nor constant position between the
two mutated programs. Thus, we assume that reaching
rebuilding information means to be able to disassemble any
obfuscated program until identifying the part of the program
using this information.
1
In this case, any disassembler would
be confronted with the robustness of the code obfuscation
scheme. And then detection is delayed during the amount of
time defined by
(
3.3 Metamorphic engine replication with no constant
kernel
At this stage we have illustrated :
1. how to obfuscate a program to guarantee that it cannot
be disassembled under a predefined time
in Sect.
3.1
;
2. which information is mandatory to create a program able
to reverse this previous obfuscation scheme in Sect.
3.2
;
τ
τ
-obfuscation.
3.4 Embedding the metamorphic engine in another
program
We now have to describe how the metamorphic engine can
link these two steps to achieve its self-replication. Figure
3
illustrates this replication process. For the purpose of sim-
plicity, we only present how the description of the original
code blocks sequence is used in the replication process.
Let
M
be an already-obfuscated version of the metamor-
phic engine
M
, as described in Sect.
3.1
.
M
embeds its own
rebuilding information, as presented in Sect.
3.2
.Morepre-
cisely,
M
is here composed of 20 instructions (
I
1
,...,
We have illustrated how the metamorphic engine can repro-
duce itself according to its rebuilding information. However,
the remaining question is the origin of this information. In
other words, howcanwe get the first obfuscatedmetamorphic
binary? First, our metamorphic engine works at binary level
taking advantage of the dissembling difficulty in x86 archi-
tecture. Second, the purpose of the engine is to be generic,
in order to transform high level language programs to make
them metamorphic. Here we only focus on programs written
in C language. Thus, we have to modify the compilation pro-
cess to build the first metamorphic binary in the same way the
metamorphic engine does. This is achieved by inserting an
obfuscator in the compilation process as illustrated in Fig.
4
step (2).
The compilation process starts normally by taking two
inputs programs, the metamorphic engine and the to-be-
obfuscated program. First,
I
20
)
M
1
,...,
M
7
)
distributed in 7 blocks
as represented in (1).
Each instruction index represents its execution order,
I
1
stands for the first executed onewhereas
I
20
is the last instruc-
tion. Each block
M
i
contains a random number of instruc-
tions and a random position in the program
M
. At the end of
each block, another one denoted
(
τ
i
-obfus-
cated branch whose destination is highlighted by pointed
arrows. As previouslymentioned, the destination of this block
cannot be determined before the
represents the
τ
τ
i
duration.
Without loss of generality, let us assume that the rebuild-
ing information is used by instruction
I
4
to start the mod-
eling step. Then, this instruction has a reference to the first
block description represented in (2). This description gives
the position and the size of each
M
i
the compiler produces the
1
The question of heuristic detection of the permuted code is not
mentioned here.
block. So,
M
can
123
  [ Pobierz całość w formacie PDF ]