Writing Disassemblers for VM-based Obfuscators

⚠️ Some information, tools, or techniques discussed may have changed or evolved since the publishing of this article.

Originally published at https://synthesis.to/2021/10/21/vm_based_obfuscation.html

After I recently gave a workshop on the Analysis of Virtualization-based Obfuscation at r2con2021 (slides, code & samples and the recording are available online), I would like to use this blog post for a brief summary on how to write disassemblers for VM-based obfuscators based on symbolic execution.

While we cover only a basic implementation of virtualization-based obfuscation, we focus on a generic approach to build disassemblers that scale for most virtual machines that can be found in the wild. As a leading example for this blog post, we will use a simple virtual machine that protects a Fibonacci implementation using the Tigress C obfuscator. In the end, we will also have a look at how the original code can be reconstructed from the VM disassembly.

If you would like to play around with the code (based on Miasm) and samples, you can find them here.

Virtual Machines

Virtual machines (VM) belong to the most sophisticated obfuscation schemes, since they create an additional layer of complexity for reverse engineers: By introducing a custom instruction set architecture (ISA) in software (similar to the Java virtual machine), they hide the original code in a sequence of bytes (the so-called bytecode) that is interpreted at runtime. In other words, the bytecode is an array of bytes that encodes the to-be-protected program as machine code; without further knowledge of the virtual machine implementing the custom instruction set architecture, the bytes aren’t meaningful for a reverse engineer. To give them a meaning, we have to fully understand the VM and its execution flow, such that we can build a disassembler for the custom instruction set and decode the bytecode.

To achieve this, let us start with opening a virtual machine in the disassembler of our choice. For the simplest virtual machines, we see a control-flow graph like the following:

We see a control-flow graph that has a central basic block—the dispatcher—which directs the control flow to individual basic blocks, which jump back to the dispatcher. While this reminds us of control-flow flattening, there is a fundamental difference: In the case of control-flow flattening, the individual basic blocks have a 1:1 mapping to the original code. However, for virtual machines, these basic blocks are known as handlers and implement the individual instruction semantics of the custom instruction set architecture, such as virtual addition or push/pop instructions.

On a high-level, the virtual machine operates as follows: Starting at the entry basic block, it backups the native CPU registers and initializes the VM state. Afterward, it walks over the bytecode array, iteratively fetches some bytes, decodes the corresponding instruction and executes the handler that implements the virtual instruction. The fetch-decode-execute (FDE) process is performed in a loop until the VM reaches a specific handler that restores the native CPU registers and leaves the VM. This handler is called the VM exit. Given this context, we can annotate the control-flow graph as follows:

Internally, virtual machines often use additional data structures to preserve their internal state (like storing native registers and intermediate values). Two of the most important data structures are the virtual instruction pointer and (sometimes) the virtual stack pointer. Similar to a native instruction pointer register (such as rip

on x86-64), the virtual instruction pointer points to the current instruction in the bytecode and is used to decode operands and keep track of the VM execution flow. Analogously, the virtual stack pointer keeps track of a VM-internal stack that may be used to store intermediate values.

Overall, we can understand a VM as an obfuscated function with a custom calling convention: It takes some parameters as inputs and calculates one or more outputs. To preserve the outer execution context, it makes use of a prologue and epilogue (VM entry and exit) that saves and restores the native CPU state. Internally, it operates on an undocumented state and interprets a sequence of bytes that represents the protected code. To reconstruct the underlying code, we have to locate the virtual machine’s inputs/outputs, understand its internal state and instruction set. Then, we can write a disassembler for the custom instruction set architecture which decodes the bytecode into human-readable pseudocode. In the following, we will have a look at how we can write a generic disassembler based on symbolic execution. Beforehand, we make a short excursion into symbolic execution.

Symbolic Execution

Symbolic execution is a program analysis technique which allows us to symbolically evaluate and summarize assembly code. These summaries provide concrete insights into the semantics of executed instructions.

To symbolically execute assembly code, we first lift it into an intermediate representation. Afterward, we evaluate the code assignment by assignment and track the individual register and memory assignments in a hashmap that is referred to as symbolic state. To propagate the data flow between the instructions, we always use the latest register/memory definitions from the symbolic state. For example, consider the following assembly code and the corresponding assignments:

1mov rax, rbx ; rax := rbx
2add rax, 0x20 ; rax := rax + 0x20
3add rbx, rax ; rbx := rbx + rax
4xor rcx, rbx ; rcx := rcx ^ rbx

Let us assume that the initial symbolic state maps all registers to themselves:

1rax: rax
2rbx: rbx
3rcx: rcx

After the first instruction, we update the symbolic state such that rax

now maps to rbx

. If we evaluate the second instruction, we propagate this information and assign state[rax] + 0x20 = rbx + 0x20

to rax

. After we symbolically executed all instructions, we have the following symbolic state:

1rax: rax + 0x20
2rbx: rbx + (rax + 0x20)
3rcx: rcx ^ (rbx + (rax + 0x20))

The symbolic state reveals the underlying semantics of a sequence of code in a comprehensive manner. If we want to enrich symbolic analysis with (partially) concrete values, we can provide them to the initial symbolic state and perform a concolic execution. For example, if we initialize rax

with 0x10

, we obtain

1rax: 0x30
2rbx: rbx + 0x30
3rcx: rcx ^ (rbx + 0x30)

In a purely concolic scenario, symbolic execution can be understood as an emulator for assembly code. However, if we keep at least some registers/memory locations symbolic, we get a very powerful technique which allows us to automatically explore virtual machine interpreters. In the following, we will make use of this to build a generic disassembler.

Writing a Disassembler

Our goal is to design a generic disassembler approach that works for most virtual machines in the wild, no matter their complexity. Starting from scratch with nearly no prior knowledge of the virtual machine, we interactively add more and more knowledge until we have a fully-fledged disassembler. For this, we build a system which symbolically explores the virtual machine and enrich the output with useful information about the symbolic state. On a high-level, the approach consists of two steps:

We build a symbolic executor that

follows the VM execution flowfrom the VM entry to the VM exit. - Afterward, we add callbacks to interesting code locations (such as VM handlers)and dump useful information from the symbolic state.

Building such a symbolic executor is an interactive process: Starting at the VM entry, we symbolically follow the execution flow until the symbolic executor stops. This is always the case if the VM does a control-flow transfer that relying on symbolic values. Then, the symbolic executor has insufficient information to derive a concrete value. In these cases, we manually inspect why this happens and add additional knowledge to the symbolic state. Often, these are either missing information about the bytecode or (conditional) jumps which depend on user input (inputs to the VM from the outside). For the former, we concretize the symbolic memory around the bytecode; for the latter, we can hardcode specific input values.

Once the symbolic executor follows the execution from VM entry to exit without any interruption, we can manually reverse engineer specific VM components and add more and more knowledge about the VM to the executor. Typically, we start by dumping the virtual instruction pointer and track all the executed handlers. Afterward, we manually inspect individual handlers, understand their semantics and add this information to the symbolic executor. In the end, we receive a fully-fledged disassembler which allows us to dump concrete values at runtime, print and optimize the disassembly or reconstruct the control-flow graph. In theory, it is even possible to perform a devirtualization and emit equivalent x86-64 code, although this may take quite some time to implement correctly.

In Practice

After we discussed the generic approach to build a disassembler based on symbolic execution, we will now exemplify the process on a simple virtual machine. For a better understanding, we will first give a detailed overview of the virtual machine. Afterward, we explore the process from symbolically following the execution flow to building a fully-fledged disassembler.

Sample

Our sample of choice protects an implementation of Fibonacci with a virtual machine whose control-flow graph is comparable to the one above. The virtual machine is stack-based; it uses a dedicated stack to store intermediate values; for example, a calculation such as 5 + 8

is performed by the following pseudocode:

1push 5
2push 8
3add

The VM entry is at 0x115a

. The virtual machine uses rdx

as the virtual instruction pointer and rcx

as the virtual stack pointer; the bytecode is located at 0x4060

. The handler 0x11a9

performs a stack-based addition:

1000011a9 4883c201 add rdx, 0x1
2000011ad 8b01 mov eax, dword [rcx]
3000011af 0141f8 add dword [rcx-0x8], eax
4000011b2 4883e908 sub rcx, 0x8
5000011b6 eb4f jmp 0x1207

For this, it fetches two arguments from the stack—[rcx]

and [rcx-0x8]

—adds them and stores them on the top of the stack. It also increments the virtual instruction pointer in 0x11a9

and the virtual stack pointer in 0x11b2

. Overall, the virtual machine has the following 11 handlers:

0x129e

: CMPBE (compare below equal)0x1238

: PUSHFROMVAR (load integer from local variable and push onto stack)0x126d

: CHECKZERO (if constant == 0x0 then pushvar_c

on top of stack)0x11c4

: PUSHPTR (push pointer to local variable)0x1262

: GOTO (jump to an address)0x11a9

: ADD (add two dwords)0x1245

: VMEXIT (leave the VM)0x11f1

: CMPE (compare if two values are equal)0x11e1

: PUSH (push constant onto stack)0x1281

: JCC (conditional jump)0x1226

: POPTOVAR (assign value to local variable)

To perform the Fibonacci calculation, it expects an input parameter in rdi

representing the n-th Fibonacci number to calculate; after the VM exit, rax

holds the output (the n-th Fibonacci number).

If you want to reverse engineer the virtual machine on your own, use the provided information as a guidance. However, let us now forget these analysis details and start writing the disassembler without any prior knowledge.

Following the VM Execution Flow

To build a symbolic executor that follows the VM execution flow from the VM entry to the exit, we use the script follow_execution_flow.py. It hardcodes the VM entry at 0x115a

and uses the following loop to symbolically follow the execution flow:

1# init worklist
2basic_block_worklist = [ExprInt(start_addr, 64)]
3
4# worklist algorithm
5while basic_block_worklist:
6    # get current block
7    current_block = basic_block_worklist.pop()
8
9    print(f"current block: {current_block}")
10
11    # symbolical execute block -> next_block: symbolic value/address to execute
12    next_block = sb.run_block_at(ira_cfg, current_block)
13
14    print(f"next block: {next_block}")
15
16    # is next block is integer or label, continue execution
17    if next_block.is_int() or next_block.is_loc():
18        basic_block_worklist.append(next_block)
19

In short, it implements a worklist algorithm that iteratively executes the next basic block until the calculation of the next basic block relies on symbolic values. Initially, the output looks as follows:

1current block: 0x115A
2next block: 0x1207
3current block: 0x1207
4next block: 0x120A
5current block: 0x120A
6next block: (@8[0x4060] == 0x80)?(0x129E,0x1212)

We successfully executed the basic blocks 0x115A

, 0x1207

and 0x120A

. However, the next basic block relies on a memory value that the symbolic executor cannot resolve: @8[0x4060]

(which can be understood as an 8-bit memory read from the address 0x4060

). Let’s have a look at the corresponding disassembly:

100001178 488d15e12e0000 lea rdx, [rel data_4060]
2[...]
300001207 0fb602 movzx eax, byte [rdx]
4[...]
50000120a 3c80 cmp al, 0x80

We notice that the comparison with al

relies on the memory dereference [rdx]

, which again relies on the address 0x4060

. In other words, the symbolic executor propagated the memory read to the byte comparison: @8[0x4060] == 0x80

. Semantically, the virtual machine reads the first byte from the bytecode and dispatches it to find the first handler. To allow the symbolic executor to perform the dispatching, we empower it with knowledge of the bytecode by adding it to the symbolic memory:

1# add bytecode to symbolic memory -- start address and size (highest address - lowest address)
2sym_address, sym_value = constraint_memory(0x4060, 0x4140 - 0x4060)
3sb.symbols[sym_address] = sym_value

Afterward, the symbolic executor runs successfully. The next time it stops it looks as follows:

1[...]
2current block: 0x120A
3next block: 0x1212
4current block: 0x1212
5next block: 0x1218
6current block: 0x1218
7next block: 0x121C
8current block: 0x121C
9next block: 0x121E
10current block: 0x121E
11next block: 0x1281
12current block: 0x1281
13next block: RDI[0:32]?(0x1298,0x1286)

This time, we stop at a conditional jump that relies on rdi

. After some manual investigation, we come to the conclusion that this is an input parameter to the VM. Let us initialize it with 0

as follows:

1# constraint VM input (rdi, first function argument). The value in `ExprInt` represents the function's input value.
2rdi = ExprId("RDI", 64)
3sb.symbols[rdi] = ExprInt(0, 64)

If we re-run again, the symbolic executor stops here:

1[...]
2next block: 0x1245
3current block: 0x1245
4next block: 0x125A
5current block: 0x125A
6next block: @64[RSP]

Now, the next basic block is an address taken from the stack; a typical pattern for a ret

instruction. Indeed, if we look at the corresponding handler we see that it leaves the VM; we found the VM exit. In other words, our symbolic executor now runs from the VM entry to the VM exit. If we have a closer look at the VM exit, we also see the following assembly line:

100001245 8b01 mov eax, dword [rcx]

Here, we use the virtual stack pointer rcx

and load a value into rax

. In other words, we found the VM output. To verify our assumption, we can add the following code lines after the worklist algorithm to print the VM output:

1# dump VMs/functions' return value -- only works if SE runs until the end
2rax = ExprId("RAX", 64)
3value = sb.symbols[rax]
4print(f"VM return value: {value}")

If we re-run the script, we receive 0

as output; if we change the input value in rdi

to 10

, we get as output 55

. In other words, the symbolic executor correctly emulates the VM execution for the provided inputs and calculates the same output as the binary (remember that it calculates the n-th Fibonacci number).

In summary, we have not only built a symbolic executor following the VM execution flow from the VM entry to its exit; we were also able to emulate the whole VM and inspect its input and output. In the following, we will extend the analysis and enrich the symbolic emulator with information about the individual VM handlers.

From Symbolic Execution to VM Disassembly

In the next step, we want to turn the symbolic executor into a fully-fledged disassembler. For this, we want to add callbacks for the individual handlers which dump additional information from the symbolic state. To realize this, we first start by manually creating a list of handlers. Then, we add the callback in the worklist algorithm as it can be seen in vm_disassembler.py:

1# worklist algorithm
2while basic_block_worklist:
3    # get current block
4    current_block = basic_block_worklist.pop()
5
6    # if current block is a VM handler, dump handler-specific knowledge
7    if current_block.is_int() and int(current_block) in VM_HANDLERS:
8        disassemble(sb, current_block)
9    [...]
10

If the current symbolically executed basic block is the beginning of a VM handler, we call the function disassemble

which takes the handler and the symbolic state as input. Within the function, we dispatch the individual handler addresses and dump handler-specific information. For now, let us just create a template and print the virtual instruction pointer:

1    # fetch concrete value of current virtual instruction pointer
2    vip = sb.symbols[ExprId("RDX", 64)]
3
4    # catch the individual handlers and print execution context
5    if int(address) == 0x129e:
6        print(f"{vip}: handler {address}")
7    [...]
8

If we execute the script, we get the following handler trace:

10x4060: handler 0x11E1
20x4065: handler 0x11C4
30x406A: handler 0x1226
40x406B: handler 0x11E1
50x4070: handler 0x11C4
60x4075: handler 0x1226
70x4076: handler 0x11C4
80x407B: handler 0x1238
90x407C: handler 0x11C4
100x4081: handler 0x1226
110x4082: handler 0x1262
120x4087: handler 0x11E1
130x408C: handler 0x126D
140x4091: handler 0x1238
150x4092: handler 0x11F1
160x4093: handler 0x1281
170x409D: handler 0x11E1
180x40A2: handler 0x11C4
190x40A7: handler 0x1226
200x40A8: handler 0x1262
210x40AD: handler 0x1262
220x4136: handler 0x11C4
230x413B: handler 0x1238
240x413C: handler 0x1245

Looking at the trace, we see that some handlers are executed more than once. To give the instruction trace more meaning, we can now manually inspect the individual handlers and replace the output with handler-specific information. For example, we could start and replace the last line handler 0x1245

with VMEXIT

. This process might be time-consuming; however, it becomes easier the more handlers we already know since we permanently improve our understanding of how the VM works internally. In the end, we can enrich the disassembler as it can be seen in vm_disassembler_final.py. Its final output is:

10x4060: PUSH 0x0
20x4065: PUSHPTR var_0x4 (push pointer to local variable var_0x4)
30x406A: POPTOVAR (assign value to local variable)
40x406B: PUSH 0x1
50x4070: PUSHPTR var_0x8 (push pointer to local variable var_0x8)
60x4075: POPTOVAR (assign value to local variable)
70x4076: PUSHPTR var_0x8 (push pointer to local variable var_0x8)
80x407B: PUSHFROMVAR (load integer from local variable and push onto stack)
90x407C: PUSHPTR var_0xC (push pointer to local variable var_0xC)
100x4081: POPTOVAR (assign value to local variable)
110x4082: GOTO 0x4087
120x4087: PUSH 0x0
130x408C: if 0x0 == 0x0 then push var_c (rdi) on top of stack
140x4091: PUSHFROMVAR (load integer from local variable and push onto stack)
150x4092: CMPE
160x4093: conditional jump
170x409D: PUSH 0x0
180x40A2: PUSHPTR var_0x10 (push pointer to local variable var_0x10)
190x40A7: POPTOVAR (assign value to local variable)
200x40A8: GOTO 0x40AD
210x40AD: GOTO 0x4136
220x4136: PUSHPTR var_0x10 (push pointer to local variable var_0x10)
230x413B: PUSHFROMVAR (load integer from local variable and push onto stack
240x413C: VMEXIT

Given this detailed VM disassembly output, we can now have a closer look at the VM disassembly, detect patterns, simplify the output and reconstruct the high-level code.

Code Reconstruction

To reconstruct the high-level code for the observed VM trace, let’s first have a closer look at some instruction patterns:

10x4060: PUSH 0x0
20x4065: PUSHPTR var_0x4 (push pointer to local variable var_0x4)
30x406A: POPTOVAR (assign value to local variable)

This pattern occurs several times; it pushes a constant onto the stack, afterward a local variable offset. Finally, it assigns the value to the local variable. In short, this pattern can be simplified to var_0x4 = 0x0

10x4076: PUSHPTR var_0x8 (push pointer to local variable var_0x8)
20x407B: PUSHFROMVAR (load integer from local variable and push onto stack
30x407C: PUSHPTR var_0xC (push pointer to local variable var_0xC)
40x4081: POPTOVAR (assign value to local variable)

This instruction pattern first pushes a pointer to a local variable onto the stack, afterward it fetches its value. Then, we push another pointer to a variable and assign it the value from the stack. In short, it can be written as var_0xc = var_0x8

Let’s also have a closer look at the conditional jump:

10x4087: PUSH 0x0
20x408C: if 0x0 == 0x0 then push var_c on top of stack
30x4091: PUSHFROMVAR (load integer from local variable and push onto stack)
40x4092: CMPE
50x4093: conditional jump

We first push 0x0

onto the stack. Then, we compare 0x0 == 0x0

and if this is true (which it is in this case), we put var_c

onto the stack. We didn’t explore it in depth, but var_c

holds the value of rdi

, the input to the VM. This way, we put rdi

onto the stack. Finally, we compare the two stack values if they are the equal: if rdi == 0x0: [...]

If we do this for all instruction patterns and remove the gotos, we receive a cleaned output:

1; var_0x4 = 0x0
20x4060: PUSH 0x0
30x4065: PUSHPTR var_0x4 (push pointer to local variable var_0x4)
40x406A: POPTOVAR (assign value to local variable)
5; var_0x8 = 0x1
60x406B: PUSH 0x1
70x4070: PUSHPTR var_0x8 (push pointer to local variable var_0x8)
80x4075: POPTOVAR (assign value to local variable)
9; var_0xc = var_0x8
100x4076: PUSHPTR var_0x8 (push pointer to local variable var_0x8)
110x407B: PUSHFROMVAR (load integer from local variable and push onto stack
120x407C: PUSHPTR var_0xC (push pointer to local variable var_0xC)
130x4081: POPTOVAR (assign value to local variable)
14; push rdi (== 0) onto stack
150x4087: PUSH 0x0
160x408C: if 0x0 == 0x0 then push var_c on top of stack
170x4091: PUSHFROMVAR (load integer from local variable and push onto stack
18; if rdi == 0
190x4092: CMPE
200x4093: conditional jump
21; var_0x10 = 0x0
220x409D: PUSH 0x0
230x40A2: PUSHPTR var_0x10 (push pointer to local variable var_0x10)
240x40A7: POPTOVAR (assign value to local variable)
25; return var_0x10
260x4136: PUSHPTR var_0x10 (push pointer to local variable var_0x10)
270x413B: PUSHFROMVAR (load integer from local variable and push onto stack
280x413C: VMEXIT

If we also remove the VM disassembler output and re-write our comments in pseudocode, we obtain:

1var_0x4 = 0x0
2var_0x8 = 0x1
3var_0xc = var_0x8
4
5if rdi == 0:
6    var_0x10 = 0x0
7    return var_0x10
8else:
9    print("not seen in VM trace")
10

We might see that it is the initialization of the Fibonacci function where we initialize two values and compare the input to 0x0

. Since we initialized rdi

with 0x0

in the symbolic executor, the VM execution triggers exactly this case. To dispel the last doubts, we can also compare it with the original, unprotected code:

1unsigned int fib(unsigned n) {
2  unsigned a=0;
3  unsigned b=1;
4  unsigned int s = b;
5
6  if (n == 0) {
7    return 0;
8  }
9  [...]
10

As we can see, our reconstruction nearly fits the original code snippet line by line. To reconstruct the missing parts of the original code, we can assign other values to rdi

and re-run the whole process.

This has been a long journey, but in the end we analyzed a virtual machine, wrote an interactive disassembler based on symbolic execution and reconstructed the original code based on the VM disassembly.

Setting the Scene

In this post, we discussed the fundamentals of virtualization-based obfuscation and learned a generic approach on how to build disassemblers for virtual machine implementations using symbolic execution. Afterward, we put it into practice on a simple virtual machine for which we not only constructed a disassembler, but also recovered parts of the protected high-level code. Overall, the whole process has been an interplay between manual analysis and automation; especially the manual work required a lot of time. In general, large parts of this time-consuming manual work can be simplified by using heuristics, pattern matching and other analysis techniques; we could, for example, detect the handlers automatically and also extract their semantics. This may be a topic for a future post.

In the wild, there exist a plethora of virtualization-based obfuscators of varying complexity that are deployed in malware, commercial applications and even in the Windows kernel. Naturally, the example code from this post will break horribly on such samples: function calls, indirect calls, jump tables, external/API calls, additional layers of obfuscation and others must be treated in special ways. Nevertheless, the core concepts remain the same.

If you would like to dive deeper into these topics, learn how to combine this approach with other deobfuscation techniques such as compiler optimizations, taint analysis, SMT solving & program synthesis or if you would like to know how you can handle the aforementioned special cases, then have a look at my training class on code deobfuscation techniques.