andrew@theinternet

NetSec Lecture Notes - Lesson 10 - Advanced Malware Analysis

Advanced Malware Analysis

Malware Prevalence

  • “Safe” websites not safe to visit
    • Reading USAToday.com may result in malware on your computer
    • How?
      • USAToday’s network may be compromised
      • Malicious scripts may be bundled with ads or content, and may redirect the user’s browser to download malicious software
  • Case study: Alexa top sites
    • Ranked list of all domains - 252 milliion
    • Top 25k of these were examined each day, systematically
    • Network traffic was monitored for drive-by download
    • 39 of these domains resulted in drive-by downloads
      • 87% of these sites involved exploits for Java
      • 46% of sites served exploits via ad networks
    • 7.8 million users were served malicious content
      • 1.2 million likely compromised
  • New features are a regular basis for attacks
    • PDF contains embedded, malicious Flash movie which exploits Acrobat Reader’s flash interpreter, thus compromising the system. This can then phone home to controller
      • Phone home traffic can be observed
    • Soon after, compromised, legitimate websites found hosting drive-by attacks that use the same flaw to exploit Flash player
    • Vulnerability traded back to bug reported to Adobe eight months prior
  • Users remain uninformed (vulnerable)
  • Example: Waledac’s email campaigns
    • Use of geo-location, temporally relevant events (e.g. bomb blast in <your city>, July 4th fireworks videos) to make attacks more compelling

Malware Evolution

  • Traditional defense-in-depth
  • Network-level protection
    • Firewall
      • evaded by C&C protocol congruency
    • IPS/IDS
      • Evaded by custom encodings
  • Host-level protection
    • User Access Control
      • analagous to “informed consent”
        • users usually don’t really get it, just click “yes” to get on with their job
      • img
  • Antivirus
    • Uses complex, heuristics-based detection along with signature matching

Malware Obfuscation

Malware Obfuscation Quiz

  • Code that reverses the pre-runtime transformation is included in the executable
    • True
  • Each instance looks different, but there is a signature or pattern across all instances
    • False
  • A signature scanner that tries to identify malware by its unique strings would not be effective
    • True

Packing

  • Definition of Packing: A technique whereby parts or all of an executable file are compressed, encrypted, or transformed in some fashion

img

  • Transformed code looks random, because it is encrypted with a randomly generated key.
    • This happens each time the packing tool is invoked, so even different copies of the same malware look different
    • Transformed machine code looks like data, so network IDS looking for executables will miss it
  • Server Side Polymorphism
    • Attacks the heart of the traditional host-based AV model by automating mutations
  • Eventually researches can deobfuscate any malware and discover its behaviors, but it can be a substantial obstacle and takes time
    • Attackers can thwart this by sending updates to compromised computers with newly obfuscated malware, with new behaviors. This effectively resets the research to square one each time
  • Example: Waledac
    • Collected on 12/30/2008
      • img
      • By 2/25/2009 90% of AV can detect it
    • New sample, collected on 2/5/2009
      • img
      • Only about 28% of AV can detect this new version
  • Impact on Antivirus Software
    • Researchers surveyed McAfee AV using 200,000 malware samples collected over 6 months
      • 53% detected on first day
      • 32% detected with delay (avg delay was 54 days)
      • 15% still not detected 6 months later

Obfuscation Quiz 2

Given the following obfuscation techniques, who is the software hiding from?

  • Rootkits
    • Users
  • Thoroughly mapping security sites and honey pots so as to avoid them
    • Security
  • Using nonce-based encryption methods
    • Researchers

Malware Analysis

  • Understanding malware behaviors
    • Network and host level detection/blocking
    • Forensics and asset remedation
    • Threat/trend analysis
  • Malware authors make analysis very challenging
    • Increasingly sophisticated techniques
  • A core challenge is that the sheer volume of malware samples is difficult to keep up with, due to automation
    • DIY kits, packing tools, server-side polymorphism vastly increase the volume of samples
    • There are hundreds of thousands of new instances each day
      • Collected from crawlers, mail filters, honeypots, user submissions, and malware exchanges
    • Given that the enemy is automated, analysis must be as well
  • The malware uncertainty principle
    • Observer affecting the observed environment
    • Malware may alter its behavior if observation is detected
    • Robust and detailed analyzers are typically invasive
      • In-memory presence
      • Hooks
      • CPU Emulation
    • Malware may refuse to run entirely
    • This is already commercialized
      • Dynamic analyzer detection is a standard malware feature
  • We need malware analysis to be transparent to the malware. Malware must not be able to detect that it is being analyzed
    • Higher privilege for analysis tool than the malware has
    • Must have no non-privileged side effects that might be detected
    • Malware should get the same instruction execution semantics, exception handling, and notion of time as it would if the analyzer were not present
  • Most existing tools fall short of the above transparency reqquirements
    • In-guest tools (runs on same machine with same privilege as malware)
      • No higher privilege
      • Non-privileged side effects
      • Exception handling issues
    • Reduced Privilege Guests (VMWare, etc)
      • Non-privileged side effects
    • Emulation (QEMU, Simics)
      • No identical instruction execution semantics
      • These are the most widely used by state of the art analyzers but they have major shortcomings
        • There have been attacks that detect full system emulator approaches by exploiting incomplete emulation
        • There is no way to guarantee the absence of such attacks
        • Determining whether the languages of two Turing machines (L and L’) are equal is known as the undecidable problem EQTM
  • Example of same instruction semantics attack
    • 0xf30xf3...0x90
      • 15 0xf3s before 0x90: execute no-op 15 times
      • But the instruction (16 bytes) is illegal because max x86 instruction length is 15 bytes
        • In bare metal, “illegal instruction” excetpion
        • Emulator will happily execute this
  • Most challenging transparency requirement is the “identical notion of time”
    • Network-based timing measurements
      • e.g. Analyzer causes delay in requests, or the website the malware tries to connect to is not the real one
      • Impossible to identify all
      • Equivalent to the problem of detecting and removing all covert channels
        • This is proved to be an undecidable problem

Analysis Difficulty Quiz

Rank the four categories from easiest (1) to hardest (4)

  • Static Properties Analysis: examine the static properties of the malware. Static properties include: metadata, strings embeddedi n the malware, header details, etc
    • 2
  • Manual Code Reversing: use a disassembler and decompiler to recreate the malware code
    • 4
  • Fully Automated Analysis: use fully automated tools to analyze the malware
    • 1
  • Interactive Behavior Analysis: running the mlaware in a protected and isolated environment
    • 3

Analysis Technique Results Quiz

Rank the four categories from the most information is obtained (1) to least information is obtained (4)

  • Static Properties Analysis: examine the static properties of the malware. Static properties include: metadata, strings embeddedi n the malware, header details, etc
    • 3
  • Manual Code Reversing: use a disassembler and decompiler to recreate the malware code *1
  • Fully Automated Analysis: use fully automated tools to analyze the malware
    • 4
  • Interactive Behavior Analysis: running the mlaware in a protected and isolated environment
    • 2

Robust and Efficient Malware Analysis

  • Robust == transparent. not easily detected or evaded
  • Efficient == automated and fast
  • Focus on host-based analysis. Run malware on a machine and observe its behaviors and properties
    • Malware will try to detect the presence of the analyzer, via differences in analysis environment
    • Our goal to achieve identical execution, such that the analyzer cannot be detected
  • Formal requirements
    • Higher privilege
      • Analyzer has to have higher privilege over malware
    • No non-privileged side-effects
      • Minimal side-effects should be introduced by the analyzer
      • Analyzer should enforce privileged access to side-effects
    • Identical basic instruction execution semantics
      • Hardware semantics and emulation
    • Transparent exception handling
      • Exception handling should be transparent or same
    • Identical measure of time
      • Timing for each instruction
      • Timing for I/O, exception handling

Ether Malware Analyzer

  • Built at Georgia Tech! And it’s open-source!
  • Higher Privilege
    • Built using Intel VT (Hardware Virtualization)
    • Hypervisor has higher privilege over kernel
    • Several hardware supported traps – VM exits
  • No non-privileged side-effects
    • Analyzer is outside environment
    • Minimal side-effects. Trap flag, system call handling
  • Identical basic instruction execution semantics
    • Uses hardware virtualization
    • Instruction execution semantics are the same
  • Transparent exception handling
    • Hypervisor pre-empts before OS exception handling
  • Identical measure of time
    • RDTSC (RealDateTimeStampCounter instruction)
    • Privileged instruction. Hypervisor can control responses to malware whenever the malware tries to get a measure of time

img

  • Ether has a Hypervisor component in Xen, and also runs in a separate VM.
  • Ether provides both a fine-grained instruction-by-instruction examination of the malware, as well as a more coarse-grained systemcall-by-systemcall examination
  • EtherUnpack: extracts hidden code from obfuscated malware
    • Georgia Tech researchers compared how well current tools extract hidden code by obfuscating a test binary and looking for a known string in the extracted code
  • EtherTrace: Records system calls executed by obfuscated malware
    • Georgia Tech researchers obfuscated a test binary which executes a set of known operations, and then observe if they were logged by the tool
    • img

Analysis vs Obfuscations

  • Analysis: Static Analysis based approaches
    • Obfuscations: Polymorphism, metamorphism, packing, opaque predicates, anti-disassembly
  • Analysis: Dynamic malware analysis (running the malware)
    • Obfuscations: Trigger-based behavior (logic bombs, time bombs, anti-debugging, anti-emulation, etc)
  • Analysis: Dynamic multipath exploration (Bitscope, EXE, Forced Execution)
    • Obfuscation: ??
      • Battle continues

Malware Emulators

  • Recent move towards emulator-based obfuscation
    • An instruction-level obfuscation approach
    • Several commercial pakcers support emulator based obfuscation, including Code Virtualizer and VMProtect
    • Emulation techniques maturing, widespread adoption is possible

img

Impacts on Existing Malware Analysis

  • Unknown language L
    • L can be randomly generated
  • Pure Static Analysis (whitebox)
    • Completely thwarted
    • Only emulator code is analyzable
    • PL is considered as data by the analyzer
  • Greybox methods (dynamic analysis)
    • Includes instruction level analyzers, information-flow, dynamic tainting, multi-path exploration, etc
    • Analysis is inaccurate
    • For example, paths may be explored in the emulator, but not on the malware
  • Manual reverse-engineering methods cannot scale
    • Manual reverse-engineering takes time
    • Each malware instance can have new bytecode language and emulator, making reverse-engineered information obsolete
  • Need automated techniques to reverse emulator
    • Should not require any knowledge about bytecode
    • Should be generic and work for a large class of emulators
  • Is automated reverse engineering possible?
    • Theoretically, it is an undecidable problem
    • However, from intuition, the emulator’s fetch-decode-execute behavior can be identified at runtime

Approaches of Emulation

img

Challenges of Reverse Engineering

  • No knowledge of bytecode program
    • The location of the bytecode program in the obfuscated program’s memory is not known
  • No knowledge of emulator’s code
    • The code that corresponds to decode, dispatch, and execute phases of the emulator is not known
  • Intentional variations
    • Context can be maintained in many different ways
    • An attacker may complicate Virtual Program Counter (VPC) identification by maintaining it in different correlated variables
    • Bytecode program may be stored in non-contiguous memory

Approach Overview

  • Abstract Variable Binding
    • Identify pointer variables within raw memory of emulator using access patterns of memory reads and writes
    • It is a combination and forward and backward data-flow analysis
  • Identify Candidate VPCs
    • Cluster memory reads according to their bound variables
    • Each cluster provides a candidate VPC
  • Identify Emulator Phases
    • Identify decode-dispatch loop and emulator phases
  • Extract bytecode Syntax and Semantics
    • Identify fetched instruction syntax – opcode and operands
    • Identify execute routine related to instruction – this is the semantics of bytecode instruction
    • This information is subsequently used to generate control flow graphs (CFGs)

Experimental Evaluation

  • Researchers created a synthetic program, used two emulation obfuscators (VMProtect and CodeVirtualizer) and applied an implementation of the above approach
  • They then compared the CFG of the original program to the CFG of the deobfuscated results

img img

  • Tool is successful in extracting main properties