Skip to content

Sample

Overview

  • Definition

    A sample is a copy of a malicious file such as a malware component, which can be reverse engineered to discover how it works and to extract indicators of compromise (IOCs). Samples are often represented using their SHA1, SHA256 or MD5 hash.

  • Usecase

    Threat actors use different types of malicious software for various purposes - these can include malware deployed on victim devices (e.g., for backdooring or cryptojacking), botnet management software installed on C&C servers, phishing infrastructure utilized on phishing landing pages, malicious scripts injected into hijacked webpages, and more.

  • Pivot Map
    flowchart LR
        classDef primary stroke-width: 2px
        classDef secondary stroke-dasharray: 5 5
    
        %% define nodes
        IP_ADDRESS(IP Address)
        DOMAIN(Domain)
        SERVER([Server / Client])
        SAMPLE(Sample):::primary
        USER_AGENT(User Agent)
        SAMPLE_(Sample):::secondary
        CODE([Code])
    
        %% define edges
    
        SAMPLE -- references ---> IP_ADDRESS
        SERVER -. hosted by .-> IP_ADDRESS
        SERVER -- stores --> SAMPLE
        SAMPLE -- communicates ---> SERVER
        SAMPLE <-- hash ---> SAMPLE_
        SAMPLE <-- code similarity --> SAMPLE_
        SAMPLE <-- behavior --> SAMPLE_
    
        SAMPLE -- references ---> DOMAIN
        SAMPLE -- queries ---> DOMAIN
        SAMPLE -- references --> USER_AGENT
        SAMPLE -- identifies as ---> USER_AGENT
    
        CODE -. compiles to ..-> SAMPLE
    
        %% define links
        click IP_ADDRESS "#ip-addresses"
        click DOMAIN "#domains"
        click SERVER "#servers"
        click SAMPLE_ "#samples"
        click USER_AGENT "#user-agents"
        click CODE "#source-code"
    

Victim-side vs. attacker-side tooling

When pivoting on a file sample, one must consider where the threat actor is expected to use it. For instance, while malware is more likely to be found within victim networks, toolkits and botnet management software are almost certain to only be identified on attacker-controlled servers.

Pivots

Clients

Clients it can be found on

Samples of malware may be retrieved from infected clients by performing forensics, or through security product telemetry.

Conversely, samples of attacker-side toolkits can be found on threat actor machines (e.g., their laptops) and remote jump boxes they operate for connecting to servers or infected devices.


Servers

Servers storing it

Attacker-controlled servers may store malware for victim devices to download during an infection process. Gaining access to such servers may therefore afford access to samples of the aforementioned malware.

Servers it communicates with at runtime

By executing a malware sample in a sandboxed environment, by observing malware that has infected a honeypot, or by analyzing security product telemetry sourced from an infected device, one can determine if the infected machine communicates with any IP addresses of attacker-controlled servers for C&C, data exfiltration, etc.


Domains

Domains it references or queries

Threat actors often configure their malware to communicate with one or more C&C servers, and this usually involves listing a domain within the malware's code (in such instances, the domain is said to be "hardcoded" in the malware).

When executed (on an infected device, honeypot, or in a sandboxed environment), the malware will send a DNS request to resolve the domain, and then communicate with the server hosted on the resolving IP address. By running a static analysis of the sample (even through something as simple as using strings), one can reveal any such hardcoded domains it may contain.

  • Pivot Minimap
    flowchart LR
        classDef primary stroke-width: 2px
        classDef secondary stroke-dasharray: 5 5
        classDef tool fill:#1433F7, stroke:#556CFF, fill-opacity:0.2
        classDef fingerprint fill:#02FF25, stroke:#02FF25, fill-opacity:0.2
    
        %% define nodes
        DOMAIN(Domain)
        SAMPLE(Sample):::primary
        sg1:::tool
    
        FILE_HASH[File Hash]:::fingerprint
    
    
        %% define edges
        SAMPLE -. hashed to .-> FILE_HASH
        FILE_HASH -- queried in --> sg2
    
    
        subgraph sg1 [Malware Zoo]
        subgraph sg2 [Database]
        SAMPLE_(Sample):::secondary
        end
        subgraph sg3 [Analysis]
        SANDBOX[Sandbox]
        STRINGS[Strings]
        end
        end
        SAMPLE -- uploaded to --> sg3
        SANDBOX -- queries --> DOMAIN
        STRINGS -- references --> DOMAIN
        SAMPLE_ -- relates to ---> DOMAIN
    
        %% define links
        click SAMPLE_ "#samples"
        click HASH "/fingerprints/#file-hash"
        click MALWARE_ZOO "/tools/#malware-zoos"
    

IP Addresses

IP addresses it references

By statically scanning a malware sample or reverse engineering it, analysts can identify server IP addresses that may be included in its source code, depending on how well the sample is obfuscated.


User Agents

User agents identifying it or referenced by it

Malware, attacker-side toolkits, and attacker-operated crawlers must identify as a specific user agent if they communicate over HTTP/S (as a requirement of the protocol). While most threat actors will therefore configure their tools to use a prevalent user agent (or rotate between a set of common user agents) in order to blend in with background noise, at times they might make the mistake of using a unique user agent (perhaps as result of a typo) or a nonsensical one (such as a machine identifying as an iPhone but fingerprinted as an IoT device). In such cases, the combination of user agent and other parameters might be uniquely identifiable enough to be used as an effective indicator for discovering infected clients or attacker-controlled infrastructure.

By observing a given sample in a sandboxed environment, honeypot, infected device, or via security product telemetry, analysts can identify which user agents it identifies as. Similarly, analysts can reveal such user agents through static analysis or reverse engineering of the sample, depending on its level of obfuscation.


Samples

Samples with same hash

Since a file hash is unique, querying for a hash in a "malware zoo" platform such as VirusTotal can lead to other copies of the same sample. This can be useful for analysis if these other copies have different metadata than the original, such as their filename, where they were uploaded from, and relationships with other artifacts (for example, one copy of the sample might have been stored in a compressed archive along with other, different samples, or it might have been available for download at one point from a phishing website).

Samples with code similarity to it

Threat actors may develop their tools over long periods of time, going through multiple iterations of the same tool. Additionally, they might reuse certain self-developed software components across more than one tool, or use the same development environment when working on different tools. In all of these cases, code similarity analysis can reveal such commonalities between different samples, if they exist.

Analysts can upload a given sample to a code similarity platform, and check if any other previously uploaded samples are similar to it.

Samples with overlapping behavior

Threat actors might develop several variations of the same tool, or implement overlapping functionality in different tools, depending on their operational requirements. By dynamically scanning a given sample in a sandbox, observing it on a honeypot, or monitoring it through security product telemetry, analysts can characterize the sample's behavior and map it to certain sets of TTPs. If these are unique enough, analysts can leverage this to surface additional samples exhibiting the same combination of behavioral traits, such as by querying for them in "malware zoo" platforms such as VirusTotal.

Samples with overlapping observables

Malware is not often deployed as a single file on the disk of an infected device, but rather leaves traces in multiple locations, such as files in certain paths, registry keys, process names, etc. By observing an infected device, sandbox, honeypot, or by checking security product telemetry, an analyst can identify such traces and leverage them to detect other instances of the same sample, or variants of it.

Given a sample, analysts can use "malware zoo" platforms such as VirusTotal to query for any such previously encountered samples, usually using YARA rules for this purpose.


Source Code

Code which compiles to it

If a threat actor is using an open-source tool that isn't unique to their own operations, its source code is likely to be available in a code repository.

Conversely, source code for proprietary tools can be found on attacker-controlled machines, and is sometimes published online as a result of hack-and-leak operations oconducted against the threat actor, or following internal disputes within threat actor groups.