CYB. Defensive AI (Part 3) Master's in Artificial Intelligence PDF
Document Details
Uploaded by RealizableGorgon
University of Vigo
2024
Tags
Summary
This document is part of a course on Defensive AI for a Master's in Artificial Intelligence at ESEI – University of Vigo. It explores AI/ML techniques applicable to malware analysis, including types of malicious software and different analysis approaches, focusing on static and dynamic methodologies.
Full Transcript
CYB. Defensive AI (part 3) Master in Artificial Intelligence 2024/25 ESEI – University of Vigo AI/ML in Malware Analysis Malware: Definition and Types Malware (malicious software) → software intentionally designed to harm, exploit, or compromise computer systems and data Typical types of...
CYB. Defensive AI (part 3) Master in Artificial Intelligence 2024/25 ESEI – University of Vigo AI/ML in Malware Analysis Malware: Definition and Types Malware (malicious software) → software intentionally designed to harm, exploit, or compromise computer systems and data Typical types of malware (can be a mixture) Self-replicating Viruses Programs that spread when the infected files are executed [Stuxnet] Worms Programs that spread across networks without user interaction [SQL Slammer] Auto-hiding malware Trojans Software that masquerades as legitimate but contains malicious code (backdoor, data theft) [Qbot/Qakbot, TrickBot] Rootkits Hides and makes difficult to detect or remove malicious software [Linfo, Pandora, HIDEDRV] 1 Malware: Definition and Types (II) Designed to harm Ransomware Encrypts a victim’s files and demands a ransom for decryption [CryptoLocker, Phobos/Dharma] Botnets Networks of compromised computers used for malicious activities (DDoS attacks, spam) [Mirai, Andromeda] Logic/Time Bombs Malicious code that triggers when specific conditions are met (causing damage to system/data) Keyloggers Records keystrokes to capture sensitive information Cryptojacking Cryptocurrency mining malware (hijack a computer’s processing power) [Kinsing, LoudMiner] ”Grayware” Spyware Collects information without the user’s consent (espionage, advertising) [CoolWebSearch, Gator] Adware Unwanted advertisements and can collect user data for targeted advertising [Fireball, Appearch] 2 Malware Analysis Understand the behavior and purpose of a suspicious file detection and mitigation a potential threat triage incidents by level of severity uncover hidden indicators of compromise (IOCs) Static analysis Examining the code and characteristics of a malware sample without executing it Study file’s structure, strings, metadata, embedded resources → identify known patterns/signatures Indicators: file names, hashes, strings, IP addr, domains, file headers Tools: disassemblers, static rules (example Yara Rules) ( signature-based (+) Effective detecting known malware heuristic analysis (—) May miss sophisticated or polymorphic threats. 3 Malware Analysis (II) Dynamic analysis Running malware in a controlled environment to observe its actual behavior virtual machine Malware executed within a sandbox: isolated system etc,... (to prevent to harm to the host) Monitor malware’s actions: file system access and changes, network communication (TCP, DNS,..), library loading, system calls,... Useful for detecting/analyzing unknown or evolving malware (+) Accurate understanding of the malware’s behavior (—) Resource-intensive 4 Malware Analysis (III) Resources and online sandboxes MalwareBazaar VirusShare.com Malware Traffic Analysis Microsoft Malware Classification Challenge (BIG 2015) (Kaggle) Cuckoo Sandbox (open source, sandbox running in VirtualBox) Mobile Security Framework (MobSF) (Automated static and dynamic malware analysis for mobile apps) Joe Sandbox , Joe Sandbox reports examples Hybrid Analysis VirusTotal (VT APIv3) Critical aspect in IA/ML in Malware Analysis → Feature Generation (static vs dynamic) 5 Typical features in Malware analysis Static features Opcode Sequences [sequence of operation codes within the binary code] API Import and Export Functions [specific API functions to interact with the operating system for malicious purposes] File Metadata [size, creation/modific. dates, certificates], String Analysis Control Flow Graph (CFG) [flow of control between sections ( of the code] classification → malware unique CFG patterns ⇒ ML models use them for anomaly detection File Headers and Sections [features from file header (Portable Executable (PE) header in Windows), section names (.text,.data,.rsrc)] Image Representation [malware samples transformed into grayscale images] → malware families with distinctive visual patterns ⇒ use CNN (Conv. Neural Nets) Permissions and Manifest Information (for Mobile Malware) 6 Typical features in Malware analysis (II) Dynamic features API Call Sequences and Frequencies [malware make specific system calls, in certain order and frequency ⇒ ML models & behaviour analysis] Memory Access Patterns [how and when malware accesses specific memory regions] → clues about malware behavior (privilege escalation, sensitive memory regions) Network Traffic Patterns [malware communicates with remote servers for command and control (C2) purposes, DNS requests to known malicous sites] System Call Behavior [malware use specific system calls more frequently than benign programs (file, network, process operations)] Persistence Mechanisms (startup entries, scheduled tasks) and Registry Operations (for Windows Malware) 7 Microsoft Malware Classification Challenge (BIG 2015) Dataset: https://www.kaggle.com/c/malware-classification Paper: https://arxiv.org/pdf/1802.10135.pdf Multiclass problem: 9 types of malware Raw binary data (≈ 400GB) + metadata (function calls, strings,...) 8 Microsoft Malware Classification Challenge (BIG 2015) (II) First place approach in MMC Challenge (BIG 2015) (youtube) Features (static analysis on disassembled code) opcode n-gram (1-2-3-4-gram from the disassembled machine code) [> 70K sparse features] Segment names count [≈ 400 features] Feature selection using Random Forest ”importance” [≈ 4.4K opcode n-gram features, 19 segment count] First 800 pixel intensity from viewing the asm (assembler) file as an image Other features: ≈ 2K Final classification: XGBoost 9 AI/ML in Intrusion Detection IDS/IPS Intrusion Detection/Prevention Systems (IDS/IPS) Monitor network/hosts/applications to detect dangerous activity IDS → detect and alert (pasive) IPS → detect and block (proactive) Network-based IDS vs Host-based IDS NIDS → monitor network traffic [Snort, Suricata, Zeek] HIDS → monitor host activity (file system, system calls, logs) [Fail2Ban, OSSEC/Wazuh] Signature-based IDS vs Behavior-based IDS Signature-based → known attack patterns or signatures Behavior-based → deviations from ”normal” behavior baseline 10 Behavior-based / Anomaly-based IDS/IPS Unexpected events (anomalies) → system errors/misconfigurations or malicious activity Data exfiltration, malware activity (ransonware, virus), botnet activity, reconnaissance traffic,... Anomaly-based detection → detecting activities that are statistically unusual or abnormal Baseline establishment. Baseline of normal behavior for the system/network being monitored By analyzing historical data → define typical or acceptable behavior Behavioral profiling. Continuously monitor and profile the behavior of users/systems/network traffic Aspects to monitor: data transfer volumes, protocol usage, system resource usage, login times and frequency,... 11 Behavior-based / Anomaly-based IDS/IPS (II) Anomaly detection. Statistical and ML models to assess deviations from the established baseline Statistical approaches: deviation on moving average, time series analysis, Markov models,... Supervised learning: Random Forests, Recurrent Neural Nets (RNN, LSTM), k-Nearest Neighbors,... Unsupervised learning: clustering (k-means, DBSCAN), autoencoders, outlier/novelty detection (Isolation forests, One-Class-SVM),... Continuous learning Manage baseline evolution and behavior change (concept drifting) Adaptative models/thresholds Address data seasonality and changing trends 12 Anomaly detection techniques Finding events that don’t conform to an expectation (deviate significantly from the norm) Outlier detection: identifying data points that are significantly different from the majority of the data Learning from data containing both outliers (anomalies) and ”regular” data Novelty detection: identifying instances in the data that differ significantly from what was observed during the training phase Objective: detect instances that are unseen during model’s training Learning ”regular” data representation from data that does not contains novelties (anomalies) 13 Anomaly detection techniques (II) Types of Anomalies Point Anomalies. Anomalous individual data instances that are significantly different from the rest of the dataset. Contextual Anomalies. Anomalous behavior/instances that are considered abnormal in a particular context (periods fo time, regions) or under specific conditions (they may not be anomalies when considered in isolation) Collective Anomalies. A set of data points that, when considered together, exhibit anomalous behavior 14 Anomaly detection techniques (III) Source: Chiheb Chebbi, Mastering Machine Learning for Penetration Testing, 2019 15 Anomaly detection techniques (IV) Feature engineering for Anomaly detection Host intrusion detection Metrics/signals from host and OS activity OS instrumentation and auditing frameworks: OSquery (Cross-platform endpoint instrumentation), Auditd Daemon (GNU/Linux auditing system) Typical ”signals”: Running processes, Active/new user accounts, Permission changes DNS lookups, Network connections Kernel modules loaded, System scheduler changes, Startup operations Daemon/background/persistent processes, Syscalls, CPU utilization Filesystem changes Optionally, correlate signals from different sources 16 Anomaly detection techniques (V) Network intrusion detection Features from traffic between hosts Traffic metadata (info. from packet headers) vs. deep packet inspection (info. from packet payload) Aggregated info. (transactions/connections/bytes by IP address/subnet/geolocation) vs. Individual info. (single packet data) Protocol analyzers (Zeek, CICFlowMeter [python]) and firewall logs Web/Application intrusion detection Features from application level logs Server log info: IP-level statistics, malformed URLs, user agent patterns Application log info.: login attempts, application errors, out of order requests/accesses,... 17 NIDS Datasets NSL-KDD Dataset (improved version of Knowledge Discovery in Databases (KDD) Cup 1999 dataset) ”Classical” benchmark in ML-based intrusion detection Data collected over 9 weeks on a simulated military network environment (≈ 4.9 M connection records) Raw PCAP captures → 41 processed high level features (categorical and numerical) [see KDD names] 22 types of attacks [see attack types] in 4 general categories dos [denial of service] r2l [unauthorized access from remote servers] u2r [privilege escalation attempts] probe [brute-force probing attacks] 18 NIDS Datasets (II) Criticisms and limitations of KDD Cup 1999 dataset Outdated [does not adequately represent modern cyber threats challenges] Limited representation of real-world traffic [artificially generated dataset, does not accurately reflect the diversity and dynamics of actual network traffic] Limited variety of attacks Lack of network context Focus on signature-based detection Imbalance between normal and anomalous data Alternative Datasets UNSW-NB15 Dataset CIC-IDS2017 and CSE-CIC-IDS2018 on AWS UGR’16 Dataset 19