2 - RANSAQ - Searching Source-Binary Hybrid Code Property Graphs for Vulnerabilities.pdf
Document Details
Uploaded by DelectableRubellite9401
University of Texas at Dallas
Tags
Full Transcript
RANSAQ: Searching Source-Binary Hybrid Code Property Graphs for Vulnerabilities Shamila Wickramasuriya1[0000 0003 2917 0348] , Erick 1[0009 0004 4804 4261] Bauman , and Kevin W. Hamlen1[0000 0003 0479 6280] The Univers...
RANSAQ: Searching Source-Binary Hybrid Code Property Graphs for Vulnerabilities Shamila Wickramasuriya1[0000 0003 2917 0348] , Erick 1[0009 0004 4804 4261] Bauman , and Kevin W. Hamlen1[0000 0003 0479 6280] The University of Texas at Dallas, Richardson, Texas, USA Abstract. RANking Source-Assembly Queries (RANSAQ) of Code Prop- erty Graph (CPG) structures is a new strategy for finding security vul- nerabilities in compiled C programs that leverages compilation targets (e.g., assembly code) coupled with program source code data to discover exploitable security vulnerabilities that are indiscernible without both forms of information. This addresses the large class of attacks wherein adversaries exploit details of the compiled binary code to abuse high-level developer intentions and errors embodied in the original source code. Evaluation of RANSAQ on 8 applications comprising 424K SLoC spanning 1362 source files shows that the proposed POI ranking method can detect 70% of reported vulnerable functions on average by exploring 30% of high risk functions, and achieve nearly 100% by exploring 45%. Keywords: static analysis, code property graph, vulnerability detection, function classification 1 Introduction Timely identification of high-risk vulnerabilities in large software products has become an increasingly critical throughout the software industry. Almost a third of all attacks against enterprise networks target software vulnerabilities with over half of these vulnerabilities classified as high-risk or critical, and requiring a mean time to remediation (MTTR) of multiple months. This is in part due to the increasing scalability gap between software size and complexity and the processes used to find and fix vulnerabilities. Such audits typically entail careful manual or semi-manual code review—a tedious and error-prone process requiring significant expertise and e↵ort. Of the 84% of companies having such a process, only 36% conclude that their processes are most e↵ective for improving code quality, and 42% of the remaining companies lack the personnel needed to adopt such a process. As a result, much deployed software has not undergone comprehensive auditing due to cost limitations and expansive attack surfaces. The slowness of manual audits exacerbates this problem by allowing critical vulnerabilities to leak into applications that have changed since the audit. For example, the recent xz supply chain attack evaded overburdened open source auditors by introducing critical vulnerabilities into xz Utils’ binary self-testing code—a large, complex, and therefore rarely scrutinized yet widely deployed 2 S. Wickramasuriya et al. library. The attack has been described as a “near-miss cyber catastrophe” due to how hard it was for humans to spot. This showcases the need for more e↵ective, scalable tools for quickly discovering likely vulnerabilities in large products. One central impediment to these e↵orts involves the many layers of abstraction currently used to develop large products. Low-level languages like C/C++, which remain the most prevalent languages for COTS software development, expose unsafe operations to programmers (e.g., unchecked type casts and pointer arithmetic), that can turn code into a dense minefield of risky actions. To help programmers avoid these pitfalls, modern dev environments introduce abstractions that add automatically generated checks. For example, many C idioms replace dynamic allocation and dereference operators with macros that expand into architecture- specific, guarded, equivalent code. But this can frustrate auditing because e↵ective source auditing must then navigate a maze of complex macros, compiler pragmas, in-lined assembly code, and potentially security-relevant optimization decisions to accurately analyze even simple programs. To address this dilemma, we propose a new strategy for semiautomated code review through cross-domain code property graphs (CDCPGs), which combine CPGs with semantic information from both source code and binary code into a single, searchable data structure. CPGs have emerged as a powerful tool for modeling large codes and their properties at the source level. By leveraging information from the binary domain, CDCPGs can more e↵ectively discover, analyze, and diagnose likely vulnerabilities in large, legacy software products expressed in unsafe languages such as C/C++, and answer security-relevant questions that span both domains. For example, discovering a return-oriented programming (ROP) vulnerability might entail searching the CDCPG for control-flow paths that dereference a bu↵er pointer variable (defined at the source level) to assign to the return address on the call stack (defined at the binary level). Combining these semantic domains into queryable graph structures for syntax, control-flow, and dataflow analysis exposes important classes of vulnerabilities that are otherwise hard to diagnose because of flow and context sensitivity. To demonstrate, we introduce RANSAQ: an automated, scalable and extendable static analysis tool to detect vulnerabilities. We efficiently extract contextual information from both source and binary, thereby abstracting the behavior of selected functions. Discovered points of interest (POIs) are ranked by vulnerability score (VS) to help auditors focus on the code areas that pose greater risk/severity. RANSAQ thereby helps auditors find POIs worthy of human inspection in a sea of security-irrelevant code, and helps them decide whether a POI is dangerous. In summary, we make the following contributions: – We introduce and implement source-binary CDCPGs and vulnerability-finding queries for them. These graphs can answer semantic questions that source analysis and binary analysis alone cannot. – Combining source and binary information is shown to eliminate false positives that are otherwise difficult to avoid because they depend on statically unde- cidable code properties (e.g., code reachability). The binary semantics reveal RANSAQ 3 Table 1: CWE Classes of interest Class Description CWE-119 Improper Restriction of Operations within Memory Bu↵er Bounds CWE-120 Bu↵er Copy without Checking Size of Input CWE-200 Exposure of Sensitive Information CWE-416 Use After Free CWE-476 NULL Pointer Dereference CWE-676 Use of Potentially Dangerous Function CWE-690 Unchecked Return Value to NULL Pointer Dereference the compiler’s decisions, allowing RANSAQ to analyze the resulting security implications irrespective of whether the compiler was correct. – To help humans focus their e↵orts, POIs are ranked by VS and code complexity. This helps distribute the workload between auditors of varying skill, addressing the scarcity of high-skill code analysts. – We evaluate RANSAQ’s e↵ectiveness on 434 CVEs in 8 real-world applications and 5 hacking challenges drawn from the DARPA CHESS program.1 The prototype identifies 83 CVEs in the selected applications, exhibiting higher recall than prior approaches and reporting precise, CVE-relevant POIs within its top 10% of POIs for all applications. 2 Overview 2.1 Objectives and Motivation Our goal is to provide human auditors with a ranked list of code-points in C programs that might o↵er malicious exploitation opportunities to adversaries, in order to focus their e↵orts. Table 1 lists vulnerability classes of interest. This distinguishes our work from general debugging in that we intentionally omit or deprioritize possible code errors that are unlikely to present a security risk. This helps reduce the search space and the false positive rate. For example, Listing 1.1 shows some C and assembly code that is not vulnerable, but that many source-only analyzers falsely identify as a bu↵er overflow. The code is problematic because the size of buf is difficult to statically predict, since it depends on the size of pthread t, which is a system-defined structure. However, the assembly code reveals that the compiler has reserved 24 bytes, which makes the loop bound of 12 safe. Moreover, the assembly code also reveals a compiler-inserted stack canary guard that blocks exploits even were a bu↵er overflow vulnerability present. This bug therefore poses almost no security risk, and is probably not worth an auditor’s time. Listing 1.2 exhibits code that presents the opposite problem. To protect against information leaks, the source code conservatively zeros secret once it is no longer needed. However, the assembly code reveals that the compiler has optimized away the redaction because its dataflow analysis classifies it as a write-without- read. Hybrid source-binary analysis can therefore detect this compiler-introduced vulnerability even though neither source nor binary analysis alone has enough information to reliably succeed. 1 https://www.darpa.mil/program/computers-and-humans-exploring-software-security 4 S. Wickramasuriya et al. 1 char buf[sizeof(pthread_t)]; sub esp, 24 2 for (int i=0; i < 12; ++i)... 3 buf[i] = 0x7F; 4... 5 mov eax, [esp+4] 6 xor eax, 7 jnz __stk_chk_fail 8 return; ret Listing 1.1: Non-vulnerable C and assembly code example 1 memset(secret, 0, sizeof(secret)); nop 2...... 3 return; ret Listing 1.2: Compiler-introduced vulnerability 2.2 Challenges These examples showcase the value of combining semantic information from source and binary code to discover likely vulnerabilities in a given software deployment. CPGs have a well-established record of capturing code semantic information to facilitate security analyses (e.g., [61, 63, 21, 11, 66]), and are therefore a natural fit for meeting this need. However, CPGs have not previously combined code semantic information from multiple di↵erent computational models that represent a common program. Doing so raises several technical challenges. Relating Semantic Domains. To assess POI exploitability, graph queries must consult both source-level and binary-level semantic information simultaneously. For example, queries must decide whether there exists a control-flow path from source line `1 to source line `2 that clobbers binary-level register esp. This precludes solutions that place source and binary information in separate graphs. Merely unioning the edges of two CPGs to form a CDCPG would be unsound. For example, the union of the source- and binary-level control-flow graphs (CFGs) is a CFG that has two di↵erent paths from `1 to `2 —one through the source and one through the binary code—instead of a single path that encodes the e↵ects of both. CDCPGs therefore introduce a new class of relational edges that indicate that two distinct nodes or edges actually represent the same code entity in di↵erent semantic domains. These relational edge sets are not one-to-one, since some binary entities have no source (e.g., compiler-introduced code), and some source entities that have no binary (e.g., compiler-eliminated “dead” code). Path Explosion. Exhaustively linking CPGs from two semantic domains risks exacerbating the perennial path explosion problem. RANSAQ therefore builds cross-domain portions of the CDCPG lazily, using VS estimates to condition subgraph generation. In particular, binary symbolic analysis only proceeds where source analysis reveals a high-risk POI that demands exploitability and impact evaluation. This lazy approach is inspired by past research that evaluates functions bottom-up in call graphs to avoid duplicate analysis. We further reduce the scope by narrowing the function subset by vulnerability class. For example, to analyze dynamic allocations, evaluation starts with functions RANSAQ 5 1 char token; 2 char buffer; 3 int authenticated; 4 recv(socket_con, buffer, 80, 0); 5 if (!strncmp(buffer, token, strlen(token))) 6 authenticated = TRUE; 7... 8 if (authenticated != FALSE) 9... Listing 1.3: Bu↵er Overflow 1 uint64_t sz = mSampleTime * 2 * sizeof(uint32_t); 2 if (sz > SIZE_MAX) return ERROR_OUT_OF_RANGE; Listing 1.4: Incorrect bounds check that call alloc family functions, and does forward tracing to determine whether their sizes propagate to callers and to sufficient bounds-checks. Path Exploration. Past works leverage two kinds of graph analysis: intra-procedural and inter-procedural. Intra-procedural approaches (e.g., [61, 25]) cannot fully diagnose interactions between functions, so are unsuitable for many vulnerability classes, such as memory leaks, taint vulnerabilities, etc. Conservative pattern-based analyses tend to find universally path-quantified properties, such as variables that remain uninitialized on all paths, without accurately detecting variables that might remain uninitialized on some paths. They also only find locations where sanitization is missing but not where wrong sanitization occurs. RAN- SAQ therefore builds inter-procedural CFGs for binary code in addition to the intra-procedural source-level CFGs, provides faster queries using the lazy approach described above, and uses symbolic interpretation to assess correctness of sanitizations. False Positives. Many CPG-based analysis tools are ine↵ective for large-scale vulnerability discovery because they inundate users with false alarms. This is in part because they find likely code errors without analyzing exploitability, which is left up to the human user. However, exploitability is notoriously difficult for humans to assess since it often requires understanding binary-level code that is not designed to be human-readable, as in the xz attack. RANSAQ reduces the false positive rate by leveraging the binary CDCPG information to assess exploitability, prioritizing POIs that have greater security implications. Listing 1.3 shows code that RANSAQ prioritizes because the compiler has located security-sensitive variable authentication adjacent to overflowable ar- ray buffer in the local stack frame. This binary information indicates high exploitability potential, elevating its priority above other questionable operations. False Negatives. To mitigate the false positive problem, many prior approaches resort to overly permissive syntax-based queries. For example, syntax-based queries for bounds-checking might classify as safe any dereference operation dominated by an inequality test containing the correct bound (e.g., ). 6 S. Wickramasuriya et al. Queries ph t ou ra SigBIN ay G Binary (-g) s Ts FG rl l al AS Va C C DwarvenKing Checker Source Joern report Fig. 1: RANSAQ’s High Level Architecture Listing 1.4 shows a bounds check that this strategy incorrectly classifies as safe (false negative). The bounds check on line 2 appears safe syntactically, but the value assigned to sz in line 1 is actually incorrect because it uses 32-bit multiplication instead of 64-bit multiplication. This vulnerability was exploited in 2015 as part of the Stagefright attacks against Android. RANSAQ supplements syntax-based queries with semantic queries evaluated by symbolic interpretation of path fragments to reduce false negatives. Parameter Analysis. Many C vulnerabilities revolve around unsafe dereferences of pointers passed as function arguments. These are difficult for many approaches to diagnose since safeguarding pointers to caller-allocated data structures requires knowledge of the binary-level layout of the full call stack, not merely the local stack frame. RANSAQ is the first CPG analysis that models the full binary-level stack layouts of call chains leading to each callee, based on call graph analysis. This allows it to detect many critical pointer-parameter vulnerabilities. 3 Design 3.1 RANSAQ Overview Figure 1 depicts RANSAQ’s high-level architecture. It contains three parsers: (1) the Joern C code parser , (2) a binary parser named SigBIN, and (3) a DWARF debug information parser called DwarvenKing. Each parser generates data structures consumed by RANSAQ’s analysis component, Checker. Joern is an unmodified, third-party component; the other components are original. 3.2 SigBIN Our binary parser SigBIN extracts assembly code semantics using the Binary Analysis Platform (BAP). BAP expresses assembly semantics in an ISA-general intermediate language BIL , which SigBIN translates into CPG nodes and edges. The CPGs are stored in a Neo4j database. Similar to BitBlaze , RANSAQ can export Graphviz DOT graphs for viewing. For best memory utilization, the CFG encodes the coarse-grained basic block relationships rather than fine-grained instructions. We store each block and analyze them only as needed. Call graph structures capture the relationships between callers and callees, and also encode conditional branches. This allows the analyzer to trace inter-procedural paths that satisfy query-relevant constraints. All our graphs are delineated using networkx to run robust path traversal queries. RANSAQ 7 Fig. 2: Cross-Domain Code Property Graph Maintaining a balance between in-memory data and the on-disk database is critical for feasibly analyzing large codes. Running path searches on CFGs with many branches can lead to path explosion. For more efficient queries, the intra-procedural CFGs and full call graph are stored in memory and are passed into our Checker as needed. This avoids unnecessary database reads. 3.3 DwarvenKing The DwarvenKing module parses the DWARF debug information in the binary in order to obtain stack frame layout information. The DWARF tables are generated using the compiler’s -g flag. This exposes the compiler’s code and data layout decisions to RANSAQ in a machine-readable form, which is especially important for analyzing code compiled at high optimization levels. DwarvenKing uses a two-pass algorithm to parse the DWARF information. All debugging information entries (DIEs) are recorded in the first pass. The second pass iteratively excavates the child and member DIEs, including their locations. We also collect child DIE sizes and base types to determine their lengths at the execution level. This also reveals the order and spacing of stack variables, which can be di↵erent in the binary than their ordering in the source code. 3.4 Cross-Domain Code Property Graphs Figure 2 shows RANSAQ’s CDCPG for the code in Listing 1.5. CDCPGs are constructed from source abstract syntax trees (ASTs), CFGs, and program dependency graphs (PDGs) combined with binary CFGs and instruction side- e↵ects from BAP. Source statements and binary instructions are connected using the DWARF information. This facilitates queries that consult both. The domain- specific CFGs facilitate discovery of discrepancies between corresponding flows in the two domains, such as those introduced by compile-time code elimination. The low-level instruction semantics expose side-e↵ects, including changes to status flags, registers, and memory. RANSAQ presents side-e↵ects at function and basic-block level, and computes trace side-e↵ects by symbolically evaluating call graph paths. 8 S. Wickramasuriya et al. 1 int foo(char* source) 2 { 3 int src_sz = strlen(source); 4 if (src_sz > DST_SIZE){ 5 return 1; 6 } 7 return 0; 8 } Listing 1.5: Example code Source-level code property subgraph. The subgraph of the CDCPG that represents the source code is a joint data structure that combines AST, CFG, and PDG data from the C source code as recovered by Joern. This source level representation is stored in a relational database. Binary-level control-flow subgraph. Binary-level control-flow information is rep- resented in the CDCPG by connecting assembly instruction nodes with edges for fallthroughs and static branches and jumps. Computed control-flow transfer instructions (e.g., for method calls and returns) can have statically unpredictable destinations, and can be misdirected by adversaries during control-flow hijacking attacks, so their control-flows require special treatment. Such nodes are labeled as indirect in the CFG, and their computed targets are exposed by the side-e↵ect semantics lifted by BAP. Queries consult the source CFG edges to estimate developer-intended destinations, and consult the binary side-e↵ect information to infer actual destinations of an exploit. Binary nodes additionally hold the address o↵set, assembly instruction syntax, register value set, basic block id, and function id of each instruction. This construction supports querying through basic block instructions and their side e↵ects. Source-binary relational edges. The mapping between the source CFG and binary CFG is constructed using debug information emitted by the compiler. Source line and assembly address o↵sets have a one-to-many relationship. The binary-to- source relationship is important for providing a source location resulting from a binary-level analysis, for easier interpretation by human users. Conversely, the source-to-binary relationship is used when assessing exploitability and potential security impact of potential vulnerabilities discovered at the source level. Queries can also freely interleave flows between both domains, such as when searching for flows that have the potential to corrupt security-relevant source-level variables by abusing binary-level instruction side-e↵ects. Side-e↵ects. Each node in the binary-level subgraph (i.e., assembly instruction) contains a code fragment expressed in BIL that explicitly captures the instruction’s side-e↵ects upon the processor state, including registers, status flags, and memory. RANSAQ uses this information along with the binary-to-source relational edges to infer source-level side-e↵ects resulting from possible exploits. Section 3.5 details how this is achieved through value set analysis. RANSAQ 9 Algorithm 1 Pseudo code: Insecure dataflow analysis 1: procedure insecure dataflows (callgraph) 2: ext sources [recv, gets, read, scanf, getenv,...] 3: ext sinks [system, write, send,...] 4: for source in ext sources : 5: for sink in ext sinks : 6: if has path(callgraph,source,sink) then return POI Algorithm 2 Pseudo code: Insecure library call analysis 1: procedure insecure calls (stack layout) 2: insec calls [recv, strcat, strcpy, memcpy,...] 3: for f in insec calls : 4: dst get dst(f, stack layout) 5: src get src(f, stack layout) 6: if ¬bounded(stack layout, dst, src) then return POI 3.5 Checker The Checker executes expert-written analysis query templates. Each template is a predefined code explorer wrapped around in-file and in-memory data structure queries. We use the Gremlin query language to interface with the Neo4j database. Call graph and stack layout information is drawn from the DwarvenKing and SigBIN components. Templates perform independent analyses on the CDCPG to detect various classes of POIs. For brevity, the query examples that follow are expressed as pseudo-code that abstracts the Gremlin and Python query code. Insecure dataflows. Algorithm 1 sketches RANSAQ’s insecure dataflow analysis, which performs taint tracking [36, 15] to detect exploitable inter-procedural paths from user-controlled program inputs to security-sensitive outputs. The analysis identifies external sources (e.g., command-line argument parameters), sinks (e.g., exploitable system calls), and paths between them. External sources include functions that receive data from the environment (e.g., recv or getenv), and external sinks interact with the kernel (e.g., send or system). For efficacy, we avoid the expensive trace of tainted data, but flag the potential data flow paths. The discovered paths are used in follow-up queries that search for sanitation failures and data corruption vulnerabilities. The implementation in Algorithm 1 takes the call graph as an argument, and line 6 returns each identified interprocedural path in the PDG as a POI. Insecure String Library Functions. Certain C standard functions that do not check sizes when copying data, such as strcpy and memcpy, are hotspots of attacker abuse. Many large C codes are cluttered with such functions, but most of them are typically not vulnerable. Reporting them all would therefore raise many false alarms. RANSAQ therefore analyzes call sites for such functions at the binary level, which can often reveal more precise bounds for the callee’s memory references than the source code. The binary semantics can also reveal extenuating mitigations, such as the compiler’s replacement of an unsafe library function with a safe one, or stack memory layouts that defeat e↵ective exploitation. 10 S. Wickramasuriya et al. 1 ssize_t recv(int socket, void * buffer, 2 size_t length, int flags) 3 char *strcpy(char *dest, const char *src) Listing 1.6: Signatures of functions with elevated security risk Algorithm 3 Pseudo code: BOIL analysis 1: procedure check boils (cfg) 2: ptr list get ptr arithmatic assignments() 3: for p in ptr list [] : 4: if in loop(cfg, p) ^ ¬bound checked(p) then return POI Algorithm 2 searches for the insecure library calls in Listing 1.6 on x86-64 systems. The x86-64 System V AMD64 ABI calling conventions specify a kernel interface that uses RDI, RSI, RDX, R10, R8, and R9 for parameters, respectively. Algorithm 2 therefore implements value set analysis (VSA) on the binary code property subgraph to derive symbolic expressions for the values of these registers at call sites identified in the source-level subgraph. The VSA queries the BIL side-e↵ect information in the CDCPG to get register values expressed as o↵sets relative to RSP, and stack layout provides the size and type information related to that location. A candidate POI is returned if the inferred stack layout contains security-sensitive data within the bounds of the resulting expressions. Specifically, line 6 returns a POI when the size of the source is not necessarily bounded by the size of the destination. Bu↵er Overflow Induction Loop. While dangerous library functions with standard names are easily identifiable, developers frequently implement the same risky functionality via custom code that comes in many forms, and is therefore more difficult to identify. RANSAQ therefore implements bu↵er overflow induction loop (BOIL) detection, which searches for code that matches these patterns. Algorithm 3 does so by searching for flows that move data from one pointer to another within a loop without a safe constraint on destination bu↵er size. Helper function in loop returns T rue if node p is within an intraprocedural CFG cycle, and function call bound checked (p) appeals to the VSA engine (§3.5) to search for a sufficient dominating bounds check. 3.6 Ranking Model The POIs identified by the Checker are next sorted by vulnerability score. To focus auditor e↵orts, we choose a metric that prioritizes POIs whose safety requires a runtime control (e.g., a sanitization or bounds check), but where no adequate control is apparent within a small radius of the dangerous operation within the CFG. This localizes both the automated search and the human auditor’s e↵ort, since unsanitized or unbounded data that is permitted to leak deeper into a large program often constitutes a security risk even if it is later guarded on many (but possibly not all) paths. We track this statistic with a height feature. In contrast to most vulnerability scoring metrics proposed by prior works (e.g., [64, 37, 54, 51]), this metric incorporates source and binary features. RANSAQ 11 Fig. 3: Feature histograms The scarcity of labeled data and true negative labels for large, real-world C applications makes building a supervised prediction model infeasible for most code auditing tasks. Previous work shows that the complexity is proportional to vulnerability likelihood. Normalizing data from di↵erent application enables building a larger cross-project data set. Since our goal is to have a weighted feature score, we used Min-Max normalization having a range [0, 1000]: 1000 ⇤ (xi min(x)) S(xi ) = (1) max(x) min(x) Figure 3 shows histograms for some features listed in Table 2. Most of the histograms for non-vulnerable/unknown data are distributed left-skewed, and vulnerable data distributions lean toward higher values. This observation also indicates that higher feature values tend to increase the vulnerability score. The feature importance of di↵erent classical machine learning models can be used as weights. Our experiment shows that use of feature importance of Extreme Gradient Boosting classification as weights showed the best ranking results. The total vulnerability score VS is calculated by the sum of weighted scores: X VS = wi S(xi ) (2) i where S(xi ) and wi are the score and weight (resp.) of the ith feature of point x. Table 2 lists the computed weights for features and their dimensionality. These results are discussed in more detail in §5. 4 Implementation Our prototype implementation of RANSAQ consists of about 5200 lines of original Python code, plus integration of three major third-party tools: Joern 0.3.1 (for C source parsing), Neo4j 2.1 (a database platform), and BAP 2.1.0 (for binary parsing and semantic lifting). The source is available at. RANSAQ concurrently parses the source, binary, and DWARF information to generate the CDCPG. Our DWARF parser is a two-pass algorithm that iterates through DIEs. The first pass collects definitions of variable and return types, and the second pass traverses the references to collect implicit data, such as size and type. Call graphs are incrementally generated as the assembly level functions are parsed. Basic blocks form the nodes, and the graph is built in-memory. This a↵ords faster queries while preserving the topological order of calls. Finally, RANSAQ executes queries (§3.5), ranks POIs (§3.6), and presents its findings via the interactive user interface shown in Appendix A. 12 S. Wickramasuriya et al. Table 2: Features Dimension Feature Description Weight Parameters number of parameters 0.1279 Structure Cyclomatic Complexity number of linearly independent paths 0.0457 based Loop Number number of loops 0.0452 Nesting Degree max control-structure nesting level in func 0.0657 SLOC number of source lines 0.0605 Variables number of local variables 0.0611 Fan-In number of distinct callers 0.0751 based Flow Fan-Out number of distinct callees 0.0631 Height distance to closest external input 0.0572 ALOC number of assembly codes 0.0811 Binary based Conditions number of binary conditions 0.1116 Cmps number of cmp instructions 0.0634 Jmps number of jmp instructions 0.0514 Pointer Pointers number of pointer variables 0.0541 based Pointer Arguments number of pointer arguments 0.0000 Pointer Assignments number of pointer assignments 0.0367 5 Evaluation 5.1 Test Data Set and Experimental Setup To evaluate RANSAQ’s e↵ectiveness, we analyzed 8 real-world applications and 5 challenge problems drawn from DARPA’s CHESS program. Our evaluation strategy for the real-world applications is patterned after that of prior work [21, 32], but with an updated data set of applications and CVEs. For each CVE, we manually identified all relevant functions, yielding a larger, more accurate data set of CVE-related functions per application. For example, for the two apps our data set shares with that of the prior works (Libti↵ and Freetype), we gathered 102% more relevant CVEs and identified 81% more related functions. Our data set contains 80% more CVEs per application and 29% more CVEs overall. Application and vulnerability data were obtained from open sources (e.g., Github and NVD). We consider all CVEs reported until 2021, and label relevant functions. Gathering this information is a difficult and time consuming task for at least two reasons: (1) Not all patches are well maintained, requiring cross- referencing of myriad additional sources (e.g. Bugzilla, ExploitDB, etc.) to collect relevant details. We analyzed all NVD URLs tagged as patch or exploit. (2) Not all patched functions are related to each vulnerability addressed by the patch. We manually filtered irrelevant changes from patch commits. This careful exploration yielded a more precise database compared to previous works. The DARPA CHESS challenge problems were drawn from hackathon challenge problems consisting of in-house developed or modified real-world applications with inserted vulnerabilities. Success against these challenges was assessed using reference patches released by the challenge creators. Although each real-world target has one or more known CVEs, no CVE information was used to craft query templates (§3.5) or to train the ranking model. Each evaluation treats the set of known CVEs and patched source lines for each tested application as a ground truth. We combined all cross-project data to evaluate popular ML techniques via resampling (§3.6). While some prior RANSAQ 13 works use similarity-based analysis to infer new vulnerabilities from existing ones (e.g., ), RANSAQ employs purely unsupervised ranking with scores calculated from feature values (i.e., no ranks or labels for training). We therefore solely use the manually labeled data from each CVE to evaluate our method. All target applications were compiled using GCC 5.5. Experiments are executed on an Ubuntu 18 VM with 4 cores and 64GB RAM. 5.2 Vulnerability Scoring As discussed in §3.6, RANSAQ ranks each of its POIs by a vulnerability score. While many state-of-the-art static analyzers do not rank their POIs (e.g., [35, 26, 56, 27]), a ranking methodology has the advantage of allowing auditors to better utilize their time by prioritizing the most potentially vulnerable POIs. Table 2 presents the weights for each feature based on their contribution to vulnerability detection. The weights were computed from three applications (Tcpdump, OpenSSH, and Libti↵) by estimating each feature’s impact on prediction accuracy (see §3.6). Extreme gradient boosting showed best recall on our unbalanced data set, so was used as the classification model. The binary-to-source relational CPG edges detailed in §3.4 help to generate binary-level features, resulting in the 16 features listed in Table 2. Of the four dimensions of features in Table 2, binary-based features earned the highest average weight (0.077) from this procedure. This shows the utility of binary-level analysis for source code auditing tasks. The highest weight of any individual feature (0.12) was earned by a source-level feature (function arity), showing that the combination of source- and binary-level features analyzed by CDCPGs is instrumental to its e↵ectiveness. For example, binary-level condition counts earned a higher weight (0.11) than analogous source-level features (e.g., cyclomatic complexity, loop number, and nesting degree). 5.3 Ranking E↵ectiveness Comparing RANSAQ to prior works requires comparing POI ranking accuracy. Since the implementation and data set of the closest ranking-based related work (Leopard ) is not publicly available, we reproduced the methodology of that work and applied it to our data set. To obtain a fair comparison, we restricted RANSAQ to the same number of features as the prior work (11 features of our 16), and chose the best features as determined by ensemble learning (§5.2). Omitted features are grayed out in Table 2. For binning, we choose 3 metrics (cyclomatic number, loop number, and nesting degree) to calculate a complexity score. Each bin is a stack sorted by VS, from which each selection round pops the top k% until all bins are empty. The final list is populated with the ranked functions. Figure 4 shows the e↵ectiveness of ranking compared with Leopard, as well as two ML methods: random forest and extreme gradient boosting. For 75% of the tested applications, RANSAQ achieves a superior percentage of vulnerable function coverage within its top-ranked functions for nearly all percentiles. For example, RANSAQ ranks over 75% of vulnerable functions in Proftpd 1.3.6 within the top 20% of its ranked function list, compared to less than 40% reported by 14 S. Wickramasuriya et al. (a) Proftpd 1.3.6 (b) Tcpdump 4.6.2 (c) Openssh 7.3 (d) Sudo 1.8.25p1 (e) Libpng 1.5.7 (f) Libti↵ 3.8.1 (g) TinTin++ 1.97.9 (h) Freetype 2.3.9 Fig. 4: Ranking e↵ectiveness of vulnerable functions identified by CVE the alternatives. On average, RANSAQ ranks 70% of all reported vulnerable functions within its top 30% of ranked functions whereas Leopard, random forest, and extreme gradient boosting cover only 48%, 33% and 40%, respectively. This higher accuracy also comes with more detailed localization information for users in the form of source line and control- and data-flow information. This improves upon prior works that only locate POIs with function- or file-level granularity, which greatly increases the auditing burden for human users. 5.4 POI-Function Evaluation To evaluate our approach against works that o↵er coarser, function-level detection granularity for POIs, we associated each POI detected by RANSAQ with the function containing them, and compared the results against three major open- source code review tools: Cppcheck , Flawfinder and Infer. These are chosen because they are widely used in evaluating other pattern-based vulnerability detection systems similar to RANSAQ, and for their active maintenance. Table 3 lists the total number of functions, total number of reported CVEs, quantity of CVEs that include the specified tested application version, number of functions covered by CVEs, and functions a↵ected by the included CVEs, for each tested application. In all, the data set includes over 10K functions, of which less than 300 are related to version-relevant CVEs, making this a challenging auditing task representative of real code reviews. Table 4 reports e↵ectiveness of each tool. True positives are POIs detected within version-relevant CVE-related functions. Recall is calculated by using all functions included in the version-related CVEs. The results show that RANSAQ achieves the highest recall in 75% of ap- plications studied, and flagged most of the vulnerable functions. Flawfinder outperforms Cppcheck and Infer in the remaining cases. Since the other approaches are compiler-based, they can fail at compile-time if the modified compiler is unable to accommodate unusual features of the source code. This resulted in compilation failures for two of the Infer experiments. RANSAQ 15 Table 3: Tested applications Table 4: Related work comparison Version- RANSAQ Flawfinder Cppcheck Infer Total Included Related relevant App Version Pos. RC Pos. RC Pos. RC Pos. RC App Version Funcs CVEs CVEs Funcs Funcs Tintin++ 1.97.9 159 1.00 125 1.00 0 0.00 26 0.00 Tintin++ 1.97.9 681 3 2 4 4 Sudo 1.8.25p1 47 0.50 62 0.58 0 0.00 3 0.00 Sudo 1.8.25p1 187 14 7 12 10 Proftpd 1.3.6 309 0.19 508 0.14 6 0.00 129 0.00 Proftpd 1.3.6 2429 22 21 21 3 Libpng 1.5.7 70 0.44 0 0.00 3 0.00 15 0.03 Libpng 1.5.7 743 43 35 34 13 Libti↵ 3.8.1 71 0.11 46 0.04 3 0.00 0 0.00 Libti↵ 3.8.1 655 176 134 79 34 Tcpdump 4.6.2 107 0.14 70 0.11 1 0.01 % % Tcpdump 4.6.2 903 168 151 128 116 OpenSSH 7.3 577 0.27 583 0.29 19 0.02 % % Openssh 7.3 2613 96 36 52 22 Freetype 2.3.9 143 0.31 44 0.02 52 0.07 17 0.00 Freetype 2.3.9 1791 88 48 89 74 Pos. = Positives, RC = Recall, %= compile-time error Total 10002 558 434 425 276 5.5 POI Case Study Table 5 showcases the e↵ectiveness of our POI detection methodology by listing some high-severity CVEs that RANSAQ ranked within its top POIs recommended for human auditing. In all, RANSAQ correctly identified 83 CVEs across the 8 applications. The rank column of the table indicates the position of the exact CVE-relevant line discovered by RANSAQ, not merely a function or file that contains it. The two highlighted CVEs in Table 5 are described in Appendix A. Our POI detection and ranking system proved important in two ways: First, as shown in Table 3, POIs were generated for vulnerable functions with a higher recall. Second, as demonstrated in Table 5, RANSAQ places significant vulnerabilities high in the rankings. Overall, RANSAQ identified precise POIs corresponding to one or more high-severity CVEs within its top 10% of POIs for every application in the data set. 5.6 CTF Challenge Testing RANSAQ was also evaluated in a series of DARPA CTF challenges. Each challenge consisted of applications into which experts had inserted hidden vulnerabilities. Independent teams were provided access to competing tools in a closed network. The goal was to find as many exploits as possible within the allotted time. Each team was comprised of novices, who have basic programming knowledge but no security training, led by one security expert. Table 6 summarizes the e↵ectiveness of RANSAQ in these CTFs. It lists challenges consisting of C/C++ code, and for which there was sufficient time and computational resources during the challenge to generate a complete ranking. (Some challenges concerned other languages, or lacked sufficient computing time.) The table’s final column identifies the rank at which RANSAQ identified each hidden vulnerability. In 80% of challenges the POI is within RANSAQ’s top 10 POIs. Non-experts described RANSAQ as particularly easy to use and understand. 6 Related Work Vulnerability detection can be partitioned into metric-based, text mining, static pattern-based, dynamic, and hybrid approaches. Metric approaches are rooted in bug/fault detection [23, 48], including supervised and unsupervised ML for source codes [29, 33, 38, 65, 44, 47, 37, 45, 14, 67, 62]. Binary features can yield better 16 S. Wickramasuriya et al. Table 5: POI Ranks Table 6: DARPA Challenges CVE # CVSS Rank VS Percentile POI Rank of Intentional Challenge Functions Functions Vulnerability CVE-2021-3156 7.2 4/153 97.44% CVE-2016-10708 7.5 9/408 97.79% I 817 10 1 CVE-2008-0671 10.0 21/272 92.28% II 365 39 8 CVE-2017-13028 9.8 8/828 99.03% III 1766 21 10 CVE-2014-9495 10.0 4/317 98.74% IV 62 30 1 CVE-2016-5323 7.5 25/348 92.81% V 74 37 21 CVE-2020-9272 7.5 220/2199 90.00% results , but sometimes pose scalability challenges, and binary POIs are less actionable since they are harder for humans to interpret. Our work avoids this problem by using binary semantics as an aid to identify source-level POIs. Text mining treats source code as keyword count vectors [41, 58, 59]. While there has been debate over whether text-mining or software metrics yield superior results , studies on imbalanced data suggest that the two perform similarly for e↵ort-aware vulnerability prediction. Dynamic monitors, such as fuzzers, trade generality for accuracy, detecting flaws exhibited by a limited subset of the input space by code compiled with only certain options. This can be problematic for detecting security vulnerabilities, since criminals excel at finding inputs omitted by limited-coverage fuzzers. Dynamic mitigations (e.g., [18, 52, 17]), introduce code at compile-time to safeguard pointers. RANSAQ benefits from these by detecting them at the binary level to reduce false positives. Static mitigations (e.g., [24, 19]) boast greater generality, but often require substantial user assistance, and can result in high false positive rates or require high computational resources. CPGs have emerged as a promising paradigm for scaling static approaches (e.g., [60, 53, 25, 30]). Prior to our work, CPGs have only used source features, which can lead to false negatives for binary exploits and false positives for complex codes. RANSAQ extends CPGs to combine source and binary features. Concolic execution [1, 42] combines symbolic and concrete execution to lower false negative and false positive rates (e.g., [9, 55]). Automatic Exploit Gener- ation uses concolic execution to produce sample exploits. In contrast, our work focuses on helping source code auditors find and close vulnerabilities in large projects even if a concrete exploit cannot be generated automatically. It can supplement AEG e↵orts by feeding its POIs to AEG tools (e.g., [13, 55]). Vyper analyzes binary code for vulnerabilities in a few classes, such as stack/heap overflow and information leaks. Leopard identifies vulnerable source functions by binning and then ranking them using source-level metrics. RANSAQ di↵ers from these by integrating source and binary features. 7 Conclusion RANSAQ combines source-level and binary-level semantic information of compiled C/C++ programs into a common CPG for vulnerability detection. Evaluation of RANSAQ on 8 real-world applications and 5 Hackathon challenges finds POIs with higher precision and more specific locations relative to competing approaches, often identifying severe vulnerabilities within the first 20% of POIs.