Graph Neural Networks for Vulnerability Mining: From Theory to Practice

Author: Jiqiang Feng (风宁) Email: [email protected] Published: January 12, 2026 Version: v1.0

Executive Summary

Code vulnerability detection is undergoing a paradigm shift. Large Language Models are powerful, no doubt. But Graph Neural Networks offer unique advantages in certain scenarios—particularly when you need to understand control flow and data dependencies.

This isn't another "GNN is amazing" hype piece. Based on 55 top-tier papers and hands-on testing of 7 open-source projects, I'll give you a practical assessment: what GNN excels at, where it falls short, and how to make the right choice in 2026.

Key Findings:

GNN matches LLM performance on small-to-medium datasets (<100K functions), at 5-10x lower training cost
For cross-function data flow tracking, GNN currently outperforms pure Transformer architectures
The real breakthrough lies in hybrid architectures (GNN+LLM), achieving 96%+ accuracy

1. Why Code Vulnerability Detection Needs Graphs

Here's a simple question: Static analysis tools have been around for decades. Why bother with deep learning?

The answer? False positive rates are killing us.

Coverity, Fortify, these commercial tools—50%+ false positive rates are common. Security engineers spend hours every day confirming "this is NOT a vulnerability." Deep learning's value is learning vulnerability patterns from history, not rigidly matching rules.

So why graphs? Code is inherently structured data.

Look at this C code:

System Output

void vulnerable(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Dangerous!
}

From a text perspective, it's just a few strings. But from a graph perspective:

Control Flow Graph (CFG): Entry → strcpy call → Exit
Data Flow Graph (DFG): input → strcpy → buffer (taint propagation path)
Program Dependency Graph (PDG): Intersection of control and data dependencies

Encode these relationships into a graph, and GNN can "understand" the vulnerability's essence—untrusted input reaching a dangerous function.

2. Code Property Graph: Turning Code into Graphs

2.1 The CPG: Three Graphs in One

In 2014, Fabian Yamaguchi introduced the Code Property Graph (CPG) concept. Simply put, it combines AST, CFG, and PDG into a single graph.

System Output

CPG = AST + CFG + PDG

Why merge them? Each graph alone has blind spots:

AST only captures syntax, not execution order
CFG only shows control flow, not variable propagation
PDG lacks syntactic details

After fusion, nodes contain complete semantic information. This is fundamental for GNN to work effectively.

2.2 Joern: The De Facto Standard

When it comes to CPG, you can't avoid Joern. I've used it for three years. Some honest observations:

Pros:

Supports C/C++/Java/JavaScript/Python
Generates standardized graph structures, GNN-ready
Cypher-style query language—writing rules is satisfying

Pitfalls:

C++ template code parsing crashes frequently
Memory pressure on large projects (1M+ lines)
Python support was added later, some edge types incomplete

Practical advice: For projects over 500K lines, split by modules before generating CPG.

2.3 My CPG Generation Pipeline

System Output

# Install Joern (macOS)
brew install joern

# Generate CPG (using OpenSSL as example)
joern-parse /path/to/openssl --language c --output openssl.cpg

# Export to DOT format (for visualization/debugging)
joern-export openssl.cpg --repr cpg14 --format dot

Generated graphs often have millions of edges. Here's where graph slicing becomes essential.

3. Graph Slicing: Keep Only What Matters

3.1 Backward Slicing from Sinks

Full-function CPGs are too large. DeepWukong paper proposed a clever approach: only keep subgraphs related to sensitive operations.

What are sensitive operations? Depends on the vulnerability type you're detecting:

| Vulnerability Type | Sensitive Sinks | |-------------------|-----------------| | Buffer Overflow | strcpy, memcpy, sprintf | | Command Injection | system, popen, exec | | Memory Leak | malloc (without matching free) | | Use-After-Free | Pointer dereference after free |

Backward slicing from sink points keeps only nodes whose data flow can reach the sink. This can reduce graph size by 80%+.

3.2 Slicing Code Example

Using Joern's query language:

System Output

// Find all strcpy call sites
val sinks = cpg.call.name("strcpy").l

// Backward data flow slicing (depth 5 steps)
val slice = sinks.flatMap { sink =>
  sink.reachableByFlows(cpg.method.parameter, 5)
}

This traces back from strcpy to see which parameters can flow there. Depth 5 is usually sufficient—going deeper adds noise.

4. GNN Architecture Selection

4.1 Mainstream Architectures Compared

After running dozens of models, here's my assessment:

| Model | Accuracy | Training Speed | Interpretability | Recommended Scenario | |-------|----------|----------------|------------------|---------------------| | GGNN | 85-90% | Fast | Fair | Large-scale scanning | | GAT | 87-92% | Medium | Good | Line-level localization | | DGCNN | 88-93% | Slow | Poor | Maximum accuracy | | GCN | 82-87% | Very Fast | Good | Resource-constrained |

My choice: GAT (Graph Attention Network). The reason is practical—attention weights directly show which edges matter most, making debugging easier.

4.2 Message Passing Iterations

GNN's core mechanism is message passing: nodes aggregate neighbor information and update their representations.

How many iterations? Papers vary wildly. My experience:

Code analysis: 4-6 iterations suffice
Beyond 8 iterations: Over-smoothing kicks in—all nodes become similar

DEVIGN paper uses 8 iterations, but their graphs are sparser. Adjust based on your graph density.

4.3 Node Feature Initialization

This is overlooked but crucial. How do you turn code nodes into vectors?

Method 1: Word2Vec

Treat code tokens as "words"
Train word embeddings
Drawback: Doesn't understand code semantics

Method 2: CodeBERT Embeddings

Use pre-trained CodeBERT
128/256-dimensional embeddings
Drawback: Slower inference

Method 3: Instruction2Vec (Binary Analysis)

Map assembly instructions to vectors
Suitable for firmware analysis

I now default to CodeBERT—embedding quality is significantly better. Speed issues can be solved through pre-computation—once the graph is fixed, compute embeddings once.

5. From Dataset to Model Deployment

5.1 Dataset Selection

Stop benchmarking on SARD. That dataset is entirely synthetic vulnerabilities with overly regular patterns. Getting 95% means nothing.

Recommended Datasets:

| Dataset | Scale | Source | Realism | |---------|-------|--------|---------| | Big-Vul | 188K functions | GitHub commits | ⭐⭐⭐⭐ | | DiverseVul | 150+ CWEs | Multi-source | ⭐⭐⭐⭐⭐ | | Draper | 1.27M functions | Static analyzer labels | ⭐⭐⭐ | | CodeXGLUE | Benchmark | Microsoft curated | ⭐⭐⭐⭐ |

Key reminder: All datasets have biases. Big-Vul over-represents buffer overflows. DiverseVul label quality varies. Best practice: mix multiple datasets for training.

5.2 Model Training Code Framework

With PyTorch Geometric, the core code is concise:

System Output

import torch
from torch_geometric.nn import GATConv, global_mean_pool
from torch_geometric.data import DataLoader

class VulnDetector(torch.nn.Module):
    def __init__(self, in_dim=128, hidden_dim=256, heads=8):
        super().__init__()
        self.conv1 = GATConv(in_dim, hidden_dim, heads=heads)
        self.conv2 = GATConv(hidden_dim * heads, hidden_dim, heads=1)
        self.classifier = torch.nn.Linear(hidden_dim, 2)

    def forward(self, x, edge_index, batch):
        # Two GAT layers
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index).relu()
        # Graph-level pooling
        x = global_mean_pool(x, batch)
        return self.classifier(x)

# Training loop
model = VulnDetector()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(100):
    for batch in train_loader:
        optimizer.zero_grad()
        out = model(batch.x, batch.edge_index, batch.batch)
        loss = F.cross_entropy(out, batch.y)
        loss.backward()
        optimizer.step()

This is the basic version. Production deployment needs:

Dropout for regularization
Learning rate scheduling
Early stopping
Class imbalance handling (vulnerabilities are usually rare)

5.3 Handling Class Imbalance

Common issue with vulnerability datasets: normal code vastly outnumbers vulnerable code. Ratios like 10:1 or even 100:1.

My approach:

Oversample vulnerability samples: SMOTE-Graph variants
Focal Loss: Reduce weight on easy-to-classify samples
Threshold adjustment: Don't use 0.5—tune based on business needs

System Output

# Focal Loss implementation
class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0, alpha=0.75):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha

    def forward(self, pred, target):
        ce_loss = F.cross_entropy(pred, target, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

6. GNN vs LLM: When to Use Which?

6.1 Performance Comparison (Based on My Testing)

On Big-Vul dataset:

| Model | F1 Score | Inference Speed | Training Cost | |-------|----------|-----------------|---------------| | GAT | 0.62 | 1200 func/sec | 2hr/A100 | | DEVIGN | 0.64 | 800 func/sec | 3hr/A100 | | CodeBERT | 0.65 | 200 func/sec | 8hr/A100 | | StarCoder | 0.68 | 50 func/sec | 24hr/A100 | | GPT-4 (zero-shot) | 0.55 | 5 func/sec | $0.01/func |

Key findings:

LLM leads on large datasets, but training cost is 5-10x higher than GNN
Zero-shot GPT-4 performs worse than many assume (overhyped)
GNN's inference speed advantage is significant—suitable for CI/CD integration

6.2 Scenario Recommendations

Use GNN when:

Daily code scanning (need speed)
Resource-constrained environments (edge devices)
Explainability required (audit compliance)
Cross-function data flow tracking

Use LLM when:

Novel vulnerability types (zero-shot generalization)
Code understanding + fix suggestions (need language capability)
Sufficient labeled data for fine-tuning

Use hybrid architectures when:

Pursuing maximum accuracy
Adequate compute resources
Willing to invest in engineering optimization

6.3 Hybrid Architectures: The Future Direction

The 2025 trend is clear: GNN + LLM hybrid architectures are rising.

Representative works:

Vul-LMGNNs: LLM for knowledge distillation, GNN for final judgment
GRACE: GCN + residual connections + contrastive learning
HAGNN: Hierarchical attention graph network, 96.6% accuracy on C

My assessment: Pure GNN or pure LLM will both be surpassed by hybrid approaches. The challenge is engineering complexity—you'll face pitfalls from both systems.

6.4 2024-2025 Breakthrough Methods

Academia hasn't been idle. From "proving feasibility" to "explainability" and "multimodal fusion," research has entered its second phase.

| Method | Source | Core Innovation | Practical Value | |--------|--------|-----------------|-----------------| | CFExplainer | ISSTA 2024 | Counterfactual explanation: "If you change this line, vulnerability disappears" | Helps developers locate root cause | | VulGCANet | TrustCom 2024 | GCN+GAT hybrid, solves cross-function long-range dependencies | Significantly improved recall | | TACSan | 2024 | Three-Address Code (TAC) intermediate representation, more precise | Reduces C/C++ memory vulnerability false positives | | HAGNN | arXiv 2025 | Heterogeneous attention GNN, distinguishes node/edge types | Cross-language, cross-function tracking |

HAGNN deserves special attention—it models code graphs as heterogeneous graphs, where variable nodes, function nodes, and API nodes each have distinct features, and edges are differentiated between control-flow and data-flow. This design better reflects code's true nature.

7. Industrial Applications

7.1 Aurora Infinite (极光无限)

A leading Chinese vendor in GNN-based vulnerability detection. They have two core products:

WeiZhen (维阵): Static Code Security Analysis Platform

Uses CPG (Code Property Graph) as unified representation
Runs GGNN for vulnerability pattern learning
Supports C/C++/Java/Go/Python
CI/CD integration with GitLab/Jenkins

Aurora Hunter (极光猎手): AI-Assisted Security Audit

Combines GNN detection results with LLM explanation capabilities
Auto-generates vulnerability reports and fix suggestions
Targets financial, energy, and critical infrastructure sectors

I've discussed with their technical team—CPG generation uses Joern (open source), GNN part has proprietary optimizations. Solid commercial deployment track record.

7.2 GitHub CodeQL's Graph Queries

Strictly speaking, CodeQL isn't GNN, but the philosophy is similar—model code as a graph, find vulnerabilities with declarative queries.

System Output

// Find SQL injection
from Call call, DataFlow::PathNode source, DataFlow::PathNode sink
where
  call.getTarget().hasName("query") and
  source.isSource() and
  sink.isSink() and
  DataFlow::localFlow(source, sink)
select sink, "SQL injection from $@", source, "user input"

CodeQL's advantage is maintainable rules. Disadvantage: can't automatically learn new vulnerability patterns.

7.3 Smart Contracts

Ethereum smart contracts are an excellent GNN application scenario. Code is short (usually hundreds of lines), vulnerability patterns are well-defined (reentrancy, integer overflow, etc.).

Tool recommendations:

Slither: Static analysis framework
Mythril: Symbolic execution
GNNSCVulDetector: GNN-specific

I've used GNN on Solidity code—reentrancy vulnerability detection accuracy can reach 92%+. Main reason: contract control flow is simpler, graphs aren't as complex.

7.4 Google Big Sleep: AI Agent Discovers Real 0-day

The most shocking case of 2025. Google's Big Sleep project (Gemini-powered AI Agent) discovered a real 0-day vulnerability in SQLite (CVE-2025-6965).

This was an exploitable stack buffer underflow that traditional fuzzers and manual audits had missed. Big Sleep's approach:

Use Gemini to understand SQLite code semantics
Build code dependency graphs for data flow tracking
Generate targeted PoC to trigger the vulnerability

Why it matters: This is the first time AI discovered a real vulnerability in a major open-source project. Not a lab environment, not a synthetic vulnerability—a genuine CVE-worthy finding.

Technical detail: Big Sleep isn't pure GNN, but rather an LLM + graph analysis hybrid system. LLM handles semantic understanding, graph analysis handles data flow and control flow tracking.

8. Pitfall Avoidance Guide

8.1 Data Leakage

Most common mistake: overlap between training and test sets.

Big-Vul may contain the same vulnerability multiple times (different commits fixing the same bug). Random splitting severely overestimates model performance.

Correct approach: Deduplicate by CVE or commit hash, then split.

8.2 Label Noise

Draper dataset labels come from static analyzers. Problem: static analysis itself has false positives.

Models trained on noisy labels won't perform much better than static analysis. Chicken-and-egg problem.

My approach: Cross-validate with multiple datasets, keep only labels consistent across datasets.

8.3 Graph Size Issues

CPGs can have hundreds of thousands of nodes. Feeding directly to GNN will blow GPU memory.

Solutions:

Graph slicing (discussed earlier)
Graph sampling (sample k-hop neighbors only)
Hierarchical processing (function-level first, then file-level)

8.4 Over-smoothing

Too many GNN layers cause node representations to converge.

Symptoms: Training loss drops but validation performance plateaus.

Solutions:

Add residual connections
Use Jumping Knowledge
Limit layers (4-6)

9. Future Outlook

9.1 Does GNN Have a Future in the LLM Era?

Yes, but positioning must change.

Pure GNN will struggle to beat fine-tuned LLMs in general vulnerability detection. But in these scenarios, GNN remains irreplaceable:

Real-time scanning: 20x faster inference
Explainability: Attention weights are traceable
Data flow analysis: LLMs struggle with long-range dependencies

My prediction: The future is LLM for coarse screening, GNN for fine ranking—a hybrid pipeline.

9.2 Technical Trends

Worth watching:

Self-supervised graph learning: Reduce labeled data dependency
Dynamic graph neural networks: Handle code evolution
Neural-symbolic hybrids: Combine rule reasoning with deep learning

9.3 GNN+LLM Fusion: Two Technical Paths

The 2024-2025 research hotspot is deep fusion of GNN and LLM. Two mainstream approaches have emerged:

Path One: Graph-for-LLM (Graph-Enhanced LLM)

Use GNN-extracted structural features as additional LLM input
Example: Use CPG node embeddings as soft prompts
Advantage: LLM gains structure awareness, reduces hallucination

Path Two: LLM-Augmented GNN

Use LLM semantic embeddings to initialize GNN node features
Example: Use CodeBERT/StarCoder to encode code tokens
Advantage: GNN gains better semantic understanding, not just structure

My assessment: Path Two is more practical. Reason: LLM inference is too slow—using it as preprocessing (run once offline) is more reasonable. Path One requires online LLM calls, where latency and cost are hard to accept.

Practical recommendation: Start with CodeBERT for node embedding (free, fast), then use GAT for graph learning. This combination is the 2025 sweet spot for cost-effectiveness.

10. Summary and Action Items

For Security Engineers

Don't believe in silver bullets: Neither GNN nor LLM is universal
Start with Joern: Convert existing code to graphs, accumulate data
Begin with small datasets: Practice GAT on 10K samples
Watch hybrid architectures: Vul-LMGNNs, GRACE worth trying

For Researchers

Stop benchmarking on SARD: Use Big-Vul or DiverseVul
Fair comparison matters: Same dataset, same preprocessing
Focus on industrial deployment: 99% accuracy means nothing if inference takes 1 minute

Code and Resources

Code, dataset links, and paper references mentioned in this article are compiled on GitHub:

https://github.com/sgInnora/gnn-vuln-detection

Stars and issues welcome.

References

Yamaguchi, F., et al. "Modeling and Discovering Vulnerabilities with Code Property Graphs." IEEE S&P 2014.
Zhou, Y., et al. "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks." NeurIPS 2019.
Cheng, X., et al. "DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network." ACM TOSEM 2021.
Li, Y., et al. "Vulnerability Detection with Fine-Grained Interpretations." ESEC/FSE 2021.
Wang, H., et al. "Combining Graph-Based Learning with Automated Data Collection for Code Vulnerability Detection." IEEE TIFS 2021.

Complete list of 55 papers available in the GitHub repository and in Appendix A below.

Disclaimer: This article is based on public information and the author's hands-on experience, aiming to explore GNN's technical applications in vulnerability detection. Specific product features should be verified with official sources.

Appendix A: Complete List of 55 Papers

Based on 6 months of systematic research, we compiled 55 top-tier papers from top venues (NeurIPS, ICSE, FSE, S&P, CCS, USENIX Security, etc.) covering the full landscape of GNN vulnerability detection from 2016 to 2025.

A.1 Foundational Theory and Methodology (18 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | Devign | Zhou et al., NTU | NeurIPS 2019 | GGNN + CPG | First end-to-end GNN vuln detection | | ReVeal | Chakraborty et al. | ICSE 2021 | GGNN + Triplet Loss | Vulnerability revision learning | | ReGVD | Nguyen et al. | ICSE 2022 | Residual GNN | Mitigates over-smoothing | | IVDetect | Li et al. | FSE 2021 | GGNN + Interpretation | Fine-grained explainability | | DeepWukong | Cheng et al. | TOSEM 2021 | GGNN + Slicing | Backward slicing methodology | | VulChecker | Mirsky et al. | USENIX 2023 | GGNN + Multi-task | Cross-project generalization | | FUNDED | Wang et al. | TOSEM 2022 | GNN + Imbalance | Focal loss for imbalance | | mVulPreter | Li et al. | ASE 2022 | Multi-view GNN | Multi-scale fusion | | PILOT | Wu et al. | ISSTA 2022 | GGNN + Pre-training | Graph-level pre-training | | VulCNN | Wu et al. | ASE 2022 | CNN + Code Image | Code visualization approach | | SySeVR | Li et al. | IEEE TDSC 2022 | Syntax + Semantic | Dual representation | | VulDeePecker | Li et al. | NDSS 2018 | BiLSTM | Code gadget concept | | Draper | Russell et al. | ICMLA 2018 | CNN + Random Forest | Large-scale dataset | | µVulDeePecker | Zou et al. | IEEE TDSC 2019 | Multi-type Detection | Attention mechanism | | BGNN4VD | Cao et al. | Inf. Sci. 2021 | Bi-directional GNN | Bidirectional message passing | | VGDetector | Zheng et al. | ISSRE 2021 | GCN + Pooling | Hierarchical readout | | DeepVD | Pornprasit et al. | 2022 | GNN + Class Balance | Threshold tuning | | VDSimilar | Fang et al. | ICSE 2020 | Similarity + GNN | Clone-based detection |

A.2 Pre-trained Models and Transformer Methods (12 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | CodeBERT | Feng et al., Microsoft | EMNLP 2020 | BERT + Code | Bimodal pre-training | | GraphCodeBERT | Guo et al., Microsoft | ICLR 2021 | CodeBERT + DFG | Data flow graph integration | | UniXcoder | Guo et al. | ACL 2022 | Unified | Multi-task unified model | | LineVul | Fu et al. | MSR 2022 | Transformer | Line-level localization | | Vul-LMGNNs | Anonymous | arXiv 2023 | LLM + GNN | LLM knowledge distillation | | GRACE | Anonymous | arXiv 2024 | GCN + Contrastive | Residual + contrastive learning | | CodeT5+ | Wang et al., Salesforce | EMNLP 2023 | T5 + Code | Encoder-decoder pre-training | | StarCoder | BigCode Team | 2023 | 15B Model | Large-scale code model | | DeepSeek-Coder | DeepSeek | 2024 | 33B Model | Fill-in-middle training | | Magicoder | Wei et al. | 2024 | OSS-Instruct | Open-source instruction tuning | | WizardCoder | Luo et al. | 2023 | Evol-Instruct | Code evolutionary instruction | | SonarQube ML | SonarSource | 2024 | Hybrid | Industrial grade integration |

A.3 Smart Contract and Blockchain Security (8 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | SmartEmbed | Gao et al. | TOSEM 2020 | Code Embedding | Solidity similarity | | GNNSCVulDetector | Zhuang et al. | IJCNN 2020 | GCN | Multi-contract analysis | | TMP | Zhuang et al. | USENIX 2021 | Temporal GNN | Reentrancy detection | | Peculiar | Wu et al. | ICSE 2021 | Code Property | Ponzi scheme detection | | SCGformer | Li et al. | 2023 | Transformer | Cross-contract semantics | | BlockScope | Huang et al. | ISSTA 2023 | GNN + Static | Gas optimization analysis | | GPTScan | Sun et al. | 2024 | GPT + Static | Logic bug detection | | AuditGPT | Anonymous | 2024 | LLM Audit | Automated audit reports |

A.4 Binary and Low-Level Code Analysis (7 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | PALMTREE | Li et al. | CCS 2021 | Instruction Embedding | Assembly pre-training | | VulHunter | Xu et al. | ASE 2020 | Binary GNN | Firmware analysis | | Gemini | Xu et al. | CCS 2017 | Siamese Network | Binary similarity | | SAFE | Massarelli et al. | DIMVA 2019 | Self-Attentive | Assembly self-attention | | OrderMatters | Yu et al. | NDSS 2020 | Instruction Order | Binary semantics | | BinGo | Chandramohan et al. | NDSS 2016 | Partial Match | Selective inlining | | αDiff | Liu et al. | ASE 2018 | Deep Learning | Cross-version diff |

A.5 Industrial Tools and Applications (5 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | Joern | Yamaguchi et al. | IEEE S&P 2014 | CPG | De facto standard | | WeiZhen (维阵) | Aurora Infinite | Industry 2022 | GGNN + CPG | Commercial platform | | Slither | Trail of Bits | 2019 | Pattern Match | Solidity static analysis | | Mythril | ConsenSys | 2017 | Symbolic Exec | Ethereum security | | CodeQL | GitHub | 2019 | Declarative Query | Graph query language |

A.6 Program Slicing and Explainability (5 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | CFExplainer | Liu et al. | ISSTA 2024 | Counterfactual | "Change this line to fix" | | VulGCANet | Chen et al. | TrustCom 2024 | GCN + GAT | Long-range dependencies | | HAGNN | Anonymous | arXiv 2025 | Heterogeneous Attention | Multi-type nodes/edges | | Big Sleep | Google | 2025 | Gemini + Graph | First real 0-day by AI | | XVulExplain | Wang et al. | 2024 | SHAP + GNN | Feature attribution |

Appendix B: Complete Code Implementation

This section provides complete, runnable implementations of core GNN components for vulnerability detection.

B.1 GGNN (Gated Graph Neural Network) Implementation

System Output

"""
GGNN-based Vulnerability Detector
Based on DEVIGN (NeurIPS 2019) architecture
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import MessagePassing, global_mean_pool

class GGNNLayer(MessagePassing):
    """Single GGNN layer with GRU-style gating"""

    def __init__(self, hidden_dim: int, num_edge_types: int = 4):
        super().__init__(aggr='add')  # Sum aggregation
        self.hidden_dim = hidden_dim
        self.num_edge_types = num_edge_types

        # Edge-specific transformation matrices
        self.edge_mlps = nn.ModuleList([
            nn.Linear(hidden_dim, hidden_dim, bias=False)
            for _ in range(num_edge_types)
        ])

        # GRU gate
        self.gru = nn.GRUCell(hidden_dim, hidden_dim)

    def forward(self, x, edge_index, edge_type):
        # Message passing for each edge type
        aggregated = torch.zeros_like(x)
        for etype in range(self.num_edge_types):
            mask = edge_type == etype
            if mask.sum() > 0:
                edge_idx = edge_index[:, mask]
                msg = self.edge_mlps[etype](x)
                aggregated += self.propagate(edge_idx, x=msg)

# GRU update
        return self.gru(aggregated, x)

    def message(self, x_j):
        return x_j


class DevignModel(nn.Module):
    """
    DEVIGN: Effective Vulnerability Identification
    Architecture: GGNN backbone + Conv1D readout
    """

    def __init__(
        self,
        input_dim: int = 128,
        hidden_dim: int = 256,
        num_layers: int = 6,
        num_edge_types: int = 4,  # AST, CFG, DFG, CDG
        num_classes: int = 2,
        dropout: float = 0.3
    ):
        super().__init__()

        # Input projection
        self.input_proj = nn.Linear(input_dim, hidden_dim)

        # GGNN layers
        self.ggnn_layers = nn.ModuleList([
            GGNNLayer(hidden_dim, num_edge_types)
            for _ in range(num_layers)
        ])

# Conv1D for sequence modeling
        self.conv1 = nn.Conv1d(hidden_dim, hidden_dim, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(hidden_dim, hidden_dim, kernel_size=1)

        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, num_classes)
        )

    def forward(self, x, edge_index, edge_type, batch):
        # Project input features
        h = self.input_proj(x)

        # GGNN message passing
        for layer in self.ggnn_layers:
            h = layer(h, edge_index, edge_type)

        # Conv1D processing (requires reshaping)
        h = h.unsqueeze(0).permute(0, 2, 1)  # [1, hidden, nodes]
        h = F.relu(self.conv1(h))
        h = self.conv2(h)
        h = h.permute(0, 2, 1).squeeze(0)  # [nodes, hidden]

        # Graph-level pooling
        graph_emb = global_mean_pool(h, batch)

        return self.classifier(graph_emb)


# Usage example
if __name__ == "__main__":
    model = DevignModel(input_dim=128, hidden_dim=256, num_layers=6)

    # Simulated input (100 nodes, batch of 4 graphs)
    x = torch.randn(100, 128)
    edge_index = torch.randint(0, 100, (2, 300))
    edge_type = torch.randint(0, 4, (300,))
    batch = torch.repeat_interleave(torch.arange(4), 25)

    output = model(x, edge_index, edge_type, batch)
    print(f"Output shape: {output.shape}")  # [4, 2]

B.2 CodeBERT Node Embedding

System Output

"""
CodeBERT-based node feature initialization
Using Microsoft's pre-trained CodeBERT model
"""

import torch
from transformers import AutoTokenizer, AutoModel
from typing import List

class CodeBERTEmbedder:
    """Extract code embeddings using CodeBERT"""

    def __init__(
        self,
        model_name: str = "microsoft/codebert-base",
        device: str = "cuda" if torch.cuda.is_available() else "cpu",
        max_length: int = 512
    ):
        self.device = device
        self.max_length = max_length

        # Load pre-trained model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name).to(device)
        self.model.eval()

    @torch.no_grad()
    def encode(self, code_snippets: List[str]) -> torch.Tensor:
        """
        Encode code snippets to embedding vectors

        Args:
            code_snippets: List of code strings (one per node)

        Returns:
            Tensor of shape [num_nodes, 768]
        """
        embeddings = []

        for code in code_snippets:
            # Tokenize
            inputs = self.tokenizer(
                code,
                return_tensors="pt",
                max_length=self.max_length,
                truncation=True,
                padding="max_length"
            ).to(self.device)

            # Get [CLS] token embedding
            outputs = self.model(**inputs)
            cls_emb = outputs.last_hidden_state[:, 0, :]  # [1, 768]
            embeddings.append(cls_emb)

        return torch.cat(embeddings, dim=0)

    def encode_batch(
        self,
        code_snippets: List[str],
        batch_size: int = 32
    ) -> torch.Tensor:
        """Batch encoding for large datasets"""
        all_embeddings = []

        for i in range(0, len(code_snippets), batch_size):
            batch = code_snippets[i:i + batch_size]

            inputs = self.tokenizer(
                batch,
                return_tensors="pt",
                max_length=self.max_length,
                truncation=True,
                padding=True
            ).to(self.device)

            with torch.no_grad():
                outputs = self.model(**inputs)
                cls_emb = outputs.last_hidden_state[:, 0, :]
                all_embeddings.append(cls_emb.cpu())

        return torch.cat(all_embeddings, dim=0)


# Usage example
if __name__ == "__main__":
    embedder = CodeBERTEmbedder()

    code_samples = [
        "void foo() { int x = 0; }",
        "char* strcpy(char* dest, const char* src);",
        "if (ptr != NULL) { free(ptr); }"
    ]

    embeddings = embedder.encode(code_samples)
    print(f"Embeddings shape: {embeddings.shape}")  # [3, 768]

B.3 Graph Slicing (Backward Slicing)

System Output

"""
CPG Graph Slicing Implementation
Based on DeepWukong (TOSEM 2021) methodology
"""

from typing import List, Dict, Set, Tuple
from dataclasses import dataclass
import networkx as nx

@dataclass
class CPGNode:
    """CPG node representation"""
    id: int
    type: str  # 'call', 'identifier', 'literal', etc.
    code: str
    line_number: int

@dataclass
class CPGEdge:
    """CPG edge representation"""
    src: int
    dst: int
    type: str  # 'AST', 'CFG', 'DFG', 'CDG'

class CPGSlicer:
    """Backward slicing from sensitive sinks"""

# Sensitive sink functions by vulnerability type
    SINKS = {
        'buffer_overflow': ['strcpy', 'strcat', 'sprintf', 'gets', 'memcpy'],
        'command_injection': ['system', 'popen', 'exec', 'execve', 'execl'],
        'format_string': ['printf', 'fprintf', 'sprintf', 'snprintf'],
        'memory_leak': ['malloc', 'calloc', 'realloc', 'strdup'],
        'use_after_free': ['free'],  # Track pointers after free
        'sql_injection': ['mysql_query', 'sqlite3_exec', 'PQexec'],
    }

    def __init__(
        self,
        nodes: List[CPGNode],
        edges: List[CPGEdge],
        max_depth: int = 5
    ):
        self.nodes = {n.id: n for n in nodes}
        self.max_depth = max_depth

        # Build graph
        self.graph = nx.DiGraph()
        for node in nodes:
            self.graph.add_node(node.id, data=node)

        for edge in edges:
            self.graph.add_edge(edge.src, edge.dst, type=edge.type)

        # Build reverse graph for backward slicing
        self.reverse_graph = self.graph.reverse()

    def find_sinks(self, vuln_type: str = 'all') -> List[int]:
        """Find all sink nodes in the graph"""
        sinks = []

        if vuln_type == 'all':
            sink_names = set()
            for names in self.SINKS.values():
                sink_names.update(names)
        else:
            sink_names = set(self.SINKS.get(vuln_type, []))

        for node_id, node in self.nodes.items():
            if node.type == 'call':
                # Check if function name matches a sink
                for sink_name in sink_names:
                    if sink_name in node.code:
                        sinks.append(node_id)
                        break

        return sinks

    def backward_slice(
        self,
        sink_id: int,
        edge_types: Set[str] = {'DFG', 'CDG'}
    ) -> Set[int]:
        """
        Perform backward slicing from a sink node

        Args:
            sink_id: ID of the sink node
            edge_types: Edge types to follow (DFG for data flow, CDG for control)

        Returns:
            Set of node IDs in the slice
        """
        visited = set()
        queue = [(sink_id, 0)]  # (node_id, depth)

        while queue:
            node_id, depth = queue.pop(0)

            if node_id in visited or depth > self.max_depth:
                continue

            visited.add(node_id)

            # Traverse backward edges
            for pred in self.reverse_graph.predecessors(node_id):
                edge_data = self.graph[pred][node_id]
                if edge_data['type'] in edge_types:
                    queue.append((pred, depth + 1))

        return visited

    def slice_for_vulnerability(
        self,
        vuln_type: str = 'all'
    ) -> List[Tuple[int, Set[int]]]:
        """
        Generate slices for all sinks of a given vulnerability type

        Returns:
            List of (sink_id, slice_node_set) tuples
        """
        sinks = self.find_sinks(vuln_type)
        slices = []

        for sink_id in sinks:
            slice_nodes = self.backward_slice(sink_id)
            slices.append((sink_id, slice_nodes))

        return slices

    def extract_subgraph(
        self,
        node_ids: Set[int]
    ) -> Tuple[List[CPGNode], List[CPGEdge]]:
        """Extract subgraph containing only specified nodes"""
        nodes = [self.nodes[nid] for nid in node_ids if nid in self.nodes]
        edges = []

        for src, dst, data in self.graph.edges(data=True):
            if src in node_ids and dst in node_ids:
                edges.append(CPGEdge(src, dst, data['type']))

        return nodes, edges


# Joern integration helper
def joern_query_to_slice(joern_result: str) -> List[int]:
    """
    Parse Joern query results to get slice nodes

    Joern query example:
    cpg.call.name("strcpy").reachableByFlows(cpg.method.parameter, 5).id.l
    """
    # Parse Joern output format
    import re
    ids = re.findall(r'\d+', joern_result)
    return [int(i) for i in ids]


# Usage example
if __name__ == "__main__":
    # Create sample CPG
    nodes = [
        CPGNode(0, 'param', 'char* input', 1),
        CPGNode(1, 'identifier', 'input', 2),
        CPGNode(2, 'call', 'strlen(input)', 2),
        CPGNode(3, 'identifier', 'buffer', 3),
        CPGNode(4, 'call', 'strcpy(buffer, input)', 4),  # SINK
        CPGNode(5, 'return', 'return 0', 5),
    ]

    edges = [
        CPGEdge(0, 1, 'DFG'),  # input def -> use
        CPGEdge(1, 2, 'DFG'),  # input -> strlen
        CPGEdge(1, 4, 'DFG'),  # input -> strcpy (TAINT)
        CPGEdge(3, 4, 'DFG'),  # buffer -> strcpy
        CPGEdge(2, 4, 'CDG'),  # strlen -> strcpy (control)
        CPGEdge(4, 5, 'CFG'),  # strcpy -> return
    ]

    slicer = CPGSlicer(nodes, edges, max_depth=5)

    # Find buffer overflow sinks
    sinks = slicer.find_sinks('buffer_overflow')
    print(f"Found sinks: {sinks}")  # [4]

    # Backward slice from sink
    slices = slicer.slice_for_vulnerability('buffer_overflow')
    for sink_id, slice_nodes in slices:
        print(f"Sink {sink_id}: slice contains {len(slice_nodes)} nodes")
        print(f"  Nodes: {slice_nodes}")

B.4 Complete Training Pipeline

System Output

"""
Complete GNN Vulnerability Detection Training Pipeline
Includes: data loading, training loop, evaluation, checkpointing
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch_geometric.data import DataLoader, Data
from torch_geometric.nn import GATConv, global_mean_pool
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from typing import Dict, List, Tuple
import numpy as np
from tqdm import tqdm

class FocalLoss(nn.Module):
    """Focal Loss for class imbalance"""

    def __init__(self, gamma: float = 2.0, alpha: float = 0.75):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha

    def forward(self, pred: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
        ce_loss = F.cross_entropy(pred, target, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()


class VulnerabilityDetector(nn.Module):
    """GAT-based Vulnerability Detector with residual connections"""

    def __init__(
        self,
        input_dim: int = 768,  # CodeBERT dimension
        hidden_dim: int = 256,
        num_heads: int = 8,
        num_layers: int = 4,
        dropout: float = 0.3,
        num_classes: int = 2
    ):
        super().__init__()

        # Input projection
        self.input_proj = nn.Linear(input_dim, hidden_dim)

        # GAT layers with residual connections
        self.gat_layers = nn.ModuleList()
        self.layer_norms = nn.ModuleList()

        for i in range(num_layers):
            in_channels = hidden_dim if i == 0 else hidden_dim * num_heads
            self.gat_layers.append(
                GATConv(in_channels, hidden_dim, heads=num_heads, dropout=dropout)
            )
            self.layer_norms.append(nn.LayerNorm(hidden_dim * num_heads))

        # Final projection
        self.final_gat = GATConv(hidden_dim * num_heads, hidden_dim, heads=1, dropout=dropout)

        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, num_classes)
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, edge_index, batch):
        # Input projection
        h = self.input_proj(x)

        # First GAT layer (no residual - dimension change)
        h = self.gat_layers[0](h, edge_index)
        h = F.elu(h)
        h = self.layer_norms[0](h)
        h = self.dropout(h)

        # Remaining GAT layers with residual connections
        for i in range(1, len(self.gat_layers)):
            h_res = h
            h = self.gat_layers[i](h, edge_index)
            h = F.elu(h)
            h = self.layer_norms[i](h)
            h = self.dropout(h)
            h = h + h_res  # Residual connection

        # Final GAT layer
        h = self.final_gat(h, edge_index)
        h = F.elu(h)

        # Graph-level pooling
        graph_emb = global_mean_pool(h, batch)

        return self.classifier(graph_emb)


class VulnerabilityTrainer:
    """Training pipeline with evaluation and checkpointing"""

    def __init__(
        self,
        model: nn.Module,
        device: str = "cuda" if torch.cuda.is_available() else "cpu",
        learning_rate: float = 1e-4,
        weight_decay: float = 1e-5
    ):
        self.model = model.to(device)
        self.device = device

        # Loss and optimizer
        self.criterion = FocalLoss(gamma=2.0, alpha=0.75)
        self.optimizer = AdamW(
            model.parameters(),
            lr=learning_rate,
            weight_decay=weight_decay
        )

        # Metrics tracking
        self.best_f1 = 0.0
        self.history = {'train_loss': [], 'val_f1': []}

    def train_epoch(self, train_loader: DataLoader) -> float:
        """Train for one epoch"""
        self.model.train()
        total_loss = 0.0

        for batch in tqdm(train_loader, desc="Training"):
            batch = batch.to(self.device)

            self.optimizer.zero_grad()
            output = self.model(batch.x, batch.edge_index, batch.batch)
            loss = self.criterion(output, batch.y)

            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
            self.optimizer.step()

            total_loss += loss.item()

        return total_loss / len(train_loader)

    @torch.no_grad()
    def evaluate(self, val_loader: DataLoader) -> Dict[str, float]:
        """Evaluate model on validation set"""
        self.model.eval()

        all_preds = []
        all_labels = []
        all_probs = []

        for batch in val_loader:
            batch = batch.to(self.device)
            output = self.model(batch.x, batch.edge_index, batch.batch)
            probs = F.softmax(output, dim=1)[:, 1]
            preds = output.argmax(dim=1)

            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch.y.cpu().numpy())
            all_probs.extend(probs.cpu().numpy())

        # Calculate metrics
        metrics = {
            'f1': f1_score(all_labels, all_preds),
            'precision': precision_score(all_labels, all_preds),
            'recall': recall_score(all_labels, all_preds),
            'auc': roc_auc_score(all_labels, all_probs) if len(set(all_labels)) > 1 else 0.0
        }

        return metrics

    def train(
        self,
        train_loader: DataLoader,
        val_loader: DataLoader,
        num_epochs: int = 100,
        patience: int = 10,
        checkpoint_path: str = "best_model.pt"
    ) -> Dict:
        """Full training loop with early stopping"""

        scheduler = CosineAnnealingLR(self.optimizer, T_max=num_epochs)
        no_improve = 0

        for epoch in range(num_epochs):
            # Training
            train_loss = self.train_epoch(train_loader)
            self.history['train_loss'].append(train_loss)

            # Evaluation
            metrics = self.evaluate(val_loader)
            self.history['val_f1'].append(metrics['f1'])

            # Learning rate scheduling
            scheduler.step()

            # Logging
            print(f"Epoch {epoch+1}/{num_epochs}")
            print(f"  Train Loss: {train_loss:.4f}")
            print(f"  Val F1: {metrics['f1']:.4f}, Precision: {metrics['precision']:.4f}, "
                  f"Recall: {metrics['recall']:.4f}, AUC: {metrics['auc']:.4f}")

            # Checkpointing
            if metrics['f1'] > self.best_f1:
                self.best_f1 = metrics['f1']
                torch.save({
                    'epoch': epoch,
                    'model_state_dict': self.model.state_dict(),
                    'optimizer_state_dict': self.optimizer.state_dict(),
                    'best_f1': self.best_f1
                }, checkpoint_path)
                print(f"  Saved best model (F1: {self.best_f1:.4f})")
                no_improve = 0
            else:
                no_improve += 1

            # Early stopping
            if no_improve >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break

        return self.history


def create_synthetic_dataset(num_graphs: int = 1000) -> List[Data]:
    """Create synthetic dataset for testing"""
    dataset = []

    for _ in range(num_graphs):
        num_nodes = np.random.randint(50, 200)
        num_edges = np.random.randint(num_nodes, num_nodes * 3)

        # Random node features (simulating CodeBERT embeddings)
        x = torch.randn(num_nodes, 768)

        # Random edges
        edge_index = torch.randint(0, num_nodes, (2, num_edges))

        # Random label (0: safe, 1: vulnerable)
        y = torch.tensor([np.random.randint(0, 2)])

        dataset.append(Data(x=x, edge_index=edge_index, y=y))

    return dataset


# Main execution
if __name__ == "__main__":
    # Create synthetic dataset
    print("Creating synthetic dataset...")
    train_data = create_synthetic_dataset(800)
    val_data = create_synthetic_dataset(200)

    # Create data loaders
    train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
    val_loader = DataLoader(val_data, batch_size=32)

    # Initialize model and trainer
    model = VulnerabilityDetector(
        input_dim=768,
        hidden_dim=256,
        num_heads=8,
        num_layers=4
    )

    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

    trainer = VulnerabilityTrainer(model)

    # Train
    print("\nStarting training...")
    history = trainer.train(
        train_loader,
        val_loader,
        num_epochs=50,
        patience=10,
        checkpoint_path="vuln_detector_best.pt"
    )

    print(f"\nTraining complete. Best F1: {trainer.best_f1:.4f}")

Author: Jiqiang Feng (风宁) Contact: [email protected] | [email protected] GitHub: @sgInnora

Related reading from Innora Security Research:

Graph Neural Networks for Vulnerability Mining: From Theory to Practice

Author: Jiqiang Feng (风宁) Email: [email protected] Published: January 12, 2026 Version: v1.0

Executive Summary

Key Findings:

GNN matches LLM performance on small-to-medium datasets (<100K functions), at 5-10x lower training cost
For cross-function data flow tracking, GNN currently outperforms pure Transformer architectures
The real breakthrough lies in hybrid architectures (GNN+LLM), achieving 96%+ accuracy

1. Why Code Vulnerability Detection Needs Graphs

Here's a simple question: Static analysis tools have been around for decades. Why bother with deep learning?

The answer? False positive rates are killing us.

So why graphs? Code is inherently structured data.

Look at this C code:

System Output

void vulnerable(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Dangerous!
}

From a text perspective, it's just a few strings. But from a graph perspective:

Control Flow Graph (CFG): Entry → strcpy call → Exit
Data Flow Graph (DFG): input → strcpy → buffer (taint propagation path)
Program Dependency Graph (PDG): Intersection of control and data dependencies

Encode these relationships into a graph, and GNN can "understand" the vulnerability's essence—untrusted input reaching a dangerous function.

2. Code Property Graph: Turning Code into Graphs

2.1 The CPG: Three Graphs in One

In 2014, Fabian Yamaguchi introduced the Code Property Graph (CPG) concept. Simply put, it combines AST, CFG, and PDG into a single graph.

System Output

CPG = AST + CFG + PDG

Why merge them? Each graph alone has blind spots:

AST only captures syntax, not execution order
CFG only shows control flow, not variable propagation
PDG lacks syntactic details

After fusion, nodes contain complete semantic information. This is fundamental for GNN to work effectively.

2.2 Joern: The De Facto Standard

When it comes to CPG, you can't avoid Joern. I've used it for three years. Some honest observations:

Pros:

Supports C/C++/Java/JavaScript/Python
Generates standardized graph structures, GNN-ready
Cypher-style query language—writing rules is satisfying

Pitfalls:

C++ template code parsing crashes frequently
Memory pressure on large projects (1M+ lines)
Python support was added later, some edge types incomplete

Practical advice: For projects over 500K lines, split by modules before generating CPG.

2.3 My CPG Generation Pipeline

System Output

# Install Joern (macOS)
brew install joern

# Generate CPG (using OpenSSL as example)
joern-parse /path/to/openssl --language c --output openssl.cpg

# Export to DOT format (for visualization/debugging)
joern-export openssl.cpg --repr cpg14 --format dot

Generated graphs often have millions of edges. Here's where graph slicing becomes essential.

3. Graph Slicing: Keep Only What Matters

3.1 Backward Slicing from Sinks

Full-function CPGs are too large. DeepWukong paper proposed a clever approach: only keep subgraphs related to sensitive operations.

What are sensitive operations? Depends on the vulnerability type you're detecting:

Backward slicing from sink points keeps only nodes whose data flow can reach the sink. This can reduce graph size by 80%+.

3.2 Slicing Code Example

Using Joern's query language:

System Output

// Find all strcpy call sites
val sinks = cpg.call.name("strcpy").l

// Backward data flow slicing (depth 5 steps)
val slice = sinks.flatMap { sink =>
  sink.reachableByFlows(cpg.method.parameter, 5)
}

This traces back from strcpy to see which parameters can flow there. Depth 5 is usually sufficient—going deeper adds noise.

4. GNN Architecture Selection

4.1 Mainstream Architectures Compared

After running dozens of models, here's my assessment:

My choice: GAT (Graph Attention Network). The reason is practical—attention weights directly show which edges matter most, making debugging easier.

4.2 Message Passing Iterations

GNN's core mechanism is message passing: nodes aggregate neighbor information and update their representations.

How many iterations? Papers vary wildly. My experience:

Code analysis: 4-6 iterations suffice
Beyond 8 iterations: Over-smoothing kicks in—all nodes become similar

DEVIGN paper uses 8 iterations, but their graphs are sparser. Adjust based on your graph density.

4.3 Node Feature Initialization

This is overlooked but crucial. How do you turn code nodes into vectors?

Method 1: Word2Vec

Treat code tokens as "words"
Train word embeddings
Drawback: Doesn't understand code semantics

Method 2: CodeBERT Embeddings

Use pre-trained CodeBERT
128/256-dimensional embeddings
Drawback: Slower inference

Method 3: Instruction2Vec (Binary Analysis)

Map assembly instructions to vectors
Suitable for firmware analysis

I now default to CodeBERT—embedding quality is significantly better. Speed issues can be solved through pre-computation—once the graph is fixed, compute embeddings once.

5. From Dataset to Model Deployment

5.1 Dataset Selection

Stop benchmarking on SARD. That dataset is entirely synthetic vulnerabilities with overly regular patterns. Getting 95% means nothing.

Recommended Datasets:

Key reminder: All datasets have biases. Big-Vul over-represents buffer overflows. DiverseVul label quality varies. Best practice: mix multiple datasets for training.

5.2 Model Training Code Framework

With PyTorch Geometric, the core code is concise:

System Output

import torch
from torch_geometric.nn import GATConv, global_mean_pool
from torch_geometric.data import DataLoader

class VulnDetector(torch.nn.Module):
    def __init__(self, in_dim=128, hidden_dim=256, heads=8):
        super().__init__()
        self.conv1 = GATConv(in_dim, hidden_dim, heads=heads)
        self.conv2 = GATConv(hidden_dim * heads, hidden_dim, heads=1)
        self.classifier = torch.nn.Linear(hidden_dim, 2)

    def forward(self, x, edge_index, batch):
        # Two GAT layers
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index).relu()
        # Graph-level pooling
        x = global_mean_pool(x, batch)
        return self.classifier(x)

# Training loop
model = VulnDetector()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(100):
    for batch in train_loader:
        optimizer.zero_grad()
        out = model(batch.x, batch.edge_index, batch.batch)
        loss = F.cross_entropy(out, batch.y)
        loss.backward()
        optimizer.step()

This is the basic version. Production deployment needs:

Dropout for regularization
Learning rate scheduling
Early stopping
Class imbalance handling (vulnerabilities are usually rare)

5.3 Handling Class Imbalance

Common issue with vulnerability datasets: normal code vastly outnumbers vulnerable code. Ratios like 10:1 or even 100:1.

My approach:

Oversample vulnerability samples: SMOTE-Graph variants
Focal Loss: Reduce weight on easy-to-classify samples
Threshold adjustment: Don't use 0.5—tune based on business needs

System Output

# Focal Loss implementation
class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0, alpha=0.75):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha

    def forward(self, pred, target):
        ce_loss = F.cross_entropy(pred, target, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

6. GNN vs LLM: When to Use Which?

6.1 Performance Comparison (Based on My Testing)

On Big-Vul dataset:

Key findings:

LLM leads on large datasets, but training cost is 5-10x higher than GNN
Zero-shot GPT-4 performs worse than many assume (overhyped)
GNN's inference speed advantage is significant—suitable for CI/CD integration

6.2 Scenario Recommendations

Use GNN when:

Daily code scanning (need speed)
Resource-constrained environments (edge devices)
Explainability required (audit compliance)
Cross-function data flow tracking

Use LLM when:

Novel vulnerability types (zero-shot generalization)
Code understanding + fix suggestions (need language capability)
Sufficient labeled data for fine-tuning

Use hybrid architectures when:

Pursuing maximum accuracy
Adequate compute resources
Willing to invest in engineering optimization

6.3 Hybrid Architectures: The Future Direction

The 2025 trend is clear: GNN + LLM hybrid architectures are rising.

Representative works:

Vul-LMGNNs: LLM for knowledge distillation, GNN for final judgment
GRACE: GCN + residual connections + contrastive learning
HAGNN: Hierarchical attention graph network, 96.6% accuracy on C

My assessment: Pure GNN or pure LLM will both be surpassed by hybrid approaches. The challenge is engineering complexity—you'll face pitfalls from both systems.

6.4 2024-2025 Breakthrough Methods

Academia hasn't been idle. From "proving feasibility" to "explainability" and "multimodal fusion," research has entered its second phase.

7. Industrial Applications

7.1 Aurora Infinite (极光无限)

A leading Chinese vendor in GNN-based vulnerability detection. They have two core products:

WeiZhen (维阵): Static Code Security Analysis Platform

Uses CPG (Code Property Graph) as unified representation
Runs GGNN for vulnerability pattern learning
Supports C/C++/Java/Go/Python
CI/CD integration with GitLab/Jenkins

Aurora Hunter (极光猎手): AI-Assisted Security Audit

Combines GNN detection results with LLM explanation capabilities
Auto-generates vulnerability reports and fix suggestions
Targets financial, energy, and critical infrastructure sectors

I've discussed with their technical team—CPG generation uses Joern (open source), GNN part has proprietary optimizations. Solid commercial deployment track record.

7.2 GitHub CodeQL's Graph Queries

Strictly speaking, CodeQL isn't GNN, but the philosophy is similar—model code as a graph, find vulnerabilities with declarative queries.

System Output

// Find SQL injection
from Call call, DataFlow::PathNode source, DataFlow::PathNode sink
where
  call.getTarget().hasName("query") and
  source.isSource() and
  sink.isSink() and
  DataFlow::localFlow(source, sink)
select sink, "SQL injection from $@", source, "user input"

CodeQL's advantage is maintainable rules. Disadvantage: can't automatically learn new vulnerability patterns.

7.3 Smart Contracts

Ethereum smart contracts are an excellent GNN application scenario. Code is short (usually hundreds of lines), vulnerability patterns are well-defined (reentrancy, integer overflow, etc.).

Tool recommendations:

Slither: Static analysis framework
Mythril: Symbolic execution
GNNSCVulDetector: GNN-specific

I've used GNN on Solidity code—reentrancy vulnerability detection accuracy can reach 92%+. Main reason: contract control flow is simpler, graphs aren't as complex.

7.4 Google Big Sleep: AI Agent Discovers Real 0-day

The most shocking case of 2025. Google's Big Sleep project (Gemini-powered AI Agent) discovered a real 0-day vulnerability in SQLite (CVE-2025-6965).

This was an exploitable stack buffer underflow that traditional fuzzers and manual audits had missed. Big Sleep's approach:

Use Gemini to understand SQLite code semantics
Build code dependency graphs for data flow tracking
Generate targeted PoC to trigger the vulnerability

Why it matters: This is the first time AI discovered a real vulnerability in a major open-source project. Not a lab environment, not a synthetic vulnerability—a genuine CVE-worthy finding.

Technical detail: Big Sleep isn't pure GNN, but rather an LLM + graph analysis hybrid system. LLM handles semantic understanding, graph analysis handles data flow and control flow tracking.

8. Pitfall Avoidance Guide

8.1 Data Leakage

Most common mistake: overlap between training and test sets.

Big-Vul may contain the same vulnerability multiple times (different commits fixing the same bug). Random splitting severely overestimates model performance.

Correct approach: Deduplicate by CVE or commit hash, then split.

8.2 Label Noise

Draper dataset labels come from static analyzers. Problem: static analysis itself has false positives.

Models trained on noisy labels won't perform much better than static analysis. Chicken-and-egg problem.

My approach: Cross-validate with multiple datasets, keep only labels consistent across datasets.

8.3 Graph Size Issues

CPGs can have hundreds of thousands of nodes. Feeding directly to GNN will blow GPU memory.

Solutions:

Graph slicing (discussed earlier)
Graph sampling (sample k-hop neighbors only)
Hierarchical processing (function-level first, then file-level)

8.4 Over-smoothing

Too many GNN layers cause node representations to converge.

Symptoms: Training loss drops but validation performance plateaus.

Solutions:

Add residual connections
Use Jumping Knowledge
Limit layers (4-6)

9. Future Outlook

9.1 Does GNN Have a Future in the LLM Era?

Yes, but positioning must change.

Pure GNN will struggle to beat fine-tuned LLMs in general vulnerability detection. But in these scenarios, GNN remains irreplaceable:

Real-time scanning: 20x faster inference
Explainability: Attention weights are traceable
Data flow analysis: LLMs struggle with long-range dependencies

My prediction: The future is LLM for coarse screening, GNN for fine ranking—a hybrid pipeline.

9.2 Technical Trends

Worth watching:

Self-supervised graph learning: Reduce labeled data dependency
Dynamic graph neural networks: Handle code evolution
Neural-symbolic hybrids: Combine rule reasoning with deep learning

9.3 GNN+LLM Fusion: Two Technical Paths

The 2024-2025 research hotspot is deep fusion of GNN and LLM. Two mainstream approaches have emerged:

Path One: Graph-for-LLM (Graph-Enhanced LLM)

Use GNN-extracted structural features as additional LLM input
Example: Use CPG node embeddings as soft prompts
Advantage: LLM gains structure awareness, reduces hallucination

Path Two: LLM-Augmented GNN

Use LLM semantic embeddings to initialize GNN node features
Example: Use CodeBERT/StarCoder to encode code tokens
Advantage: GNN gains better semantic understanding, not just structure

Practical recommendation: Start with CodeBERT for node embedding (free, fast), then use GAT for graph learning. This combination is the 2025 sweet spot for cost-effectiveness.

10. Summary and Action Items

For Security Engineers

Don't believe in silver bullets: Neither GNN nor LLM is universal
Start with Joern: Convert existing code to graphs, accumulate data
Begin with small datasets: Practice GAT on 10K samples
Watch hybrid architectures: Vul-LMGNNs, GRACE worth trying

For Researchers

Stop benchmarking on SARD: Use Big-Vul or DiverseVul
Fair comparison matters: Same dataset, same preprocessing
Focus on industrial deployment: 99% accuracy means nothing if inference takes 1 minute

Code and Resources

Code, dataset links, and paper references mentioned in this article are compiled on GitHub:

https://github.com/sgInnora/gnn-vuln-detection

Stars and issues welcome.

References

Yamaguchi, F., et al. "Modeling and Discovering Vulnerabilities with Code Property Graphs." IEEE S&P 2014.
Zhou, Y., et al. "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks." NeurIPS 2019.
Cheng, X., et al. "DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network." ACM TOSEM 2021.
Li, Y., et al. "Vulnerability Detection with Fine-Grained Interpretations." ESEC/FSE 2021.
Wang, H., et al. "Combining Graph-Based Learning with Automated Data Collection for Code Vulnerability Detection." IEEE TIFS 2021.

Complete list of 55 papers available in the GitHub repository and in Appendix A below.

Appendix A: Complete List of 55 Papers

A.1 Foundational Theory and Methodology (18 Papers)

A.2 Pre-trained Models and Transformer Methods (12 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | CodeBERT | Feng et al., Microsoft | EMNLP 2020 | BERT + Code | Bimodal pre-training | | GraphCodeBERT | Guo et al., Microsoft | ICLR 2021 | CodeBERT + DFG | Data flow graph integration | | UniXcoder | Guo et al. | ACL 2022 | Unified | Multi-task unified model | | LineVul | Fu et al. | MSR 2022 | Transformer | Line-level localization | | Vul-LMGNNs | Anonymous | arXiv 2023 | LLM + GNN | LLM knowledge distillation | | GRACE | Anonymous | arXiv 2024 | GCN + Contrastive | Residual + contrastive learning | | CodeT5+ | Wang et al., Salesforce | EMNLP 2023 | T5 + Code | Encoder-decoder pre-training | | StarCoder | BigCode Team | 2023 | 15B Model | Large-scale code model | | DeepSeek-Coder | DeepSeek | 2024 | 33B Model | Fill-in-middle training | | Magicoder | Wei et al. | 2024 | OSS-Instruct | Open-source instruction tuning | | WizardCoder | Luo et al. | 2023 | Evol-Instruct | Code evolutionary instruction | | SonarQube ML | SonarSource | 2024 | Hybrid | Industrial grade integration |

A.3 Smart Contract and Blockchain Security (8 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | SmartEmbed | Gao et al. | TOSEM 2020 | Code Embedding | Solidity similarity | | GNNSCVulDetector | Zhuang et al. | IJCNN 2020 | GCN | Multi-contract analysis | | TMP | Zhuang et al. | USENIX 2021 | Temporal GNN | Reentrancy detection | | Peculiar | Wu et al. | ICSE 2021 | Code Property | Ponzi scheme detection | | SCGformer | Li et al. | 2023 | Transformer | Cross-contract semantics | | BlockScope | Huang et al. | ISSTA 2023 | GNN + Static | Gas optimization analysis | | GPTScan | Sun et al. | 2024 | GPT + Static | Logic bug detection | | AuditGPT | Anonymous | 2024 | LLM Audit | Automated audit reports |

A.4 Binary and Low-Level Code Analysis (7 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | PALMTREE | Li et al. | CCS 2021 | Instruction Embedding | Assembly pre-training | | VulHunter | Xu et al. | ASE 2020 | Binary GNN | Firmware analysis | | Gemini | Xu et al. | CCS 2017 | Siamese Network | Binary similarity | | SAFE | Massarelli et al. | DIMVA 2019 | Self-Attentive | Assembly self-attention | | OrderMatters | Yu et al. | NDSS 2020 | Instruction Order | Binary semantics | | BinGo | Chandramohan et al. | NDSS 2016 | Partial Match | Selective inlining | | αDiff | Liu et al. | ASE 2018 | Deep Learning | Cross-version diff |

A.5 Industrial Tools and Applications (5 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | Joern | Yamaguchi et al. | IEEE S&P 2014 | CPG | De facto standard | | WeiZhen (维阵) | Aurora Infinite | Industry 2022 | GGNN + CPG | Commercial platform | | Slither | Trail of Bits | 2019 | Pattern Match | Solidity static analysis | | Mythril | ConsenSys | 2017 | Symbolic Exec | Ethereum security | | CodeQL | GitHub | 2019 | Declarative Query | Graph query language |

A.6 Program Slicing and Explainability (5 Papers)

| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | CFExplainer | Liu et al. | ISSTA 2024 | Counterfactual | "Change this line to fix" | | VulGCANet | Chen et al. | TrustCom 2024 | GCN + GAT | Long-range dependencies | | HAGNN | Anonymous | arXiv 2025 | Heterogeneous Attention | Multi-type nodes/edges | | Big Sleep | Google | 2025 | Gemini + Graph | First real 0-day by AI | | XVulExplain | Wang et al. | 2024 | SHAP + GNN | Feature attribution |

Appendix B: Complete Code Implementation

This section provides complete, runnable implementations of core GNN components for vulnerability detection.

B.1 GGNN (Gated Graph Neural Network) Implementation

System Output

"""
GGNN-based Vulnerability Detector
Based on DEVIGN (NeurIPS 2019) architecture
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import MessagePassing, global_mean_pool

class GGNNLayer(MessagePassing):
    """Single GGNN layer with GRU-style gating"""

    def __init__(self, hidden_dim: int, num_edge_types: int = 4):
        super().__init__(aggr='add')  # Sum aggregation
        self.hidden_dim = hidden_dim
        self.num_edge_types = num_edge_types

        # Edge-specific transformation matrices
        self.edge_mlps = nn.ModuleList([
            nn.Linear(hidden_dim, hidden_dim, bias=False)
            for _ in range(num_edge_types)
        ])

        # GRU gate
        self.gru = nn.GRUCell(hidden_dim, hidden_dim)

    def forward(self, x, edge_index, edge_type):
        # Message passing for each edge type
        aggregated = torch.zeros_like(x)
        for etype in range(self.num_edge_types):
            mask = edge_type == etype
            if mask.sum() > 0:
                edge_idx = edge_index[:, mask]
                msg = self.edge_mlps[etype](x)
                aggregated += self.propagate(edge_idx, x=msg)

# GRU update
        return self.gru(aggregated, x)

    def message(self, x_j):
        return x_j


class DevignModel(nn.Module):
    """
    DEVIGN: Effective Vulnerability Identification
    Architecture: GGNN backbone + Conv1D readout
    """

    def __init__(
        self,
        input_dim: int = 128,
        hidden_dim: int = 256,
        num_layers: int = 6,
        num_edge_types: int = 4,  # AST, CFG, DFG, CDG
        num_classes: int = 2,
        dropout: float = 0.3
    ):
        super().__init__()

        # Input projection
        self.input_proj = nn.Linear(input_dim, hidden_dim)

        # GGNN layers
        self.ggnn_layers = nn.ModuleList([
            GGNNLayer(hidden_dim, num_edge_types)
            for _ in range(num_layers)
        ])

# Conv1D for sequence modeling
        self.conv1 = nn.Conv1d(hidden_dim, hidden_dim, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(hidden_dim, hidden_dim, kernel_size=1)

        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, num_classes)
        )

    def forward(self, x, edge_index, edge_type, batch):
        # Project input features
        h = self.input_proj(x)

        # GGNN message passing
        for layer in self.ggnn_layers:
            h = layer(h, edge_index, edge_type)

        # Conv1D processing (requires reshaping)
        h = h.unsqueeze(0).permute(0, 2, 1)  # [1, hidden, nodes]
        h = F.relu(self.conv1(h))
        h = self.conv2(h)
        h = h.permute(0, 2, 1).squeeze(0)  # [nodes, hidden]

        # Graph-level pooling
        graph_emb = global_mean_pool(h, batch)

        return self.classifier(graph_emb)


# Usage example
if __name__ == "__main__":
    model = DevignModel(input_dim=128, hidden_dim=256, num_layers=6)

    # Simulated input (100 nodes, batch of 4 graphs)
    x = torch.randn(100, 128)
    edge_index = torch.randint(0, 100, (2, 300))
    edge_type = torch.randint(0, 4, (300,))
    batch = torch.repeat_interleave(torch.arange(4), 25)

    output = model(x, edge_index, edge_type, batch)
    print(f"Output shape: {output.shape}")  # [4, 2]

B.2 CodeBERT Node Embedding

System Output

"""
CodeBERT-based node feature initialization
Using Microsoft's pre-trained CodeBERT model
"""

import torch
from transformers import AutoTokenizer, AutoModel
from typing import List

class CodeBERTEmbedder:
    """Extract code embeddings using CodeBERT"""

    def __init__(
        self,
        model_name: str = "microsoft/codebert-base",
        device: str = "cuda" if torch.cuda.is_available() else "cpu",
        max_length: int = 512
    ):
        self.device = device
        self.max_length = max_length

        # Load pre-trained model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name).to(device)
        self.model.eval()

    @torch.no_grad()
    def encode(self, code_snippets: List[str]) -> torch.Tensor:
        """
        Encode code snippets to embedding vectors

        Args:
            code_snippets: List of code strings (one per node)

        Returns:
            Tensor of shape [num_nodes, 768]
        """
        embeddings = []

        for code in code_snippets:
            # Tokenize
            inputs = self.tokenizer(
                code,
                return_tensors="pt",
                max_length=self.max_length,
                truncation=True,
                padding="max_length"
            ).to(self.device)

            # Get [CLS] token embedding
            outputs = self.model(**inputs)
            cls_emb = outputs.last_hidden_state[:, 0, :]  # [1, 768]
            embeddings.append(cls_emb)

        return torch.cat(embeddings, dim=0)

    def encode_batch(
        self,
        code_snippets: List[str],
        batch_size: int = 32
    ) -> torch.Tensor:
        """Batch encoding for large datasets"""
        all_embeddings = []

        for i in range(0, len(code_snippets), batch_size):
            batch = code_snippets[i:i + batch_size]

            inputs = self.tokenizer(
                batch,
                return_tensors="pt",
                max_length=self.max_length,
                truncation=True,
                padding=True
            ).to(self.device)

            with torch.no_grad():
                outputs = self.model(**inputs)
                cls_emb = outputs.last_hidden_state[:, 0, :]
                all_embeddings.append(cls_emb.cpu())

        return torch.cat(all_embeddings, dim=0)


# Usage example
if __name__ == "__main__":
    embedder = CodeBERTEmbedder()

    code_samples = [
        "void foo() { int x = 0; }",
        "char* strcpy(char* dest, const char* src);",
        "if (ptr != NULL) { free(ptr); }"
    ]

    embeddings = embedder.encode(code_samples)
    print(f"Embeddings shape: {embeddings.shape}")  # [3, 768]

B.3 Graph Slicing (Backward Slicing)

System Output

"""
CPG Graph Slicing Implementation
Based on DeepWukong (TOSEM 2021) methodology
"""

from typing import List, Dict, Set, Tuple
from dataclasses import dataclass
import networkx as nx

@dataclass
class CPGNode:
    """CPG node representation"""
    id: int
    type: str  # 'call', 'identifier', 'literal', etc.
    code: str
    line_number: int

@dataclass
class CPGEdge:
    """CPG edge representation"""
    src: int
    dst: int
    type: str  # 'AST', 'CFG', 'DFG', 'CDG'

class CPGSlicer:
    """Backward slicing from sensitive sinks"""

# Sensitive sink functions by vulnerability type
    SINKS = {
        'buffer_overflow': ['strcpy', 'strcat', 'sprintf', 'gets', 'memcpy'],
        'command_injection': ['system', 'popen', 'exec', 'execve', 'execl'],
        'format_string': ['printf', 'fprintf', 'sprintf', 'snprintf'],
        'memory_leak': ['malloc', 'calloc', 'realloc', 'strdup'],
        'use_after_free': ['free'],  # Track pointers after free
        'sql_injection': ['mysql_query', 'sqlite3_exec', 'PQexec'],
    }

    def __init__(
        self,
        nodes: List[CPGNode],
        edges: List[CPGEdge],
        max_depth: int = 5
    ):
        self.nodes = {n.id: n for n in nodes}
        self.max_depth = max_depth

        # Build graph
        self.graph = nx.DiGraph()
        for node in nodes:
            self.graph.add_node(node.id, data=node)

        for edge in edges:
            self.graph.add_edge(edge.src, edge.dst, type=edge.type)

        # Build reverse graph for backward slicing
        self.reverse_graph = self.graph.reverse()

    def find_sinks(self, vuln_type: str = 'all') -> List[int]:
        """Find all sink nodes in the graph"""
        sinks = []

        if vuln_type == 'all':
            sink_names = set()
            for names in self.SINKS.values():
                sink_names.update(names)
        else:
            sink_names = set(self.SINKS.get(vuln_type, []))

        for node_id, node in self.nodes.items():
            if node.type == 'call':
                # Check if function name matches a sink
                for sink_name in sink_names:
                    if sink_name in node.code:
                        sinks.append(node_id)
                        break

        return sinks

    def backward_slice(
        self,
        sink_id: int,
        edge_types: Set[str] = {'DFG', 'CDG'}
    ) -> Set[int]:
        """
        Perform backward slicing from a sink node

        Args:
            sink_id: ID of the sink node
            edge_types: Edge types to follow (DFG for data flow, CDG for control)

        Returns:
            Set of node IDs in the slice
        """
        visited = set()
        queue = [(sink_id, 0)]  # (node_id, depth)

        while queue:
            node_id, depth = queue.pop(0)

            if node_id in visited or depth > self.max_depth:
                continue

            visited.add(node_id)

            # Traverse backward edges
            for pred in self.reverse_graph.predecessors(node_id):
                edge_data = self.graph[pred][node_id]
                if edge_data['type'] in edge_types:
                    queue.append((pred, depth + 1))

        return visited

    def slice_for_vulnerability(
        self,
        vuln_type: str = 'all'
    ) -> List[Tuple[int, Set[int]]]:
        """
        Generate slices for all sinks of a given vulnerability type

        Returns:
            List of (sink_id, slice_node_set) tuples
        """
        sinks = self.find_sinks(vuln_type)
        slices = []

        for sink_id in sinks:
            slice_nodes = self.backward_slice(sink_id)
            slices.append((sink_id, slice_nodes))

        return slices

    def extract_subgraph(
        self,
        node_ids: Set[int]
    ) -> Tuple[List[CPGNode], List[CPGEdge]]:
        """Extract subgraph containing only specified nodes"""
        nodes = [self.nodes[nid] for nid in node_ids if nid in self.nodes]
        edges = []

        for src, dst, data in self.graph.edges(data=True):
            if src in node_ids and dst in node_ids:
                edges.append(CPGEdge(src, dst, data['type']))

        return nodes, edges


# Joern integration helper
def joern_query_to_slice(joern_result: str) -> List[int]:
    """
    Parse Joern query results to get slice nodes

    Joern query example:
    cpg.call.name("strcpy").reachableByFlows(cpg.method.parameter, 5).id.l
    """
    # Parse Joern output format
    import re
    ids = re.findall(r'\d+', joern_result)
    return [int(i) for i in ids]


# Usage example
if __name__ == "__main__":
    # Create sample CPG
    nodes = [
        CPGNode(0, 'param', 'char* input', 1),
        CPGNode(1, 'identifier', 'input', 2),
        CPGNode(2, 'call', 'strlen(input)', 2),
        CPGNode(3, 'identifier', 'buffer', 3),
        CPGNode(4, 'call', 'strcpy(buffer, input)', 4),  # SINK
        CPGNode(5, 'return', 'return 0', 5),
    ]

    edges = [
        CPGEdge(0, 1, 'DFG'),  # input def -> use
        CPGEdge(1, 2, 'DFG'),  # input -> strlen
        CPGEdge(1, 4, 'DFG'),  # input -> strcpy (TAINT)
        CPGEdge(3, 4, 'DFG'),  # buffer -> strcpy
        CPGEdge(2, 4, 'CDG'),  # strlen -> strcpy (control)
        CPGEdge(4, 5, 'CFG'),  # strcpy -> return
    ]

    slicer = CPGSlicer(nodes, edges, max_depth=5)

    # Find buffer overflow sinks
    sinks = slicer.find_sinks('buffer_overflow')
    print(f"Found sinks: {sinks}")  # [4]

    # Backward slice from sink
    slices = slicer.slice_for_vulnerability('buffer_overflow')
    for sink_id, slice_nodes in slices:
        print(f"Sink {sink_id}: slice contains {len(slice_nodes)} nodes")
        print(f"  Nodes: {slice_nodes}")

B.4 Complete Training Pipeline

System Output

"""
Complete GNN Vulnerability Detection Training Pipeline
Includes: data loading, training loop, evaluation, checkpointing
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch_geometric.data import DataLoader, Data
from torch_geometric.nn import GATConv, global_mean_pool
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from typing import Dict, List, Tuple
import numpy as np
from tqdm import tqdm

class FocalLoss(nn.Module):
    """Focal Loss for class imbalance"""

    def __init__(self, gamma: float = 2.0, alpha: float = 0.75):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha

    def forward(self, pred: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
        ce_loss = F.cross_entropy(pred, target, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()


class VulnerabilityDetector(nn.Module):
    """GAT-based Vulnerability Detector with residual connections"""

    def __init__(
        self,
        input_dim: int = 768,  # CodeBERT dimension
        hidden_dim: int = 256,
        num_heads: int = 8,
        num_layers: int = 4,
        dropout: float = 0.3,
        num_classes: int = 2
    ):
        super().__init__()

        # Input projection
        self.input_proj = nn.Linear(input_dim, hidden_dim)

        # GAT layers with residual connections
        self.gat_layers = nn.ModuleList()
        self.layer_norms = nn.ModuleList()

        for i in range(num_layers):
            in_channels = hidden_dim if i == 0 else hidden_dim * num_heads
            self.gat_layers.append(
                GATConv(in_channels, hidden_dim, heads=num_heads, dropout=dropout)
            )
            self.layer_norms.append(nn.LayerNorm(hidden_dim * num_heads))

        # Final projection
        self.final_gat = GATConv(hidden_dim * num_heads, hidden_dim, heads=1, dropout=dropout)

        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, num_classes)
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, edge_index, batch):
        # Input projection
        h = self.input_proj(x)

        # First GAT layer (no residual - dimension change)
        h = self.gat_layers[0](h, edge_index)
        h = F.elu(h)
        h = self.layer_norms[0](h)
        h = self.dropout(h)

        # Remaining GAT layers with residual connections
        for i in range(1, len(self.gat_layers)):
            h_res = h
            h = self.gat_layers[i](h, edge_index)
            h = F.elu(h)
            h = self.layer_norms[i](h)
            h = self.dropout(h)
            h = h + h_res  # Residual connection

        # Final GAT layer
        h = self.final_gat(h, edge_index)
        h = F.elu(h)

        # Graph-level pooling
        graph_emb = global_mean_pool(h, batch)

        return self.classifier(graph_emb)


class VulnerabilityTrainer:
    """Training pipeline with evaluation and checkpointing"""

    def __init__(
        self,
        model: nn.Module,
        device: str = "cuda" if torch.cuda.is_available() else "cpu",
        learning_rate: float = 1e-4,
        weight_decay: float = 1e-5
    ):
        self.model = model.to(device)
        self.device = device

        # Loss and optimizer
        self.criterion = FocalLoss(gamma=2.0, alpha=0.75)
        self.optimizer = AdamW(
            model.parameters(),
            lr=learning_rate,
            weight_decay=weight_decay
        )

        # Metrics tracking
        self.best_f1 = 0.0
        self.history = {'train_loss': [], 'val_f1': []}

    def train_epoch(self, train_loader: DataLoader) -> float:
        """Train for one epoch"""
        self.model.train()
        total_loss = 0.0

        for batch in tqdm(train_loader, desc="Training"):
            batch = batch.to(self.device)

            self.optimizer.zero_grad()
            output = self.model(batch.x, batch.edge_index, batch.batch)
            loss = self.criterion(output, batch.y)

            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
            self.optimizer.step()

            total_loss += loss.item()

        return total_loss / len(train_loader)

    @torch.no_grad()
    def evaluate(self, val_loader: DataLoader) -> Dict[str, float]:
        """Evaluate model on validation set"""
        self.model.eval()

        all_preds = []
        all_labels = []
        all_probs = []

        for batch in val_loader:
            batch = batch.to(self.device)
            output = self.model(batch.x, batch.edge_index, batch.batch)
            probs = F.softmax(output, dim=1)[:, 1]
            preds = output.argmax(dim=1)

            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch.y.cpu().numpy())
            all_probs.extend(probs.cpu().numpy())

        # Calculate metrics
        metrics = {
            'f1': f1_score(all_labels, all_preds),
            'precision': precision_score(all_labels, all_preds),
            'recall': recall_score(all_labels, all_preds),
            'auc': roc_auc_score(all_labels, all_probs) if len(set(all_labels)) > 1 else 0.0
        }

        return metrics

    def train(
        self,
        train_loader: DataLoader,
        val_loader: DataLoader,
        num_epochs: int = 100,
        patience: int = 10,
        checkpoint_path: str = "best_model.pt"
    ) -> Dict:
        """Full training loop with early stopping"""

        scheduler = CosineAnnealingLR(self.optimizer, T_max=num_epochs)
        no_improve = 0

        for epoch in range(num_epochs):
            # Training
            train_loss = self.train_epoch(train_loader)
            self.history['train_loss'].append(train_loss)

            # Evaluation
            metrics = self.evaluate(val_loader)
            self.history['val_f1'].append(metrics['f1'])

            # Learning rate scheduling
            scheduler.step()

            # Logging
            print(f"Epoch {epoch+1}/{num_epochs}")
            print(f"  Train Loss: {train_loss:.4f}")
            print(f"  Val F1: {metrics['f1']:.4f}, Precision: {metrics['precision']:.4f}, "
                  f"Recall: {metrics['recall']:.4f}, AUC: {metrics['auc']:.4f}")

            # Checkpointing
            if metrics['f1'] > self.best_f1:
                self.best_f1 = metrics['f1']
                torch.save({
                    'epoch': epoch,
                    'model_state_dict': self.model.state_dict(),
                    'optimizer_state_dict': self.optimizer.state_dict(),
                    'best_f1': self.best_f1
                }, checkpoint_path)
                print(f"  Saved best model (F1: {self.best_f1:.4f})")
                no_improve = 0
            else:
                no_improve += 1

            # Early stopping
            if no_improve >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break

        return self.history


def create_synthetic_dataset(num_graphs: int = 1000) -> List[Data]:
    """Create synthetic dataset for testing"""
    dataset = []

    for _ in range(num_graphs):
        num_nodes = np.random.randint(50, 200)
        num_edges = np.random.randint(num_nodes, num_nodes * 3)

        # Random node features (simulating CodeBERT embeddings)
        x = torch.randn(num_nodes, 768)

        # Random edges
        edge_index = torch.randint(0, num_nodes, (2, num_edges))

        # Random label (0: safe, 1: vulnerable)
        y = torch.tensor([np.random.randint(0, 2)])

        dataset.append(Data(x=x, edge_index=edge_index, y=y))

    return dataset


# Main execution
if __name__ == "__main__":
    # Create synthetic dataset
    print("Creating synthetic dataset...")
    train_data = create_synthetic_dataset(800)
    val_data = create_synthetic_dataset(200)

    # Create data loaders
    train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
    val_loader = DataLoader(val_data, batch_size=32)

    # Initialize model and trainer
    model = VulnerabilityDetector(
        input_dim=768,
        hidden_dim=256,
        num_heads=8,
        num_layers=4
    )

    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

    trainer = VulnerabilityTrainer(model)

    # Train
    print("\nStarting training...")
    history = trainer.train(
        train_loader,
        val_loader,
        num_epochs=50,
        patience=10,
        checkpoint_path="vuln_detector_best.pt"
    )

    print(f"\nTraining complete. Best F1: {trainer.best_f1:.4f}")

Author: Jiqiang Feng (风宁) Contact: [email protected] | [email protected] GitHub: @sgInnora

Related reading from Innora Security Research:

Graph Neural Networks for Vulnerability Mining: From Theory to Practice

Executive Summary

1. Why Code Vulnerability Detection Needs Graphs

2. Code Property Graph: Turning Code into Graphs

2.1 The CPG: Three Graphs in One

2.2 Joern: The De Facto Standard

2.3 My CPG Generation Pipeline

3. Graph Slicing: Keep Only What Matters

3.1 Backward Slicing from Sinks

3.2 Slicing Code Example

4. GNN Architecture Selection

4.1 Mainstream Architectures Compared

4.2 Message Passing Iterations

4.3 Node Feature Initialization

5. From Dataset to Model Deployment

5.1 Dataset Selection

5.2 Model Training Code Framework

5.3 Handling Class Imbalance

6. GNN vs LLM: When to Use Which?

6.1 Performance Comparison (Based on My Testing)

6.2 Scenario Recommendations

6.3 Hybrid Architectures: The Future Direction

6.4 2024-2025 Breakthrough Methods

7. Industrial Applications

7.1 Aurora Infinite (极光无限)

7.2 GitHub CodeQL's Graph Queries

7.3 Smart Contracts

7.4 Google Big Sleep: AI Agent Discovers Real 0-day

8. Pitfall Avoidance Guide

8.1 Data Leakage

8.2 Label Noise

8.3 Graph Size Issues

8.4 Over-smoothing

9. Future Outlook

9.1 Does GNN Have a Future in the LLM Era?

9.2 Technical Trends

9.3 GNN+LLM Fusion: Two Technical Paths

10. Summary and Action Items

For Security Engineers

For Researchers

Code and Resources

References

Appendix A: Complete List of 55 Papers

A.1 Foundational Theory and Methodology (18 Papers)

A.2 Pre-trained Models and Transformer Methods (12 Papers)

A.3 Smart Contract and Blockchain Security (8 Papers)

A.4 Binary and Low-Level Code Analysis (7 Papers)

A.5 Industrial Tools and Applications (5 Papers)

A.6 Program Slicing and Explainability (5 Papers)

Appendix B: Complete Code Implementation

B.1 GGNN (Gated Graph Neural Network) Implementation

B.2 CodeBERT Node Embedding

B.3 Graph Slicing (Backward Slicing)

B.4 Complete Training Pipeline

Feng Ning (风宁)

Related Chronicles

AI Supply Chain Poisoning: From Hugging Face to Local RCE

Comprehensive Analysis of Mainstream APT Teams' Tactics and Techniques (2020-2025)

Nora Vision: Advanced Linux Intrusion Detection System Now Open Source

Subscribe for AI Security Insights

Graph Neural Networks for Vulnerability Mining: From Theory to Practice

Executive Summary

1. Why Code Vulnerability Detection Needs Graphs

2. Code Property Graph: Turning Code into Graphs

2.1 The CPG: Three Graphs in One

2.2 Joern: The De Facto Standard

2.3 My CPG Generation Pipeline

3. Graph Slicing: Keep Only What Matters

3.1 Backward Slicing from Sinks

3.2 Slicing Code Example

4. GNN Architecture Selection

4.1 Mainstream Architectures Compared

4.2 Message Passing Iterations

4.3 Node Feature Initialization

5. From Dataset to Model Deployment

5.1 Dataset Selection

5.2 Model Training Code Framework

5.3 Handling Class Imbalance

6. GNN vs LLM: When to Use Which?

6.1 Performance Comparison (Based on My Testing)