Graph Neural Networks for Vulnerability Mining: From Theory to Practice
Author: Jiqiang Feng (风宁) Email: [email protected] Published: January 12, 2026 Version: v1.0
Executive Summary
Code vulnerability detection is undergoing a paradigm shift. Large Language Models are powerful, no doubt. But Graph Neural Networks offer unique advantages in certain scenarios—particularly when you need to understand control flow and data dependencies.
This isn't another "GNN is amazing" hype piece. Based on 55 top-tier papers and hands-on testing of 7 open-source projects, I'll give you a practical assessment: what GNN excels at, where it falls short, and how to make the right choice in 2026.
Key Findings:
- GNN matches LLM performance on small-to-medium datasets (<100K functions), at 5-10x lower training cost
- For cross-function data flow tracking, GNN currently outperforms pure Transformer architectures
- The real breakthrough lies in hybrid architectures (GNN+LLM), achieving 96%+ accuracy
1. Why Code Vulnerability Detection Needs Graphs
Here's a simple question: Static analysis tools have been around for decades. Why bother with deep learning?
The answer? False positive rates are killing us.
Coverity, Fortify, these commercial tools—50%+ false positive rates are common. Security engineers spend hours every day confirming "this is NOT a vulnerability." Deep learning's value is learning vulnerability patterns from history, not rigidly matching rules.
So why graphs? Code is inherently structured data.
Look at this C code:
void vulnerable(char *input) {
char buffer[64];
strcpy(buffer, input); // Dangerous!
}
From a text perspective, it's just a few strings. But from a graph perspective:
- Control Flow Graph (CFG): Entry → strcpy call → Exit
- Data Flow Graph (DFG):
input→strcpy→buffer(taint propagation path) - Program Dependency Graph (PDG): Intersection of control and data dependencies
Encode these relationships into a graph, and GNN can "understand" the vulnerability's essence—untrusted input reaching a dangerous function.
2. Code Property Graph: Turning Code into Graphs
2.1 The CPG: Three Graphs in One
In 2014, Fabian Yamaguchi introduced the Code Property Graph (CPG) concept. Simply put, it combines AST, CFG, and PDG into a single graph.
CPG = AST + CFG + PDG
Why merge them? Each graph alone has blind spots:
- AST only captures syntax, not execution order
- CFG only shows control flow, not variable propagation
- PDG lacks syntactic details
After fusion, nodes contain complete semantic information. This is fundamental for GNN to work effectively.
2.2 Joern: The De Facto Standard
When it comes to CPG, you can't avoid Joern. I've used it for three years. Some honest observations:
Pros:
- Supports C/C++/Java/JavaScript/Python
- Generates standardized graph structures, GNN-ready
- Cypher-style query language—writing rules is satisfying
Pitfalls:
- C++ template code parsing crashes frequently
- Memory pressure on large projects (1M+ lines)
- Python support was added later, some edge types incomplete
Practical advice: For projects over 500K lines, split by modules before generating CPG.
2.3 My CPG Generation Pipeline
# Install Joern (macOS)
brew install joern
# Generate CPG (using OpenSSL as example)
joern-parse /path/to/openssl --language c --output openssl.cpg
# Export to DOT format (for visualization/debugging)
joern-export openssl.cpg --repr cpg14 --format dot
Generated graphs often have millions of edges. Here's where graph slicing becomes essential.
3. Graph Slicing: Keep Only What Matters
3.1 Backward Slicing from Sinks
Full-function CPGs are too large. DeepWukong paper proposed a clever approach: only keep subgraphs related to sensitive operations.
What are sensitive operations? Depends on the vulnerability type you're detecting:
| Vulnerability Type | Sensitive Sinks | |-------------------|-----------------| | Buffer Overflow | strcpy, memcpy, sprintf | | Command Injection | system, popen, exec | | Memory Leak | malloc (without matching free) | | Use-After-Free | Pointer dereference after free |
Backward slicing from sink points keeps only nodes whose data flow can reach the sink. This can reduce graph size by 80%+.
3.2 Slicing Code Example
Using Joern's query language:
// Find all strcpy call sites
val sinks = cpg.call.name("strcpy").l
// Backward data flow slicing (depth 5 steps)
val slice = sinks.flatMap { sink =>
sink.reachableByFlows(cpg.method.parameter, 5)
}
This traces back from strcpy to see which parameters can flow there. Depth 5 is usually sufficient—going deeper adds noise.
4. GNN Architecture Selection
4.1 Mainstream Architectures Compared
After running dozens of models, here's my assessment:
| Model | Accuracy | Training Speed | Interpretability | Recommended Scenario | |-------|----------|----------------|------------------|---------------------| | GGNN | 85-90% | Fast | Fair | Large-scale scanning | | GAT | 87-92% | Medium | Good | Line-level localization | | DGCNN | 88-93% | Slow | Poor | Maximum accuracy | | GCN | 82-87% | Very Fast | Good | Resource-constrained |
My choice: GAT (Graph Attention Network). The reason is practical—attention weights directly show which edges matter most, making debugging easier.
4.2 Message Passing Iterations
GNN's core mechanism is message passing: nodes aggregate neighbor information and update their representations.
How many iterations? Papers vary wildly. My experience:
- Code analysis: 4-6 iterations suffice
- Beyond 8 iterations: Over-smoothing kicks in—all nodes become similar
DEVIGN paper uses 8 iterations, but their graphs are sparser. Adjust based on your graph density.
4.3 Node Feature Initialization
This is overlooked but crucial. How do you turn code nodes into vectors?
Method 1: Word2Vec
- Treat code tokens as "words"
- Train word embeddings
- Drawback: Doesn't understand code semantics
Method 2: CodeBERT Embeddings
- Use pre-trained CodeBERT
- 128/256-dimensional embeddings
- Drawback: Slower inference
Method 3: Instruction2Vec (Binary Analysis)
- Map assembly instructions to vectors
- Suitable for firmware analysis
I now default to CodeBERT—embedding quality is significantly better. Speed issues can be solved through pre-computation—once the graph is fixed, compute embeddings once.
5. From Dataset to Model Deployment
5.1 Dataset Selection
Stop benchmarking on SARD. That dataset is entirely synthetic vulnerabilities with overly regular patterns. Getting 95% means nothing.
Recommended Datasets:
| Dataset | Scale | Source | Realism | |---------|-------|--------|---------| | Big-Vul | 188K functions | GitHub commits | ⭐⭐⭐⭐ | | DiverseVul | 150+ CWEs | Multi-source | ⭐⭐⭐⭐⭐ | | Draper | 1.27M functions | Static analyzer labels | ⭐⭐⭐ | | CodeXGLUE | Benchmark | Microsoft curated | ⭐⭐⭐⭐ |
Key reminder: All datasets have biases. Big-Vul over-represents buffer overflows. DiverseVul label quality varies. Best practice: mix multiple datasets for training.
5.2 Model Training Code Framework
With PyTorch Geometric, the core code is concise:
import torch
from torch_geometric.nn import GATConv, global_mean_pool
from torch_geometric.data import DataLoader
class VulnDetector(torch.nn.Module):
def __init__(self, in_dim=128, hidden_dim=256, heads=8):
super().__init__()
self.conv1 = GATConv(in_dim, hidden_dim, heads=heads)
self.conv2 = GATConv(hidden_dim * heads, hidden_dim, heads=1)
self.classifier = torch.nn.Linear(hidden_dim, 2)
def forward(self, x, edge_index, batch):
# Two GAT layers
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index).relu()
# Graph-level pooling
x = global_mean_pool(x, batch)
return self.classifier(x)
# Training loop
model = VulnDetector()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(100):
for batch in train_loader:
optimizer.zero_grad()
out = model(batch.x, batch.edge_index, batch.batch)
loss = F.cross_entropy(out, batch.y)
loss.backward()
optimizer.step()
This is the basic version. Production deployment needs:
- Dropout for regularization
- Learning rate scheduling
- Early stopping
- Class imbalance handling (vulnerabilities are usually rare)
5.3 Handling Class Imbalance
Common issue with vulnerability datasets: normal code vastly outnumbers vulnerable code. Ratios like 10:1 or even 100:1.
My approach:
- Oversample vulnerability samples: SMOTE-Graph variants
- Focal Loss: Reduce weight on easy-to-classify samples
- Threshold adjustment: Don't use 0.5—tune based on business needs
# Focal Loss implementation
class FocalLoss(nn.Module):
def __init__(self, gamma=2.0, alpha=0.75):
super().__init__()
self.gamma = gamma
self.alpha = alpha
def forward(self, pred, target):
ce_loss = F.cross_entropy(pred, target, reduction='none')
pt = torch.exp(-ce_loss)
focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
return focal_loss.mean()
6. GNN vs LLM: When to Use Which?
6.1 Performance Comparison (Based on My Testing)
On Big-Vul dataset:
| Model | F1 Score | Inference Speed | Training Cost | |-------|----------|-----------------|---------------| | GAT | 0.62 | 1200 func/sec | 2hr/A100 | | DEVIGN | 0.64 | 800 func/sec | 3hr/A100 | | CodeBERT | 0.65 | 200 func/sec | 8hr/A100 | | StarCoder | 0.68 | 50 func/sec | 24hr/A100 | | GPT-4 (zero-shot) | 0.55 | 5 func/sec | $0.01/func |
Key findings:
- LLM leads on large datasets, but training cost is 5-10x higher than GNN
- Zero-shot GPT-4 performs worse than many assume (overhyped)
- GNN's inference speed advantage is significant—suitable for CI/CD integration
6.2 Scenario Recommendations
Use GNN when:
- Daily code scanning (need speed)
- Resource-constrained environments (edge devices)
- Explainability required (audit compliance)
- Cross-function data flow tracking
Use LLM when:
- Novel vulnerability types (zero-shot generalization)
- Code understanding + fix suggestions (need language capability)
- Sufficient labeled data for fine-tuning
Use hybrid architectures when:
- Pursuing maximum accuracy
- Adequate compute resources
- Willing to invest in engineering optimization
6.3 Hybrid Architectures: The Future Direction
The 2025 trend is clear: GNN + LLM hybrid architectures are rising.
Representative works:
- Vul-LMGNNs: LLM for knowledge distillation, GNN for final judgment
- GRACE: GCN + residual connections + contrastive learning
- HAGNN: Hierarchical attention graph network, 96.6% accuracy on C
My assessment: Pure GNN or pure LLM will both be surpassed by hybrid approaches. The challenge is engineering complexity—you'll face pitfalls from both systems.
6.4 2024-2025 Breakthrough Methods
Academia hasn't been idle. From "proving feasibility" to "explainability" and "multimodal fusion," research has entered its second phase.
| Method | Source | Core Innovation | Practical Value | |--------|--------|-----------------|-----------------| | CFExplainer | ISSTA 2024 | Counterfactual explanation: "If you change this line, vulnerability disappears" | Helps developers locate root cause | | VulGCANet | TrustCom 2024 | GCN+GAT hybrid, solves cross-function long-range dependencies | Significantly improved recall | | TACSan | 2024 | Three-Address Code (TAC) intermediate representation, more precise | Reduces C/C++ memory vulnerability false positives | | HAGNN | arXiv 2025 | Heterogeneous attention GNN, distinguishes node/edge types | Cross-language, cross-function tracking |
HAGNN deserves special attention—it models code graphs as heterogeneous graphs, where variable nodes, function nodes, and API nodes each have distinct features, and edges are differentiated between control-flow and data-flow. This design better reflects code's true nature.
7. Industrial Applications
7.1 Aurora Infinite (极光无限)
A leading Chinese vendor in GNN-based vulnerability detection. They have two core products:
WeiZhen (维阵): Static Code Security Analysis Platform
- Uses CPG (Code Property Graph) as unified representation
- Runs GGNN for vulnerability pattern learning
- Supports C/C++/Java/Go/Python
- CI/CD integration with GitLab/Jenkins
Aurora Hunter (极光猎手): AI-Assisted Security Audit
- Combines GNN detection results with LLM explanation capabilities
- Auto-generates vulnerability reports and fix suggestions
- Targets financial, energy, and critical infrastructure sectors
I've discussed with their technical team—CPG generation uses Joern (open source), GNN part has proprietary optimizations. Solid commercial deployment track record.
7.2 GitHub CodeQL's Graph Queries
Strictly speaking, CodeQL isn't GNN, but the philosophy is similar—model code as a graph, find vulnerabilities with declarative queries.
// Find SQL injection
from Call call, DataFlow::PathNode source, DataFlow::PathNode sink
where
call.getTarget().hasName("query") and
source.isSource() and
sink.isSink() and
DataFlow::localFlow(source, sink)
select sink, "SQL injection from $@", source, "user input"
CodeQL's advantage is maintainable rules. Disadvantage: can't automatically learn new vulnerability patterns.
7.3 Smart Contracts
Ethereum smart contracts are an excellent GNN application scenario. Code is short (usually hundreds of lines), vulnerability patterns are well-defined (reentrancy, integer overflow, etc.).
Tool recommendations:
- Slither: Static analysis framework
- Mythril: Symbolic execution
- GNNSCVulDetector: GNN-specific
I've used GNN on Solidity code—reentrancy vulnerability detection accuracy can reach 92%+. Main reason: contract control flow is simpler, graphs aren't as complex.
7.4 Google Big Sleep: AI Agent Discovers Real 0-day
The most shocking case of 2025. Google's Big Sleep project (Gemini-powered AI Agent) discovered a real 0-day vulnerability in SQLite (CVE-2025-6965).
This was an exploitable stack buffer underflow that traditional fuzzers and manual audits had missed. Big Sleep's approach:
- Use Gemini to understand SQLite code semantics
- Build code dependency graphs for data flow tracking
- Generate targeted PoC to trigger the vulnerability
Why it matters: This is the first time AI discovered a real vulnerability in a major open-source project. Not a lab environment, not a synthetic vulnerability—a genuine CVE-worthy finding.
Technical detail: Big Sleep isn't pure GNN, but rather an LLM + graph analysis hybrid system. LLM handles semantic understanding, graph analysis handles data flow and control flow tracking.
8. Pitfall Avoidance Guide
8.1 Data Leakage
Most common mistake: overlap between training and test sets.
Big-Vul may contain the same vulnerability multiple times (different commits fixing the same bug). Random splitting severely overestimates model performance.
Correct approach: Deduplicate by CVE or commit hash, then split.
8.2 Label Noise
Draper dataset labels come from static analyzers. Problem: static analysis itself has false positives.
Models trained on noisy labels won't perform much better than static analysis. Chicken-and-egg problem.
My approach: Cross-validate with multiple datasets, keep only labels consistent across datasets.
8.3 Graph Size Issues
CPGs can have hundreds of thousands of nodes. Feeding directly to GNN will blow GPU memory.
Solutions:
- Graph slicing (discussed earlier)
- Graph sampling (sample k-hop neighbors only)
- Hierarchical processing (function-level first, then file-level)
8.4 Over-smoothing
Too many GNN layers cause node representations to converge.
Symptoms: Training loss drops but validation performance plateaus.
Solutions:
- Add residual connections
- Use Jumping Knowledge
- Limit layers (4-6)
9. Future Outlook
9.1 Does GNN Have a Future in the LLM Era?
Yes, but positioning must change.
Pure GNN will struggle to beat fine-tuned LLMs in general vulnerability detection. But in these scenarios, GNN remains irreplaceable:
- Real-time scanning: 20x faster inference
- Explainability: Attention weights are traceable
- Data flow analysis: LLMs struggle with long-range dependencies
My prediction: The future is LLM for coarse screening, GNN for fine ranking—a hybrid pipeline.
9.2 Technical Trends
Worth watching:
- Self-supervised graph learning: Reduce labeled data dependency
- Dynamic graph neural networks: Handle code evolution
- Neural-symbolic hybrids: Combine rule reasoning with deep learning
9.3 GNN+LLM Fusion: Two Technical Paths
The 2024-2025 research hotspot is deep fusion of GNN and LLM. Two mainstream approaches have emerged:
Path One: Graph-for-LLM (Graph-Enhanced LLM)
- Use GNN-extracted structural features as additional LLM input
- Example: Use CPG node embeddings as soft prompts
- Advantage: LLM gains structure awareness, reduces hallucination
Path Two: LLM-Augmented GNN
- Use LLM semantic embeddings to initialize GNN node features
- Example: Use CodeBERT/StarCoder to encode code tokens
- Advantage: GNN gains better semantic understanding, not just structure
My assessment: Path Two is more practical. Reason: LLM inference is too slow—using it as preprocessing (run once offline) is more reasonable. Path One requires online LLM calls, where latency and cost are hard to accept.
Practical recommendation: Start with CodeBERT for node embedding (free, fast), then use GAT for graph learning. This combination is the 2025 sweet spot for cost-effectiveness.
10. Summary and Action Items
For Security Engineers
- Don't believe in silver bullets: Neither GNN nor LLM is universal
- Start with Joern: Convert existing code to graphs, accumulate data
- Begin with small datasets: Practice GAT on 10K samples
- Watch hybrid architectures: Vul-LMGNNs, GRACE worth trying
For Researchers
- Stop benchmarking on SARD: Use Big-Vul or DiverseVul
- Fair comparison matters: Same dataset, same preprocessing
- Focus on industrial deployment: 99% accuracy means nothing if inference takes 1 minute
Code and Resources
Code, dataset links, and paper references mentioned in this article are compiled on GitHub:
- https://github.com/sgInnora/gnn-vuln-detection
Stars and issues welcome.
References
- Yamaguchi, F., et al. "Modeling and Discovering Vulnerabilities with Code Property Graphs." IEEE S&P 2014.
- Zhou, Y., et al. "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks." NeurIPS 2019.
- Cheng, X., et al. "DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network." ACM TOSEM 2021.
- Li, Y., et al. "Vulnerability Detection with Fine-Grained Interpretations." ESEC/FSE 2021.
- Wang, H., et al. "Combining Graph-Based Learning with Automated Data Collection for Code Vulnerability Detection." IEEE TIFS 2021.
Complete list of 55 papers available in the GitHub repository and in Appendix A below.
Disclaimer: This article is based on public information and the author's hands-on experience, aiming to explore GNN's technical applications in vulnerability detection. Specific product features should be verified with official sources.
Appendix A: Complete List of 55 Papers
Based on 6 months of systematic research, we compiled 55 top-tier papers from top venues (NeurIPS, ICSE, FSE, S&P, CCS, USENIX Security, etc.) covering the full landscape of GNN vulnerability detection from 2016 to 2025.
A.1 Foundational Theory and Methodology (18 Papers)
| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | Devign | Zhou et al., NTU | NeurIPS 2019 | GGNN + CPG | First end-to-end GNN vuln detection | | ReVeal | Chakraborty et al. | ICSE 2021 | GGNN + Triplet Loss | Vulnerability revision learning | | ReGVD | Nguyen et al. | ICSE 2022 | Residual GNN | Mitigates over-smoothing | | IVDetect | Li et al. | FSE 2021 | GGNN + Interpretation | Fine-grained explainability | | DeepWukong | Cheng et al. | TOSEM 2021 | GGNN + Slicing | Backward slicing methodology | | VulChecker | Mirsky et al. | USENIX 2023 | GGNN + Multi-task | Cross-project generalization | | FUNDED | Wang et al. | TOSEM 2022 | GNN + Imbalance | Focal loss for imbalance | | mVulPreter | Li et al. | ASE 2022 | Multi-view GNN | Multi-scale fusion | | PILOT | Wu et al. | ISSTA 2022 | GGNN + Pre-training | Graph-level pre-training | | VulCNN | Wu et al. | ASE 2022 | CNN + Code Image | Code visualization approach | | SySeVR | Li et al. | IEEE TDSC 2022 | Syntax + Semantic | Dual representation | | VulDeePecker | Li et al. | NDSS 2018 | BiLSTM | Code gadget concept | | Draper | Russell et al. | ICMLA 2018 | CNN + Random Forest | Large-scale dataset | | µVulDeePecker | Zou et al. | IEEE TDSC 2019 | Multi-type Detection | Attention mechanism | | BGNN4VD | Cao et al. | Inf. Sci. 2021 | Bi-directional GNN | Bidirectional message passing | | VGDetector | Zheng et al. | ISSRE 2021 | GCN + Pooling | Hierarchical readout | | DeepVD | Pornprasit et al. | 2022 | GNN + Class Balance | Threshold tuning | | VDSimilar | Fang et al. | ICSE 2020 | Similarity + GNN | Clone-based detection |
A.2 Pre-trained Models and Transformer Methods (12 Papers)
| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | CodeBERT | Feng et al., Microsoft | EMNLP 2020 | BERT + Code | Bimodal pre-training | | GraphCodeBERT | Guo et al., Microsoft | ICLR 2021 | CodeBERT + DFG | Data flow graph integration | | UniXcoder | Guo et al. | ACL 2022 | Unified | Multi-task unified model | | LineVul | Fu et al. | MSR 2022 | Transformer | Line-level localization | | Vul-LMGNNs | Anonymous | arXiv 2023 | LLM + GNN | LLM knowledge distillation | | GRACE | Anonymous | arXiv 2024 | GCN + Contrastive | Residual + contrastive learning | | CodeT5+ | Wang et al., Salesforce | EMNLP 2023 | T5 + Code | Encoder-decoder pre-training | | StarCoder | BigCode Team | 2023 | 15B Model | Large-scale code model | | DeepSeek-Coder | DeepSeek | 2024 | 33B Model | Fill-in-middle training | | Magicoder | Wei et al. | 2024 | OSS-Instruct | Open-source instruction tuning | | WizardCoder | Luo et al. | 2023 | Evol-Instruct | Code evolutionary instruction | | SonarQube ML | SonarSource | 2024 | Hybrid | Industrial grade integration |
A.3 Smart Contract and Blockchain Security (8 Papers)
| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | SmartEmbed | Gao et al. | TOSEM 2020 | Code Embedding | Solidity similarity | | GNNSCVulDetector | Zhuang et al. | IJCNN 2020 | GCN | Multi-contract analysis | | TMP | Zhuang et al. | USENIX 2021 | Temporal GNN | Reentrancy detection | | Peculiar | Wu et al. | ICSE 2021 | Code Property | Ponzi scheme detection | | SCGformer | Li et al. | 2023 | Transformer | Cross-contract semantics | | BlockScope | Huang et al. | ISSTA 2023 | GNN + Static | Gas optimization analysis | | GPTScan | Sun et al. | 2024 | GPT + Static | Logic bug detection | | AuditGPT | Anonymous | 2024 | LLM Audit | Automated audit reports |
A.4 Binary and Low-Level Code Analysis (7 Papers)
| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | PALMTREE | Li et al. | CCS 2021 | Instruction Embedding | Assembly pre-training | | VulHunter | Xu et al. | ASE 2020 | Binary GNN | Firmware analysis | | Gemini | Xu et al. | CCS 2017 | Siamese Network | Binary similarity | | SAFE | Massarelli et al. | DIMVA 2019 | Self-Attentive | Assembly self-attention | | OrderMatters | Yu et al. | NDSS 2020 | Instruction Order | Binary semantics | | BinGo | Chandramohan et al. | NDSS 2016 | Partial Match | Selective inlining | | αDiff | Liu et al. | ASE 2018 | Deep Learning | Cross-version diff |
A.5 Industrial Tools and Applications (5 Papers)
| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | Joern | Yamaguchi et al. | IEEE S&P 2014 | CPG | De facto standard | | WeiZhen (维阵) | Aurora Infinite | Industry 2022 | GGNN + CPG | Commercial platform | | Slither | Trail of Bits | 2019 | Pattern Match | Solidity static analysis | | Mythril | ConsenSys | 2017 | Symbolic Exec | Ethereum security | | CodeQL | GitHub | 2019 | Declarative Query | Graph query language |
A.6 Program Slicing and Explainability (5 Papers)
| Paper Title | Authors/Institution | Year/Venue | Core Technology | Key Innovation | |------------|---------------------|------------|-----------------|----------------| | CFExplainer | Liu et al. | ISSTA 2024 | Counterfactual | "Change this line to fix" | | VulGCANet | Chen et al. | TrustCom 2024 | GCN + GAT | Long-range dependencies | | HAGNN | Anonymous | arXiv 2025 | Heterogeneous Attention | Multi-type nodes/edges | | Big Sleep | Google | 2025 | Gemini + Graph | First real 0-day by AI | | XVulExplain | Wang et al. | 2024 | SHAP + GNN | Feature attribution |
Appendix B: Complete Code Implementation
This section provides complete, runnable implementations of core GNN components for vulnerability detection.
B.1 GGNN (Gated Graph Neural Network) Implementation
"""
GGNN-based Vulnerability Detector
Based on DEVIGN (NeurIPS 2019) architecture
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import MessagePassing, global_mean_pool
class GGNNLayer(MessagePassing):
"""Single GGNN layer with GRU-style gating"""
def __init__(self, hidden_dim: int, num_edge_types: int = 4):
super().__init__(aggr='add') # Sum aggregation
self.hidden_dim = hidden_dim
self.num_edge_types = num_edge_types
# Edge-specific transformation matrices
self.edge_mlps = nn.ModuleList([
nn.Linear(hidden_dim, hidden_dim, bias=False)
for _ in range(num_edge_types)
])
# GRU gate
self.gru = nn.GRUCell(hidden_dim, hidden_dim)
def forward(self, x, edge_index, edge_type):
# Message passing for each edge type
aggregated = torch.zeros_like(x)
for etype in range(self.num_edge_types):
mask = edge_type == etype
if mask.sum() > 0:
edge_idx = edge_index[:, mask]
msg = self.edge_mlps[etype](x)
aggregated += self.propagate(edge_idx, x=msg)
# GRU update
return self.gru(aggregated, x)
def message(self, x_j):
return x_j
class DevignModel(nn.Module):
"""
DEVIGN: Effective Vulnerability Identification
Architecture: GGNN backbone + Conv1D readout
"""
def __init__(
self,
input_dim: int = 128,
hidden_dim: int = 256,
num_layers: int = 6,
num_edge_types: int = 4, # AST, CFG, DFG, CDG
num_classes: int = 2,
dropout: float = 0.3
):
super().__init__()
# Input projection
self.input_proj = nn.Linear(input_dim, hidden_dim)
# GGNN layers
self.ggnn_layers = nn.ModuleList([
GGNNLayer(hidden_dim, num_edge_types)
for _ in range(num_layers)
])
# Conv1D for sequence modeling
self.conv1 = nn.Conv1d(hidden_dim, hidden_dim, kernel_size=3, padding=1)
self.conv2 = nn.Conv1d(hidden_dim, hidden_dim, kernel_size=1)
# Classifier
self.classifier = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim // 2, num_classes)
)
def forward(self, x, edge_index, edge_type, batch):
# Project input features
h = self.input_proj(x)
# GGNN message passing
for layer in self.ggnn_layers:
h = layer(h, edge_index, edge_type)
# Conv1D processing (requires reshaping)
h = h.unsqueeze(0).permute(0, 2, 1) # [1, hidden, nodes]
h = F.relu(self.conv1(h))
h = self.conv2(h)
h = h.permute(0, 2, 1).squeeze(0) # [nodes, hidden]
# Graph-level pooling
graph_emb = global_mean_pool(h, batch)
return self.classifier(graph_emb)
# Usage example
if __name__ == "__main__":
model = DevignModel(input_dim=128, hidden_dim=256, num_layers=6)
# Simulated input (100 nodes, batch of 4 graphs)
x = torch.randn(100, 128)
edge_index = torch.randint(0, 100, (2, 300))
edge_type = torch.randint(0, 4, (300,))
batch = torch.repeat_interleave(torch.arange(4), 25)
output = model(x, edge_index, edge_type, batch)
print(f"Output shape: {output.shape}") # [4, 2]
B.2 CodeBERT Node Embedding
"""
CodeBERT-based node feature initialization
Using Microsoft's pre-trained CodeBERT model
"""
import torch
from transformers import AutoTokenizer, AutoModel
from typing import List
class CodeBERTEmbedder:
"""Extract code embeddings using CodeBERT"""
def __init__(
self,
model_name: str = "microsoft/codebert-base",
device: str = "cuda" if torch.cuda.is_available() else "cpu",
max_length: int = 512
):
self.device = device
self.max_length = max_length
# Load pre-trained model
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name).to(device)
self.model.eval()
@torch.no_grad()
def encode(self, code_snippets: List[str]) -> torch.Tensor:
"""
Encode code snippets to embedding vectors
Args:
code_snippets: List of code strings (one per node)
Returns:
Tensor of shape [num_nodes, 768]
"""
embeddings = []
for code in code_snippets:
# Tokenize
inputs = self.tokenizer(
code,
return_tensors="pt",
max_length=self.max_length,
truncation=True,
padding="max_length"
).to(self.device)
# Get [CLS] token embedding
outputs = self.model(**inputs)
cls_emb = outputs.last_hidden_state[:, 0, :] # [1, 768]
embeddings.append(cls_emb)
return torch.cat(embeddings, dim=0)
def encode_batch(
self,
code_snippets: List[str],
batch_size: int = 32
) -> torch.Tensor:
"""Batch encoding for large datasets"""
all_embeddings = []
for i in range(0, len(code_snippets), batch_size):
batch = code_snippets[i:i + batch_size]
inputs = self.tokenizer(
batch,
return_tensors="pt",
max_length=self.max_length,
truncation=True,
padding=True
).to(self.device)
with torch.no_grad():
outputs = self.model(**inputs)
cls_emb = outputs.last_hidden_state[:, 0, :]
all_embeddings.append(cls_emb.cpu())
return torch.cat(all_embeddings, dim=0)
# Usage example
if __name__ == "__main__":
embedder = CodeBERTEmbedder()
code_samples = [
"void foo() { int x = 0; }",
"char* strcpy(char* dest, const char* src);",
"if (ptr != NULL) { free(ptr); }"
]
embeddings = embedder.encode(code_samples)
print(f"Embeddings shape: {embeddings.shape}") # [3, 768]
B.3 Graph Slicing (Backward Slicing)
"""
CPG Graph Slicing Implementation
Based on DeepWukong (TOSEM 2021) methodology
"""
from typing import List, Dict, Set, Tuple
from dataclasses import dataclass
import networkx as nx
@dataclass
class CPGNode:
"""CPG node representation"""
id: int
type: str # 'call', 'identifier', 'literal', etc.
code: str
line_number: int
@dataclass
class CPGEdge:
"""CPG edge representation"""
src: int
dst: int
type: str # 'AST', 'CFG', 'DFG', 'CDG'
class CPGSlicer:
"""Backward slicing from sensitive sinks"""
# Sensitive sink functions by vulnerability type
SINKS = {
'buffer_overflow': ['strcpy', 'strcat', 'sprintf', 'gets', 'memcpy'],
'command_injection': ['system', 'popen', 'exec', 'execve', 'execl'],
'format_string': ['printf', 'fprintf', 'sprintf', 'snprintf'],
'memory_leak': ['malloc', 'calloc', 'realloc', 'strdup'],
'use_after_free': ['free'], # Track pointers after free
'sql_injection': ['mysql_query', 'sqlite3_exec', 'PQexec'],
}
def __init__(
self,
nodes: List[CPGNode],
edges: List[CPGEdge],
max_depth: int = 5
):
self.nodes = {n.id: n for n in nodes}
self.max_depth = max_depth
# Build graph
self.graph = nx.DiGraph()
for node in nodes:
self.graph.add_node(node.id, data=node)
for edge in edges:
self.graph.add_edge(edge.src, edge.dst, type=edge.type)
# Build reverse graph for backward slicing
self.reverse_graph = self.graph.reverse()
def find_sinks(self, vuln_type: str = 'all') -> List[int]:
"""Find all sink nodes in the graph"""
sinks = []
if vuln_type == 'all':
sink_names = set()
for names in self.SINKS.values():
sink_names.update(names)
else:
sink_names = set(self.SINKS.get(vuln_type, []))
for node_id, node in self.nodes.items():
if node.type == 'call':
# Check if function name matches a sink
for sink_name in sink_names:
if sink_name in node.code:
sinks.append(node_id)
break
return sinks
def backward_slice(
self,
sink_id: int,
edge_types: Set[str] = {'DFG', 'CDG'}
) -> Set[int]:
"""
Perform backward slicing from a sink node
Args:
sink_id: ID of the sink node
edge_types: Edge types to follow (DFG for data flow, CDG for control)
Returns:
Set of node IDs in the slice
"""
visited = set()
queue = [(sink_id, 0)] # (node_id, depth)
while queue:
node_id, depth = queue.pop(0)
if node_id in visited or depth > self.max_depth:
continue
visited.add(node_id)
# Traverse backward edges
for pred in self.reverse_graph.predecessors(node_id):
edge_data = self.graph[pred][node_id]
if edge_data['type'] in edge_types:
queue.append((pred, depth + 1))
return visited
def slice_for_vulnerability(
self,
vuln_type: str = 'all'
) -> List[Tuple[int, Set[int]]]:
"""
Generate slices for all sinks of a given vulnerability type
Returns:
List of (sink_id, slice_node_set) tuples
"""
sinks = self.find_sinks(vuln_type)
slices = []
for sink_id in sinks:
slice_nodes = self.backward_slice(sink_id)
slices.append((sink_id, slice_nodes))
return slices
def extract_subgraph(
self,
node_ids: Set[int]
) -> Tuple[List[CPGNode], List[CPGEdge]]:
"""Extract subgraph containing only specified nodes"""
nodes = [self.nodes[nid] for nid in node_ids if nid in self.nodes]
edges = []
for src, dst, data in self.graph.edges(data=True):
if src in node_ids and dst in node_ids:
edges.append(CPGEdge(src, dst, data['type']))
return nodes, edges
# Joern integration helper
def joern_query_to_slice(joern_result: str) -> List[int]:
"""
Parse Joern query results to get slice nodes
Joern query example:
cpg.call.name("strcpy").reachableByFlows(cpg.method.parameter, 5).id.l
"""
# Parse Joern output format
import re
ids = re.findall(r'\d+', joern_result)
return [int(i) for i in ids]
# Usage example
if __name__ == "__main__":
# Create sample CPG
nodes = [
CPGNode(0, 'param', 'char* input', 1),
CPGNode(1, 'identifier', 'input', 2),
CPGNode(2, 'call', 'strlen(input)', 2),
CPGNode(3, 'identifier', 'buffer', 3),
CPGNode(4, 'call', 'strcpy(buffer, input)', 4), # SINK
CPGNode(5, 'return', 'return 0', 5),
]
edges = [
CPGEdge(0, 1, 'DFG'), # input def -> use
CPGEdge(1, 2, 'DFG'), # input -> strlen
CPGEdge(1, 4, 'DFG'), # input -> strcpy (TAINT)
CPGEdge(3, 4, 'DFG'), # buffer -> strcpy
CPGEdge(2, 4, 'CDG'), # strlen -> strcpy (control)
CPGEdge(4, 5, 'CFG'), # strcpy -> return
]
slicer = CPGSlicer(nodes, edges, max_depth=5)
# Find buffer overflow sinks
sinks = slicer.find_sinks('buffer_overflow')
print(f"Found sinks: {sinks}") # [4]
# Backward slice from sink
slices = slicer.slice_for_vulnerability('buffer_overflow')
for sink_id, slice_nodes in slices:
print(f"Sink {sink_id}: slice contains {len(slice_nodes)} nodes")
print(f" Nodes: {slice_nodes}")
B.4 Complete Training Pipeline
"""
Complete GNN Vulnerability Detection Training Pipeline
Includes: data loading, training loop, evaluation, checkpointing
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch_geometric.data import DataLoader, Data
from torch_geometric.nn import GATConv, global_mean_pool
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from typing import Dict, List, Tuple
import numpy as np
from tqdm import tqdm
class FocalLoss(nn.Module):
"""Focal Loss for class imbalance"""
def __init__(self, gamma: float = 2.0, alpha: float = 0.75):
super().__init__()
self.gamma = gamma
self.alpha = alpha
def forward(self, pred: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
ce_loss = F.cross_entropy(pred, target, reduction='none')
pt = torch.exp(-ce_loss)
focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
return focal_loss.mean()
class VulnerabilityDetector(nn.Module):
"""GAT-based Vulnerability Detector with residual connections"""
def __init__(
self,
input_dim: int = 768, # CodeBERT dimension
hidden_dim: int = 256,
num_heads: int = 8,
num_layers: int = 4,
dropout: float = 0.3,
num_classes: int = 2
):
super().__init__()
# Input projection
self.input_proj = nn.Linear(input_dim, hidden_dim)
# GAT layers with residual connections
self.gat_layers = nn.ModuleList()
self.layer_norms = nn.ModuleList()
for i in range(num_layers):
in_channels = hidden_dim if i == 0 else hidden_dim * num_heads
self.gat_layers.append(
GATConv(in_channels, hidden_dim, heads=num_heads, dropout=dropout)
)
self.layer_norms.append(nn.LayerNorm(hidden_dim * num_heads))
# Final projection
self.final_gat = GATConv(hidden_dim * num_heads, hidden_dim, heads=1, dropout=dropout)
# Classifier
self.classifier = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim // 2, num_classes)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, edge_index, batch):
# Input projection
h = self.input_proj(x)
# First GAT layer (no residual - dimension change)
h = self.gat_layers[0](h, edge_index)
h = F.elu(h)
h = self.layer_norms[0](h)
h = self.dropout(h)
# Remaining GAT layers with residual connections
for i in range(1, len(self.gat_layers)):
h_res = h
h = self.gat_layers[i](h, edge_index)
h = F.elu(h)
h = self.layer_norms[i](h)
h = self.dropout(h)
h = h + h_res # Residual connection
# Final GAT layer
h = self.final_gat(h, edge_index)
h = F.elu(h)
# Graph-level pooling
graph_emb = global_mean_pool(h, batch)
return self.classifier(graph_emb)
class VulnerabilityTrainer:
"""Training pipeline with evaluation and checkpointing"""
def __init__(
self,
model: nn.Module,
device: str = "cuda" if torch.cuda.is_available() else "cpu",
learning_rate: float = 1e-4,
weight_decay: float = 1e-5
):
self.model = model.to(device)
self.device = device
# Loss and optimizer
self.criterion = FocalLoss(gamma=2.0, alpha=0.75)
self.optimizer = AdamW(
model.parameters(),
lr=learning_rate,
weight_decay=weight_decay
)
# Metrics tracking
self.best_f1 = 0.0
self.history = {'train_loss': [], 'val_f1': []}
def train_epoch(self, train_loader: DataLoader) -> float:
"""Train for one epoch"""
self.model.train()
total_loss = 0.0
for batch in tqdm(train_loader, desc="Training"):
batch = batch.to(self.device)
self.optimizer.zero_grad()
output = self.model(batch.x, batch.edge_index, batch.batch)
loss = self.criterion(output, batch.y)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
total_loss += loss.item()
return total_loss / len(train_loader)
@torch.no_grad()
def evaluate(self, val_loader: DataLoader) -> Dict[str, float]:
"""Evaluate model on validation set"""
self.model.eval()
all_preds = []
all_labels = []
all_probs = []
for batch in val_loader:
batch = batch.to(self.device)
output = self.model(batch.x, batch.edge_index, batch.batch)
probs = F.softmax(output, dim=1)[:, 1]
preds = output.argmax(dim=1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(batch.y.cpu().numpy())
all_probs.extend(probs.cpu().numpy())
# Calculate metrics
metrics = {
'f1': f1_score(all_labels, all_preds),
'precision': precision_score(all_labels, all_preds),
'recall': recall_score(all_labels, all_preds),
'auc': roc_auc_score(all_labels, all_probs) if len(set(all_labels)) > 1 else 0.0
}
return metrics
def train(
self,
train_loader: DataLoader,
val_loader: DataLoader,
num_epochs: int = 100,
patience: int = 10,
checkpoint_path: str = "best_model.pt"
) -> Dict:
"""Full training loop with early stopping"""
scheduler = CosineAnnealingLR(self.optimizer, T_max=num_epochs)
no_improve = 0
for epoch in range(num_epochs):
# Training
train_loss = self.train_epoch(train_loader)
self.history['train_loss'].append(train_loss)
# Evaluation
metrics = self.evaluate(val_loader)
self.history['val_f1'].append(metrics['f1'])
# Learning rate scheduling
scheduler.step()
# Logging
print(f"Epoch {epoch+1}/{num_epochs}")
print(f" Train Loss: {train_loss:.4f}")
print(f" Val F1: {metrics['f1']:.4f}, Precision: {metrics['precision']:.4f}, "
f"Recall: {metrics['recall']:.4f}, AUC: {metrics['auc']:.4f}")
# Checkpointing
if metrics['f1'] > self.best_f1:
self.best_f1 = metrics['f1']
torch.save({
'epoch': epoch,
'model_state_dict': self.model.state_dict(),
'optimizer_state_dict': self.optimizer.state_dict(),
'best_f1': self.best_f1
}, checkpoint_path)
print(f" Saved best model (F1: {self.best_f1:.4f})")
no_improve = 0
else:
no_improve += 1
# Early stopping
if no_improve >= patience:
print(f"Early stopping at epoch {epoch+1}")
break
return self.history
def create_synthetic_dataset(num_graphs: int = 1000) -> List[Data]:
"""Create synthetic dataset for testing"""
dataset = []
for _ in range(num_graphs):
num_nodes = np.random.randint(50, 200)
num_edges = np.random.randint(num_nodes, num_nodes * 3)
# Random node features (simulating CodeBERT embeddings)
x = torch.randn(num_nodes, 768)
# Random edges
edge_index = torch.randint(0, num_nodes, (2, num_edges))
# Random label (0: safe, 1: vulnerable)
y = torch.tensor([np.random.randint(0, 2)])
dataset.append(Data(x=x, edge_index=edge_index, y=y))
return dataset
# Main execution
if __name__ == "__main__":
# Create synthetic dataset
print("Creating synthetic dataset...")
train_data = create_synthetic_dataset(800)
val_data = create_synthetic_dataset(200)
# Create data loaders
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
val_loader = DataLoader(val_data, batch_size=32)
# Initialize model and trainer
model = VulnerabilityDetector(
input_dim=768,
hidden_dim=256,
num_heads=8,
num_layers=4
)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
trainer = VulnerabilityTrainer(model)
# Train
print("\nStarting training...")
history = trainer.train(
train_loader,
val_loader,
num_epochs=50,
patience=10,
checkpoint_path="vuln_detector_best.pt"
)
print(f"\nTraining complete. Best F1: {trainer.best_f1:.4f}")
Author: Jiqiang Feng (风宁) Contact: [email protected] | [email protected] GitHub: @sgInnora
Related reading from Innora Security Research:

Related Chronicles
AI Supply Chain Poisoning: From Hugging Face to Local RCE
How attackers poison AI supply chains via Hugging Face: pickle deserialization RCE, malicious tensor injection, and defense strategies.
Comprehensive Analysis of Mainstream APT Teams' Tactics and Techniques (2020-2025)
Comprehensive analysis of mainstream APT groups tactics and techniques from 2020-2025, by Innora OmniSec Team.
Nora Vision: Advanced Linux Intrusion Detection System Now Open Source
We are thrilled to announce that Nora Vision, our advanced Linux intrusion detection and threat hunting system
Subscribe for AI Security Insights
Join 5,000+ engineers and security researchers. Get our latest deep dives into Sovereign AI, Red Teaming, and System Architecture.
No spam. Unsubscribe at any time.
Comments are currently disabled.