VulnAI - Multi-Model Code Vulnerability Detection

Experimental multi-model ensemble system for detecting vulnerabilities in C code using fine-tuned transformer models.

📖 Read the full story: Building a Multi-Model Ensemble for Code Vulnerability Detection

🎯 Overview

This project explores using state-of-the-art code understanding models (CodeBERT, GraphCodeBERT, CodeT5) to detect security vulnerabilities in source code. Trained on Microsoft's Devign dataset containing 21K+ labeled C functions.

Key learnings from this project:

Multi-model ensemble challenges and calibration issues
The critical importance of data quality and proper evaluation
Why 66% accuracy on a real problem teaches more than 99% on toy datasets
Production ML is 80% engineering, 20% modeling

📊 Results

Model	Accuracy	Precision	Recall	F1 Score
CodeBERT	66.35%	68.42%	47.97%	56.40%
GraphCodeBERT	61.66%	55.57%	77.23%	64.63%
CodeT5	56.68%	51.71%	68.14%	58.80%

Baseline comparison: 12% improvement over regex patterns (54% accuracy)

🚀 Quick Start

Installation

git clone https://github.com/yourusername/vulnai.git
cd vulnai
pip install -r requirements.txt

Download Dataset

python data/download_data.py
python data/preprocess.py

Train Models

# Train individual models
python src/train.py --model codebert
python src/train.py --model graphcodebert
python src/train.py --model codet5

# Or train all at once
python scripts/train_all_models.py

Inference

from src.inference import VulnerabilityDetector

detector = VulnerabilityDetector(model_name='graphcodebert')

code = '''
def get_user(user_id):
    query = "SELECT * FROM users WHERE id=" + user_id
    return execute(query)
'''

result = detector.predict(code)
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2%}")
# Output: Prediction: VULNERABLE, Confidence: 87%

CLI Usage:

# Analyze code string
python -m src.inference --code "query = 'SELECT * WHERE id=' + uid"

# Analyze file
python -m src.inference --file vulnerable_code.c --model graphcodebert

🔬 Models

CodeBERT

Architecture: BERT pre-trained on code (6 programming languages)
Parameters: 125M
Strength: Keyword-based patterns (e.g., strcpy, eval)
Weakness: Misses structural vulnerabilities

GraphCodeBERT ⭐ (Best F1)

Architecture: Graph-aware BERT (understands data flow)
Parameters: 125M
Strength: Structural vulnerabilities (loops without bounds, control flow issues)
Weakness: Slower inference due to graph construction

CodeT5

Architecture: T5 encoder-decoder adapted for code
Parameters: 220M
Strength: Larger capacity for complex patterns
Weakness: Encoder-decoder not ideal for classification tasks

📚 Dataset

Devign (Microsoft Research, 2019)

Size: 21,854 C functions from real projects
Source: QEMU, FFmpeg, Linux kernel, Pidgin
Labels: Binary (vulnerable / safe)
Split: 80% train, 10% val, 10% test
Quality assurance: Zero data leakage verified (aggressive deduplication)

⚠️ Limitations & Lessons Learned

What Didn't Work:

❌ Ensemble failed (50% accuracy = random guessing)

Root cause: Model disagreement without proper calibration
Learning: Ensemble methods require careful probability calibration
Fix needed: Platt scaling or meta-model stacking

❌ CodeT5 underperformed despite more parameters

Root cause: Encoder-decoder architecture not suited for classification
Learning: Model design matters more than parameter count

❌ Gap from production tools (80-88% accuracy)

Root cause: Limited training data, no domain-specific tuning
Learning: Commercial tools have years of optimization and larger datasets

What Worked:

✅ Proper data preprocessing prevented inflated results

Initial 100% accuracy revealed data leakage (296 overlapping samples)
Aggressive deduplication and different random seeds fixed it

✅ GraphCodeBERT's structural understanding proved valuable

Best F1 score (64.63%) despite same parameter count as CodeBERT
Confirms: code structure matters for vulnerability detection

✅ Complete pipeline demonstrates production thinking

Data → Training → Evaluation → Inference
80% of work was engineering, not modeling

🎓 Key Takeaways

"Production ML is 20% modeling, 80% infrastructure" — This project proved it

Data quality > Model architecture - Clean, deduplicated data matters more than parameter count
The right metric matters - Accuracy misled on imbalanced data; F1 told the truth
Ensembles aren't magic - Require calibration and can fail spectacularly
Ship it - Perfect is the enemy of done; 66% accuracy teaches more than endless optimization

🔮 Future Improvements

If I were to continue this project:

Fix ensemble with Platt scaling or stacking
Implement Graph Neural Networks for AST/CFG analysis
Add explainability (attention visualization, LIME)
Combine with static analysis tools (hybrid approach)
Expand to multi-language support (Python, JavaScript, Java)
Active learning on uncertain samples
Longer training (10 epochs instead of 3)

📖 Read More

Full writeup on Medium:
👉 Building a Multi-Model Ensemble for Code Vulnerability Detection: Lessons from Fine-Tuning CodeBERT, GraphCodeBERT, and CodeT5

The article covers:

Why the ensemble failed and what I learned
Data leakage horror story (and how I fixed it)
Comparison with commercial tools
Honest discussion of when to ship vs. optimize

🙏 Acknowledgments

Microsoft Research for the Devign dataset (paper)
HuggingFace for transformer implementations
Kaggle for free T4 GPU compute (5 hours of training)

📞 Contact

Author: Amr Gaber
Medium: @amrgabeerr20

🔗 Related Papers

Devign: Graph Neural Networks for Vulnerability Detection
CodeBERT: CodeBERT: A Pre-Trained Model for Programming and Natural Languages
GraphCodeBERT: GraphCodeBERT: Pre-training Code Representations with Data Flow
CodeT5: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models

⚠️ Disclaimer: This is an experimental research project demonstrating ML methodology. Not intended for production security scanning. Always combine automated tools with human security review.

If this project helped you learn about ML engineering, star it! ⭐

Questions? Open an issue or read the Medium article.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VulnAI - Multi-Model Code Vulnerability Detection

🎯 Overview

📊 Results

🚀 Quick Start

Installation

Download Dataset

Train Models

Inference

🔬 Models

CodeBERT

GraphCodeBERT ⭐ (Best F1)

CodeT5

📚 Dataset

⚠️ Limitations & Lessons Learned

What Didn't Work:

What Worked:

🎓 Key Takeaways

🔮 Future Improvements

📖 Read More

🙏 Acknowledgments

📞 Contact

🔗 Related Papers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VulnAI - Multi-Model Code Vulnerability Detection

🎯 Overview

📊 Results

🚀 Quick Start

Installation

Download Dataset

Train Models

Inference

🔬 Models

CodeBERT

GraphCodeBERT ⭐ (Best F1)

CodeT5

📚 Dataset

⚠️ Limitations & Lessons Learned

What Didn't Work:

What Worked:

🎓 Key Takeaways

🔮 Future Improvements

📖 Read More

🙏 Acknowledgments

📞 Contact

🔗 Related Papers

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages