Experimental multi-model ensemble system for detecting vulnerabilities in C code using fine-tuned transformer models.
📖 Read the full story: Building a Multi-Model Ensemble for Code Vulnerability Detection
This project explores using state-of-the-art code understanding models (CodeBERT, GraphCodeBERT, CodeT5) to detect security vulnerabilities in source code. Trained on Microsoft's Devign dataset containing 21K+ labeled C functions.
Key learnings from this project:
- Multi-model ensemble challenges and calibration issues
- The critical importance of data quality and proper evaluation
- Why 66% accuracy on a real problem teaches more than 99% on toy datasets
- Production ML is 80% engineering, 20% modeling
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| CodeBERT | 66.35% | 68.42% | 47.97% | 56.40% |
| GraphCodeBERT | 61.66% | 55.57% | 77.23% | 64.63% |
| CodeT5 | 56.68% | 51.71% | 68.14% | 58.80% |
Baseline comparison: 12% improvement over regex patterns (54% accuracy)
git clone https://github.com/yourusername/vulnai.git
cd vulnai
pip install -r requirements.txtpython data/download_data.py
python data/preprocess.py# Train individual models
python src/train.py --model codebert
python src/train.py --model graphcodebert
python src/train.py --model codet5
# Or train all at once
python scripts/train_all_models.pyfrom src.inference import VulnerabilityDetector
detector = VulnerabilityDetector(model_name='graphcodebert')
code = '''
def get_user(user_id):
query = "SELECT * FROM users WHERE id=" + user_id
return execute(query)
'''
result = detector.predict(code)
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2%}")
# Output: Prediction: VULNERABLE, Confidence: 87%CLI Usage:
# Analyze code string
python -m src.inference --code "query = 'SELECT * WHERE id=' + uid"
# Analyze file
python -m src.inference --file vulnerable_code.c --model graphcodebert- Architecture: BERT pre-trained on code (6 programming languages)
- Parameters: 125M
- Strength: Keyword-based patterns (e.g.,
strcpy,eval) - Weakness: Misses structural vulnerabilities
- Architecture: Graph-aware BERT (understands data flow)
- Parameters: 125M
- Strength: Structural vulnerabilities (loops without bounds, control flow issues)
- Weakness: Slower inference due to graph construction
- Architecture: T5 encoder-decoder adapted for code
- Parameters: 220M
- Strength: Larger capacity for complex patterns
- Weakness: Encoder-decoder not ideal for classification tasks
Devign (Microsoft Research, 2019)
- Size: 21,854 C functions from real projects
- Source: QEMU, FFmpeg, Linux kernel, Pidgin
- Labels: Binary (vulnerable / safe)
- Split: 80% train, 10% val, 10% test
- Quality assurance: Zero data leakage verified (aggressive deduplication)
❌ Ensemble failed (50% accuracy = random guessing)
- Root cause: Model disagreement without proper calibration
- Learning: Ensemble methods require careful probability calibration
- Fix needed: Platt scaling or meta-model stacking
❌ CodeT5 underperformed despite more parameters
- Root cause: Encoder-decoder architecture not suited for classification
- Learning: Model design matters more than parameter count
❌ Gap from production tools (80-88% accuracy)
- Root cause: Limited training data, no domain-specific tuning
- Learning: Commercial tools have years of optimization and larger datasets
✅ Proper data preprocessing prevented inflated results
- Initial 100% accuracy revealed data leakage (296 overlapping samples)
- Aggressive deduplication and different random seeds fixed it
✅ GraphCodeBERT's structural understanding proved valuable
- Best F1 score (64.63%) despite same parameter count as CodeBERT
- Confirms: code structure matters for vulnerability detection
✅ Complete pipeline demonstrates production thinking
- Data → Training → Evaluation → Inference
- 80% of work was engineering, not modeling
"Production ML is 20% modeling, 80% infrastructure" — This project proved it
- Data quality > Model architecture - Clean, deduplicated data matters more than parameter count
- The right metric matters - Accuracy misled on imbalanced data; F1 told the truth
- Ensembles aren't magic - Require calibration and can fail spectacularly
- Ship it - Perfect is the enemy of done; 66% accuracy teaches more than endless optimization
If I were to continue this project:
- Fix ensemble with Platt scaling or stacking
- Implement Graph Neural Networks for AST/CFG analysis
- Add explainability (attention visualization, LIME)
- Combine with static analysis tools (hybrid approach)
- Expand to multi-language support (Python, JavaScript, Java)
- Active learning on uncertain samples
- Longer training (10 epochs instead of 3)
Full writeup on Medium:
👉 Building a Multi-Model Ensemble for Code Vulnerability Detection: Lessons from Fine-Tuning CodeBERT, GraphCodeBERT, and CodeT5
The article covers:
- Why the ensemble failed and what I learned
- Data leakage horror story (and how I fixed it)
- Comparison with commercial tools
- Honest discussion of when to ship vs. optimize
- Microsoft Research for the Devign dataset (paper)
- HuggingFace for transformer implementations
- Kaggle for free T4 GPU compute (5 hours of training)
- Author: Amr Gaber
- Medium: @amrgabeerr20
- Devign: Graph Neural Networks for Vulnerability Detection
- CodeBERT: CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- GraphCodeBERT: GraphCodeBERT: Pre-training Code Representations with Data Flow
- CodeT5: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models
If this project helped you learn about ML engineering, star it! ⭐
Questions? Open an issue or read the Medium article.