JavaMLBugDetective is a machine learning-aided bug prediction framework for Java projects. It combines static code analysis, process metrics, and evolutionary context modeling to predict defect-prone code.
Developed as part of Ph.D. research at Dokuz EylΓΌl University, this framework is actively maintained and continues to evolve.
# Clone the repository
git clone https://github.com/ttaymaz/JavaMLBugDetective.git
cd JavaMLBugDetective
# Configure your target repository
cp sample.config.properties config.properties
# Edit config.properties with your settings
# Run the analysis pipeline
chmod +x clean_and_run.sh
./clean_and_run.sh- SZZ Algorithm: Identifies bug-introducing commits via enhanced pattern matching
- Version-Based Validation: Uses Git tags for realistic, chronological evaluation
- Hybrid Metrics: Combines process, static, and diff/churn metrics
- Process: NR, NDEV, AGE, EXP
- Static (CK suite): WMC, CBO, RFC, LCOM, CYCLO
- Diff/Churn: LINES_ADDED, LINES_DELETED, HUNK_COUNT
- ML Pipeline: RandomForest, J48, NaiveBayes, SMO (via Weka)
- Class Balancing: SMOTE and ClassBalancer
- Cost-Sensitive Learning: Configurable FN/FP cost matrix
- Automated Reporting: Scientific validation and prediction reports
- Green AI: 32,000x more energy-efficient than LLM-based approaches
JavaMLBugDetective/
βββ src/main/java/org/tymz/
β βββ config/ # Configuration management
β βββ db/ # SQLite database operations
β βββ feature/ # Data preprocessing
β βββ git/ # JGit repository operations
β βββ main/ # Application entry point
β βββ metric/ # Metric calculators
β βββ ml/ # Weka ML training
β βββ report/ # Report generation
β βββ szz/ # SZZ algorithm
β βββ version/ # Version management
βββ src/test/ # Unit tests
βββ pom.xml # Maven configuration
βββ config.properties # Analysis settings
βββ clean_and_run.sh # Pipeline script
Edit config.properties to configure your analysis:
# Target repository
repository.url=https://github.com/your-org/your-project.git
repository.local.path=./repositories/your-project
project.name=your-project
# SZZ settings
szz.bug_fix_keywords=fix,bug,issue,defect,error,fault,problem,crash,exception
# ML settings
ml.algorithm=all
ml.balance.classes=true
ml.validation.strategy=version-based
ml.smote.enabled=true
# Cost-sensitive learning
ml.cost.fn=10.0 # False Negative cost
ml.cost.fp=1.0 # False Positive costgithub.username=your-username
github.token=ghp_your_token_hereNote:
config.propertiesis excluded from Git via.gitignore
| Output | Description |
|---|---|
[project]-dataset.arff |
ML dataset with all metrics |
reports/[project]-report-*.md |
Scientific validation report |
reports/[project]-prediction-*.md |
Bug prediction report |
Evaluation metrics comparing 5 distinct algorithms evaluating the 'buggy' target class. Models were evaluated using 10-fold Cross Validation, Cost-Sensitive Classification (10:1 FN:FP), and SMOTE class balancing.
| Algorithm | Precision | Recall | F1-Score | MCC |
|---|---|---|---|---|
| RandomForest | 0.5111 | 0.9916 | 0.6745 | 0.3175 |
| J48 | 0.5341 | 0.8895 | 0.6675 | 0.2883 |
| NaiveBayes | 0.4633 | 0.9179 | 0.6158 | 0.0672 |
| SMO | 0.4518 | 0.9996 | 0.6223 | -0.0017 |
| AdaBoostM1 | 0.4518 | 1.0000 | 0.6224 | NaN |
Note: Sequential boosting algorithms like AdaBoost collapse under the extreme SZZ label noise combined with SMOTE, while parallel ensembles (RandomForest) successfully isolate the true defect signal.
Cross-project validation results (Hybrid Model with Cost-Sensitive Learning):
| Project | F1-Score | Precision | Recall | Instances |
|---|---|---|---|---|
| Apache Kafka | 0.742 | 0.61 | 0.94 | 72,705 |
| Google Gson | 0.685 | 0.52 | 0.99 | 6,034 |
| Apache Commons-IO | 0.570 | 0.40 | 0.99 | 12,920 |
Ablation Study Highlights:
- Hybrid model outperforms static-only by up to 128% (Commons-IO)
- Process metrics consistently outperform static metrics
- Model maintains robust performance despite 70.8% label noise
- Java: JDK 21+
- Maven: 3.9+
- Git: For repository operations
- RAM: 4GB+ (recommended for large repos)
- Eclipse JGit: Git operations
- PMD: Static code analysis
- Weka: Machine learning
- SQLite JDBC: Data persistence
The JML-BugDB dataset and complete replication package are permanently archived at Zenodo:
The package includes:
- JML-BugDB dataset (91,633 instances across 3 Java projects)
- Manual validation data and methodology
- Framework source code snapshot
- Replication instructions
If you use this work in your research, please cite:
@software{taymaz2026jmlbugdetective,
author = {Taymaz, Turgay and Birant, KΓΆkten UlaΕ},
title = {JavaMLBugDetective: ML-Aided Bug Prediction Framework},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19218373},
url = {https://doi.org/10.5281/zenodo.19218373}
}Turgay Taymaz β Developer & Researcher
Assoc. Prof. Dr. KΓΆkten UlaΕ Birant β Advisor
Dokuz EylΓΌl University, The Graduate School of Natural and Applied Sciences
Contributions are welcome! Please:
- Open an issue for bugs or feature requests
- Submit pull requests for improvements
Contact: turgay[at]taymaz.org
This project is released under the MIT License.
Last Updated: March 2026