Skip to content

ttaymaz/JavaMLBugDetective

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

JavaMLBugDetective

DOI License Java Maven Build Status

JavaMLBugDetective is a machine learning-aided bug prediction framework for Java projects. It combines static code analysis, process metrics, and evolutionary context modeling to predict defect-prone code.

Developed as part of Ph.D. research at Dokuz EylΓΌl University, this framework is actively maintained and continues to evolve.


πŸš€ Quick Start

# Clone the repository
git clone https://github.com/ttaymaz/JavaMLBugDetective.git
cd JavaMLBugDetective

# Configure your target repository
cp sample.config.properties config.properties
# Edit config.properties with your settings

# Run the analysis pipeline
chmod +x clean_and_run.sh
./clean_and_run.sh

✨ Key Features

  • SZZ Algorithm: Identifies bug-introducing commits via enhanced pattern matching
  • Version-Based Validation: Uses Git tags for realistic, chronological evaluation
  • Hybrid Metrics: Combines process, static, and diff/churn metrics
    • Process: NR, NDEV, AGE, EXP
    • Static (CK suite): WMC, CBO, RFC, LCOM, CYCLO
    • Diff/Churn: LINES_ADDED, LINES_DELETED, HUNK_COUNT
  • ML Pipeline: RandomForest, J48, NaiveBayes, SMO (via Weka)
  • Class Balancing: SMOTE and ClassBalancer
  • Cost-Sensitive Learning: Configurable FN/FP cost matrix
  • Automated Reporting: Scientific validation and prediction reports
  • Green AI: 32,000x more energy-efficient than LLM-based approaches

πŸ“ Project Structure

JavaMLBugDetective/
β”œβ”€β”€ src/main/java/org/tymz/
β”‚   β”œβ”€β”€ config/        # Configuration management
β”‚   β”œβ”€β”€ db/            # SQLite database operations
β”‚   β”œβ”€β”€ feature/       # Data preprocessing
β”‚   β”œβ”€β”€ git/           # JGit repository operations
β”‚   β”œβ”€β”€ main/          # Application entry point
β”‚   β”œβ”€β”€ metric/        # Metric calculators
β”‚   β”œβ”€β”€ ml/            # Weka ML training
β”‚   β”œβ”€β”€ report/        # Report generation
β”‚   β”œβ”€β”€ szz/           # SZZ algorithm
β”‚   └── version/       # Version management
β”œβ”€β”€ src/test/          # Unit tests
β”œβ”€β”€ pom.xml            # Maven configuration
β”œβ”€β”€ config.properties  # Analysis settings
└── clean_and_run.sh   # Pipeline script

βš™οΈ Configuration

Edit config.properties to configure your analysis:

# Target repository
repository.url=https://github.com/your-org/your-project.git
repository.local.path=./repositories/your-project
project.name=your-project

# SZZ settings
szz.bug_fix_keywords=fix,bug,issue,defect,error,fault,problem,crash,exception

# ML settings
ml.algorithm=all
ml.balance.classes=true
ml.validation.strategy=version-based
ml.smote.enabled=true

# Cost-sensitive learning
ml.cost.fn=10.0  # False Negative cost
ml.cost.fp=1.0   # False Positive cost

Private Repository Support

github.username=your-username
github.token=ghp_your_token_here

Note: config.properties is excluded from Git via .gitignore


πŸ“Š Outputs

Output Description
[project]-dataset.arff ML dataset with all metrics
reports/[project]-report-*.md Scientific validation report
reports/[project]-prediction-*.md Bug prediction report

πŸ“ˆ Verified Results

Algorithm Robustness Benchmark (Gson Project)

Evaluation metrics comparing 5 distinct algorithms evaluating the 'buggy' target class. Models were evaluated using 10-fold Cross Validation, Cost-Sensitive Classification (10:1 FN:FP), and SMOTE class balancing.

Algorithm Precision Recall F1-Score MCC
RandomForest 0.5111 0.9916 0.6745 0.3175
J48 0.5341 0.8895 0.6675 0.2883
NaiveBayes 0.4633 0.9179 0.6158 0.0672
SMO 0.4518 0.9996 0.6223 -0.0017
AdaBoostM1 0.4518 1.0000 0.6224 NaN

Note: Sequential boosting algorithms like AdaBoost collapse under the extreme SZZ label noise combined with SMOTE, while parallel ensembles (RandomForest) successfully isolate the true defect signal.

Cross-Project Validation Results

Cross-project validation results (Hybrid Model with Cost-Sensitive Learning):

Project F1-Score Precision Recall Instances
Apache Kafka 0.742 0.61 0.94 72,705
Google Gson 0.685 0.52 0.99 6,034
Apache Commons-IO 0.570 0.40 0.99 12,920

Ablation Study Highlights:

  • Hybrid model outperforms static-only by up to 128% (Commons-IO)
  • Process metrics consistently outperform static metrics
  • Model maintains robust performance despite 70.8% label noise

πŸ”§ Requirements

  • Java: JDK 21+
  • Maven: 3.9+
  • Git: For repository operations
  • RAM: 4GB+ (recommended for large repos)

πŸ“¦ Dependencies

  • Eclipse JGit: Git operations
  • PMD: Static code analysis
  • Weka: Machine learning
  • SQLite JDBC: Data persistence

πŸ“š Dataset & Replication Package

The JML-BugDB dataset and complete replication package are permanently archived at Zenodo:

DOI

The package includes:

  • JML-BugDB dataset (91,633 instances across 3 Java projects)
  • Manual validation data and methodology
  • Framework source code snapshot
  • Replication instructions

πŸ“– Citation

If you use this work in your research, please cite:

@software{taymaz2026jmlbugdetective,
  author    = {Taymaz, Turgay and Birant, Kâkten Ulaş},
  title     = {JavaMLBugDetective: ML-Aided Bug Prediction Framework},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19218373},
  url       = {https://doi.org/10.5281/zenodo.19218373}
}

πŸ‘₯ Authors

Turgay Taymaz β€” Developer & Researcher
Assoc. Prof. Dr. KΓΆkten Ulaş Birant β€” Advisor

Dokuz EylΓΌl University, The Graduate School of Natural and Applied Sciences


🀝 Contributing

Contributions are welcome! Please:

  1. Open an issue for bugs or feature requests
  2. Submit pull requests for improvements

Contact: turgay[at]taymaz.org


πŸ“„ License

This project is released under the MIT License.


Last Updated: March 2026

About

A machine learning-aided bug prediction framework for Java projects combining static code analysis and evolutionary context modeling.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors