Skip to content

g8rdier/qml-fraud-detection-benchmark

Repository files navigation

QML Fraud Detection Benchmark

Academic Context

Course Industry Project VI, 6th Semester
Institution IU International University of Applied Sciences
Supervisor Wolfgang Zollner
Student Gregor Kobilarov
Dataset Kaggle Credit Card Fraud Detection (n = 284,807)

Research Question

"To what extent can Quantum Machine Learning algorithms, given current NISQ constraints, represent a competitive alternative to classical methods for fraud detection?"


Key Hypotheses

Three central hypothesis pairs are derived from theoretical analysis of NISQ hardware constraints, the mathematical structure of quantum feature spaces, and the specifics of fraud detection.

H1: Classification Performance Under Extreme Class Imbalance

H0₁: QSVM achieves no statistically significant difference in MCC compared to the classical baseline XGBoost.

H1₁: QSVM achieves a significantly different MCC compared to XGBoost.

Rationale: The primary metric is Matthews Correlation Coefficient (MCC), which is more informative than F1 for heavily imbalanced datasets. This hypothesis directly addresses whether quantum kernels can compete with classical ensemble methods on a real fraud detection task.

H2: Noise Robustness Under Realistic Hardware Conditions

H0₂: Quantum models exhibit identical error tolerance across depolarizing noise levels p ∈ [0.0, 0.05].

H1₂: QSVM and VQC exhibit significantly different degradation patterns with increasing depolarizing noise.

Rationale: Mathematical analysis suggests QSVM should degrade gracefully due to kernel structure stability, while VQC is expected to hit a performance floor early due to barren plateau effects. This tests whether theoretical noise resilience translates to empirical robustness.

H3: Operational Efficiency

H0₃: Hybrid quantum-classical methods require equal or less computational time for training and inference than classical ensemble methods.

H1₃: Hybrid quantum-classical methods require significantly more computational time than classical baselines, despite theoretical quantum acceleration potential.

Rationale: Classical simulators execute quantum circuits on CPUs, introducing overhead that negates theoretical speedup. This hypothesis evaluates the practical cost of quantum simulation on available hardware.


Key Findings

Quantum advantage was not demonstrated in this setting.

Model MCC PR-AUC F1-Fraud ROC-AUC
XGBoost 0.900 0.968 0.924 0.974
Random Forest 0.894 0.974 0.917 0.983
QSVM (p=0.0) 0.871 0.952 0.905 0.963
VQC (p=0.0) 0.626 0.785 0.699 0.875

MCC and PR-AUC are the primary metrics — they account for extreme class imbalance (0.17% fraud). F1-Fraud and ROC-AUC are provided for reference.

  • QSVM is competitive — within 0.029 MCC of XGBoost (0.871 vs 0.900) using only 300 training samples vs ~155,000 — but does not exceed classical baselines.
  • VQC underperforms significantly — 0.274 MCC below XGBoost (0.626 vs 0.900). The bottleneck is model capacity (8 qubits / 30 epochs), not noise.
  • Both quantum models remain below classical at all depolarizing noise levels tested (p = 0.0 → 0.05).

See the noise sweep results notebook for the full analysis.


Overview

This repository implements a rigorous benchmarking framework comparing quantum and classical models on credit card fraud detection:

Model Type Library
Random Forest Classical scikit-learn
XGBoost Classical xgboost
Variational Quantum Classifier (VQC) Quantum PennyLane
Quantum Support Vector Machine (QSVM) Quantum PennyLane + scikit-learn

The benchmark is designed around the specific challenges of financial fraud detection:

  • Extreme class imbalance (~0.17% fraud in the reference dataset)
  • High dimensionality vs. the limited qubit count of NISQ simulators
  • Rigorous metrics — MCC and PR-AUC as primary (robust to class imbalance); F1-Fraud and ROC-AUC as secondary reference

Dataset Overview

Source Kaggle — Credit Card Fraud Detection
Size 284,807 transactions
Features 30 (28 PCA-anonymised V1–V28 + Amount + Time)
Target Binary: Fraud (492 cases, 0.17%) / Legitimate (284,315 cases)
Split Train 68% / Val 12% / Test 20% (stratified)
# Option A – Kaggle CLI
kaggle datasets download -d mlg-ulb/creditcardfraud --path data/raw --unzip

# Option B – Manual
# Download creditcard.csv from Kaggle and place at data/raw/creditcard.csv

Project Structure

qml-fraud-detection-benchmark/
├── data/
│   ├── raw/                        # Place creditcard.csv here
│   └── processed/                  # Auto-generated preprocessed arrays
├── notebooks/
│   └── 02_noise_sweep_results.ipynb  # Noise sweep analysis and conclusions
├── results/
│   ├── figures/                    # Benchmark and ablation plots
│   └── noise/
│       └── noise_vs_metric.png     # Noise sweep: MCC and PR-AUC vs p
├── src/
│   ├── data_loader.py              # Dataset verification
│   ├── preprocessing.py            # Scaling · SMOTE · PCA pipeline
│   ├── classical_models.py         # Random Forest & XGBoost
│   ├── quantum_models.py           # VQC & QSVM (PennyLane)
│   └── evaluation.py              # Metrics, plots, comparison tables
├── tests/
├── run_benchmark.py                # Main benchmark entrypoint
├── run_ablation.py                 # Qubit count sweep (4–12 qubits)
├── run_noise.py                    # Depolarizing noise sweep
├── run_noise_parallel.sh           # Parallel noise sweep launcher
├── run_noise_watcher.sh            # Queue-based watcher for long runs
└── run_plots.py                    # Plot generation from saved results

Setup

python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e . --no-deps

Running the Benchmark

# Classical baselines only (fast)
python run_benchmark.py --n-qubits 8 --classical-only --cv-folds 0 --no-plots

# Full benchmark (classical + VQC + QSVM)
python run_benchmark.py --n-qubits 8

# Qubit ablation study
python run_ablation.py

# Depolarizing noise sweep (long — ~3 days on M4 Mac Mini)
bash run_noise_parallel.sh

Preprocessing Pipeline

Challenge Solution
Class imbalance (~0.17% fraud) SMOTE oversampling (training set only)
Outliers in transaction amounts RobustScaler (median/IQR-based)
High dimensionality (30 features) PCA to n_qubits components
Data leakage prevention PCA & scaler fitted on train split only

Quantum Models

Variational Quantum Classifier (VQC)

AngleEmbedding(X) → StronglyEntanglingLayers(weights) → ⟨Z₀⟩
  • 8 qubits, 2 layers, 30 training epochs, Adam optimiser
  • Backend: lightning.qubit (ideal), default.mixed (noisy simulation)

Quantum SVM (QSVM)

  • Quantum kernel: K(x, x') = |⟨φ(x)|φ(x')⟩|²
  • Feature map: double AngleEmbedding (captures 2nd-order feature interactions)
  • Kernel matrix fed to scikit-learn SVC(kernel="precomputed")

Noise Sweep

Depolarizing noise was swept across p = [0.0, 0.001, 0.005, 0.01, 0.02, 0.05].

Noise sweep results

Model Metric p=0.0 p=0.001 p=0.01 p=0.05
QSVM MCC 0.871 0.871 0.871 0.835
QSVM PR-AUC 0.952 0.952 0.951 0.932
VQC MCC 0.626 0.626 0.626
VQC PR-AUC 0.785 0.786 0.786

QSVM degrades gracefully — MCC drops 0.036 (0.871 → 0.835) and PR-AUC drops 0.020 (0.952 → 0.932) from p=0.0 to p=0.05. The break-even point is p ≈ 0.02, which is above the ~p=0.001 gate error rates of current superconducting hardware. VQC shows no meaningful degradation because it is already at its performance floor (MCC 0.626).

At current best NISQ hardware noise (p ≈ 0.001), QSVM achieves MCC = 0.871, PR-AUC = 0.952 — within 0.029 MCC of XGBoost.


Qubit Ablation Study

An ablation study systematically varies one component to measure its effect — here, qubit count — while keeping everything else fixed. We swept n_qubits ∈ [4, 6, 8, 10, 12] for both VQC and QSVM.

Qubit ablation results

QSVM is essentially flat across the entire sweep (MCC ~0.83–0.87). More qubits yield no meaningful gain, and performance dips slightly at 10–12 qubits. Two reasons:

  • Concentration of measure. As the quantum circuit grows, kernel values K(x, x') converge toward the same number — data points become indistinguishable in Hilbert space, degrading the kernel's ability to separate fraud from non-fraud. This is a known fundamental limitation of large quantum kernels.
  • Diminishing PCA signal. Each extra qubit adds one more PCA component, but later components capture less and less variance. At 12 qubits the model is partially trained on noise.

VQC peaks at 8 qubits then drops sharply at 10–12. The same PCA effect applies, compounded by barren plateaus — with deeper circuits the gradient landscape flattens, making training increasingly ineffective.

Takeaway: 8 qubits is the sweet spot for this dataset. More qubits add compute cost with no benefit, and can actively hurt performance.


Running Tests

pytest tests/ -v

Glossary

Machine Learning Terms

Term Definition
Classification Predicting a category label for each input from labeled training examples
SVM Support Vector Machine — finds the optimal decision boundary maximising the margin between classes
Kernel A function measuring similarity between data points, enabling non-linear decision boundaries
PCA Principal Component Analysis — reduces dimensionality while retaining maximum variance
SMOTE Synthetic Minority Oversampling Technique — generates synthetic fraud examples to counteract class imbalance
MCC Matthews Correlation Coefficient — a single balanced metric for binary classification on imbalanced data (range: −1 to +1)
PR-AUC Area under the Precision-Recall curve — more informative than ROC-AUC for heavily imbalanced datasets
F1-Fraud F1-score computed only for the fraud class — harmonic mean of fraud precision and fraud recall

Quantum Computing Terms

Term Definition
Qubit Quantum bit — exists in superposition of 0 and 1 simultaneously until measured
Quantum Circuit A sequence of quantum gates applied to qubits to encode and process information
AngleEmbedding Encodes classical feature values as rotation angles on qubits
Quantum Kernel Measures similarity between two data points as the overlap of their quantum states: K(x, x') = |⟨φ(x)|φ(x')⟩|²
Depolarizing Noise A noise model where each gate randomly applies X, Y, or Z errors with probability p — representative of real NISQ hardware
NISQ Noisy Intermediate-Scale Quantum — current era of quantum hardware: 50–1000 qubits with non-negligible error rates
Barren Plateau Phenomenon where gradients vanish exponentially with circuit depth, making VQC training ineffective at scale
Density Matrix Simulator Simulates quantum noise exactly by tracking the full mixed state — required for noise modelling, but scales as O(4ⁿ)

Limitations

Constrained comparison. Classical models (RF, XGBoost) were trained on the same PCA-reduced feature space as the quantum models — 8 components out of 30 original features. This is necessary for a like-for-like input comparison, but it handicaps classical models that are designed to exploit the full feature set. Unconstrained XGBoost on all 30 features typically achieves MCC above 0.90 and F1-fraud of 0.93–0.96 on this dataset, making the actual gap wider than measured here. The benchmark answers the question "how close can quantum get given NISQ hardware constraints?" — not "is quantum better than classical in absolute terms?"


Learnings

  1. QSVM is surprisingly noise-tolerant. Its kernel structure buffers it from depolarizing noise far better than the circuit-based VQC. If a quantum model were to be deployed on NISQ hardware, QSVM is the stronger candidate.

  2. VQC's problem is not noise — it's capacity. With 8 qubits and 30 epochs, VQC is underpowered for a 30-feature, heavily imbalanced fraud dataset. Noise makes virtually no difference because the model is already at its floor.

  3. Classical models are not easily displaced. XGBoost and Random Forest are well-matched to tabular fraud data: they handle imbalance, feature interactions, and high dimensionality natively. Quantum models need a structural advantage in the data to compete — which does not exist here.

  4. SMOTE miscalibrates quantum model thresholds. Oversampling shifts the decision boundary such that default thresholds (0.5) produce poor precision/recall trade-offs. Post-training threshold tuning on the validation set is essential.

  5. Parallel execution is necessary for noise sweeps. Each noise level takes 26–37h on an M4 Mac Mini. Running all 6 levels in parallel reduces wall time from ~9 days to ~3.5 days.


Author

Gregor Kobilarov

License

MIT

About

Benchmark hybrid quantum-classical ML (VQC, QSVM) vs classical baselines on financial fraud detection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors