QML Fraud Detection Benchmark

Academic Context


Course	Industry Project VI, 6th Semester
Institution	IU International University of Applied Sciences
Supervisor	Wolfgang Zollner
Student	Gregor Kobilarov
Dataset	Kaggle Credit Card Fraud Detection (n = 284,807)

Research Question

"To what extent can Quantum Machine Learning algorithms, given current NISQ constraints, represent a competitive alternative to classical methods for fraud detection?"

Key Hypotheses

Three central hypothesis pairs are derived from theoretical analysis of NISQ hardware constraints, the mathematical structure of quantum feature spaces, and the specifics of fraud detection.

H1: Classification Performance Under Extreme Class Imbalance

H0₁: QSVM achieves no statistically significant difference in MCC compared to the classical baseline XGBoost.

H1₁: QSVM achieves a significantly different MCC compared to XGBoost.

Rationale: The primary metric is Matthews Correlation Coefficient (MCC), which is more informative than F1 for heavily imbalanced datasets. This hypothesis directly addresses whether quantum kernels can compete with classical ensemble methods on a real fraud detection task.

H2: Noise Robustness Under Realistic Hardware Conditions

H0₂: Quantum models exhibit identical error tolerance across depolarizing noise levels p ∈ [0.0, 0.05].

H1₂: QSVM and VQC exhibit significantly different degradation patterns with increasing depolarizing noise.

Rationale: Mathematical analysis suggests QSVM should degrade gracefully due to kernel structure stability, while VQC is expected to hit a performance floor early due to barren plateau effects. This tests whether theoretical noise resilience translates to empirical robustness.

H3: Operational Efficiency

H0₃: Hybrid quantum-classical methods require equal or less computational time for training and inference than classical ensemble methods.

H1₃: Hybrid quantum-classical methods require significantly more computational time than classical baselines, despite theoretical quantum acceleration potential.

Rationale: Classical simulators execute quantum circuits on CPUs, introducing overhead that negates theoretical speedup. This hypothesis evaluates the practical cost of quantum simulation on available hardware.

Key Findings

Quantum advantage was not demonstrated in this setting.

Model	MCC	PR-AUC	F1-Fraud	ROC-AUC
XGBoost	0.900	0.968	0.924	0.974
Random Forest	0.894	0.974	0.917	0.983
QSVM (p=0.0)	0.871	0.952	0.905	0.963
VQC (p=0.0)	0.626	0.785	0.699	0.875

MCC and PR-AUC are the primary metrics — they account for extreme class imbalance (0.17% fraud). F1-Fraud and ROC-AUC are provided for reference.

QSVM is competitive — within 0.029 MCC of XGBoost (0.871 vs 0.900) using only 300 training samples vs ~155,000 — but does not exceed classical baselines.
VQC underperforms significantly — 0.274 MCC below XGBoost (0.626 vs 0.900). The bottleneck is model capacity (8 qubits / 30 epochs), not noise.
Both quantum models remain below classical at all depolarizing noise levels tested (p = 0.0 → 0.05).

See the noise sweep results notebook for the full analysis.

Overview

This repository implements a rigorous benchmarking framework comparing quantum and classical models on credit card fraud detection:

Model	Type	Library
Random Forest	Classical	scikit-learn
XGBoost	Classical	xgboost
Variational Quantum Classifier (VQC)	Quantum	PennyLane
Quantum Support Vector Machine (QSVM)	Quantum	PennyLane + scikit-learn

The benchmark is designed around the specific challenges of financial fraud detection:

Extreme class imbalance (~0.17% fraud in the reference dataset)
High dimensionality vs. the limited qubit count of NISQ simulators
Rigorous metrics — MCC and PR-AUC as primary (robust to class imbalance); F1-Fraud and ROC-AUC as secondary reference

Dataset Overview


Source	Kaggle — Credit Card Fraud Detection
Size	284,807 transactions
Features	30 (28 PCA-anonymised V1–V28 + Amount + Time)
Target	Binary: Fraud (492 cases, 0.17%) / Legitimate (284,315 cases)
Split	Train 68% / Val 12% / Test 20% (stratified)

# Option A – Kaggle CLI
kaggle datasets download -d mlg-ulb/creditcardfraud --path data/raw --unzip

# Option B – Manual
# Download creditcard.csv from Kaggle and place at data/raw/creditcard.csv

Project Structure

qml-fraud-detection-benchmark/
├── data/
│   ├── raw/                        # Place creditcard.csv here
│   └── processed/                  # Auto-generated preprocessed arrays
├── notebooks/
│   └── 02_noise_sweep_results.ipynb  # Noise sweep analysis and conclusions
├── results/
│   ├── figures/                    # Benchmark and ablation plots
│   └── noise/
│       └── noise_vs_metric.png     # Noise sweep: MCC and PR-AUC vs p
├── src/
│   ├── data_loader.py              # Dataset verification
│   ├── preprocessing.py            # Scaling · SMOTE · PCA pipeline
│   ├── classical_models.py         # Random Forest & XGBoost
│   ├── quantum_models.py           # VQC & QSVM (PennyLane)
│   └── evaluation.py              # Metrics, plots, comparison tables
├── tests/
├── run_benchmark.py                # Main benchmark entrypoint
├── run_ablation.py                 # Qubit count sweep (4–12 qubits)
├── run_noise.py                    # Depolarizing noise sweep
├── run_noise_parallel.sh           # Parallel noise sweep launcher
├── run_noise_watcher.sh            # Queue-based watcher for long runs
└── run_plots.py                    # Plot generation from saved results

Setup

python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e . --no-deps

Running the Benchmark

# Classical baselines only (fast)
python run_benchmark.py --n-qubits 8 --classical-only --cv-folds 0 --no-plots

# Full benchmark (classical + VQC + QSVM)
python run_benchmark.py --n-qubits 8

# Qubit ablation study
python run_ablation.py

# Depolarizing noise sweep (long — ~3 days on M4 Mac Mini)
bash run_noise_parallel.sh

Preprocessing Pipeline

Challenge	Solution
Class imbalance (~0.17% fraud)	SMOTE oversampling (training set only)
Outliers in transaction amounts	`RobustScaler` (median/IQR-based)
High dimensionality (30 features)	PCA to n_qubits components
Data leakage prevention	PCA & scaler fitted on train split only

Quantum Models

Variational Quantum Classifier (VQC)

AngleEmbedding(X) → StronglyEntanglingLayers(weights) → ⟨Z₀⟩

8 qubits, 2 layers, 30 training epochs, Adam optimiser
Backend: lightning.qubit (ideal), default.mixed (noisy simulation)

Quantum SVM (QSVM)

Quantum kernel: K(x, x') = |⟨φ(x)|φ(x')⟩|²
Feature map: double AngleEmbedding (captures 2nd-order feature interactions)
Kernel matrix fed to scikit-learn SVC(kernel="precomputed")

Noise Sweep

Depolarizing noise was swept across p = [0.0, 0.001, 0.005, 0.01, 0.02, 0.05].

Model	Metric	p=0.0	p=0.001	p=0.01	p=0.05
QSVM	MCC	0.871	0.871	0.871	0.835
QSVM	PR-AUC	0.952	0.952	0.951	0.932
VQC	MCC	0.626	0.626	0.626	—
VQC	PR-AUC	0.785	0.786	0.786	—

QSVM degrades gracefully — MCC drops 0.036 (0.871 → 0.835) and PR-AUC drops 0.020 (0.952 → 0.932) from p=0.0 to p=0.05. The break-even point is p ≈ 0.02, which is above the ~p=0.001 gate error rates of current superconducting hardware. VQC shows no meaningful degradation because it is already at its performance floor (MCC 0.626).

At current best NISQ hardware noise (p ≈ 0.001), QSVM achieves MCC = 0.871, PR-AUC = 0.952 — within 0.029 MCC of XGBoost.

Qubit Ablation Study

An ablation study systematically varies one component to measure its effect — here, qubit count — while keeping everything else fixed. We swept n_qubits ∈ [4, 6, 8, 10, 12] for both VQC and QSVM.

QSVM is essentially flat across the entire sweep (MCC ~0.83–0.87). More qubits yield no meaningful gain, and performance dips slightly at 10–12 qubits. Two reasons:

Concentration of measure. As the quantum circuit grows, kernel values K(x, x') converge toward the same number — data points become indistinguishable in Hilbert space, degrading the kernel's ability to separate fraud from non-fraud. This is a known fundamental limitation of large quantum kernels.
Diminishing PCA signal. Each extra qubit adds one more PCA component, but later components capture less and less variance. At 12 qubits the model is partially trained on noise.

VQC peaks at 8 qubits then drops sharply at 10–12. The same PCA effect applies, compounded by barren plateaus — with deeper circuits the gradient landscape flattens, making training increasingly ineffective.

Takeaway: 8 qubits is the sweet spot for this dataset. More qubits add compute cost with no benefit, and can actively hurt performance.

Running Tests

pytest tests/ -v

Glossary

Machine Learning Terms

Term	Definition
Classification	Predicting a category label for each input from labeled training examples
SVM	Support Vector Machine — finds the optimal decision boundary maximising the margin between classes
Kernel	A function measuring similarity between data points, enabling non-linear decision boundaries
PCA	Principal Component Analysis — reduces dimensionality while retaining maximum variance
SMOTE	Synthetic Minority Oversampling Technique — generates synthetic fraud examples to counteract class imbalance
MCC	Matthews Correlation Coefficient — a single balanced metric for binary classification on imbalanced data (range: −1 to +1)
PR-AUC	Area under the Precision-Recall curve — more informative than ROC-AUC for heavily imbalanced datasets
F1-Fraud	F1-score computed only for the fraud class — harmonic mean of fraud precision and fraud recall

Quantum Computing Terms

Term	Definition
Qubit	Quantum bit — exists in superposition of 0 and 1 simultaneously until measured
Quantum Circuit	A sequence of quantum gates applied to qubits to encode and process information
AngleEmbedding	Encodes classical feature values as rotation angles on qubits
Quantum Kernel	Measures similarity between two data points as the overlap of their quantum states: K(x, x') = \|⟨φ(x)\|φ(x')⟩\|²
Depolarizing Noise	A noise model where each gate randomly applies X, Y, or Z errors with probability p — representative of real NISQ hardware
NISQ	Noisy Intermediate-Scale Quantum — current era of quantum hardware: 50–1000 qubits with non-negligible error rates
Barren Plateau	Phenomenon where gradients vanish exponentially with circuit depth, making VQC training ineffective at scale
Density Matrix Simulator	Simulates quantum noise exactly by tracking the full mixed state — required for noise modelling, but scales as O(4ⁿ)

Limitations

Constrained comparison. Classical models (RF, XGBoost) were trained on the same PCA-reduced feature space as the quantum models — 8 components out of 30 original features. This is necessary for a like-for-like input comparison, but it handicaps classical models that are designed to exploit the full feature set. Unconstrained XGBoost on all 30 features typically achieves MCC above 0.90 and F1-fraud of 0.93–0.96 on this dataset, making the actual gap wider than measured here. The benchmark answers the question "how close can quantum get given NISQ hardware constraints?" — not "is quantum better than classical in absolute terms?"

Learnings

QSVM is surprisingly noise-tolerant. Its kernel structure buffers it from depolarizing noise far better than the circuit-based VQC. If a quantum model were to be deployed on NISQ hardware, QSVM is the stronger candidate.
VQC's problem is not noise — it's capacity. With 8 qubits and 30 epochs, VQC is underpowered for a 30-feature, heavily imbalanced fraud dataset. Noise makes virtually no difference because the model is already at its floor.
Classical models are not easily displaced. XGBoost and Random Forest are well-matched to tabular fraud data: they handle imbalance, feature interactions, and high dimensionality natively. Quantum models need a structural advantage in the data to compete — which does not exist here.
SMOTE miscalibrates quantum model thresholds. Oversampling shifts the decision boundary such that default thresholds (0.5) produce poor precision/recall trade-offs. Post-training threshold tuning on the validation set is essential.
Parallel execution is necessary for noise sweeps. Each noise level takes 26–37h on an M4 Mac Mini. Running all 6 levels in parallel reduces wall time from ~9 days to ~3.5 days.

Author

Gregor Kobilarov

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
notebooks		notebooks
results		results
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_plots.py		generate_plots.py
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_ablation.py		run_ablation.py
run_benchmark.py		run_benchmark.py
run_latency.py		run_latency.py
run_noise.py		run_noise.py
run_noise_parallel.sh		run_noise_parallel.sh
run_noise_queue.sh		run_noise_queue.sh
run_noise_watcher.sh		run_noise_watcher.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QML Fraud Detection Benchmark

Academic Context

Research Question

Key Hypotheses

H1: Classification Performance Under Extreme Class Imbalance

H2: Noise Robustness Under Realistic Hardware Conditions

H3: Operational Efficiency

Key Findings

Overview

Dataset Overview

Project Structure

Setup

Running the Benchmark

Preprocessing Pipeline

Quantum Models

Variational Quantum Classifier (VQC)

Quantum SVM (QSVM)

Noise Sweep

Qubit Ablation Study

Running Tests

Glossary

Machine Learning Terms

Quantum Computing Terms

Limitations

Learnings

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QML Fraud Detection Benchmark

Academic Context

Research Question

Key Hypotheses

H1: Classification Performance Under Extreme Class Imbalance

H2: Noise Robustness Under Realistic Hardware Conditions

H3: Operational Efficiency

Key Findings

Overview

Dataset Overview

Project Structure

Setup

Running the Benchmark

Preprocessing Pipeline

Quantum Models

Variational Quantum Classifier (VQC)

Quantum SVM (QSVM)

Noise Sweep

Qubit Ablation Study

Running Tests

Glossary

Machine Learning Terms

Quantum Computing Terms

Limitations

Learnings

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages