| Course | Industry Project VI, 6th Semester |
| Institution | IU International University of Applied Sciences |
| Supervisor | Wolfgang Zollner |
| Student | Gregor Kobilarov |
| Dataset | Kaggle Credit Card Fraud Detection (n = 284,807) |
"To what extent can Quantum Machine Learning algorithms, given current NISQ constraints, represent a competitive alternative to classical methods for fraud detection?"
Three central hypothesis pairs are derived from theoretical analysis of NISQ hardware constraints, the mathematical structure of quantum feature spaces, and the specifics of fraud detection.
H0₁: QSVM achieves no statistically significant difference in MCC compared to the classical baseline XGBoost.
H1₁: QSVM achieves a significantly different MCC compared to XGBoost.
Rationale: The primary metric is Matthews Correlation Coefficient (MCC), which is more informative than F1 for heavily imbalanced datasets. This hypothesis directly addresses whether quantum kernels can compete with classical ensemble methods on a real fraud detection task.
H0₂: Quantum models exhibit identical error tolerance across depolarizing noise levels p ∈ [0.0, 0.05].
H1₂: QSVM and VQC exhibit significantly different degradation patterns with increasing depolarizing noise.
Rationale: Mathematical analysis suggests QSVM should degrade gracefully due to kernel structure stability, while VQC is expected to hit a performance floor early due to barren plateau effects. This tests whether theoretical noise resilience translates to empirical robustness.
H0₃: Hybrid quantum-classical methods require equal or less computational time for training and inference than classical ensemble methods.
H1₃: Hybrid quantum-classical methods require significantly more computational time than classical baselines, despite theoretical quantum acceleration potential.
Rationale: Classical simulators execute quantum circuits on CPUs, introducing overhead that negates theoretical speedup. This hypothesis evaluates the practical cost of quantum simulation on available hardware.
Quantum advantage was not demonstrated in this setting.
| Model | MCC | PR-AUC | F1-Fraud | ROC-AUC |
|---|---|---|---|---|
| XGBoost | 0.900 | 0.968 | 0.924 | 0.974 |
| Random Forest | 0.894 | 0.974 | 0.917 | 0.983 |
| QSVM (p=0.0) | 0.871 | 0.952 | 0.905 | 0.963 |
| VQC (p=0.0) | 0.626 | 0.785 | 0.699 | 0.875 |
MCC and PR-AUC are the primary metrics — they account for extreme class imbalance (0.17% fraud). F1-Fraud and ROC-AUC are provided for reference.
- QSVM is competitive — within 0.029 MCC of XGBoost (0.871 vs 0.900) using only 300 training samples vs ~155,000 — but does not exceed classical baselines.
- VQC underperforms significantly — 0.274 MCC below XGBoost (0.626 vs 0.900). The bottleneck is model capacity (8 qubits / 30 epochs), not noise.
- Both quantum models remain below classical at all depolarizing noise levels tested (p = 0.0 → 0.05).
See the noise sweep results notebook for the full analysis.
This repository implements a rigorous benchmarking framework comparing quantum and classical models on credit card fraud detection:
| Model | Type | Library |
|---|---|---|
| Random Forest | Classical | scikit-learn |
| XGBoost | Classical | xgboost |
| Variational Quantum Classifier (VQC) | Quantum | PennyLane |
| Quantum Support Vector Machine (QSVM) | Quantum | PennyLane + scikit-learn |
The benchmark is designed around the specific challenges of financial fraud detection:
- Extreme class imbalance (~0.17% fraud in the reference dataset)
- High dimensionality vs. the limited qubit count of NISQ simulators
- Rigorous metrics — MCC and PR-AUC as primary (robust to class imbalance); F1-Fraud and ROC-AUC as secondary reference
| Source | Kaggle — Credit Card Fraud Detection |
| Size | 284,807 transactions |
| Features | 30 (28 PCA-anonymised V1–V28 + Amount + Time) |
| Target | Binary: Fraud (492 cases, 0.17%) / Legitimate (284,315 cases) |
| Split | Train 68% / Val 12% / Test 20% (stratified) |
# Option A – Kaggle CLI
kaggle datasets download -d mlg-ulb/creditcardfraud --path data/raw --unzip
# Option B – Manual
# Download creditcard.csv from Kaggle and place at data/raw/creditcard.csvqml-fraud-detection-benchmark/
├── data/
│ ├── raw/ # Place creditcard.csv here
│ └── processed/ # Auto-generated preprocessed arrays
├── notebooks/
│ └── 02_noise_sweep_results.ipynb # Noise sweep analysis and conclusions
├── results/
│ ├── figures/ # Benchmark and ablation plots
│ └── noise/
│ └── noise_vs_metric.png # Noise sweep: MCC and PR-AUC vs p
├── src/
│ ├── data_loader.py # Dataset verification
│ ├── preprocessing.py # Scaling · SMOTE · PCA pipeline
│ ├── classical_models.py # Random Forest & XGBoost
│ ├── quantum_models.py # VQC & QSVM (PennyLane)
│ └── evaluation.py # Metrics, plots, comparison tables
├── tests/
├── run_benchmark.py # Main benchmark entrypoint
├── run_ablation.py # Qubit count sweep (4–12 qubits)
├── run_noise.py # Depolarizing noise sweep
├── run_noise_parallel.sh # Parallel noise sweep launcher
├── run_noise_watcher.sh # Queue-based watcher for long runs
└── run_plots.py # Plot generation from saved results
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e . --no-deps# Classical baselines only (fast)
python run_benchmark.py --n-qubits 8 --classical-only --cv-folds 0 --no-plots
# Full benchmark (classical + VQC + QSVM)
python run_benchmark.py --n-qubits 8
# Qubit ablation study
python run_ablation.py
# Depolarizing noise sweep (long — ~3 days on M4 Mac Mini)
bash run_noise_parallel.sh| Challenge | Solution |
|---|---|
| Class imbalance (~0.17% fraud) | SMOTE oversampling (training set only) |
| Outliers in transaction amounts | RobustScaler (median/IQR-based) |
| High dimensionality (30 features) | PCA to n_qubits components |
| Data leakage prevention | PCA & scaler fitted on train split only |
AngleEmbedding(X) → StronglyEntanglingLayers(weights) → ⟨Z₀⟩
- 8 qubits, 2 layers, 30 training epochs, Adam optimiser
- Backend:
lightning.qubit(ideal),default.mixed(noisy simulation)
- Quantum kernel: K(x, x') = |⟨φ(x)|φ(x')⟩|²
- Feature map: double
AngleEmbedding(captures 2nd-order feature interactions) - Kernel matrix fed to scikit-learn
SVC(kernel="precomputed")
Depolarizing noise was swept across p = [0.0, 0.001, 0.005, 0.01, 0.02, 0.05].
| Model | Metric | p=0.0 | p=0.001 | p=0.01 | p=0.05 |
|---|---|---|---|---|---|
| QSVM | MCC | 0.871 | 0.871 | 0.871 | 0.835 |
| QSVM | PR-AUC | 0.952 | 0.952 | 0.951 | 0.932 |
| VQC | MCC | 0.626 | 0.626 | 0.626 | — |
| VQC | PR-AUC | 0.785 | 0.786 | 0.786 | — |
QSVM degrades gracefully — MCC drops 0.036 (0.871 → 0.835) and PR-AUC drops 0.020 (0.952 → 0.932) from p=0.0 to p=0.05. The break-even point is p ≈ 0.02, which is above the ~p=0.001 gate error rates of current superconducting hardware. VQC shows no meaningful degradation because it is already at its performance floor (MCC 0.626).
At current best NISQ hardware noise (p ≈ 0.001), QSVM achieves MCC = 0.871, PR-AUC = 0.952 — within 0.029 MCC of XGBoost.
An ablation study systematically varies one component to measure its effect — here, qubit count — while keeping everything else fixed. We swept n_qubits ∈ [4, 6, 8, 10, 12] for both VQC and QSVM.
QSVM is essentially flat across the entire sweep (MCC ~0.83–0.87). More qubits yield no meaningful gain, and performance dips slightly at 10–12 qubits. Two reasons:
- Concentration of measure. As the quantum circuit grows, kernel values K(x, x') converge toward the same number — data points become indistinguishable in Hilbert space, degrading the kernel's ability to separate fraud from non-fraud. This is a known fundamental limitation of large quantum kernels.
- Diminishing PCA signal. Each extra qubit adds one more PCA component, but later components capture less and less variance. At 12 qubits the model is partially trained on noise.
VQC peaks at 8 qubits then drops sharply at 10–12. The same PCA effect applies, compounded by barren plateaus — with deeper circuits the gradient landscape flattens, making training increasingly ineffective.
Takeaway: 8 qubits is the sweet spot for this dataset. More qubits add compute cost with no benefit, and can actively hurt performance.
pytest tests/ -v| Term | Definition |
|---|---|
| Classification | Predicting a category label for each input from labeled training examples |
| SVM | Support Vector Machine — finds the optimal decision boundary maximising the margin between classes |
| Kernel | A function measuring similarity between data points, enabling non-linear decision boundaries |
| PCA | Principal Component Analysis — reduces dimensionality while retaining maximum variance |
| SMOTE | Synthetic Minority Oversampling Technique — generates synthetic fraud examples to counteract class imbalance |
| MCC | Matthews Correlation Coefficient — a single balanced metric for binary classification on imbalanced data (range: −1 to +1) |
| PR-AUC | Area under the Precision-Recall curve — more informative than ROC-AUC for heavily imbalanced datasets |
| F1-Fraud | F1-score computed only for the fraud class — harmonic mean of fraud precision and fraud recall |
| Term | Definition |
|---|---|
| Qubit | Quantum bit — exists in superposition of 0 and 1 simultaneously until measured |
| Quantum Circuit | A sequence of quantum gates applied to qubits to encode and process information |
| AngleEmbedding | Encodes classical feature values as rotation angles on qubits |
| Quantum Kernel | Measures similarity between two data points as the overlap of their quantum states: K(x, x') = |⟨φ(x)|φ(x')⟩|² |
| Depolarizing Noise | A noise model where each gate randomly applies X, Y, or Z errors with probability p — representative of real NISQ hardware |
| NISQ | Noisy Intermediate-Scale Quantum — current era of quantum hardware: 50–1000 qubits with non-negligible error rates |
| Barren Plateau | Phenomenon where gradients vanish exponentially with circuit depth, making VQC training ineffective at scale |
| Density Matrix Simulator | Simulates quantum noise exactly by tracking the full mixed state — required for noise modelling, but scales as O(4ⁿ) |
Constrained comparison. Classical models (RF, XGBoost) were trained on the same PCA-reduced feature space as the quantum models — 8 components out of 30 original features. This is necessary for a like-for-like input comparison, but it handicaps classical models that are designed to exploit the full feature set. Unconstrained XGBoost on all 30 features typically achieves MCC above 0.90 and F1-fraud of 0.93–0.96 on this dataset, making the actual gap wider than measured here. The benchmark answers the question "how close can quantum get given NISQ hardware constraints?" — not "is quantum better than classical in absolute terms?"
-
QSVM is surprisingly noise-tolerant. Its kernel structure buffers it from depolarizing noise far better than the circuit-based VQC. If a quantum model were to be deployed on NISQ hardware, QSVM is the stronger candidate.
-
VQC's problem is not noise — it's capacity. With 8 qubits and 30 epochs, VQC is underpowered for a 30-feature, heavily imbalanced fraud dataset. Noise makes virtually no difference because the model is already at its floor.
-
Classical models are not easily displaced. XGBoost and Random Forest are well-matched to tabular fraud data: they handle imbalance, feature interactions, and high dimensionality natively. Quantum models need a structural advantage in the data to compete — which does not exist here.
-
SMOTE miscalibrates quantum model thresholds. Oversampling shifts the decision boundary such that default thresholds (0.5) produce poor precision/recall trade-offs. Post-training threshold tuning on the validation set is essential.
-
Parallel execution is necessary for noise sweeps. Each noise level takes 26–37h on an M4 Mac Mini. Running all 6 levels in parallel reduces wall time from ~9 days to ~3.5 days.
Gregor Kobilarov
MIT

