Classification Experiments on Gene Expression & Image Data
SVM · KNN · NCC · PCA · Kernel PCA · LDA
Benchmarking classical ML classifiers combined with dimensionality reduction on two fundamentally different datasets: MLL (3-class leukemia gene expression) and CIFAR-10 (10-class image recognition).
src/
│
├── Utils.py # Shared data loading, preprocessing & helpers
├── config.yaml # Centralised hyperparameters for both datasets
│
├── SVM_MLL.py # SVM with grid search on MLL
├── SVM_CIFAR.py # SVM with grid search on CIFAR-10
├── Baseline_models.py # KNN & NCC baselines on both datasets
├── kpca_lda_standalone.py # KPCA → LDA pipeline (grid search, used as classifier)
├── kpca_lda_classifier.py # KPCA → LDA → KNN or SVM pipeline
│
├── MLL.tab # MLL gene expression dataset
└── requirements.txt # Python dependencies
Experiments Results/ # Pre-run output files (CSVs, plots, saved models)
| MLL | CIFAR-10 | |
|---|---|---|
| Type | Gene expression (tabular) | RGB images (32 × 32) |
| Classes | 3 — ALL · AML · MLL | 10 — airplane, automobile, … |
| Features | 12 533 | 3 072 (32 × 32 × 3) |
| Samples | 43 train / 29 test | 50 000 train / 10 000 test |
| Source | MLL.tab (bundled) |
Auto-downloaded via Keras |
| Technique | Description |
|---|---|
| PCA | Retains a configurable percentage of variance (default 90–95%) |
| KPCA | Kernel PCA with RBF or linear kernel; components tuned via grid search |
| LDA | Linear Discriminant Analysis applied after KPCA to maximise class separability |
| Script | Classifier | Reduction |
|---|---|---|
SVM_MLL.py |
SVM (RBF kernel) | PCA |
SVM_CIFAR.py |
SVM (RBF kernel) | PCA |
Baseline_models.py |
KNN · NCC | PCA |
kpca_lda_standalone.py |
LDA (max-likelihood decision) | KPCA + LDA |
kpca_lda_classifier.py |
KNN or SVM | KPCA + LDA |
|
Non-linear separability & kernel selection. CIFAR-10 data is not linearly separable. Linear-kernel SVMs struggled severely — training times reached up to 30 hours for large C values without converging. Non-linear kernels (RBF, Polynomial) performed vastly better, with the RBF kernel emerging as the optimal choice.
Regularisation matters (soft margin). For linear SVMs, very small C values produced better accuracy and drastically faster training. A small C creates a soft margin that tolerates misclassifications, preventing overfitting on noisy image data and improving generalisation.
Baseline models failed. KNN and NCC performed the worst by a significant margin. NCC assumes each class clusters around a single centroid — but the "mean image" of a CIFAR-10 class is essentially noise. KNN's reliance on Euclidean pixel distance is too crude to capture the complex features needed to distinguish visually similar classes (e.g. cat vs dog).
Scaling had negligible impact.
MinMax [0, 1], MinMax [−1, 1], and Z-score standardisation all produced very similar accuracy. StandardScaler required slightly more training time, likely because unbounded values slow convergence.
KPCA + LDA underperformed. LDA assumes normally distributed classes with similar variances — conditions CIFAR-10 violates due to high intra-class variance and overlapping class means. Hardware constraints also prevented an exhaustive hyperparameter search for the KPCA models.
|
Inherent linear separability. With thousands of dimensions but very few samples, MLL data is almost perfectly linearly separable. Linear SVMs and KPCA with a linear kernel achieved the best results.
Strong regularisation is crucial. The best performances consistently used exceptionally small C values. The tiny dataset makes large C values cause severe overfitting and poor generalisation.
PCA trade-offs. Retaining 95% of variance reduced dimensionality from 12 533 → ~35 features. This caused a slight accuracy drop but cut training time by an order of magnitude.
Baseline models excelled. Unlike on CIFAR-10, KNN and NCC (with PCA) achieved > 86% accuracy, confirming that the three leukemia types form highly distinct, well-separated clusters in gene-expression space.
KPCA + LDA was exceptional. Linear KPCA + LDA projected the high-dimensional data into a 2D space where the classes were almost flawlessly separated — a simple linear classifier could easily distinguish them.
| Aspect | CIFAR-10 | MLL |
|---|---|---|
| Best pipeline | PCA + SVM (RBF) | KPCA + LDA |
| Separability | Non-linear | Nearly linear |
| Regularisation | Small C preferred | Small C crucial |
| Baseline models | Failed | Excelled (> 86%) |
| KPCA + LDA | Underperformed | Best overall |
| Scaling impact | Negligible | — |
pip install -r requirements.txtNote
TensorFlow / Keras is required only to download CIFAR-10. If you are only using the MLL dataset you can skip it.
All scripts must be run from inside src/ — MLL.tab is loaded via a relative path:
cd src/SVM on MLL
python SVM_MLL.pyRuns a grid search over C and gamma for an RBF SVM across three scalers (MinMaxScaler, MinMaxScaler [−1, 1], StandardScaler) with PCA preprocessing.
SVM on CIFAR-10
python SVM_CIFAR.pySame pipeline as above, but on a configurable subsample of CIFAR-10 (default 20 000 training images).
KNN & NCC baselines
python Baseline_models.pyRuns a KNN grid search (over k) and a Nearest Centroid classifier. Switch between datasets by changing dataset_name at the top of main().
KPCA + LDA standalone
python kpca_lda_standalone.pyGrid searches KPCA hyperparameters (components, kernel, gamma) and uses the resulting LDA projection directly for classification.
KPCA + LDA + downstream classifier
python kpca_lda_classifier.pyFits KPCA → LDA → KNN or SVM in a single pipeline. Select the classifier by setting model_name to "KPCA_LDA_KNN" or "KPCA_LDA_SVM" at the top of main().
All hyperparameters live in config.yaml under two top-level keys — MLL and CIFAR10. Read at runtime by Utils.load_config(dataset_name), it is the single source of truth for grid search bounds, PCA thresholds, CV folds, sample sizes, and KPCA settings.
Each script's main() also exposes a few runtime flags:
| Flag | Description |
|---|---|
dataset_name |
'MLL' or 'CIFAR-10' |
PCA_enable |
Toggle PCA on / off |
Multiple_cores_enable |
Use all CPU cores for grid search |
GrayScale |
Convert CIFAR-10 to grayscale (CIFAR scripts only) |
model_name |
"KPCA_LDA_KNN" or "KPCA_LDA_SVM" (KPCA classifier script only) |
Each run creates a subdirectory next to the script (e.g. MLL_rbf_MinMaxScaler_PCA_output_files/):
| File | Contents |
|---|---|
results_*.csv |
Full grid search CV results, ranked by score |
best_est_results_*.csv |
Best model's parameters & evaluation metrics |
conf_matrix_*.png |
Confusion matrix heatmap on the test set |
contour_grid_*.png |
C vs γ accuracy contour (SVM with RBF / poly kernel) |
*_best_model.pkl |
Saved best model (joblib) |
classification_example_*.png |
Example misclassified image (CIFAR-10 only) |