Skip to content

Smasko7/SVM-PCA-KPCA-LDA-on-Gene-Expression-and-Image-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SVMs, PCA, KPCA + LDA

Classification Experiments on Gene Expression & Image Data

SVM · KNN · NCC · PCA · Kernel PCA · LDA


Benchmarking classical ML classifiers combined with dimensionality reduction on two fundamentally different datasets: MLL (3-class leukemia gene expression) and CIFAR-10 (10-class image recognition).


📂 Project Structure

src/
│
├── Utils.py                   # Shared data loading, preprocessing & helpers
├── config.yaml                # Centralised hyperparameters for both datasets
│
├── SVM_MLL.py                 # SVM with grid search on MLL
├── SVM_CIFAR.py               # SVM with grid search on CIFAR-10
├── Baseline_models.py         # KNN & NCC baselines on both datasets
├── kpca_lda_standalone.py     # KPCA → LDA pipeline (grid search, used as classifier)
├── kpca_lda_classifier.py     # KPCA → LDA → KNN or SVM pipeline
│
├── MLL.tab                    # MLL gene expression dataset
└── requirements.txt           # Python dependencies

Experiments Results/            # Pre-run output files (CSVs, plots, saved models)

🗂️ Datasets

MLL CIFAR-10
Type Gene expression (tabular) RGB images (32 × 32)
Classes 3 — ALL · AML · MLL 10 — airplane, automobile, …
Features 12 533 3 072 (32 × 32 × 3)
Samples 43 train / 29 test 50 000 train / 10 000 test
Source MLL.tab (bundled) Auto-downloaded via Keras

⚙️ Methods

Dimensionality Reduction

Technique Description
PCA Retains a configurable percentage of variance (default 90–95%)
KPCA Kernel PCA with RBF or linear kernel; components tuned via grid search
LDA Linear Discriminant Analysis applied after KPCA to maximise class separability

Classifiers

Script Classifier Reduction
SVM_MLL.py SVM (RBF kernel) PCA
SVM_CIFAR.py SVM (RBF kernel) PCA
Baseline_models.py KNN · NCC PCA
kpca_lda_standalone.py LDA (max-likelihood decision) KPCA + LDA
kpca_lda_classifier.py KNN or SVM KPCA + LDA

📊 Experimental Findings

CIFAR-10 — Image Classification

🔑 Key Takeaway

The RBF-kernel SVM was the best overall model. Simpler PCA + SVM (RBF) outperformed the more complex KPCA + LDA pipelines on this dataset.

Non-linear separability & kernel selection. CIFAR-10 data is not linearly separable. Linear-kernel SVMs struggled severely — training times reached up to 30 hours for large C values without converging. Non-linear kernels (RBF, Polynomial) performed vastly better, with the RBF kernel emerging as the optimal choice.

Regularisation matters (soft margin). For linear SVMs, very small C values produced better accuracy and drastically faster training. A small C creates a soft margin that tolerates misclassifications, preventing overfitting on noisy image data and improving generalisation.

Baseline models failed. KNN and NCC performed the worst by a significant margin. NCC assumes each class clusters around a single centroid — but the "mean image" of a CIFAR-10 class is essentially noise. KNN's reliance on Euclidean pixel distance is too crude to capture the complex features needed to distinguish visually similar classes (e.g. cat vs dog).

Scaling had negligible impact. MinMax [0, 1], MinMax [−1, 1], and Z-score standardisation all produced very similar accuracy. StandardScaler required slightly more training time, likely because unbounded values slow convergence.

KPCA + LDA underperformed. LDA assumes normally distributed classes with similar variances — conditions CIFAR-10 violates due to high intra-class variance and overlapping class means. Hardware constraints also prevented an exhaustive hyperparameter search for the KPCA models.


MLL — Gene Expression

🔑 Key Takeaway

KPCA + LDA achieved the highest overall accuracy, projecting 12 533 dimensions into just 2D with near-perfect class separation. Linear models dominated.

Inherent linear separability. With thousands of dimensions but very few samples, MLL data is almost perfectly linearly separable. Linear SVMs and KPCA with a linear kernel achieved the best results.

Strong regularisation is crucial. The best performances consistently used exceptionally small C values. The tiny dataset makes large C values cause severe overfitting and poor generalisation.

PCA trade-offs. Retaining 95% of variance reduced dimensionality from 12 533 → ~35 features. This caused a slight accuracy drop but cut training time by an order of magnitude.

Baseline models excelled. Unlike on CIFAR-10, KNN and NCC (with PCA) achieved > 86% accuracy, confirming that the three leukemia types form highly distinct, well-separated clusters in gene-expression space.

KPCA + LDA was exceptional. Linear KPCA + LDA projected the high-dimensional data into a 2D space where the classes were almost flawlessly separated — a simple linear classifier could easily distinguish them.


Side-by-Side Comparison

Aspect CIFAR-10 MLL
Best pipeline PCA + SVM (RBF) KPCA + LDA
Separability Non-linear Nearly linear
Regularisation Small C preferred Small C crucial
Baseline models Failed Excelled (> 86%)
KPCA + LDA Underperformed Best overall
Scaling impact Negligible

🚀 Getting Started

Installation

pip install -r requirements.txt

Note

TensorFlow / Keras is required only to download CIFAR-10. If you are only using the MLL dataset you can skip it.

Running the Experiments

All scripts must be run from inside src/MLL.tab is loaded via a relative path:

cd src/
SVM on MLL
python SVM_MLL.py

Runs a grid search over C and gamma for an RBF SVM across three scalers (MinMaxScaler, MinMaxScaler [−1, 1], StandardScaler) with PCA preprocessing.

SVM on CIFAR-10
python SVM_CIFAR.py

Same pipeline as above, but on a configurable subsample of CIFAR-10 (default 20 000 training images).

KNN & NCC baselines
python Baseline_models.py

Runs a KNN grid search (over k) and a Nearest Centroid classifier. Switch between datasets by changing dataset_name at the top of main().

KPCA + LDA standalone
python kpca_lda_standalone.py

Grid searches KPCA hyperparameters (components, kernel, gamma) and uses the resulting LDA projection directly for classification.

KPCA + LDA + downstream classifier
python kpca_lda_classifier.py

Fits KPCA → LDA → KNN or SVM in a single pipeline. Select the classifier by setting model_name to "KPCA_LDA_KNN" or "KPCA_LDA_SVM" at the top of main().


🛠️ Configuration

All hyperparameters live in config.yaml under two top-level keys — MLL and CIFAR10. Read at runtime by Utils.load_config(dataset_name), it is the single source of truth for grid search bounds, PCA thresholds, CV folds, sample sizes, and KPCA settings.

Each script's main() also exposes a few runtime flags:

Flag Description
dataset_name 'MLL' or 'CIFAR-10'
PCA_enable Toggle PCA on / off
Multiple_cores_enable Use all CPU cores for grid search
GrayScale Convert CIFAR-10 to grayscale (CIFAR scripts only)
model_name "KPCA_LDA_KNN" or "KPCA_LDA_SVM" (KPCA classifier script only)

📁 Output Files

Each run creates a subdirectory next to the script (e.g. MLL_rbf_MinMaxScaler_PCA_output_files/):

File Contents
results_*.csv Full grid search CV results, ranked by score
best_est_results_*.csv Best model's parameters & evaluation metrics
conf_matrix_*.png Confusion matrix heatmap on the test set
contour_grid_*.png C vs γ accuracy contour (SVM with RBF / poly kernel)
*_best_model.pkl Saved best model (joblib)
classification_example_*.png Example misclassified image (CIFAR-10 only)

About

Benchmarking SVM, KNN & NCC classifiers with PCA, Kernel PCA and LDA dimensionality reduction on MLL gene-expression (3-class leukemia) and CIFAR-10 image data. Includes grid search pipelines, experiment results and comparative analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages