Skip to content

Jalalab/CIFAR10-CNN-ViT

Repository files navigation

CIFAR-10 Image Classification: CNN vs Vision Transformer

What We Did

We trained and compared four deep learning models on the CIFAR-10 dataset to study the difference between CNN and Transformer-based architectures for image classification.


Dataset

CIFAR-10 contains 60,000 color images (32×32 pixels) across 10 classes.

Split Images Purpose
Train 40,000 Model learns from this
Validation 10,000 Monitor training progress
Test 10,000 Final accuracy evaluation

Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck


Models & Results

Model Type Test Accuracy Parameters
Baseline CNN CNN 87.91% 617K
ResNeXt CNN 94.57% 2.5M
MLP-Mixer Transformer 81.34% 2.2M
CCT Transformer 88.05% 3.8M

CNN Models

Baseline CNN

  • Simple 4-block convolutional network
  • No skip connections, no attention
  • Used as a performance starting point

ResNeXt

  • Grouped convolutions with cardinality 32
  • Skip connections for stable gradient flow
  • Squeeze-and-Excitation blocks for channel attention
  • Best performing model at 94.57%

Transformer Models

MLP-Mixer

  • Replaces self-attention with MLP-based mixing
  • No local spatial understanding (no convolution)
  • Lowest accuracy — confirms transformers struggle on small datasets without inductive bias

CCT (Compact Convolutional Transformer)

  • Convolutional tokenizer + self-attention
  • Combines CNN's local understanding with transformer's global attention
  • Best transformer model at 88.05%

Key Finding

CNNs outperform Transformers on CIFAR-10 because convolutions have built-in spatial understanding (nearby pixels are related). Transformers must learn this from data — and 50,000 images is not enough without pretraining.

MLP-Mixer (81.34%) < Baseline CNN (87.91%) < CCT (88.05%) < ResNeXt (94.57%)

Training Setup

Setting CNN Models Transformer Models
Optimizer SGD AdamW
Epochs 30-50 10-50
Batch Size 128 128
LR Schedule Cosine Warmup + Cosine

Files

  • CNN.ipynb — Baseline CNN and ResNeXt training
  • ViT.ipynb — MLP-Mixer and CCT training
  • Report.pdf — Full written report with analysis

Requirements

pip install torch torchvision timm torchsummary
pip install scikit-learn matplotlib seaborn

How to Run

  1. Open CNN.ipynb or ViT.ipynb in Google Colab
  2. Set Runtime → T4 GPU
  3. Run all cells top to bottom

About

CNN (ResNeXt-29) and Vision Transformer on CIFAR-10

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors