We trained and compared four deep learning models on the CIFAR-10 dataset to study the difference between CNN and Transformer-based architectures for image classification.
CIFAR-10 contains 60,000 color images (32×32 pixels) across 10 classes.
| Split | Images | Purpose |
|---|---|---|
| Train | 40,000 | Model learns from this |
| Validation | 10,000 | Monitor training progress |
| Test | 10,000 | Final accuracy evaluation |
Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
| Model | Type | Test Accuracy | Parameters |
|---|---|---|---|
| Baseline CNN | CNN | 87.91% | 617K |
| ResNeXt | CNN | 94.57% | 2.5M |
| MLP-Mixer | Transformer | 81.34% | 2.2M |
| CCT | Transformer | 88.05% | 3.8M |
- Simple 4-block convolutional network
- No skip connections, no attention
- Used as a performance starting point
- Grouped convolutions with cardinality 32
- Skip connections for stable gradient flow
- Squeeze-and-Excitation blocks for channel attention
- Best performing model at 94.57%
- Replaces self-attention with MLP-based mixing
- No local spatial understanding (no convolution)
- Lowest accuracy — confirms transformers struggle on small datasets without inductive bias
- Convolutional tokenizer + self-attention
- Combines CNN's local understanding with transformer's global attention
- Best transformer model at 88.05%
CNNs outperform Transformers on CIFAR-10 because convolutions have built-in spatial understanding (nearby pixels are related). Transformers must learn this from data — and 50,000 images is not enough without pretraining.
MLP-Mixer (81.34%) < Baseline CNN (87.91%) < CCT (88.05%) < ResNeXt (94.57%)
| Setting | CNN Models | Transformer Models |
|---|---|---|
| Optimizer | SGD | AdamW |
| Epochs | 30-50 | 10-50 |
| Batch Size | 128 | 128 |
| LR Schedule | Cosine | Warmup + Cosine |
CNN.ipynb— Baseline CNN and ResNeXt trainingViT.ipynb— MLP-Mixer and CCT trainingReport.pdf— Full written report with analysis
pip install torch torchvision timm torchsummary
pip install scikit-learn matplotlib seaborn
- Open
CNN.ipynborViT.ipynbin Google Colab - Set Runtime → T4 GPU
- Run all cells top to bottom