We ran 40 experiments (2 models x 4 operations x 5 seeds) on modular arithmetic tasks with p=113. All experiments successfully grokked (100% success rate).
| Operation | MLP (epochs) | KAN (epochs) | KAN Speedup |
|---|---|---|---|
| Addition | 1,360 ± 1,800 | 693 ± 54 | 2.0x faster |
| Subtraction | 550 ± 192 | 708 ± 31 | 0.8x |
| Multiplication | 8,558 ± 1,919 | 708 ± 48 | 12.1x faster |
| Division | 615 ± 186 | 692 ± 47 | 0.9x |
- KAN groks multiplication 12x faster than MLP - the most striking result
- KAN shows remarkably consistent grokking times - low variance across seeds
- MLP has high variance - one addition seed took 4,554 epochs vs 321 for another
- For simple operations, MLP and KAN are comparable - subtraction and division show similar speeds
| Question | Answer |
|---|---|
| Does KAN grok? | Yes, 100% success rate (20/20 KAN experiments) |
| Does KAN grok faster? | Yes, significantly for multiplication (12x) and addition (2x) |
| Is KAN more consistent? | Yes, lower variance in grokking times across seeds |
Grokking is a phenomenon where neural networks suddenly generalize long after memorizing training data. First discovered by Power et al. (2022) and mechanistically analyzed by Nanda et al. (2023).
Kolmogorov-Arnold Networks (KAN) replace fixed activation functions with learnable B-spline functions on edges, as introduced by Liu et al. (2024).
This project empirically tests whether KAN's architectural differences affect grokking behavior.
# Clone the repository
git clone https://github.com/stchakwdev/can-kan-grok.git
cd can-kan-grok
# Create environment
conda create -n grok python=3.10
conda activate grok
# Install PyTorch
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# Install efficient-kan (required for KAN models)
pip install git+https://github.com/Blealtan/efficient-kan.git
# Install package
pip install -e .# Run a single experiment
python scripts/run_experiment.py --model kan --operation multiplication --seed 42
# Run full comparison (40 experiments)
python scripts/run_experiment.py --model all --operation all --seeds 5
# Analyze results and generate figures
python scripts/analyze_modular_results.pyFollowing Nanda et al. (2023):
- Input: (a, b) pairs where a, b ∈ {0, 1, ..., p-1}
- Output: operation(a, b) mod p
- Prime: p = 113 (12,769 total pairs)
- Split: 50% train, 50% validation
| Model | Architecture | Parameters |
|---|---|---|
| MLP | Embed → Concat → Linear(128) → ReLU → Linear → Unembed | ~78K |
| KAN | Embed → Concat → KANLinear(64) → KANLinear → Unembed | ~75K |
Critical hyperparameters for grokking:
- Weight Decay: 1.0 (essential for grokking)
- Learning Rate: 1e-3
- Optimizer: AdamW
- Batch Size: Full batch
- Max Epochs: 100,000
The project implements smart stopping based on phase detection:
Phase 1: Random → Both accuracies low
Phase 2: Memorization → Train > 99%, Val < 50%
Phase 3: Circuit Formation → Val climbing (50-80%)
Phase 4: Cleanup → Val approaching threshold (80-95%)
Phase 5: Grokked → Val > 95% ✓
can-kan-grok/
├── can_kan_grok/
│ ├── configs/ # Configuration dataclasses
│ ├── models/ # MLP and KAN implementations
│ ├── data/ # Modular arithmetic datasets
│ ├── training/ # PyTorch Lightning infrastructure
│ ├── detection/ # Grokking detection algorithms
│ └── visualization/ # Plotting utilities
├── scripts/
│ ├── run_experiment.py # Main experiment runner
│ └── analyze_modular_results.py # Results analysis
├── results/
│ └── modular/
│ └── figures/ # Generated visualizations
└── tests/ # Unit tests
All figures are generated in results/modular/figures/:
| Figure | Description |
|---|---|
publication_figure.png |
4-panel publication-ready figure |
kan_speedup_factor.png |
KAN speedup over MLP |
grokking_boxplots.png |
Distribution by operation |
multiplication_curves_comparison.png |
Training curves |
grokking_heatmap.png |
Summary heatmap |
@misc{can_kan_grok_2025,
title={Can KAN Grok? Empirical Investigation of Grokking in Kolmogorov-Arnold Networks},
author={Samuel T. Chakwera},
year={2025},
url={https://github.com/stchakwdev/can-kan-grok},
}- Nanda et al. (2023) - Progress measures for grokking via mechanistic interpretability
- Power et al. (2022) - Grokking: Generalization beyond overfitting
- Liu et al. (2024) - KAN: Kolmogorov-Arnold Networks
- Park et al. (2024) - Acceleration of Grokking via Kolmogorov-Arnold Representation
MIT License - see LICENSE for details.
