Non-record: Mamba-Inspired SSM Hybrid 3:1 (val_bpb 3.3168)#1197
Non-record: Mamba-Inspired SSM Hybrid 3:1 (val_bpb 3.3168)#1197dentity007 wants to merge 3 commits intoopenai:mainfrom
Conversation
…er optimization, and SSM exploration
…3168) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Research Expansion: Ablation ResultsRan an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile. Results
FindingSSM-1 achieved the best raw BPB of all 22 runs across all 7 architectures at 2.0295. The catch: pure PyTorch SSM is 50x slower than attention (37s per step vs 700ms). At that speed, only about 5 effective training steps completed in 200 iterations, yet it still reached the lowest BPB. This is a very strong signal that a fast SSM implementation (Triton or CUDA selective scan kernel) would be genuinely competitive for this competition. The limitation is implementation speed, not model quality. Anyone with Triton skills could make this work. SSM-2 and SSM-3 crashed due to env var wiring issues in the ablation script (not a fundamental problem with the architectures). Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08 |
Community Review — Non-record: Mamba-Inspired SSM Hybrid 3:1 (val_bpb 3.3168)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache PR #1197 — MambaSSMHybrid (track_non_record_16mb, 2026-03-31) Head SHA: 6e23d7d File audited: records/track_non_record_16mb/2026-03-31_MambaSSMHybrid/train_gpt.py ### Architecture Pure neural SSM+Transformer hybrid. 9 layers in a 3:1 SSM:Attention ratio (6 SSM layers, 3 attention layers) via
|
|
Thanks for the careful audit. The 3:1 ratio is explicit in HybridBlock (lines 746-789 in your audit) and the pure-PyTorch SSM scan at lines 714-731 is the slow path I want to call out below. Research finding worth noting for context: SSM-1 (1:1 ratio) in my Spark ablation actually hit the best raw BPB of all 22 runs across all 7 architectures at 2.0295. The catch is that pure PyTorch SSM runs at ~37 seconds per step on DGX Spark GB10 (50x slower than attention), so only about 5 effective training steps completed in 200 iterations. Despite that, it still reached the lowest BPB. Strong signal that a fast SSM implementation (Triton or CUDA selective scan kernel) would be genuinely competitive for this competition. The architecture is not the bottleneck, the implementation speed is. Anyone with Triton kernel experience who wanted to pick this up and add a selective scan kernel would probably see substantial BPB improvement from training the same architecture 50x longer. |
Non-record: Mamba-Inspired SSM Hybrid (3:1 SSM:Attention)
val_bpb: 3.3168 | 1x RTX 5090 Ada 16GB, 180s wallclock | sp1024
Implements OpenAI's requested "State-space models" research direction.
Architecture
Results
Key Findings
Pure PyTorch SSM is viable but slow. Without custom CUDA kernels, the SSM blocks are significantly slower than attention blocks per iteration. This limits training steps within the wallclock budget.
The 3:1 ratio follows production patterns. Qwen3-Next and Kimi Linear use similar ratios in production models. However, those models benefit from optimized CUDA kernels (Mamba2 selective scan) that are unavailable here.
SSM shows signs of life despite the speed disadvantage. At 3.3168 BPB, the model is learning language patterns. With optimized kernels allowing more training steps, this could converge significantly lower.
Selective gating is the key architectural contribution. Input-dependent state transitions (vs fixed transitions in classical SSMs) allow the model to dynamically choose what information to retain in the recurrent state.
Comparison to Naive Baseline
Reproduction
Discussion
The SSM direction is limited by the pure PyTorch implementation speed. The strongest next step would be integrating Triton or CUDA kernels for the selective scan operation. This would roughly 3x the training throughput and bring val_bpb much closer to the attention baseline.
The 3:1 ratio is worth revisiting once speed is addressed. For parameter-constrained models, SSM blocks have fewer parameters than attention (no KV projection), so a higher SSM ratio could allow a larger overall model within the 16 MB limit.
Would welcome pointers to lightweight SSM kernel implementations compatible with the competition environment.
Credits
Script:
train_gpt_mamba_hybrid.pyImplements OpenAI's requested "State-space models" direction from the README.