Skip to content

Non-record: Mamba-Inspired SSM Hybrid 3:1 (val_bpb 3.3168)#1197

Open
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/mamba-hybrid
Open

Non-record: Mamba-Inspired SSM Hybrid 3:1 (val_bpb 3.3168)#1197
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/mamba-hybrid

Conversation

@dentity007
Copy link
Copy Markdown

@dentity007 dentity007 commented Mar 31, 2026

Non-record: Mamba-Inspired SSM Hybrid (3:1 SSM:Attention)

val_bpb: 3.3168 | 1x RTX 5090 Ada 16GB, 180s wallclock | sp1024

Implements OpenAI's requested "State-space models" research direction.

Architecture

  • Pure PyTorch SSM implementation (no custom CUDA kernels like Mamba's selective scan)
  • 3:1 ratio of SSM to attention blocks, following the Qwen3-Next and Kimi Linear architecture pattern
  • SSM blocks use selective gating with input-dependent state transitions
  • Causal Conv1d for local context, SiLU output activation
  • Attention blocks use standard GQA with flash attention
  • Base config: 9 layers (7 SSM + 2 attention), d=512, 8 heads, sp1024 vocab

Results

Metric Value
val_bpb (final) 3.3168
SSM blocks 7
Attention blocks 2
Training time 180s (1x RTX 5090)

Key Findings

  1. Pure PyTorch SSM is viable but slow. Without custom CUDA kernels, the SSM blocks are significantly slower than attention blocks per iteration. This limits training steps within the wallclock budget.

  2. The 3:1 ratio follows production patterns. Qwen3-Next and Kimi Linear use similar ratios in production models. However, those models benefit from optimized CUDA kernels (Mamba2 selective scan) that are unavailable here.

  3. SSM shows signs of life despite the speed disadvantage. At 3.3168 BPB, the model is learning language patterns. With optimized kernels allowing more training steps, this could converge significantly lower.

  4. Selective gating is the key architectural contribution. Input-dependent state transitions (vs fixed transitions in classical SSMs) allow the model to dynamically choose what information to retain in the recurrent state.

Comparison to Naive Baseline

Naive Baseline SSM Hybrid
Architecture Pure attention 3:1 SSM:Attention
Custom kernels Flash attention None (pure PyTorch)
val_bpb 1.2244 3.3168
Steps in 180s ~500 ~200 (SSM overhead)

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
MAX_WALLCLOCK_SECONDS=180 python3 train_gpt_mamba_hybrid.py

Discussion

The SSM direction is limited by the pure PyTorch implementation speed. The strongest next step would be integrating Triton or CUDA kernels for the selective scan operation. This would roughly 3x the training throughput and bring val_bpb much closer to the attention baseline.

The 3:1 ratio is worth revisiting once speed is addressed. For parameter-constrained models, SSM blocks have fewer parameters than attention (no KV projection), so a higher SSM ratio could allow a larger overall model within the 16 MB limit.

Would welcome pointers to lightweight SSM kernel implementations compatible with the competition environment.

Credits

Script: train_gpt_mamba_hybrid.py
Implements OpenAI's requested "State-space models" direction from the README.

@dentity007 dentity007 closed this Apr 1, 2026
@dentity007 dentity007 reopened this Apr 1, 2026
@dentity007
Copy link
Copy Markdown
Author

Research Expansion: Ablation Results

Ran an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile.

Results

Run Config Params val_bpb ms/step
SSM-1 1:1 ratio 26.5M 2.0295 37,492
SSM-4 Larger state (64) 31.3M 2.1816 55,824

Finding

SSM-1 achieved the best raw BPB of all 22 runs across all 7 architectures at 2.0295. The catch: pure PyTorch SSM is 50x slower than attention (37s per step vs 700ms). At that speed, only about 5 effective training steps completed in 200 iterations, yet it still reached the lowest BPB.

This is a very strong signal that a fast SSM implementation (Triton or CUDA selective scan kernel) would be genuinely competitive for this competition. The limitation is implementation speed, not model quality. Anyone with Triton skills could make this work.

SSM-2 and SSM-3 crashed due to env var wiring issues in the ablation script (not a fundamental problem with the architectures).

Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Mamba-Inspired SSM Hybrid 3:1 (val_bpb 3.3168)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1197 — MambaSSMHybrid (track_non_record_16mb, 2026-03-31) Head SHA: 6e23d7d File audited: records/track_non_record_16mb/2026-03-31_MambaSSMHybrid/train_gpt.py ### Architecture Pure neural SSM+Transformer hybrid. 9 layers in a 3:1 SSM:Attention ratio (6 SSM layers, 3 attention layers) via HybridBlock (lines 746–789). The SSM is a pure-PyTorch simplified Mamba with input-dependent gating, conv1d local context, and sequential state scan (lines 649–739). No bigram tables, no hash keys, no n-gram structures anywhere in the file. ### Check: ILLEGAL n-gram / BigramHash family bug NOT PRESENT. Searched entire file. No n-gram hash, no XOR key, no BigramHash, no frequency table derived from targets. No token identity used outside of standard embedding lookup and cross-entropy loss. ### Check: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) NOT PRESENT. val_tokens is loaded once at line 1019 and passed read-only into eval_val() (lines 232–291). eval_val() is called only under should_validate (lines 1198–1219) and in the post-training quantization roundtrip check (lines 1324–1335). No gradient computation touches val_tokens at any point — eval_val runs under torch.inference_mode() (line 263). No TTT of any kind on val_tokens. ### Check: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) NOT PRESENT — and that is fine. There is no TTT at all, score-first or otherwise. The submission is pure neural. ### Check: SCORED-REGION SLOT HOLD NOT APPLICABLE. No scored-region manipulation detected. ### Check: CLEAN pure neural CONFIRMED. Training reads from train_files shards only (lines 1138, 1236). Validation reads from val_files in inference mode only (lines 1019, 1202–1213). The Mamba SSM scan (lines 714–731) is a standard causal...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

@dentity007
Copy link
Copy Markdown
Author

Thanks for the careful audit. The 3:1 ratio is explicit in HybridBlock (lines 746-789 in your audit) and the pure-PyTorch SSM scan at lines 714-731 is the slow path I want to call out below.

Research finding worth noting for context: SSM-1 (1:1 ratio) in my Spark ablation actually hit the best raw BPB of all 22 runs across all 7 architectures at 2.0295. The catch is that pure PyTorch SSM runs at ~37 seconds per step on DGX Spark GB10 (50x slower than attention), so only about 5 effective training steps completed in 200 iterations. Despite that, it still reached the lowest BPB. Strong signal that a fast SSM implementation (Triton or CUDA selective scan kernel) would be genuinely competitive for this competition. The architecture is not the bottleneck, the implementation speed is.

Anyone with Triton kernel experience who wanted to pick this up and add a selective scan kernel would probably see substantial BPB improvement from training the same architecture 50x longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants