Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)#1194
Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)#1194dentity007 wants to merge 3 commits intoopenai:mainfrom
Conversation
…er optimization, and SSM exploration
…b 3.3801) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Research Expansion: Ablation ResultsRan an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile. Results
FindingAll three configurations produce identical BPB to 4 decimal places. The diffusion loss contributes literally nothing to causal eval. The 70/30 split is actually slower (1388ms vs 997ms) because the diffusion forward pass adds overhead without benefit. Diffusion for text compression appears to be a fundamental mismatch with causal eval, not a tuning problem. The knowledge from bidirectional masked prediction does not transfer to left-to-right scoring. Would need eval-time protocol changes to extract value from diffusion training. Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08 |
Community Review — Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache PR #1194 — TextDiffusion MDLM, track_non_record_16mbHead SHAFiles Changed
ChecklistILLEGAL n-gram family bug (target XOR'd into hash key, BigramHash): NOT PRESENT. ILLEGAL Pre-Quant TTT (multi-epoch gradient updates on val_tokens before scoring): LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard): NOT PRESENT. HOLD scored-region SLOT: NOT PRESENT. No scored-region adaptation or SLOT logic. Architecture: MDLM-style masked discrete diffusion with bidirectional attention Conclusion: This is a clean pure-neural submission. The MDLM diffusion training is novel Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
|
Thanks for the review. The specific observation that "both losses use only train_loader batches drawn from training shards (train_files), never from val shards" is exactly the right check for this architecture since diffusion models can look suspicious at a glance due to the bidirectional attention path. For context on the research finding: my Spark ablation tested three AR/diff ratios (70/30, 50/50, 100/0) and they produced identical BPB to 4 decimal places. The diffusion loss appears to contribute nothing to causal eval, which is a fundamental mismatch rather than a tuning problem. Leaving the submission open as a documented negative result for anyone researching diffusion-for-text-compression in the future. |
Non-record: Text Diffusion (MDLM) - Masked Discrete Diffusion
val_bpb: 3.3801 | 1x RTX 5090 Ada 16GB, 180s wallclock | sp1024
Implements OpenAI's requested "Text diffusion" research direction. First diffusion submission to Parameter Golf.
Architecture
Results
Key Findings
Diffusion shows signs of life but is far from competitive. At 3.3801 BPB vs the naive baseline's 1.2244, the model is learning language structure but the diffusion objective is much less efficient than pure AR for this task.
The 30/70 AR/diffusion split was chosen empirically. Pure diffusion (100%) diverged. Pure AR (100%) is the standard baseline. The hybrid allows the diffusion head to learn from the AR signal.
Short training time (180s) limits conclusions. With 600s and more iterations, the diffusion component may converge further. The 180s run was a proof-of-concept.
Bidirectional attention during training is the key difference. The model sees all tokens when predicting masked positions, which is fundamentally different from causal AR. This could be powerful for compression if the eval protocol allowed non-causal scoring.
Comparison to Naive Baseline
Reproduction
Discussion
Text diffusion for language modeling remains challenging. The fundamental issue is that evaluation requires causal (left-to-right) scoring, but diffusion training benefits from bidirectional context. This mismatch means the diffusion component's strength (global context) cannot be leveraged at eval time.
Potential directions: using diffusion as a pre-training objective before AR fine-tuning, or exploring discrete diffusion with causal masking patterns that are compatible with AR eval.
Would welcome feedback on whether longer training runs or alternative masking schedules would be worth exploring.
Credits
Script:
train_gpt_diffusion.pyImplements OpenAI's requested "Text diffusion" direction from the README.