Skip to content

Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)#1194

Open
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/text-diffusion
Open

Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)#1194
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/text-diffusion

Conversation

@dentity007
Copy link
Copy Markdown

@dentity007 dentity007 commented Mar 31, 2026

Non-record: Text Diffusion (MDLM) - Masked Discrete Diffusion

val_bpb: 3.3801 | 1x RTX 5090 Ada 16GB, 180s wallclock | sp1024

Implements OpenAI's requested "Text diffusion" research direction. First diffusion submission to Parameter Golf.

Architecture

  • MDLM (Masked Discrete Language Model) approach: tokens are randomly masked during training, model predicts masked tokens using bidirectional attention
  • Hybrid training loss: 30% autoregressive (causal), 70% diffusion (bidirectional masked prediction)
  • Cosine masking schedule for curriculum during training
  • Evaluation uses standard causal AR mode (bidirectional heads stripped)
  • Base config: 9 layers, d=512, 8 heads, sp1024 vocab, MLP 2x

Results

Metric Value
val_bpb (final) 3.3801
Training time 180s (1x RTX 5090)
Hardware NVIDIA RTX 5090 Ada 16GB

Key Findings

  1. Diffusion shows signs of life but is far from competitive. At 3.3801 BPB vs the naive baseline's 1.2244, the model is learning language structure but the diffusion objective is much less efficient than pure AR for this task.

  2. The 30/70 AR/diffusion split was chosen empirically. Pure diffusion (100%) diverged. Pure AR (100%) is the standard baseline. The hybrid allows the diffusion head to learn from the AR signal.

  3. Short training time (180s) limits conclusions. With 600s and more iterations, the diffusion component may converge further. The 180s run was a proof-of-concept.

  4. Bidirectional attention during training is the key difference. The model sees all tokens when predicting masked positions, which is fundamentally different from causal AR. This could be powerful for compression if the eval protocol allowed non-causal scoring.

Comparison to Naive Baseline

Naive Baseline Text Diffusion
Approach Autoregressive Hybrid AR + Diffusion
val_bpb 1.2244 3.3801
Attention Causal only Bidirectional (train) + Causal (eval)

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
MAX_WALLCLOCK_SECONDS=180 python3 train_gpt_diffusion.py

Discussion

Text diffusion for language modeling remains challenging. The fundamental issue is that evaluation requires causal (left-to-right) scoring, but diffusion training benefits from bidirectional context. This mismatch means the diffusion component's strength (global context) cannot be leveraged at eval time.

Potential directions: using diffusion as a pre-training objective before AR fine-tuning, or exploring discrete diffusion with causal masking patterns that are compatible with AR eval.

Would welcome feedback on whether longer training runs or alternative masking schedules would be worth exploring.

Credits

Script: train_gpt_diffusion.py
Implements OpenAI's requested "Text diffusion" direction from the README.

@dentity007 dentity007 closed this Apr 1, 2026
@dentity007 dentity007 reopened this Apr 1, 2026
@dentity007
Copy link
Copy Markdown
Author

Research Expansion: Ablation Results

Ran an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile.

Results

Run AR/Diff Split val_bpb ms/step
DIFF-1 70% AR / 30% diff 2.4195 1,388
DIFF-2 50/50 2.4194 997
DIFF-3 100% AR (reference) 2.4194 997

Finding

All three configurations produce identical BPB to 4 decimal places. The diffusion loss contributes literally nothing to causal eval. The 70/30 split is actually slower (1388ms vs 997ms) because the diffusion forward pass adds overhead without benefit.

Diffusion for text compression appears to be a fundamental mismatch with causal eval, not a tuning problem. The knowledge from bidirectional masked prediction does not transfer to left-to-right scoring. Would need eval-time protocol changes to extract value from diffusion training.

Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1194 — TextDiffusion MDLM, track_non_record_16mb

Head SHA

a6f3a34

Files Changed

  • APPROACH.md
  • records/track_non_record_16mb/2026-03-31_TextDiffusion_MDLM/README.md
  • records/track_non_record_16mb/2026-03-31_TextDiffusion_MDLM/submission.json
  • records/track_non_record_16mb/2026-03-31_TextDiffusion_MDLM/train_gpt.py (1321 lines)

Checklist

ILLEGAL n-gram family bug (target XOR'd into hash key, BigramHash): NOT PRESENT.
No n-gram, bigram, hash, or XOR logic anywhere in the file. Clean.

ILLEGAL Pre-Quant TTT (multi-epoch gradient updates on val_tokens before scoring):
NOT PRESENT. val_tokens appears only in eval_val() (lines 219, 237, 250). That function runs
exclusively under model.eval() + torch.inference_mode() (lines 244–245), with no .backward()
call, no optimizer step, and no weight mutation of any kind. val_tokens data never enters a gradient
path. Clean.

LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard): NOT PRESENT.
No TTT of any kind — no test-time fine-tuning, no is_last_chunk guard, no scored-region SLOT logic.

HOLD scored-region SLOT: NOT PRESENT. No scored-region adaptation or SLOT logic.

Architecture: MDLM-style masked discrete diffusion with bidirectional attention
(DiffusionGPT, lines 656–873). Training uses a hybrid loss: AR causal loss + diffusion
masked-prediction loss (lines 1178–1207). Both losses use only train_loader batches drawn
from training shards (train_files), never from val shards. Evaluation is standard AR BPB
over val_tokens in inference_mode only (lines 212–273). Optimizer covers standard neural
parameters plus a mask_token embedding (line 1018). No external lookup tables, n-gram
caches, or non-neural score boosting.

Conclusion: This is a clean pure-neural submission. The MDLM diffusion training is novel
but fully legal — it never touches val_tokens for gradient updates, contains no n-gram hash
tricks, and no TTT of any form. Recommend APPROVE.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

@dentity007
Copy link
Copy Markdown
Author

Thanks for the review. The specific observation that "both losses use only train_loader batches drawn from training shards (train_files), never from val shards" is exactly the right check for this architecture since diffusion models can look suspicious at a glance due to the bidirectional attention path.

For context on the research finding: my Spark ablation tested three AR/diff ratios (70/30, 50/50, 100/0) and they produced identical BPB to 4 decimal places. The diffusion loss appears to contribute nothing to causal eval, which is a fundamental mismatch rather than a tuning problem. Leaving the submission open as a documented negative result for anyone researching diffusion-for-text-compression in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants