Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801) by dentity007 · Pull Request #1194 · openai/parameter-golf

dentity007 · 2026-03-31T20:40:59Z

Non-record: Text Diffusion (MDLM) - Masked Discrete Diffusion

val_bpb: 3.3801 | 1x RTX 5090 Ada 16GB, 180s wallclock | sp1024

Implements OpenAI's requested "Text diffusion" research direction. First diffusion submission to Parameter Golf.

Architecture

MDLM (Masked Discrete Language Model) approach: tokens are randomly masked during training, model predicts masked tokens using bidirectional attention
Hybrid training loss: 30% autoregressive (causal), 70% diffusion (bidirectional masked prediction)
Cosine masking schedule for curriculum during training
Evaluation uses standard causal AR mode (bidirectional heads stripped)
Base config: 9 layers, d=512, 8 heads, sp1024 vocab, MLP 2x

Results

Metric	Value
val_bpb (final)	3.3801
Training time	180s (1x RTX 5090)
Hardware	NVIDIA RTX 5090 Ada 16GB

Key Findings

Diffusion shows signs of life but is far from competitive. At 3.3801 BPB vs the naive baseline's 1.2244, the model is learning language structure but the diffusion objective is much less efficient than pure AR for this task.
The 30/70 AR/diffusion split was chosen empirically. Pure diffusion (100%) diverged. Pure AR (100%) is the standard baseline. The hybrid allows the diffusion head to learn from the AR signal.
Short training time (180s) limits conclusions. With 600s and more iterations, the diffusion component may converge further. The 180s run was a proof-of-concept.
Bidirectional attention during training is the key difference. The model sees all tokens when predicting masked positions, which is fundamentally different from causal AR. This could be powerful for compression if the eval protocol allowed non-causal scoring.

Comparison to Naive Baseline

	Naive Baseline	Text Diffusion
Approach	Autoregressive	Hybrid AR + Diffusion
val_bpb	1.2244	3.3801
Attention	Causal only	Bidirectional (train) + Causal (eval)

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
MAX_WALLCLOCK_SECONDS=180 python3 train_gpt_diffusion.py

Discussion

Text diffusion for language modeling remains challenging. The fundamental issue is that evaluation requires causal (left-to-right) scoring, but diffusion training benefits from bidirectional context. This mismatch means the diffusion component's strength (global context) cannot be leveraged at eval time.

Potential directions: using diffusion as a pre-training objective before AR fine-tuning, or exploring discrete diffusion with causal masking patterns that are compatible with AR eval.

Would welcome feedback on whether longer training runs or alternative masking schedules would be worth exploring.

Credits

Script: train_gpt_diffusion.py
Implements OpenAI's requested "Text diffusion" direction from the README.

…er optimization, and SSM exploration

…b 3.3801) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

dentity007 · 2026-04-11T20:38:31Z

Research Expansion: Ablation Results

Ran an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile.

Results

Run	AR/Diff Split	val_bpb	ms/step
DIFF-1	70% AR / 30% diff	2.4195	1,388
DIFF-2	50/50	2.4194	997
DIFF-3	100% AR (reference)	2.4194	997

Finding

All three configurations produce identical BPB to 4 decimal places. The diffusion loss contributes literally nothing to causal eval. The 70/30 split is actually slower (1388ms vs 997ms) because the diffusion forward pass adds overhead without benefit.

Diffusion for text compression appears to be a fundamental mismatch with causal eval, not a tuning problem. The knowledge from bidirectional masked prediction does not transfer to left-to-right scoring. Would need eval-time protocol changes to extract value from diffusion training.

Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

MatoTeziTanka · 2026-04-12T06:03:03Z

Community Review — Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1194 — TextDiffusion MDLM, track_non_record_16mb

Head SHA

a6f3a34

Files Changed

APPROACH.md
records/track_non_record_16mb/2026-03-31_TextDiffusion_MDLM/README.md
records/track_non_record_16mb/2026-03-31_TextDiffusion_MDLM/submission.json
records/track_non_record_16mb/2026-03-31_TextDiffusion_MDLM/train_gpt.py (1321 lines)

Checklist

ILLEGAL n-gram family bug (target XOR'd into hash key, BigramHash): NOT PRESENT.
No n-gram, bigram, hash, or XOR logic anywhere in the file. Clean.

ILLEGAL Pre-Quant TTT (multi-epoch gradient updates on val_tokens before scoring):
NOT PRESENT. val_tokens appears only in eval_val() (lines 219, 237, 250). That function runs
exclusively under model.eval() + torch.inference_mode() (lines 244–245), with no .backward()
call, no optimizer step, and no weight mutation of any kind. val_tokens data never enters a gradient
path. Clean.

LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard): NOT PRESENT.
No TTT of any kind — no test-time fine-tuning, no is_last_chunk guard, no scored-region SLOT logic.

HOLD scored-region SLOT: NOT PRESENT. No scored-region adaptation or SLOT logic.

Architecture: MDLM-style masked discrete diffusion with bidirectional attention
(DiffusionGPT, lines 656–873). Training uses a hybrid loss: AR causal loss + diffusion
masked-prediction loss (lines 1178–1207). Both losses use only train_loader batches drawn
from training shards (train_files), never from val shards. Evaluation is standard AR BPB
over val_tokens in inference_mode only (lines 212–273). Optimizer covers standard neural
parameters plus a mask_token embedding (line 1018). No external lookup tables, n-gram
caches, or non-neural score boosting.

Conclusion: This is a clean pure-neural submission. The MDLM diffusion training is novel
but fully legal — it never touches val_tokens for gradient updates, contains no n-gram hash
tricks, and no TTT of any form. Recommend APPROVE.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

dentity007 · 2026-04-15T03:23:44Z

Thanks for the review. The specific observation that "both losses use only train_loader batches drawn from training shards (train_files), never from val shards" is exactly the right check for this architecture since diffusion models can look suspicious at a glance due to the bidirectional attention path.

For context on the research finding: my Spark ablation tested three AR/diff ratios (70/30, 50/50, 100/0) and they produced identical BPB to 4 decimal places. The diffusion loss appears to contribute nothing to causal eval, which is a fundamental mismatch rather than a tuning problem. Leaving the submission open as a documented negative result for anyone researching diffusion-for-text-compression in the future.

dentity007 and others added 3 commits March 30, 2026 19:12

Add approach notes for parameter golf challenge

ad23b7f

Update approach with depth recurrence, factorized embeddings, tokeniz…

300eb5c

…er optimization, and SSM exploration

Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bp…

a6f3a34

…b 3.3801) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

dentity007 closed this Apr 1, 2026

dentity007 reopened this Apr 1, 2026

This was referenced Apr 13, 2026

Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) #406

Open

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed) #1127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)#1194

Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801)#1194
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/text-diffusion

dentity007 commented Mar 31, 2026 •

edited

Loading

Uh oh!

dentity007 commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

dentity007 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dentity007 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!