Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) by dentity007 · Pull Request #406 · openai/parameter-golf

dentity007 · 2026-03-22T05:58:23Z

Summary

Mean val_bpb = 1.1287 (3-seed verified, sliding window stride=64)

Third progressive submission. Uses PR #379 architecture with Self-Distillation TTT.

Seed	val_bpb (sliding)	Artifact
1337	1.1280	15.7MB
42	1.1287	15.7MB
7	1.1294	15.7MB
Mean	1.1287

Std: 0.0007 | All under 16MB

Progression (4 days, $150 total compute)

PR	BPB	What changed
#273	1.1575	Baseline, 10L
#385	1.1488	WD+SWA tuning, 11L
This	1.1287	XSA4 + EMA + SDTTT

Running on stock PyTorch SDPA (no FA3, no custom kernels). 99ms/step vs SOTA's 55ms.

Submission checklist

3-seed verification (mean=1.1287, std=0.0007)
All artifacts < 16MB
Wallclock < 600s on 8×H100
Train logs included (3 seeds)
Reproducible train_gpt.py included

MatoTeziTanka · 2026-04-12T14:18:47Z

Sibling Draft Review for PR #406

Date: 2026-04-12
Reviewer: Claude Agent (for Mato/@MatoTeziTanka)
NEEDS_LLM_REVIEW: Yes
Status: Ready for detailed code review

PR Summary

Title: Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)
State: OPEN
Mato blocking comment: NO

Marker Analysis

Marker	Found	Notes
target_in_key	False	Custom loss key injection pattern
TTT	True	Test-Time Training integration
SLOT	False	Slot-based attention variant
custom_tokenizer	True	3 patterns detected

Architecture Changes

train_gpt.py modifications: ~1840 lines changed
Quantization: True
EMA: True
LoRA: False
GPTQ: True

Assessment

⚠ REQUIRES_REVIEW - Custom modifications detected:

TTT (Test-Time Training) integration detected
Custom tokenizer patterns found (3 instances)

Recommendation

REVIEW — Detailed code inspection needed before merge

Review Checklist

Code style and formatting (PEP 8, consistency)
Numerical stability of modifications
Integration with existing pipeline
Reproducibility (seed handling, RNG control)
Performance metrics reported correctly
No undocumented dependencies

Next Steps

Full train_gpt.py code review
Validate loss computation and gradients
Check for integration issues with existing pipeline
Verify metrics reproducibility

Generated for: Mato (@MatoTeziTanka)
Scope: NEEDS_LLM_REVIEW sweep
Action: Draft only—no posts to GitHub

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

MatoTeziTanka · 2026-04-12T14:18:49Z

Sibling Draft Review for PR #406

Date: 2026-04-12
Reviewer: Claude Agent (for Mato/@MatoTeziTanka)
NEEDS_LLM_REVIEW: Yes
Status: Ready for detailed code review

PR Summary

Title: Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)
State: OPEN
Mato blocking comment: NO

Marker Analysis

Marker	Found	Notes
target_in_key	False	Custom loss key injection pattern
TTT	True	Test-Time Training integration
SLOT	False	Slot-based attention variant
custom_tokenizer	True	3 patterns detected

Architecture Changes

train_gpt.py modifications: ~1840 lines changed
Quantization: True
EMA: True
LoRA: False
GPTQ: True

Assessment

⚠ REQUIRES_REVIEW - Custom modifications detected:

TTT (Test-Time Training) integration detected
Custom tokenizer patterns found (3 instances)

Recommendation

REVIEW — Detailed code inspection needed before merge

Review Checklist

Code style and formatting (PEP 8, consistency)
Numerical stability of modifications
Integration with existing pipeline
Reproducibility (seed handling, RNG control)
Performance metrics reported correctly
No undocumented dependencies

Next Steps

Full train_gpt.py code review
Validate loss computation and gradients
Check for integration issues with existing pipeline
Verify metrics reproducibility

Generated for: Mato (@MatoTeziTanka)
Scope: NEEDS_LLM_REVIEW sweep
Action: Draft only—no posts to GitHub

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

MatoTeziTanka · 2026-04-12T14:22:09Z

Community Review — SDTTT + BigramHash + Depth Recurrence + Int6 QAT

Compliance flag: Pre-Quant TTT violation (SDTTT active)

The submitted score uses SDTTT (Self-Distillation TTT), confirmed active in the train log. SDTTT runs 2 epochs of SGD over all val_tokens before the scored sliding-window eval. Sequence from log: post_swa val_bpb:1.1448 → SDTTT 2-epoch adapt → post_sdttt val_bpb:1.1452 → final_int6_sliding_window val_bpb:1.1280 (submitted score). No BPB score is taken before adaptation completes — all val data is seen twice before the score is recorded. This violates score-first discipline.

Note on BigramHash: The BigramHash implementation xor(36313 * t[i], 27191 * t[i-1]) uses context tokens only (current + previous input), which is the legal pattern. No n-gram family bug.

Verdict: CLOSE — Pre-Quant TTT violation (SDTTT active, 2-epoch SGD on val_tokens before scoring).

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author resubmits with SDTTT disabled (the pre-SDTTT score of 1.1448 would be clean).

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, with manual correction on BigramHash classification. If this review misread your code, please call it out so I can re-audit manually.

dentity007 · 2026-04-13T22:19:29Z

@MatoTeziTanka Thank you for the review. You are right, and the trace sequence you pulled from the train log is exactly what happened: SDTTT ran 2 epochs of SGD on val_tokens before the scored sliding window eval. Not disputing the flag.

Also appreciate you clearing the BigramHash implementation separately, that was a thoughtful distinction to make manually on top of the LLM audit.

I am working on the fix now. Two options I am weighing:

Push a commit to this branch that disables SDTTT by default and re-reports 1.1448 (the clean pre-SDTTT number from the same run log you quoted)
Close this PR and submit a new one with SDTTT rewritten to adapt on a held-out training slice instead of val_tokens, similar to the fix I pushed for PR Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193 / new submission Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant) #1554

I will update this thread once the fix lands. Also doing a proactive audit of my other 5 research PRs (#1191, #1192, #1194, #1195, #1196, #1197) for the same pattern so I can self-flag anything else I find before it reaches your queue.

Thanks again for the careful review via The Agora. The feedback loop is working.

@MatoTeziTanka

@MatoTeziTanka flagged this submission on 2026-04-12 for running SDTTT (Self-Distillation TTT) 2 epochs of SGD over val_tokens before the scored sliding window eval. This violates score-first discipline per Issue openai#402 and Issue openai#677. Fix (in-place, no code change): - Report the DIAGNOSTIC post_swa numbers from the same 3-seed runs as the submission val_bpb. These are clean: they come from eval_val on EMA-averaged weights before any SDTTT adaptation. - New 3-seed mean: 1.1455 (up from illegal 1.1287) - Per-seed post_swa values: 1337=1.1448, 42=1.1457, 7=1.1461 - Train logs for all 3 seeds still contain the clean DIAGNOSTIC lines train_gpt.py already defaults sdttt_enabled to "0" (line 140), so running the committed script as-is produces legal results. The illegal numbers came from runs where SDTTT_ENABLED=1 was set as an env var override. Thanks to @MatoTeziTanka for the careful review.

dentity007 · 2026-04-13T22:25:21Z

Fix pushed in commit 14bda5f.

Went with the in-place fix (Option 1 from my previous comment): updated README and submission.json to report the pre-SDTTT DIAGNOSTIC post_swa numbers from the same 3-seed runs. These are the clean predecessors of the tainted sliding_window numbers, computed on EMA-averaged weights before any SDTTT adaptation touched val_tokens.

Updated results

Seed	val_bpb (post_swa, legal)	val_bpb (sliding + SDTTT, illegal, previously submitted)
7	1.1461	1.1294
42	1.1457	1.1287
1337	1.1448	1.1280
Mean	1.1455	1.1287

The train_gpt.py in the records folder actually already had sdttt_enabled defaulting to "0" (line 140), so running the committed script as-is produces legal results. The illegal numbers came from runs where SDTTT_ENABLED=1 was set as an env var override. No code change needed for legality, only the reported numbers needed fixing.

Audit of other research PRs

I also grepped my other 5 non-record research PRs (#1191, #1192, #1194, #1195, #1196, #1197) for the same pattern. They all use val_tokens only inside an eval_val() function under torch.inference_mode() with no .backward() calls. No TTT-on-val, no SDTTT, no hidden eval-time training. Those should be clean by the same standard.

Thanks again for the careful review.

…ka review pattern Proactive self-flag before the Agora compliance review reaches this PR. Same illegal pattern as PR openai#1193 and PR openai#406: ttt_adapt() runs on val_tokens for 1 epoch with no score-first discipline before the final eval. Changes: - train_gpt.py: TTT_ENABLED default changed from "1" to "0". Added comment explaining the fix and cross-referencing the flagged sibling PRs. - submission.json: val_bpb set to null, val_bpb_retracted preserved for record. Status set to "retracted". - README.md: Update notice at top explaining the retraction, original summary struck through. Unlike PR openai#406 which had clean DIAGNOSTIC post_swa numbers in the train logs, this submission has no pre-TTT diagnostic numbers preserved, so no clean substitute BPB is available.

Proactive compliance documentation while awaiting maintainer ruling on hash-based eval-time n-gram caches per Issue openai#402, Issue openai#677, and PR openai#886. No code changes. Just README documenting: - The open dispute (valerio-oai leaning legal, abaybektursun openai#886 disputing via hash collision density, Robert-Sneiderman openai#900 defending Dirichlet formula validity) - What this submission does (backward-looking causal n-gram cache with Dirichlet-Multinomial smoothing) - What it does NOT do (no training on val_tokens, no backward passes, model frozen during eval) - Explicit statement that I asked on Issue openai#402 on April 2 and will retract if ruled invalid Distinct from the TTT-on-val class of violations I retracted in PR openai#1193, PR openai#406, and PR openai#1127.

Same approach as PR openai#948 compliance note. This submission extends openai#948 with order-20 backoff but uses the same eval-time hash n-gram cache architecture under the same community dispute (Issue openai#402, Issue openai#677, PR openai#886, PR openai#900). No code changes. README documents: - The open dispute and relevant threads - What this submission does (causal backward-looking cache, Dirichlet smoothing, model frozen) - What it does NOT do (no training on val_tokens, no backward passes) - Distinct from the TTT-on-val class I retracted in openai#1193, openai#406, openai#1127 - Will retract if maintainers rule the class invalid

Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)

217c9f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)#406

Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)#406
dentity007 wants to merge 2 commits intoopenai:mainfrom
NathanMaine:submission/11L-SDTTT-XSA4-EMA-NathanMaine

dentity007 commented Mar 22, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

dentity007 commented Apr 13, 2026

Uh oh!

dentity007 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dentity007 commented Mar 22, 2026

Summary

Progression (4 days, $150 total compute)

Submission checklist

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Sibling Draft Review for PR #406

PR Summary

Marker Analysis

Architecture Changes

Assessment

Recommendation

Review Checklist

Next Steps

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Sibling Draft Review for PR #406

PR Summary

Marker Analysis

Architecture Changes

Assessment

Recommendation

Review Checklist

Next Steps

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — SDTTT + BigramHash + Depth Recurrence + Int6 QAT

Uh oh!

dentity007 commented Apr 13, 2026

Uh oh!

dentity007 commented Apr 13, 2026

Updated results

Audit of other research PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants