Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)#406
Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)#406dentity007 wants to merge 2 commits intoopenai:mainfrom
Conversation
Sibling Draft Review for PR #406Date: 2026-04-12 PR SummaryTitle: Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) Marker Analysis
Architecture Changestrain_gpt.py modifications: ~1840 lines changed Assessment⚠ REQUIRES_REVIEW - Custom modifications detected:
RecommendationREVIEW — Detailed code inspection needed before merge Review Checklist
Next Steps
Generated for: Mato (@MatoTeziTanka) Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks. Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier. |
1 similar comment
Sibling Draft Review for PR #406Date: 2026-04-12 PR SummaryTitle: Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) Marker Analysis
Architecture Changestrain_gpt.py modifications: ~1840 lines changed Assessment⚠ REQUIRES_REVIEW - Custom modifications detected:
RecommendationREVIEW — Detailed code inspection needed before merge Review Checklist
Next Steps
Generated for: Mato (@MatoTeziTanka) Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks. Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier. |
Community Review — SDTTT + BigramHash + Depth Recurrence + Int6 QATCompliance flag: Pre-Quant TTT violation (SDTTT active) The submitted score uses SDTTT (Self-Distillation TTT), confirmed active in the train log. SDTTT runs 2 epochs of SGD over all Note on BigramHash: The BigramHash implementation Verdict: CLOSE — Pre-Quant TTT violation (SDTTT active, 2-epoch SGD on val_tokens before scoring). Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author resubmits with SDTTT disabled (the pre-SDTTT score of 1.1448 would be clean). Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, with manual correction on BigramHash classification. If this review misread your code, please call it out so I can re-audit manually. |
|
@MatoTeziTanka Thank you for the review. You are right, and the trace sequence you pulled from the train log is exactly what happened: SDTTT ran 2 epochs of SGD on val_tokens before the scored sliding window eval. Not disputing the flag. Also appreciate you clearing the BigramHash implementation separately, that was a thoughtful distinction to make manually on top of the LLM audit. I am working on the fix now. Two options I am weighing:
I will update this thread once the fix lands. Also doing a proactive audit of my other 5 research PRs (#1191, #1192, #1194, #1195, #1196, #1197) for the same pattern so I can self-flag anything else I find before it reaches your queue. Thanks again for the careful review via The Agora. The feedback loop is working. |
@MatoTeziTanka flagged this submission on 2026-04-12 for running SDTTT (Self-Distillation TTT) 2 epochs of SGD over val_tokens before the scored sliding window eval. This violates score-first discipline per Issue openai#402 and Issue openai#677. Fix (in-place, no code change): - Report the DIAGNOSTIC post_swa numbers from the same 3-seed runs as the submission val_bpb. These are clean: they come from eval_val on EMA-averaged weights before any SDTTT adaptation. - New 3-seed mean: 1.1455 (up from illegal 1.1287) - Per-seed post_swa values: 1337=1.1448, 42=1.1457, 7=1.1461 - Train logs for all 3 seeds still contain the clean DIAGNOSTIC lines train_gpt.py already defaults sdttt_enabled to "0" (line 140), so running the committed script as-is produces legal results. The illegal numbers came from runs where SDTTT_ENABLED=1 was set as an env var override. Thanks to @MatoTeziTanka for the careful review.
|
Fix pushed in commit 14bda5f. Went with the in-place fix (Option 1 from my previous comment): updated README and submission.json to report the pre-SDTTT DIAGNOSTIC post_swa numbers from the same 3-seed runs. These are the clean predecessors of the tainted sliding_window numbers, computed on EMA-averaged weights before any SDTTT adaptation touched val_tokens. Updated results
The train_gpt.py in the records folder actually already had Audit of other research PRsI also grepped my other 5 non-record research PRs (#1191, #1192, #1194, #1195, #1196, #1197) for the same pattern. They all use Thanks again for the careful review. |
…ka review pattern Proactive self-flag before the Agora compliance review reaches this PR. Same illegal pattern as PR openai#1193 and PR openai#406: ttt_adapt() runs on val_tokens for 1 epoch with no score-first discipline before the final eval. Changes: - train_gpt.py: TTT_ENABLED default changed from "1" to "0". Added comment explaining the fix and cross-referencing the flagged sibling PRs. - submission.json: val_bpb set to null, val_bpb_retracted preserved for record. Status set to "retracted". - README.md: Update notice at top explaining the retraction, original summary struck through. Unlike PR openai#406 which had clean DIAGNOSTIC post_swa numbers in the train logs, this submission has no pre-TTT diagnostic numbers preserved, so no clean substitute BPB is available.
Proactive compliance documentation while awaiting maintainer ruling on hash-based eval-time n-gram caches per Issue openai#402, Issue openai#677, and PR openai#886. No code changes. Just README documenting: - The open dispute (valerio-oai leaning legal, abaybektursun openai#886 disputing via hash collision density, Robert-Sneiderman openai#900 defending Dirichlet formula validity) - What this submission does (backward-looking causal n-gram cache with Dirichlet-Multinomial smoothing) - What it does NOT do (no training on val_tokens, no backward passes, model frozen during eval) - Explicit statement that I asked on Issue openai#402 on April 2 and will retract if ruled invalid Distinct from the TTT-on-val class of violations I retracted in PR openai#1193, PR openai#406, and PR openai#1127.
Same approach as PR openai#948 compliance note. This submission extends openai#948 with order-20 backoff but uses the same eval-time hash n-gram cache architecture under the same community dispute (Issue openai#402, Issue openai#677, PR openai#886, PR openai#900). No code changes. README documents: - The open dispute and relevant threads - What this submission does (causal backward-looking cache, Dirichlet smoothing, model frozen) - What it does NOT do (no training on val_tokens, no backward passes) - Distinct from the TTT-on-val class I retracted in openai#1193, openai#406, openai#1127 - Will retract if maintainers rule the class invalid
Summary
Mean val_bpb = 1.1287 (3-seed verified, sliding window stride=64)
Third progressive submission. Uses PR #379 architecture with Self-Distillation TTT.
Std: 0.0007 | All under 16MB
Progression (4 days, $150 total compute)
Running on stock PyTorch SDPA (no FA3, no custom kernels). 99ms/step vs SOTA's 55ms.
Submission checklist