Skip to content

Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)#406

Open
dentity007 wants to merge 2 commits intoopenai:mainfrom
NathanMaine:submission/11L-SDTTT-XSA4-EMA-NathanMaine
Open

Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)#406
dentity007 wants to merge 2 commits intoopenai:mainfrom
NathanMaine:submission/11L-SDTTT-XSA4-EMA-NathanMaine

Conversation

@dentity007
Copy link
Copy Markdown

Summary

Mean val_bpb = 1.1287 (3-seed verified, sliding window stride=64)

Third progressive submission. Uses PR #379 architecture with Self-Distillation TTT.

Seed val_bpb (sliding) Artifact
1337 1.1280 15.7MB
42 1.1287 15.7MB
7 1.1294 15.7MB
Mean 1.1287

Std: 0.0007 | All under 16MB

Progression (4 days, $150 total compute)

PR BPB What changed
#273 1.1575 Baseline, 10L
#385 1.1488 WD+SWA tuning, 11L
This 1.1287 XSA4 + EMA + SDTTT

Running on stock PyTorch SDPA (no FA3, no custom kernels). 99ms/step vs SOTA's 55ms.

Submission checklist

  • 3-seed verification (mean=1.1287, std=0.0007)
  • All artifacts < 16MB
  • Wallclock < 600s on 8×H100
  • Train logs included (3 seeds)
  • Reproducible train_gpt.py included

@MatoTeziTanka
Copy link
Copy Markdown

Sibling Draft Review for PR #406

Date: 2026-04-12
Reviewer: Claude Agent (for Mato/@MatoTeziTanka)
NEEDS_LLM_REVIEW: Yes
Status: Ready for detailed code review

PR Summary

Title: Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)
State: OPEN
Mato blocking comment: NO

Marker Analysis

Marker Found Notes
target_in_key False Custom loss key injection pattern
TTT True Test-Time Training integration
SLOT False Slot-based attention variant
custom_tokenizer True 3 patterns detected

Architecture Changes

train_gpt.py modifications: ~1840 lines changed
Quantization: True
EMA: True
LoRA: False
GPTQ: True

Assessment

⚠ REQUIRES_REVIEW - Custom modifications detected:

  • TTT (Test-Time Training) integration detected
  • Custom tokenizer patterns found (3 instances)

Recommendation

REVIEW — Detailed code inspection needed before merge

Review Checklist

  • Code style and formatting (PEP 8, consistency)
  • Numerical stability of modifications
  • Integration with existing pipeline
  • Reproducibility (seed handling, RNG control)
  • Performance metrics reported correctly
  • No undocumented dependencies

Next Steps

  1. Full train_gpt.py code review
  2. Validate loss computation and gradients
  3. Check for integration issues with existing pipeline
  4. Verify metrics reproducibility

Generated for: Mato (@MatoTeziTanka)
Scope: NEEDS_LLM_REVIEW sweep
Action: Draft only—no posts to GitHub

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.


Reviewed by @MatoTeziTankaThe Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

1 similar comment
@MatoTeziTanka
Copy link
Copy Markdown

Sibling Draft Review for PR #406

Date: 2026-04-12
Reviewer: Claude Agent (for Mato/@MatoTeziTanka)
NEEDS_LLM_REVIEW: Yes
Status: Ready for detailed code review

PR Summary

Title: Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287)
State: OPEN
Mato blocking comment: NO

Marker Analysis

Marker Found Notes
target_in_key False Custom loss key injection pattern
TTT True Test-Time Training integration
SLOT False Slot-based attention variant
custom_tokenizer True 3 patterns detected

Architecture Changes

train_gpt.py modifications: ~1840 lines changed
Quantization: True
EMA: True
LoRA: False
GPTQ: True

Assessment

⚠ REQUIRES_REVIEW - Custom modifications detected:

  • TTT (Test-Time Training) integration detected
  • Custom tokenizer patterns found (3 instances)

Recommendation

REVIEW — Detailed code inspection needed before merge

Review Checklist

  • Code style and formatting (PEP 8, consistency)
  • Numerical stability of modifications
  • Integration with existing pipeline
  • Reproducibility (seed handling, RNG control)
  • Performance metrics reported correctly
  • No undocumented dependencies

Next Steps

  1. Full train_gpt.py code review
  2. Validate loss computation and gradients
  3. Check for integration issues with existing pipeline
  4. Verify metrics reproducibility

Generated for: Mato (@MatoTeziTanka)
Scope: NEEDS_LLM_REVIEW sweep
Action: Draft only—no posts to GitHub

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.


Reviewed by @MatoTeziTankaThe Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — SDTTT + BigramHash + Depth Recurrence + Int6 QAT

Compliance flag: Pre-Quant TTT violation (SDTTT active)

The submitted score uses SDTTT (Self-Distillation TTT), confirmed active in the train log. SDTTT runs 2 epochs of SGD over all val_tokens before the scored sliding-window eval. Sequence from log: post_swa val_bpb:1.1448 → SDTTT 2-epoch adapt → post_sdttt val_bpb:1.1452final_int6_sliding_window val_bpb:1.1280 (submitted score). No BPB score is taken before adaptation completes — all val data is seen twice before the score is recorded. This violates score-first discipline.

Note on BigramHash: The BigramHash implementation xor(36313 * t[i], 27191 * t[i-1]) uses context tokens only (current + previous input), which is the legal pattern. No n-gram family bug.

Verdict: CLOSE — Pre-Quant TTT violation (SDTTT active, 2-epoch SGD on val_tokens before scoring).

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author resubmits with SDTTT disabled (the pre-SDTTT score of 1.1448 would be clean).


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, with manual correction on BigramHash classification. If this review misread your code, please call it out so I can re-audit manually.

@dentity007
Copy link
Copy Markdown
Author

@MatoTeziTanka Thank you for the review. You are right, and the trace sequence you pulled from the train log is exactly what happened: SDTTT ran 2 epochs of SGD on val_tokens before the scored sliding window eval. Not disputing the flag.

Also appreciate you clearing the BigramHash implementation separately, that was a thoughtful distinction to make manually on top of the LLM audit.

I am working on the fix now. Two options I am weighing:

  1. Push a commit to this branch that disables SDTTT by default and re-reports 1.1448 (the clean pre-SDTTT number from the same run log you quoted)
  2. Close this PR and submit a new one with SDTTT rewritten to adapt on a held-out training slice instead of val_tokens, similar to the fix I pushed for PR Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193 / new submission Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant) #1554

I will update this thread once the fix lands. Also doing a proactive audit of my other 5 research PRs (#1191, #1192, #1194, #1195, #1196, #1197) for the same pattern so I can self-flag anything else I find before it reaches your queue.

Thanks again for the careful review via The Agora. The feedback loop is working.

@MatoTeziTanka flagged this submission on 2026-04-12 for running SDTTT
(Self-Distillation TTT) 2 epochs of SGD over val_tokens before the
scored sliding window eval. This violates score-first discipline per
Issue openai#402 and Issue openai#677.

Fix (in-place, no code change):
- Report the DIAGNOSTIC post_swa numbers from the same 3-seed runs as
  the submission val_bpb. These are clean: they come from eval_val on
  EMA-averaged weights before any SDTTT adaptation.
- New 3-seed mean: 1.1455 (up from illegal 1.1287)
- Per-seed post_swa values: 1337=1.1448, 42=1.1457, 7=1.1461
- Train logs for all 3 seeds still contain the clean DIAGNOSTIC lines

train_gpt.py already defaults sdttt_enabled to "0" (line 140), so
running the committed script as-is produces legal results. The illegal
numbers came from runs where SDTTT_ENABLED=1 was set as an env var
override.

Thanks to @MatoTeziTanka for the careful review.
@dentity007
Copy link
Copy Markdown
Author

Fix pushed in commit 14bda5f.

Went with the in-place fix (Option 1 from my previous comment): updated README and submission.json to report the pre-SDTTT DIAGNOSTIC post_swa numbers from the same 3-seed runs. These are the clean predecessors of the tainted sliding_window numbers, computed on EMA-averaged weights before any SDTTT adaptation touched val_tokens.

Updated results

Seed val_bpb (post_swa, legal) val_bpb (sliding + SDTTT, illegal, previously submitted)
7 1.1461 1.1294
42 1.1457 1.1287
1337 1.1448 1.1280
Mean 1.1455 1.1287

The train_gpt.py in the records folder actually already had sdttt_enabled defaulting to "0" (line 140), so running the committed script as-is produces legal results. The illegal numbers came from runs where SDTTT_ENABLED=1 was set as an env var override. No code change needed for legality, only the reported numbers needed fixing.

Audit of other research PRs

I also grepped my other 5 non-record research PRs (#1191, #1192, #1194, #1195, #1196, #1197) for the same pattern. They all use val_tokens only inside an eval_val() function under torch.inference_mode() with no .backward() calls. No TTT-on-val, no SDTTT, no hidden eval-time training. Those should be clean by the same standard.

Thanks again for the careful review.

dentity007 added a commit to dentity007/parameter-golf that referenced this pull request Apr 13, 2026
…ka review pattern

Proactive self-flag before the Agora compliance review reaches this PR.
Same illegal pattern as PR openai#1193 and PR openai#406: ttt_adapt() runs on val_tokens
for 1 epoch with no score-first discipline before the final eval.

Changes:
- train_gpt.py: TTT_ENABLED default changed from "1" to "0". Added comment
  explaining the fix and cross-referencing the flagged sibling PRs.
- submission.json: val_bpb set to null, val_bpb_retracted preserved for
  record. Status set to "retracted".
- README.md: Update notice at top explaining the retraction, original
  summary struck through.

Unlike PR openai#406 which had clean DIAGNOSTIC post_swa numbers in the train
logs, this submission has no pre-TTT diagnostic numbers preserved, so no
clean substitute BPB is available.
dentity007 added a commit to NathanMaine/parameter-golf that referenced this pull request Apr 13, 2026
Proactive compliance documentation while awaiting maintainer ruling on
hash-based eval-time n-gram caches per Issue openai#402, Issue openai#677, and PR openai#886.

No code changes. Just README documenting:
- The open dispute (valerio-oai leaning legal, abaybektursun openai#886 disputing
  via hash collision density, Robert-Sneiderman openai#900 defending Dirichlet
  formula validity)
- What this submission does (backward-looking causal n-gram cache with
  Dirichlet-Multinomial smoothing)
- What it does NOT do (no training on val_tokens, no backward passes,
  model frozen during eval)
- Explicit statement that I asked on Issue openai#402 on April 2 and will
  retract if ruled invalid

Distinct from the TTT-on-val class of violations I retracted in PR openai#1193,
PR openai#406, and PR openai#1127.
dentity007 added a commit to NathanMaine/parameter-golf that referenced this pull request Apr 13, 2026
Same approach as PR openai#948 compliance note. This submission extends openai#948
with order-20 backoff but uses the same eval-time hash n-gram cache
architecture under the same community dispute (Issue openai#402, Issue openai#677,
PR openai#886, PR openai#900).

No code changes. README documents:
- The open dispute and relevant threads
- What this submission does (causal backward-looking cache, Dirichlet
  smoothing, model frozen)
- What it does NOT do (no training on val_tokens, no backward passes)
- Distinct from the TTT-on-val class I retracted in openai#1193, openai#406, openai#1127
- Will retract if maintainers rule the class invalid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants