Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587) by dentity007 · Pull Request #1191 · openai/parameter-golf

dentity007 · 2026-03-31T20:39:26Z

Non-record: H-Net Dynamic Chunking - Learned Tokenization Layer

val_bpb: 1.3587 | 1x RTX 5090 Ada 16GB, 600s wallclock | sp1024

Implements OpenAI's requested "H-net tokenization" research direction.

Architecture

Adds a learned dynamic chunking layer after token embeddings
The chunker is a lightweight MLP that predicts boundary probabilities between adjacent token pairs
Where boundaries are low, neighboring embeddings are blended (soft chunking)
This is a differentiable approximation of H-Net's hard chunking
Only ~263K additional parameters beyond the base model
Base config: 9 layers, d=512, 8 heads, sp1024 vocab, MLP 2x
TTT enabled during evaluation

Results

Metric	Value
val_bpb (post-TTT)	1.3587
Baseline (same config, no chunking)	1.3577
Delta	+0.0010 (within noise)
Extra params	~263K
Training time	600s (1x RTX 5090)

Key Findings

Learned tokenization nearly matches the baseline. At 1.3587 vs 1.3577, the chunking layer neither helps nor hurts. This means the soft boundary prediction learns to approximately reproduce the identity function (no chunking), which is itself a useful finding.
The ~263K parameter overhead is minimal. The chunker MLP is tiny relative to the full model. In a larger model, this overhead would be negligible.
Soft chunking is a viable differentiable approximation. The model trains stably with gradient flow through the blending operation. Hard chunking (as in original H-Net) would require straight-through estimators or other non-differentiable workarounds.
The approach is complementary to vocabulary size. Learned chunking operates at the embedding level, after BPE tokenization. It could combine with sp4096 or larger vocabularies to provide a second level of adaptive tokenization.

Comparison to Naive Baseline

	Naive Baseline	H-Net Chunking
Tokenization	Fixed BPE	BPE + learned chunking
Extra params	0	263K
val_bpb	1.2244	1.3587
vs same-config baseline	-	+0.0010 (neutral)

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
MAX_WALLCLOCK_SECONDS=600 python3 train_gpt_hnet.py

Discussion

The neutral result suggests that fixed BPE tokenization is already a strong local optimum for this task. Learned chunking would likely show more benefit for: (a) character-level or byte-level models where fixed tokenization is suboptimal, (b) multilingual settings where BPE vocabulary is stretched, or (c) domain-specific text where standard BPE segmentation is poor.

The direction is viable and the implementation is clean. Would welcome ideas on initialization strategies that might push the chunker away from the identity solution.

Credits

Script: train_gpt_hnet.py
Implements OpenAI's requested "H-net tokenization" direction from the README.

…er optimization, and SSM exploration

…bpb 1.3587) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

dentity007 · 2026-04-11T20:38:34Z

Research Expansion: Ablation Results

Ran an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile.

Results

Run	Config	val_bpb	ms/step
HNET-1	Default chunker	2.0558	513
HNET-2	Large chunker (d=256)	2.0559	513
HNET-3	Boundary regularizer (0.1)	2.0558	514

Finding

H-Net is the fastest architecture tested at 513ms per step while still reaching 2.06 BPB (second best BPB overall after SSM). But the chunker configuration makes zero difference. All three variants produce identical BPB to 4 decimal places.

The chunker is learning the identity function regardless of what you do to it. Even with a boundary regularizer forcing chunk transitions, BPB does not change. This suggests the model routes around the chunking layer rather than using it.

Possible next steps: initialize the chunker with non-trivial boundaries (hard init from BPE token edges), or use hard chunking with straight-through estimators. Both untested here.

Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

MatoTeziTanka · 2026-04-12T06:03:13Z

Community Review — Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary PR #1191 ("HNet_DynamicChunking_LearnedTokenization") adds a learned dynamic chunking layer (H-Net style) to the standard GPT baseline. The submission is architecturally clean with no illegal techniques detected. ## Checklist ### ILLEGAL N-gram family bug — NOT PRESENT No n-gram, bigram, hash, or XOR operations anywhere in the file. No target IDs XOR'd into any hash key. Search confirmed zero hits for `ngram`, `bigram`, `hash`, `xor`. ### ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) — NOT PRESENT `val_tokens` is loaded at line 891 and consumed exclusively by `eval_val()` (called at lines 1075 and 1197). `eval_val()` runs under `torch.inference_mode()` (line 261) with `model.eval()` (line 260) and `model.train()` restored at line 288. No `.backward()` call touches `val_tokens`. No multi-epoch loop over validation data exists. ### LEGAL Score-first TTT (PR #1413 pattern, is_last_chunk guard) — NOT PRESENT No TTT machinery of any kind. No `is_last_chunk`, no score-first guard, no adaptation loop. ### SCORED-REGION SLOT — NOT PRESENT No evidence of a held-out scored-region slot exploitation pattern. ### Architecture The novel component is `DynamicChunker` (lines 520–545) and `DynamicChunkerStack` (lines 548–557), inserted after `tok_emb` + `rms_norm` at lines 778–779 in `GPT.forward()`. It uses a learned `boundary_proj` (Linear dim*2 → 1) and `chunk_mixer` (Linear dim → dim) to soft-blend adjacent token embeddings. This is a purely architectural change — learned parameters trained on train data only — with no data leakage. ### Training loop Standard: train_loader feeds train shards (line 1011, 1109), val evaluation is read-only (lines 1072–1092), quantization roundtrip eval at lines 1197–1213. No...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

dentity007 · 2026-04-15T03:23:13Z

Thanks @MatoTeziTanka for the thorough audit. Especially appreciate you flagging the specific lines (891, 1075, 1197, 261, 288) for val_tokens flow and confirming the inference_mode boundary. That is exactly the kind of line-by-line verification that makes these reviews trustworthy.

One note on the architecture finding in case it's useful context for the merge decision: my own Spark ablation showed that the H-Net chunker learns the identity function regardless of configuration. All three variants (default, large chunker, boundary regularizer) produced identical BPB to 4 decimal places, which is documented in the PR body. The architecture is legal and the forward pass works, but the research takeaway is that the chunker as currently wired needs either hard chunking or stronger initial boundaries to produce non-trivial behavior. Worth noting for anyone who picks this up as a starting point.

dentity007 and others added 3 commits March 30, 2026 19:12

Add approach notes for parameter golf challenge

ad23b7f

Update approach with depth recurrence, factorized embeddings, tokeniz…

300eb5c

…er optimization, and SSM exploration

Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_…

c4cf717

…bpb 1.3587) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

dentity007 closed this Apr 1, 2026

dentity007 reopened this Apr 1, 2026

This was referenced Apr 13, 2026

Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) #406

Open

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed) #1127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)#1191

Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)#1191
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/hnet-chunking

dentity007 commented Mar 31, 2026 •

edited

Loading

Uh oh!

dentity007 commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

dentity007 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dentity007 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!