Skip to content

Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)#1191

Open
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/hnet-chunking
Open

Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)#1191
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/hnet-chunking

Conversation

@dentity007
Copy link
Copy Markdown

@dentity007 dentity007 commented Mar 31, 2026

Non-record: H-Net Dynamic Chunking - Learned Tokenization Layer

val_bpb: 1.3587 | 1x RTX 5090 Ada 16GB, 600s wallclock | sp1024

Implements OpenAI's requested "H-net tokenization" research direction.

Architecture

  • Adds a learned dynamic chunking layer after token embeddings
  • The chunker is a lightweight MLP that predicts boundary probabilities between adjacent token pairs
  • Where boundaries are low, neighboring embeddings are blended (soft chunking)
  • This is a differentiable approximation of H-Net's hard chunking
  • Only ~263K additional parameters beyond the base model
  • Base config: 9 layers, d=512, 8 heads, sp1024 vocab, MLP 2x
  • TTT enabled during evaluation

Results

Metric Value
val_bpb (post-TTT) 1.3587
Baseline (same config, no chunking) 1.3577
Delta +0.0010 (within noise)
Extra params ~263K
Training time 600s (1x RTX 5090)

Key Findings

  1. Learned tokenization nearly matches the baseline. At 1.3587 vs 1.3577, the chunking layer neither helps nor hurts. This means the soft boundary prediction learns to approximately reproduce the identity function (no chunking), which is itself a useful finding.

  2. The ~263K parameter overhead is minimal. The chunker MLP is tiny relative to the full model. In a larger model, this overhead would be negligible.

  3. Soft chunking is a viable differentiable approximation. The model trains stably with gradient flow through the blending operation. Hard chunking (as in original H-Net) would require straight-through estimators or other non-differentiable workarounds.

  4. The approach is complementary to vocabulary size. Learned chunking operates at the embedding level, after BPE tokenization. It could combine with sp4096 or larger vocabularies to provide a second level of adaptive tokenization.

Comparison to Naive Baseline

Naive Baseline H-Net Chunking
Tokenization Fixed BPE BPE + learned chunking
Extra params 0 263K
val_bpb 1.2244 1.3587
vs same-config baseline - +0.0010 (neutral)

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
MAX_WALLCLOCK_SECONDS=600 python3 train_gpt_hnet.py

Discussion

The neutral result suggests that fixed BPE tokenization is already a strong local optimum for this task. Learned chunking would likely show more benefit for: (a) character-level or byte-level models where fixed tokenization is suboptimal, (b) multilingual settings where BPE vocabulary is stretched, or (c) domain-specific text where standard BPE segmentation is poor.

The direction is viable and the implementation is clean. Would welcome ideas on initialization strategies that might push the chunker away from the identity solution.

Credits

Script: train_gpt_hnet.py
Implements OpenAI's requested "H-net tokenization" direction from the README.

@dentity007 dentity007 closed this Apr 1, 2026
@dentity007 dentity007 reopened this Apr 1, 2026
@dentity007
Copy link
Copy Markdown
Author

Research Expansion: Ablation Results

Ran an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile.

Results

Run Config val_bpb ms/step
HNET-1 Default chunker 2.0558 513
HNET-2 Large chunker (d=256) 2.0559 513
HNET-3 Boundary regularizer (0.1) 2.0558 514

Finding

H-Net is the fastest architecture tested at 513ms per step while still reaching 2.06 BPB (second best BPB overall after SSM). But the chunker configuration makes zero difference. All three variants produce identical BPB to 4 decimal places.

The chunker is learning the identity function regardless of what you do to it. Even with a boundary regularizer forcing chunk transitions, BPB does not change. This suggests the model routes around the chunking layer rather than using it.

Possible next steps: initialize the chunker with non-trivial boundaries (hard init from BPE token edges), or use hard chunking with straight-through estimators. Both untested here.

Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary PR #1191 ("HNet_DynamicChunking_LearnedTokenization") adds a learned dynamic chunking layer (H-Net style) to the standard GPT baseline. The submission is architecturally clean with no illegal techniques detected. ## Checklist ### ILLEGAL N-gram family bug — NOT PRESENT No n-gram, bigram, hash, or XOR operations anywhere in the file. No target IDs XOR'd into any hash key. Search confirmed zero hits for ngram, bigram, hash, xor. ### ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) — NOT PRESENT val_tokens is loaded at line 891 and consumed exclusively by eval_val() (called at lines 1075 and 1197). eval_val() runs under torch.inference_mode() (line 261) with model.eval() (line 260) and model.train() restored at line 288. No .backward() call touches val_tokens. No multi-epoch loop over validation data exists. ### LEGAL Score-first TTT (PR #1413 pattern, is_last_chunk guard) — NOT PRESENT No TTT machinery of any kind. No is_last_chunk, no score-first guard, no adaptation loop. ### SCORED-REGION SLOT — NOT PRESENT No evidence of a held-out scored-region slot exploitation pattern. ### Architecture The novel component is DynamicChunker (lines 520–545) and DynamicChunkerStack (lines 548–557), inserted after tok_emb + rms_norm at lines 778–779 in GPT.forward(). It uses a learned boundary_proj (Linear dim*2 → 1) and chunk_mixer (Linear dim → dim) to soft-blend adjacent token embeddings. This is a purely architectural change — learned parameters trained on train data only — with no data leakage. ### Training loop Standard: train_loader feeds train shards (line 1011, 1109), val evaluation is read-only (lines 1072–1092), quantization roundtrip eval at lines 1197–1213. No...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

@dentity007
Copy link
Copy Markdown
Author

Thanks @MatoTeziTanka for the thorough audit. Especially appreciate you flagging the specific lines (891, 1075, 1197, 261, 288) for val_tokens flow and confirming the inference_mode boundary. That is exactly the kind of line-by-line verification that makes these reviews trustworthy.

One note on the architecture finding in case it's useful context for the merge decision: my own Spark ablation showed that the H-Net chunker learns the identity function regardless of configuration. All three variants (default, large chunker, boundary regularizer) produced identical BPB to 4 decimal places, which is documented in the PR body. The architecture is legal and the forward pass works, but the research takeaway is that the chunker as currently wired needs either hard chunking or stronger initial boundaries to produce non-trivial behavior. Worth noting for anyone who picks this up as a starting point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants