Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)#1191
Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)#1191dentity007 wants to merge 3 commits intoopenai:mainfrom
Conversation
…er optimization, and SSM exploration
…bpb 1.3587) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Research Expansion: Ablation ResultsRan an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile. Results
FindingH-Net is the fastest architecture tested at 513ms per step while still reaching 2.06 BPB (second best BPB overall after SSM). But the chunker configuration makes zero difference. All three variants produce identical BPB to 4 decimal places. The chunker is learning the identity function regardless of what you do to it. Even with a boundary regularizer forcing chunk transitions, BPB does not change. This suggests the model routes around the chunking layer rather than using it. Possible next steps: initialize the chunker with non-trivial boundaries (hard init from BPE token edges), or use hard chunking with straight-through estimators. Both untested here. Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08 |
Community Review — Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache Summary PR #1191 ("HNet_DynamicChunking_LearnedTokenization") adds a learned dynamic chunking layer (H-Net style) to the standard GPT baseline. The submission is architecturally clean with no illegal techniques detected. ## Checklist ### ILLEGAL N-gram family bug — NOT PRESENT No n-gram, bigram, hash, or XOR operations anywhere in the file. No target IDs XOR'd into any hash key. Search confirmed zero hits for
|
|
Thanks @MatoTeziTanka for the thorough audit. Especially appreciate you flagging the specific lines (891, 1075, 1197, 261, 288) for val_tokens flow and confirming the inference_mode boundary. That is exactly the kind of line-by-line verification that makes these reviews trustworthy. One note on the architecture finding in case it's useful context for the merge decision: my own Spark ablation showed that the H-Net chunker learns the identity function regardless of configuration. All three variants (default, large chunker, boundary regularizer) produced identical BPB to 4 decimal places, which is documented in the PR body. The architecture is legal and the forward pass works, but the research takeaway is that the chunker as currently wired needs either hard chunking or stronger initial boundaries to produce non-trivial behavior. Worth noting for anyone who picks this up as a starting point. |
Non-record: H-Net Dynamic Chunking - Learned Tokenization Layer
val_bpb: 1.3587 | 1x RTX 5090 Ada 16GB, 600s wallclock | sp1024
Implements OpenAI's requested "H-net tokenization" research direction.
Architecture
Results
Key Findings
Learned tokenization nearly matches the baseline. At 1.3587 vs 1.3577, the chunking layer neither helps nor hurts. This means the soft boundary prediction learns to approximately reproduce the identity function (no chunking), which is itself a useful finding.
The ~263K parameter overhead is minimal. The chunker MLP is tiny relative to the full model. In a larger model, this overhead would be negligible.
Soft chunking is a viable differentiable approximation. The model trains stably with gradient flow through the blending operation. Hard chunking (as in original H-Net) would require straight-through estimators or other non-differentiable workarounds.
The approach is complementary to vocabulary size. Learned chunking operates at the embedding level, after BPE tokenization. It could combine with sp4096 or larger vocabularies to provide a second level of adaptive tokenization.
Comparison to Naive Baseline
Reproduction
Discussion
The neutral result suggests that fixed BPE tokenization is already a strong local optimum for this task. Learned chunking would likely show more benefit for: (a) character-level or byte-level models where fixed tokenization is suboptimal, (b) multilingual settings where BPE vocabulary is stretched, or (c) domain-specific text where standard BPE segmentation is poor.
The direction is viable and the implementation is clean. Would welcome ideas on initialization strategies that might push the chunker away from the identity solution.
Credits
Script:
train_gpt_hnet.pyImplements OpenAI's requested "H-net tokenization" direction from the README.