[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies.#23663
Merged
[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies.#23663
Conversation
b86cbd8 to
e30e295
Compare
OpenAI's tiktoken is the second major tokenizer format in the ML ecosystem (alongside HuggingFace's tokenizer.json). This adds a complete tiktoken loader so IREE can ingest tokenizer definitions from either ecosystem without external conversion tools. The tiktoken format stores BPE vocabularies as base64-encoded byte tokens with integer ranks — no explicit merge list, no regex patterns, no special tokens. The loader reconstructs the full BPE merge table from ranks alone via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token's raw bytes using only merges with rank < R, and when two parts remain, those form the merge pair. This produces a tokenizer behaviorally indistinguishable from the HuggingFace equivalent — verified 100% token-for-token across cl100k_base and o200k_base on English, CJK, code, emoji, and mixed-script text. Predefined configs for cl100k_base (GPT-4), o200k_base (GPT-4o), r50k_base (GPT-3), and p50k_base (Codex) provide the regex patterns and special tokens that tiktoken files don't carry. Custom configs are supported via the public iree_tokenizer_tiktoken_config_t struct. The iree-tokenize tool now auto-detects .tiktoken files by extension and infers the encoding name from the filename (e.g., cl100k_base.tiktoken -> cl100k_base), with an explicit --encoding flag for non-standard names. The comprehensive benchmark tool gains the same support. Co-Authored-By: Claude <[email protected]>
… math. Gardening pass on the base64 library to make it production-ready for serving workloads. Three categories of changes: **API improvements**: The encode/decode functions now use proper IREE span types (iree_const_byte_span_t, iree_byte_span_t, iree_mutable_string_view_t) instead of separate pointer+length parameters. This keeps related fields together and matches IREE's allocator and buffer conventions. Added iree_base64_encode() which was previously missing — needed for any future base64 output paths. **Overflow protection**: All size computations now use iree_host_size_checked_add/mul. The critical fix is in iree_base64_encoded_size() where ((data_length + 2) / 3) * 4 wraps on pathological inputs (exploitable on 32-bit targets like WASM where IREE runs). The function returns IREE_HOST_SIZE_MAX as an overflow sentinel — never a valid encoded size since the result is always a multiple of 4. The decode path's (L/4)*3 is provably safe (contraction), but uses checked math as defense-in-depth policy. **Testing and benchmarking**: Comprehensive encode tests (RFC 4648 vectors, size calculation, buffer-too-small, single byte values), full 256-value encode→decode roundtrip, length-sweep roundtrip (0-33 bytes covering all remainder cases), explicit overflow rejection tests. Fuzzer now does bidirectional testing: random bytes as base64 input AND random binary through encode→decode roundtrip with trap-on-mismatch. Benchmark covers encode/decode/roundtrip at 6 sizes (4B-64KB) — sustained throughput is ~3 GB/s encode, ~2 GB/s decode. Co-Authored-By: Claude <[email protected]>
…support. Adds end-to-end validation of the tiktoken loader against real OpenAI encoding files (cl100k_base, o200k_base, r50k_base, p50k_base) and cross-validation against equivalent HuggingFace tokenizer.json files. Fixes a bug where the tiktoken parser rejected files with rank gaps. Real p50k_base has a gap at rank 50256 (reserved for <|endoftext|> special token). The parser now fills gaps with placeholder entries and downstream code (vocab construction, merge reconstruction) skips them. New files: - tiktoken_smoketest.py: Downloads .tiktoken files from OpenAI CDN, compares IREE output against tiktoken Python library (dev mode) or stored goldens (ci/verify mode), with cross-validation support. - run_tiktoken_smoketest.sh: uvx wrapper with Python dependencies. - generate_tiktoken_golden_ids.py: Generates golden token IDs using the tiktoken Python library for the test corpus. Changes: - tiktoken.c: Handle non-contiguous ranks by inserting zero-length placeholder entries at gap positions. Use add_token_with_id for explicit ID assignment. Skip placeholders in merge reconstruction. - tiktoken_test.cc: Add ConstructWithRankGap test with synthetic gapped data. Rename ParseNonContiguousRanks to ParseBackwardRanks. - run_benchmarks.sh: Add tiktoken encoding downloads and benchmark entries alongside existing HuggingFace tokenizers. - tokenizer_corpus.json: Add golden IDs for four tiktoken encodings (72 test cases). Compact integer arrays to single lines. - huggingface_smoketest.py: Compact integer arrays when writing corpus. Co-Authored-By: Claude <[email protected]>
e30e295 to
e521f2f
Compare
Replace \s* with literal space matches in the integer array compaction regex. The \s* could match newlines, overlapping with the explicit \n in the pattern, causing theoretical exponential backtracking. Since the input is always from json.dumps(indent=2) which uses only spaces for indentation, [ ]* is both correct and unambiguous. Co-Authored-By: Claude <[email protected]>
…cial token ID. Add predefined configs for the 3 missing OpenAI encodings: gpt2 (alias for r50k_base), p50k_edit (p50k_base + 3 FIM special tokens), and o200k_harmony (o200k_base + 10 named special tokens for ChatGPT message format). Add iree_tokenizer_tiktoken_config_by_name() lookup function that resolves any of the 7 standard encoding names to a config, replacing manual if/else chains in tool code. Fix p50k_base <|endoftext|> special token ID: was 50281 (wrong), should be 50256 (the gap rank in p50k_base's BPE sequence). The existing test was asserting the incorrect value. Add <|endoftext|> test case to the golden corpus to catch this class of bug via integration tests. Co-Authored-By: Claude <[email protected]>
…base speedup. The pair validation cache previously only stored TRUE results. When encodings with many overlapping tokens (like p50k_base's 24 whitespace tokens for 2-25 consecutive spaces) encounter code with long alignment runs, many candidate token pairs fail validation. These FALSE results were re-validated on every occurrence — 23,719 redundant calls with 8.35M hash lookup iterations for a 256KB corpus. The fix stores both TRUE and FALSE results using bit 31 of the cached token2 field as a validity flag (token IDs never exceed 2^31). Cached FALSE results are only used when deferred_merge_rank == 0; when deferral is active, cached FALSE entries trigger re-validation since deferral may change the result. This is safe because cached TRUE is monotonically valid (a pair valid with no deferral is valid with any deferral). Results (O3/march=native/thin_lto, 256KB code corpus): p50k_base Code: 1.02 MB/s -> 6.48 MB/s (6.3x speedup) All other encodings: within noise (+-3%), no regressions. Benchmarked across cl100k_base, o200k_base, r50k_base, gpt2, llama3, gemma-2b, qwen2.5, mistral-nemo, deepseek-v3. Co-Authored-By: Claude <[email protected]>
stellaraccident
added a commit
to iree-org/iree-tokenizer-py
that referenced
this pull request
Mar 5, 2026
Wire the IREE tiktoken format loader (landed in iree-org/iree#23663) through the Python bindings, enabling direct loading of OpenAI .tiktoken vocabulary files with all 7 standard encoding configs. - Update IREE version pin to 205b17f (includes tiktoken loader) - Add from_tiktoken(), from_tiktoken_str(), from_tiktoken_buffer() APIs - Link iree_tokenizer_format_tiktoken_tiktoken in CMakeLists.txt - Add --encoding flag to CLI for .tiktoken file support - Add tiktoken test suite with hardened assertions (9 tests) - Update README and PYPI_README to document tiktoken support - Update CI clone steps to handle commit SHA pins Co-Authored-By: Claude Opus 4.6 <[email protected]>
3 tasks
stellaraccident
added a commit
to iree-org/iree-tokenizer-py
that referenced
this pull request
Mar 5, 2026
Wire the IREE tiktoken format loader (landed in iree-org/iree#23663) through the Python bindings, enabling direct loading of OpenAI .tiktoken vocabulary files with all 7 standard encoding configs. - Update IREE version pin to 205b17f (includes tiktoken loader) - Add from_tiktoken(), from_tiktoken_str(), from_tiktoken_buffer() APIs - Link iree_tokenizer_format_tiktoken_tiktoken in CMakeLists.txt - Add --encoding flag to CLI for .tiktoken file support - Add tiktoken test suite with hardened assertions (9 tests) - Update README and PYPI_README to document tiktoken support - Update CI clone steps to handle commit SHA pins Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Stella Laurenzo <[email protected]>
stellaraccident
added a commit
to iree-org/iree-tokenizer-py
that referenced
this pull request
Mar 5, 2026
Wire the IREE tiktoken format loader (landed in iree-org/iree#23663) through the Python bindings, enabling direct loading of OpenAI .tiktoken vocabulary files with all 7 standard encoding configs. - Update IREE version pin to 205b17f (includes tiktoken loader) - Add from_tiktoken(), from_tiktoken_str(), from_tiktoken_buffer() APIs - Link iree_tokenizer_format_tiktoken_tiktoken in CMakeLists.txt - Add --encoding flag to CLI for .tiktoken file support - Add tiktoken test suite with hardened assertions (9 tests) - Update README and PYPI_README to document tiktoken support - Update CI clone steps to handle commit SHA pins Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Stella Laurenzo <[email protected]>
stellaraccident
added a commit
to iree-org/iree-tokenizer-py
that referenced
this pull request
Mar 5, 2026
## Summary - Wire the IREE tiktoken format loader ([iree-org/iree#23663](iree-org/iree#23663)) through Python bindings with `from_tiktoken()`, `from_tiktoken_str()`, and `from_tiktoken_buffer()` APIs supporting all 7 standard OpenAI encodings (cl100k_base, o200k_base, o200k_harmony, r50k_base, gpt2, p50k_base, p50k_edit) - Update IREE version pin to `205b17f` and CI clone steps to handle commit SHA pins - Add `--encoding` flag to CLI for `.tiktoken` file support - Add 9 tiktoken tests with hardened assertions (82 total tests pass under ASAN) ## Test plan - [x] Full test suite passes locally under ASAN (82/82) - [ ] CI test-asan job passes with new IREE pin - [ ] CI benchmark job validates correctness + runs benchmarks 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Stella Laurenzo <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
OpenAI's tiktoken is the second major tokenizer format in the ML ecosystem (alongside HuggingFace's tokenizer.json). This adds a complete tiktoken loader so IREE can ingest tokenizer definitions from either ecosystem without external conversion tools.
Loader
The tiktoken format stores BPE vocabularies as base64-encoded byte tokens with integer ranks — no explicit merge list, no regex patterns, no special tokens. The loader reconstructs the full BPE merge table from ranks alone via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token's raw bytes using only merges with rank < R, and when two parts remain, those form the merge pair. This produces a tokenizer behaviorally indistinguishable from the HuggingFace equivalent.
Rank gaps are handled (p50k_base skips rank 50256, reserved for
<|endoftext|>): zero-length placeholder entries fill gaps to preserve the entry_index==rank invariant, and explicit token IDs are assigned to ensure correct vocab construction.Encoding Configs
All 7 standard OpenAI encoding names are supported via predefined configs:
cl100k_baseo200k_baseo200k_harmonyr50k_basegpt2p50k_basep50k_editiree_tokenizer_tiktoken_config_by_name()resolves any of these names to a config. Custom encodings are supported via the publiciree_tokenizer_tiktoken_config_tstruct — populate it with your own regex pattern, special tokens, and IDs.Integration Testing
72 test cases across 4 BPE files (19 per encoding × 4 encodings), validated token-for-token against OpenAI's Python tiktoken library. Test corpus covers: ASCII, code, numbers, punctuation, mixed case, CJK, accented text, emoji, whitespace variations, empty strings, repeated characters, special characters, leading spaces, mixed scripts, long words, carriage returns, CRLF sequences, and special token matching (
<|endoftext|>).Infrastructure:
generate_tiktoken_golden_ids.py— generates golden token IDs from the Python tiktoken library intotokenizer_corpus.jsontiktoken_smoketest.py— downloads.tiktokenfiles from OpenAI's CDN, runs IREE's tokenizer against the corpus, and compares output against goldensrun_tiktoken_smoketest.sh— uvx wrapper that installs dependencies and invokes the smoketestPerformance
Benchmark:
comprehensive_benchmarkwith O3/march=native/thin_lto, 256KB per-corpus text (ASCII/CJK/Code), single-threaded, cache-hot.IREE tiktoken encode throughput (one-shot, MB/s):
vs Python tiktoken (code one-shot, 256KB):
Decode throughput: ~2 GB/s across all encodings (decode is a simple vocab lookup).
¹ p50k_base Code is slower due to 24 extra whitespace tokens (2-25 consecutive spaces, Codex indentation vocabulary) that cause combinatorial work in BPE pair validation when encoding code with long alignment indentation. This is a legacy encoding (Codex, code-davinci-002) — the modern encodings (cl100k_base, o200k_base) are unaffected. For comparison, Python tiktoken (Rust core) achieves 13.4 MB/s on the same input — about 2x faster on this specific pathological case, because tiktoken's priority-queue BPE algorithm doesn't need pair validation. On all non-pathological inputs, IREE is 5-8x faster than tiktoken.
Tool Support
The comprehensive benchmark tool auto-detects
.tiktokenfiles by extension and infers the encoding name from the filename (e.g.,cl100k_base.tiktoken→cl100k_base), with an explicit--encodingflag for non-standard names.run_benchmarks.shdownloads tiktoken files alongside HuggingFace tokenizers for automated benchmark runs.