[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies. by benvanik · Pull Request #23663 · iree-org/iree

benvanik · 2026-03-05T07:06:14Z

OpenAI's tiktoken is the second major tokenizer format in the ML ecosystem (alongside HuggingFace's tokenizer.json). This adds a complete tiktoken loader so IREE can ingest tokenizer definitions from either ecosystem without external conversion tools.

Loader

The tiktoken format stores BPE vocabularies as base64-encoded byte tokens with integer ranks — no explicit merge list, no regex patterns, no special tokens. The loader reconstructs the full BPE merge table from ranks alone via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token's raw bytes using only merges with rank < R, and when two parts remain, those form the merge pair. This produces a tokenizer behaviorally indistinguishable from the HuggingFace equivalent.

Rank gaps are handled (p50k_base skips rank 50256, reserved for <|endoftext|>): zero-length placeholder entries fill gaps to preserve the entry_index==rank invariant, and explicit token IDs are assigned to ensure correct vocab construction.

Encoding Configs

All 7 standard OpenAI encoding names are supported via predefined configs:

Encoding	BPE File	BPE Tokens	Special Tokens	Models
`cl100k_base`	cl100k_base.tiktoken	100,256	5	GPT-4, GPT-3.5-turbo, text-embedding-ada-002
`o200k_base`	o200k_base.tiktoken	199,998	2	GPT-4o, GPT-4o-mini
`o200k_harmony`	o200k_base.tiktoken	199,998	10 named	GPT-4o (ChatGPT message format)
`r50k_base`	r50k_base.tiktoken	50,256	1	GPT-3, text-davinci-002/003
`gpt2`	r50k_base.tiktoken	50,256	1	GPT-2 (identical to r50k_base)
`p50k_base`	p50k_base.tiktoken	50,280	1	Codex, code-davinci-002
`p50k_edit`	p50k_base.tiktoken	50,280	4	Codex edit models (adds FIM tokens)

iree_tokenizer_tiktoken_config_by_name() resolves any of these names to a config. Custom encodings are supported via the public iree_tokenizer_tiktoken_config_t struct — populate it with your own regex pattern, special tokens, and IDs.

Integration Testing

72 test cases across 4 BPE files (19 per encoding × 4 encodings), validated token-for-token against OpenAI's Python tiktoken library. Test corpus covers: ASCII, code, numbers, punctuation, mixed case, CJK, accented text, emoji, whitespace variations, empty strings, repeated characters, special characters, leading spaces, mixed scripts, long words, carriage returns, CRLF sequences, and special token matching (<|endoftext|>).

Infrastructure:

generate_tiktoken_golden_ids.py — generates golden token IDs from the Python tiktoken library into tokenizer_corpus.json
tiktoken_smoketest.py — downloads .tiktoken files from OpenAI's CDN, runs IREE's tokenizer against the corpus, and compares output against goldens
run_tiktoken_smoketest.sh — uvx wrapper that installs dependencies and invokes the smoketest

Performance

Benchmark: comprehensive_benchmark with O3/march=native/thin_lto, 256KB per-corpus text (ASCII/CJK/Code), single-threaded, cache-hot.

IREE tiktoken encode throughput (one-shot, MB/s):

Encoding	ASCII	CJK	Code
cl100k_base	70.2	63.3	69.4
o200k_base	70.1	59.4	68.7
r50k_base	67.6	62.2	67.6
p50k_base	68.3	62.4	6.5 ¹

vs Python tiktoken (code one-shot, 256KB):

Encoding	IREE	Python tiktoken	Speedup
cl100k_base	69.4 MB/s	13.6 MB/s	5.1×
o200k_base	68.7 MB/s	8.1 MB/s	8.5×
r50k_base	67.6 MB/s	13.7 MB/s	4.9×
p50k_base	6.5 MB/s	13.4 MB/s	0.5× ¹

Decode throughput: ~2 GB/s across all encodings (decode is a simple vocab lookup).

¹ p50k_base Code is slower due to 24 extra whitespace tokens (2-25 consecutive spaces, Codex indentation vocabulary) that cause combinatorial work in BPE pair validation when encoding code with long alignment indentation. This is a legacy encoding (Codex, code-davinci-002) — the modern encodings (cl100k_base, o200k_base) are unaffected. For comparison, Python tiktoken (Rust core) achieves 13.4 MB/s on the same input — about 2x faster on this specific pathological case, because tiktoken's priority-queue BPE algorithm doesn't need pair validation. On all non-pathological inputs, IREE is 5-8x faster than tiktoken.

Tool Support

The comprehensive benchmark tool auto-detects .tiktoken files by extension and infers the encoding name from the filename (e.g., cl100k_base.tiktoken → cl100k_base), with an explicit --encoding flag for non-standard names. run_benchmarks.sh downloads tiktoken files alongside HuggingFace tokenizers for automated benchmark runs.

OpenAI's tiktoken is the second major tokenizer format in the ML ecosystem (alongside HuggingFace's tokenizer.json). This adds a complete tiktoken loader so IREE can ingest tokenizer definitions from either ecosystem without external conversion tools. The tiktoken format stores BPE vocabularies as base64-encoded byte tokens with integer ranks — no explicit merge list, no regex patterns, no special tokens. The loader reconstructs the full BPE merge table from ranks alone via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token's raw bytes using only merges with rank < R, and when two parts remain, those form the merge pair. This produces a tokenizer behaviorally indistinguishable from the HuggingFace equivalent — verified 100% token-for-token across cl100k_base and o200k_base on English, CJK, code, emoji, and mixed-script text. Predefined configs for cl100k_base (GPT-4), o200k_base (GPT-4o), r50k_base (GPT-3), and p50k_base (Codex) provide the regex patterns and special tokens that tiktoken files don't carry. Custom configs are supported via the public iree_tokenizer_tiktoken_config_t struct. The iree-tokenize tool now auto-detects .tiktoken files by extension and infers the encoding name from the filename (e.g., cl100k_base.tiktoken -> cl100k_base), with an explicit --encoding flag for non-standard names. The comprehensive benchmark tool gains the same support. Co-Authored-By: Claude <[email protected]>

… math. Gardening pass on the base64 library to make it production-ready for serving workloads. Three categories of changes: **API improvements**: The encode/decode functions now use proper IREE span types (iree_const_byte_span_t, iree_byte_span_t, iree_mutable_string_view_t) instead of separate pointer+length parameters. This keeps related fields together and matches IREE's allocator and buffer conventions. Added iree_base64_encode() which was previously missing — needed for any future base64 output paths. **Overflow protection**: All size computations now use iree_host_size_checked_add/mul. The critical fix is in iree_base64_encoded_size() where ((data_length + 2) / 3) * 4 wraps on pathological inputs (exploitable on 32-bit targets like WASM where IREE runs). The function returns IREE_HOST_SIZE_MAX as an overflow sentinel — never a valid encoded size since the result is always a multiple of 4. The decode path's (L/4)*3 is provably safe (contraction), but uses checked math as defense-in-depth policy. **Testing and benchmarking**: Comprehensive encode tests (RFC 4648 vectors, size calculation, buffer-too-small, single byte values), full 256-value encode→decode roundtrip, length-sweep roundtrip (0-33 bytes covering all remainder cases), explicit overflow rejection tests. Fuzzer now does bidirectional testing: random bytes as base64 input AND random binary through encode→decode roundtrip with trap-on-mismatch. Benchmark covers encode/decode/roundtrip at 6 sizes (4B-64KB) — sustained throughput is ~3 GB/s encode, ~2 GB/s decode. Co-Authored-By: Claude <[email protected]>

…support. Adds end-to-end validation of the tiktoken loader against real OpenAI encoding files (cl100k_base, o200k_base, r50k_base, p50k_base) and cross-validation against equivalent HuggingFace tokenizer.json files. Fixes a bug where the tiktoken parser rejected files with rank gaps. Real p50k_base has a gap at rank 50256 (reserved for <|endoftext|> special token). The parser now fills gaps with placeholder entries and downstream code (vocab construction, merge reconstruction) skips them. New files: - tiktoken_smoketest.py: Downloads .tiktoken files from OpenAI CDN, compares IREE output against tiktoken Python library (dev mode) or stored goldens (ci/verify mode), with cross-validation support. - run_tiktoken_smoketest.sh: uvx wrapper with Python dependencies. - generate_tiktoken_golden_ids.py: Generates golden token IDs using the tiktoken Python library for the test corpus. Changes: - tiktoken.c: Handle non-contiguous ranks by inserting zero-length placeholder entries at gap positions. Use add_token_with_id for explicit ID assignment. Skip placeholders in merge reconstruction. - tiktoken_test.cc: Add ConstructWithRankGap test with synthetic gapped data. Rename ParseNonContiguousRanks to ParseBackwardRanks. - run_benchmarks.sh: Add tiktoken encoding downloads and benchmark entries alongside existing HuggingFace tokenizers. - tokenizer_corpus.json: Add golden IDs for four tiktoken encodings (72 test cases). Compact integer arrays to single lines. - huggingface_smoketest.py: Compact integer arrays when writing corpus. Co-Authored-By: Claude <[email protected]>

runtime/src/iree/tokenizer/tools/generate_tiktoken_golden_ids.py

runtime/src/iree/tokenizer/tools/huggingface_smoketest.py

Replace \s* with literal space matches in the integer array compaction regex. The \s* could match newlines, overlapping with the explicit \n in the pattern, causing theoretical exponential backtracking. Since the input is always from json.dumps(indent=2) which uses only spaces for indentation, [ ]* is both correct and unambiguous. Co-Authored-By: Claude <[email protected]>

…cial token ID. Add predefined configs for the 3 missing OpenAI encodings: gpt2 (alias for r50k_base), p50k_edit (p50k_base + 3 FIM special tokens), and o200k_harmony (o200k_base + 10 named special tokens for ChatGPT message format). Add iree_tokenizer_tiktoken_config_by_name() lookup function that resolves any of the 7 standard encoding names to a config, replacing manual if/else chains in tool code. Fix p50k_base <|endoftext|> special token ID: was 50281 (wrong), should be 50256 (the gap rank in p50k_base's BPE sequence). The existing test was asserting the incorrect value. Add <|endoftext|> test case to the golden corpus to catch this class of bug via integration tests. Co-Authored-By: Claude <[email protected]>

…base speedup. The pair validation cache previously only stored TRUE results. When encodings with many overlapping tokens (like p50k_base's 24 whitespace tokens for 2-25 consecutive spaces) encounter code with long alignment runs, many candidate token pairs fail validation. These FALSE results were re-validated on every occurrence — 23,719 redundant calls with 8.35M hash lookup iterations for a 256KB corpus. The fix stores both TRUE and FALSE results using bit 31 of the cached token2 field as a validity flag (token IDs never exceed 2^31). Cached FALSE results are only used when deferred_merge_rank == 0; when deferral is active, cached FALSE entries trigger re-validation since deferral may change the result. This is safe because cached TRUE is monotonically valid (a pair valid with no deferral is valid with any deferral). Results (O3/march=native/thin_lto, 256KB code corpus): p50k_base Code: 1.02 MB/s -> 6.48 MB/s (6.3x speedup) All other encodings: within noise (+-3%), no regressions. Benchmarked across cl100k_base, o200k_base, r50k_base, gpt2, llama3, gemma-2b, qwen2.5, mistral-nemo, deepseek-v3. Co-Authored-By: Claude <[email protected]>

Wire the IREE tiktoken format loader (landed in iree-org/iree#23663) through the Python bindings, enabling direct loading of OpenAI .tiktoken vocabulary files with all 7 standard encoding configs. - Update IREE version pin to 205b17f (includes tiktoken loader) - Add from_tiktoken(), from_tiktoken_str(), from_tiktoken_buffer() APIs - Link iree_tokenizer_format_tiktoken_tiktoken in CMakeLists.txt - Add --encoding flag to CLI for .tiktoken file support - Add tiktoken test suite with hardened assertions (9 tests) - Update README and PYPI_README to document tiktoken support - Update CI clone steps to handle commit SHA pins Co-Authored-By: Claude Opus 4.6 <[email protected]>

Wire the IREE tiktoken format loader (landed in iree-org/iree#23663) through the Python bindings, enabling direct loading of OpenAI .tiktoken vocabulary files with all 7 standard encoding configs. - Update IREE version pin to 205b17f (includes tiktoken loader) - Add from_tiktoken(), from_tiktoken_str(), from_tiktoken_buffer() APIs - Link iree_tokenizer_format_tiktoken_tiktoken in CMakeLists.txt - Add --encoding flag to CLI for .tiktoken file support - Add tiktoken test suite with hardened assertions (9 tests) - Update README and PYPI_README to document tiktoken support - Update CI clone steps to handle commit SHA pins Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Stella Laurenzo <[email protected]>

## Summary - Wire the IREE tiktoken format loader ([iree-org/iree#23663](iree-org/iree#23663)) through Python bindings with `from_tiktoken()`, `from_tiktoken_str()`, and `from_tiktoken_buffer()` APIs supporting all 7 standard OpenAI encodings (cl100k_base, o200k_base, o200k_harmony, r50k_base, gpt2, p50k_base, p50k_edit) - Update IREE version pin to `205b17f` and CI clone steps to handle commit SHA pins - Add `--encoding` flag to CLI for `.tiktoken` file support - Add 9 tiktoken tests with hardened assertions (82 total tests pass under ASAN) ## Test plan - [x] Full test suite passes locally under ASAN (82/82) - [ ] CI test-asan job passes with new IREE pin - [ ] CI benchmark job validates correctness + runs benchmarks 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Stella Laurenzo <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]>

benvanik requested a review from stellaraccident March 5, 2026 07:06

benvanik added the runtime Relating to the IREE runtime library label Mar 5, 2026

benvanik force-pushed the users/benvanik/tiktoken branch from b86cbd8 to e30e295 Compare March 5, 2026 07:22

benvanik and others added 3 commits March 5, 2026 09:00

benvanik force-pushed the users/benvanik/tiktoken branch from e30e295 to e521f2f Compare March 5, 2026 17:57

github-advanced-security bot found potential problems Mar 5, 2026

View reviewed changes

runtime/src/iree/tokenizer/tools/generate_tiktoken_golden_ids.py Fixed Show fixed Hide fixed

runtime/src/iree/tokenizer/tools/huggingface_smoketest.py Fixed Show fixed Hide fixed

benvanik marked this pull request as ready for review March 5, 2026 19:08

benvanik and others added 2 commits March 5, 2026 11:46

benvanik added the post-merge-review Ben's special place. People can pick these up and review them for forward fixes if interested. label Mar 5, 2026

benvanik merged commit 205b17f into main Mar 5, 2026
61 checks passed

benvanik deleted the users/benvanik/tiktoken branch March 5, 2026 22:14

stellaraccident mentioned this pull request Mar 5, 2026

Add tiktoken format support iree-org/iree-tokenizer-py#12

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies.#23663

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies.#23663
benvanik merged 6 commits intomainfrom
users/benvanik/tiktoken

benvanik commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benvanik commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Loader

Encoding Configs

Integration Testing

Performance

Tool Support

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benvanik commented Mar 5, 2026 •

edited

Loading