Skip to content

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies.#23663

Merged
benvanik merged 6 commits intomainfrom
users/benvanik/tiktoken
Mar 5, 2026
Merged

[Tokenizer] Add tiktoken format loader for OpenAI BPE vocabularies.#23663
benvanik merged 6 commits intomainfrom
users/benvanik/tiktoken

Conversation

@benvanik
Copy link
Collaborator

@benvanik benvanik commented Mar 5, 2026

OpenAI's tiktoken is the second major tokenizer format in the ML ecosystem (alongside HuggingFace's tokenizer.json). This adds a complete tiktoken loader so IREE can ingest tokenizer definitions from either ecosystem without external conversion tools.

Loader

The tiktoken format stores BPE vocabularies as base64-encoded byte tokens with integer ranks — no explicit merge list, no regex patterns, no special tokens. The loader reconstructs the full BPE merge table from ranks alone via simulation: for each multi-byte token at rank R, it simulates BPE encoding of that token's raw bytes using only merges with rank < R, and when two parts remain, those form the merge pair. This produces a tokenizer behaviorally indistinguishable from the HuggingFace equivalent.

Rank gaps are handled (p50k_base skips rank 50256, reserved for <|endoftext|>): zero-length placeholder entries fill gaps to preserve the entry_index==rank invariant, and explicit token IDs are assigned to ensure correct vocab construction.

Encoding Configs

All 7 standard OpenAI encoding names are supported via predefined configs:

Encoding BPE File BPE Tokens Special Tokens Models
cl100k_base cl100k_base.tiktoken 100,256 5 GPT-4, GPT-3.5-turbo, text-embedding-ada-002
o200k_base o200k_base.tiktoken 199,998 2 GPT-4o, GPT-4o-mini
o200k_harmony o200k_base.tiktoken 199,998 10 named GPT-4o (ChatGPT message format)
r50k_base r50k_base.tiktoken 50,256 1 GPT-3, text-davinci-002/003
gpt2 r50k_base.tiktoken 50,256 1 GPT-2 (identical to r50k_base)
p50k_base p50k_base.tiktoken 50,280 1 Codex, code-davinci-002
p50k_edit p50k_base.tiktoken 50,280 4 Codex edit models (adds FIM tokens)

iree_tokenizer_tiktoken_config_by_name() resolves any of these names to a config. Custom encodings are supported via the public iree_tokenizer_tiktoken_config_t struct — populate it with your own regex pattern, special tokens, and IDs.

Integration Testing

72 test cases across 4 BPE files (19 per encoding × 4 encodings), validated token-for-token against OpenAI's Python tiktoken library. Test corpus covers: ASCII, code, numbers, punctuation, mixed case, CJK, accented text, emoji, whitespace variations, empty strings, repeated characters, special characters, leading spaces, mixed scripts, long words, carriage returns, CRLF sequences, and special token matching (<|endoftext|>).

Infrastructure:

  • generate_tiktoken_golden_ids.py — generates golden token IDs from the Python tiktoken library into tokenizer_corpus.json
  • tiktoken_smoketest.py — downloads .tiktoken files from OpenAI's CDN, runs IREE's tokenizer against the corpus, and compares output against goldens
  • run_tiktoken_smoketest.sh — uvx wrapper that installs dependencies and invokes the smoketest

Performance

Benchmark: comprehensive_benchmark with O3/march=native/thin_lto, 256KB per-corpus text (ASCII/CJK/Code), single-threaded, cache-hot.

IREE tiktoken encode throughput (one-shot, MB/s):

Encoding ASCII CJK Code
cl100k_base 70.2 63.3 69.4
o200k_base 70.1 59.4 68.7
r50k_base 67.6 62.2 67.6
p50k_base 68.3 62.4 6.5 ¹

vs Python tiktoken (code one-shot, 256KB):

Encoding IREE Python tiktoken Speedup
cl100k_base 69.4 MB/s 13.6 MB/s 5.1×
o200k_base 68.7 MB/s 8.1 MB/s 8.5×
r50k_base 67.6 MB/s 13.7 MB/s 4.9×
p50k_base 6.5 MB/s 13.4 MB/s 0.5× ¹

Decode throughput: ~2 GB/s across all encodings (decode is a simple vocab lookup).

¹ p50k_base Code is slower due to 24 extra whitespace tokens (2-25 consecutive spaces, Codex indentation vocabulary) that cause combinatorial work in BPE pair validation when encoding code with long alignment indentation. This is a legacy encoding (Codex, code-davinci-002) — the modern encodings (cl100k_base, o200k_base) are unaffected. For comparison, Python tiktoken (Rust core) achieves 13.4 MB/s on the same input — about 2x faster on this specific pathological case, because tiktoken's priority-queue BPE algorithm doesn't need pair validation. On all non-pathological inputs, IREE is 5-8x faster than tiktoken.

Tool Support

The comprehensive benchmark tool auto-detects .tiktoken files by extension and infers the encoding name from the filename (e.g., cl100k_base.tiktokencl100k_base), with an explicit --encoding flag for non-standard names. run_benchmarks.sh downloads tiktoken files alongside HuggingFace tokenizers for automated benchmark runs.

@benvanik benvanik requested a review from stellaraccident March 5, 2026 07:06
@benvanik benvanik added the runtime Relating to the IREE runtime library label Mar 5, 2026
@benvanik benvanik force-pushed the users/benvanik/tiktoken branch from b86cbd8 to e30e295 Compare March 5, 2026 07:22
benvanik and others added 3 commits March 5, 2026 09:00
OpenAI's tiktoken is the second major tokenizer format in the ML ecosystem
(alongside HuggingFace's tokenizer.json). This adds a complete tiktoken
loader so IREE can ingest tokenizer definitions from either ecosystem
without external conversion tools.

The tiktoken format stores BPE vocabularies as base64-encoded byte tokens
with integer ranks — no explicit merge list, no regex patterns, no special
tokens. The loader reconstructs the full BPE merge table from ranks alone
via simulation: for each multi-byte token at rank R, it simulates BPE
encoding of that token's raw bytes using only merges with rank < R, and
when two parts remain, those form the merge pair. This produces a tokenizer
behaviorally indistinguishable from the HuggingFace equivalent — verified
100% token-for-token across cl100k_base and o200k_base on English, CJK,
code, emoji, and mixed-script text.

Predefined configs for cl100k_base (GPT-4), o200k_base (GPT-4o),
r50k_base (GPT-3), and p50k_base (Codex) provide the regex patterns and
special tokens that tiktoken files don't carry. Custom configs are
supported via the public iree_tokenizer_tiktoken_config_t struct.

The iree-tokenize tool now auto-detects .tiktoken files by extension
and infers the encoding name from the filename (e.g., cl100k_base.tiktoken
-> cl100k_base), with an explicit --encoding flag for non-standard names.
The comprehensive benchmark tool gains the same support.

Co-Authored-By: Claude <[email protected]>
… math.

Gardening pass on the base64 library to make it production-ready for
serving workloads. Three categories of changes:

**API improvements**: The encode/decode functions now use proper IREE
span types (iree_const_byte_span_t, iree_byte_span_t,
iree_mutable_string_view_t) instead of separate pointer+length
parameters. This keeps related fields together and matches IREE's
allocator and buffer conventions. Added iree_base64_encode() which was
previously missing — needed for any future base64 output paths.

**Overflow protection**: All size computations now use
iree_host_size_checked_add/mul. The critical fix is in
iree_base64_encoded_size() where ((data_length + 2) / 3) * 4 wraps on
pathological inputs (exploitable on 32-bit targets like WASM where IREE
runs). The function returns IREE_HOST_SIZE_MAX as an overflow sentinel —
never a valid encoded size since the result is always a multiple of 4.
The decode path's (L/4)*3 is provably safe (contraction), but uses
checked math as defense-in-depth policy.

**Testing and benchmarking**: Comprehensive encode tests (RFC 4648
vectors, size calculation, buffer-too-small, single byte values),
full 256-value encode→decode roundtrip, length-sweep roundtrip
(0-33 bytes covering all remainder cases), explicit overflow rejection
tests. Fuzzer now does bidirectional testing: random bytes as base64
input AND random binary through encode→decode roundtrip with
trap-on-mismatch. Benchmark covers encode/decode/roundtrip at 6 sizes
(4B-64KB) — sustained throughput is ~3 GB/s encode, ~2 GB/s decode.

Co-Authored-By: Claude <[email protected]>
…support.

Adds end-to-end validation of the tiktoken loader against real OpenAI
encoding files (cl100k_base, o200k_base, r50k_base, p50k_base) and
cross-validation against equivalent HuggingFace tokenizer.json files.

Fixes a bug where the tiktoken parser rejected files with rank gaps.
Real p50k_base has a gap at rank 50256 (reserved for <|endoftext|>
special token). The parser now fills gaps with placeholder entries and
downstream code (vocab construction, merge reconstruction) skips them.

New files:
- tiktoken_smoketest.py: Downloads .tiktoken files from OpenAI CDN,
  compares IREE output against tiktoken Python library (dev mode) or
  stored goldens (ci/verify mode), with cross-validation support.
- run_tiktoken_smoketest.sh: uvx wrapper with Python dependencies.
- generate_tiktoken_golden_ids.py: Generates golden token IDs using
  the tiktoken Python library for the test corpus.

Changes:
- tiktoken.c: Handle non-contiguous ranks by inserting zero-length
  placeholder entries at gap positions. Use add_token_with_id for
  explicit ID assignment. Skip placeholders in merge reconstruction.
- tiktoken_test.cc: Add ConstructWithRankGap test with synthetic
  gapped data. Rename ParseNonContiguousRanks to ParseBackwardRanks.
- run_benchmarks.sh: Add tiktoken encoding downloads and benchmark
  entries alongside existing HuggingFace tokenizers.
- tokenizer_corpus.json: Add golden IDs for four tiktoken encodings
  (72 test cases). Compact integer arrays to single lines.
- huggingface_smoketest.py: Compact integer arrays when writing corpus.

Co-Authored-By: Claude <[email protected]>
@benvanik benvanik force-pushed the users/benvanik/tiktoken branch from e30e295 to e521f2f Compare March 5, 2026 17:57
Replace \s* with literal space matches in the integer array compaction
regex. The \s* could match newlines, overlapping with the explicit \n
in the pattern, causing theoretical exponential backtracking. Since the
input is always from json.dumps(indent=2) which uses only spaces for
indentation, [ ]* is both correct and unambiguous.

Co-Authored-By: Claude <[email protected]>
@benvanik benvanik marked this pull request as ready for review March 5, 2026 19:08
benvanik and others added 2 commits March 5, 2026 11:46
…cial token ID.

Add predefined configs for the 3 missing OpenAI encodings: gpt2 (alias for
r50k_base), p50k_edit (p50k_base + 3 FIM special tokens), and o200k_harmony
(o200k_base + 10 named special tokens for ChatGPT message format).

Add iree_tokenizer_tiktoken_config_by_name() lookup function that resolves
any of the 7 standard encoding names to a config, replacing manual if/else
chains in tool code.

Fix p50k_base <|endoftext|> special token ID: was 50281 (wrong), should be
50256 (the gap rank in p50k_base's BPE sequence). The existing test was
asserting the incorrect value. Add <|endoftext|> test case to the golden
corpus to catch this class of bug via integration tests.

Co-Authored-By: Claude <[email protected]>
…base speedup.

The pair validation cache previously only stored TRUE results. When
encodings with many overlapping tokens (like p50k_base's 24 whitespace
tokens for 2-25 consecutive spaces) encounter code with long alignment
runs, many candidate token pairs fail validation. These FALSE results
were re-validated on every occurrence — 23,719 redundant calls with
8.35M hash lookup iterations for a 256KB corpus.

The fix stores both TRUE and FALSE results using bit 31 of the cached
token2 field as a validity flag (token IDs never exceed 2^31). Cached
FALSE results are only used when deferred_merge_rank == 0; when deferral
is active, cached FALSE entries trigger re-validation since deferral may
change the result. This is safe because cached TRUE is monotonically
valid (a pair valid with no deferral is valid with any deferral).

Results (O3/march=native/thin_lto, 256KB code corpus):
  p50k_base Code: 1.02 MB/s -> 6.48 MB/s (6.3x speedup)
  All other encodings: within noise (+-3%), no regressions.
  Benchmarked across cl100k_base, o200k_base, r50k_base, gpt2, llama3,
  gemma-2b, qwen2.5, mistral-nemo, deepseek-v3.

Co-Authored-By: Claude <[email protected]>
@benvanik benvanik added the post-merge-review Ben's special place. People can pick these up and review them for forward fixes if interested. label Mar 5, 2026
@benvanik benvanik merged commit 205b17f into main Mar 5, 2026
61 checks passed
@benvanik benvanik deleted the users/benvanik/tiktoken branch March 5, 2026 22:14
stellaraccident added a commit to iree-org/iree-tokenizer-py that referenced this pull request Mar 5, 2026
Wire the IREE tiktoken format loader (landed in iree-org/iree#23663) through
the Python bindings, enabling direct loading of OpenAI .tiktoken vocabulary
files with all 7 standard encoding configs.

- Update IREE version pin to 205b17f (includes tiktoken loader)
- Add from_tiktoken(), from_tiktoken_str(), from_tiktoken_buffer() APIs
- Link iree_tokenizer_format_tiktoken_tiktoken in CMakeLists.txt
- Add --encoding flag to CLI for .tiktoken file support
- Add tiktoken test suite with hardened assertions (9 tests)
- Update README and PYPI_README to document tiktoken support
- Update CI clone steps to handle commit SHA pins

Co-Authored-By: Claude Opus 4.6 <[email protected]>
stellaraccident added a commit to iree-org/iree-tokenizer-py that referenced this pull request Mar 5, 2026
Wire the IREE tiktoken format loader (landed in iree-org/iree#23663) through
the Python bindings, enabling direct loading of OpenAI .tiktoken vocabulary
files with all 7 standard encoding configs.

- Update IREE version pin to 205b17f (includes tiktoken loader)
- Add from_tiktoken(), from_tiktoken_str(), from_tiktoken_buffer() APIs
- Link iree_tokenizer_format_tiktoken_tiktoken in CMakeLists.txt
- Add --encoding flag to CLI for .tiktoken file support
- Add tiktoken test suite with hardened assertions (9 tests)
- Update README and PYPI_README to document tiktoken support
- Update CI clone steps to handle commit SHA pins

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Stella Laurenzo <[email protected]>
stellaraccident added a commit to iree-org/iree-tokenizer-py that referenced this pull request Mar 5, 2026
Wire the IREE tiktoken format loader (landed in iree-org/iree#23663) through
the Python bindings, enabling direct loading of OpenAI .tiktoken vocabulary
files with all 7 standard encoding configs.

- Update IREE version pin to 205b17f (includes tiktoken loader)
- Add from_tiktoken(), from_tiktoken_str(), from_tiktoken_buffer() APIs
- Link iree_tokenizer_format_tiktoken_tiktoken in CMakeLists.txt
- Add --encoding flag to CLI for .tiktoken file support
- Add tiktoken test suite with hardened assertions (9 tests)
- Update README and PYPI_README to document tiktoken support
- Update CI clone steps to handle commit SHA pins

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Stella Laurenzo <[email protected]>
stellaraccident added a commit to iree-org/iree-tokenizer-py that referenced this pull request Mar 5, 2026
## Summary

- Wire the IREE tiktoken format loader
([iree-org/iree#23663](iree-org/iree#23663))
through Python bindings with `from_tiktoken()`, `from_tiktoken_str()`,
and `from_tiktoken_buffer()` APIs supporting all 7 standard OpenAI
encodings (cl100k_base, o200k_base, o200k_harmony, r50k_base, gpt2,
p50k_base, p50k_edit)
- Update IREE version pin to `205b17f` and CI clone steps to handle
commit SHA pins
- Add `--encoding` flag to CLI for `.tiktoken` file support
- Add 9 tiktoken tests with hardened assertions (82 total tests pass
under ASAN)

## Test plan

- [x] Full test suite passes locally under ASAN (82/82)
- [ ] CI test-asan job passes with new IREE pin
- [ ] CI benchmark job validates correctness + runs benchmarks

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Stella Laurenzo <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

post-merge-review Ben's special place. People can pick these up and review them for forward fixes if interested. runtime Relating to the IREE runtime library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant