🧬 The Tokenizer Fun: From Bytes to BPE

A progressive, hands-on Jupyter Notebook that guides you from raw Python strings to a fully functional Byte Pair Encoding (BPE) tokenizer.

Heavily inspired by Andrej Karpathy's excellent video: Let's build the GPT Tokenizer

Notebook Sections

1. Unicode & Byte Inspection Lab

Explore how characters map to Unicode code points and UTF-8 byte sequences. Input emojis (🐍), flags (🇯🇵), and non-English text to see how a single character can expand into 1–4+ bytes.

2. BPE Algorithm Scratchpad

Implement the core BPE functions (get_stats, merge) from scratch. Step through the iterative merge process and visualize compression ratios at each step.

3. Regex Splitter Visualization

Test the GPT-2 and GPT-4 pre-tokenization regex patterns on code snippets and natural text. See how splitting prevents cross-category merges (e.g., keeping "dog" separate from "dog.").

4. Glitch Token Investigator

Investigate the "SolidGoldMagikarp" phenomenon — tokens that exist in the vocabulary but were never seen during model training, causing bizarre hallucinations.

5. Tiktoken Benchmarking

Compare your scratch-built BPE tokenizer against GPT-4's cl100k_base encoding. Measure compression density across English prose, Python code, JSON, YAML, and multilingual text.

Getting Started

pip install -r requirements.txt
jupyter notebook tokenizer_fun.ipynb

Acknowledgments

This project is heavily inspired by Andrej Karpathy's video Let's build the GPT Tokenizer, which provides an outstanding deep dive into how tokenizers work and why they matter for LLMs.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
tokenizer_fun.ipynb		tokenizer_fun.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 The Tokenizer Fun: From Bytes to BPE

Notebook Sections

1. Unicode & Byte Inspection Lab

2. BPE Algorithm Scratchpad

3. Regex Splitter Visualization

4. Glitch Token Investigator

5. Tiktoken Benchmarking

Getting Started

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 The Tokenizer Fun: From Bytes to BPE

Notebook Sections

1. Unicode & Byte Inspection Lab

2. BPE Algorithm Scratchpad

3. Regex Splitter Visualization

4. Glitch Token Investigator

5. Tiktoken Benchmarking

Getting Started

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages