ah-ah-ah

Offline token counting for Claude and OpenAI models. No API calls, no network, no latency.

VUN token! TWO tokens! Count all the beautiful tokens ... offline! Ah-ah-ah!

Quick start

use ah_ah_ah::{Backend, MarkdownDecomposer, count_tokens};

// Raw counting.
let report = count_tokens("Hello, world!", None, Backend::Claude, None);
assert_eq!(report.count, 4);

// With a token budget.
let report = count_tokens("Hello, world!", Some(100), Backend::Claude, None);
assert!(!report.over_budget);

// Markdown-aware counting (respects table cell boundaries).
let md = MarkdownDecomposer;
let table = "| A | B |\n|---|---|\n| 1 | 2 |";
let report = count_tokens(table, None, Backend::Claude, Some(&md));

Backends

Claude (default)

Greedy longest-match tokenizer built from 38,360 API-verified Claude 3+ token strings, using an Aho-Corasick automaton. The vocabulary was reverse-engineered by probing the Anthropic count_tokens API ~485,000 times (see ctoc and the vocabulary recovery work in the fork that produced this dataset).

No merge table — just greedy leftmost-longest matching. This is surprisingly effective: BPE's merge rules tend to produce tokens that are also the longest matches at each position.

OpenAI

Exact o200k_base BPE encoding via bpe-openai. This is the tokenizer used by GPT-4o, GPT-4.5, and o-series models.

Accuracy

Measured against the Anthropic count_tokens API on 18 diverse test strings (English prose, source code, URLs, JSON, CJK, emoji, markdown):

Category	Behavior	Typical delta
ASCII text & code	Near-exact	0 to -2 tokens
Latin prose	Mild overcount	+5-10%
CJK characters	Significant overcount	+50-80%
Emoji	Significant overcount	+30-40%

Why overcounting happens

The vocabulary contains 33,339 ASCII tokens but only 3,156 Unicode tokens. When the greedy tokenizer encounters a byte sequence not in the vocabulary, each unmatched byte is counted as one token. A single CJK character is 3 UTF-8 bytes, so an unknown CJK character costs 3 tokens in our count vs 1 in the real tokenizer.

This is the safe direction for budget enforcement — you'll never accidentally exceed a context window. Your budget estimates will be conservative.

Why undercounting happens (rare)

On some ASCII inputs, greedy longest-match produces 1-2 fewer tokens than the real BPE tokenizer. This happens when BPE's learned merge order splits text differently than left-to-right greedy — the greedy approach occasionally finds a more compact segmentation that BPE's bottom-up merge rules don't.

Undercounting is rare (typically -1 to -2 on strings of 20+ tokens) and concentrated in:

Punctuation-heavy text (JSON, markdown tables, error messages)
Strings mixing digits with special characters ($42.99)

For budget enforcement, add a small safety margin (2-3%) if your content is punctuation-heavy.

Accuracy by content type

Content	Example	Delta
English greeting	`Hello, world!`	0 (exact)
English sentence	`The quick brown fox...`	0 (exact)
Rust code	`fn main() { println!(...) }`	-1
SQL query	`SELECT u.name, COUNT(*)...`	0 (exact)
URL	`https://example.com/path/...`	0 (exact)
ISO timestamp	`2026-03-16T14:30:00.000Z`	0 (exact)
JWT fragment	`eyJhbGciOiJIUzI1NiIs...`	0 (exact)
Rust (complex)	`fn count(&self, text: &str...)`	+2
JSON	`{"name": "test", ...}`	-2
Latin prose	`Lorem ipsum dolor sit amet...`	+7
CJK	`こんにちは世界`	+6
Emoji	`🌍🚀✨`	+3

Decomposer

Structured content like markdown tables can cause greedy tokenizers to match tokens spanning cell boundaries (e.g., matching | A as a single token when | and A should be separate). The Decomposer trait lets you plug in boundary-aware counting.

MarkdownDecomposer is included — it uses pulldown-cmark to find tables, splits rows on |, and counts each cell independently. A fast-path heuristic skips the parser entirely for content without table separator rows (|---|), so source code with pipes (match arms, closures, shell pipes) doesn't pay the parsing cost.

use ah_ah_ah::{Backend, Decomposer, count_tokens};

// Custom decomposer example.
struct CsvDecomposer;

impl Decomposer for CsvDecomposer {
    fn count(&self, text: &str, raw_count: &dyn Fn(&str) -> usize) -> usize {
        text.lines()
            .map(|line| {
                let commas = line.bytes().filter(|&b| b == b',').count();
                let cells: usize = line.split(',').map(|c| raw_count(c)).sum();
                commas + cells
            })
            .sum()
    }
}

Smoke testing against the API

scripts/gen-token-fixtures.sh compares ah-ah-ah counts against the live Anthropic count_tokens API (via Claude Code CLI). It measures a baseline overhead, subtracts it, and prints a comparison table:

TEXT                                                          AH-AH     API  DELTA
Hello, world!                                                     4       4     0
The quick brown fox jumps over the lazy dog.                     11      11     0
こんにちは世界                                                    13       7    +6

Requires: claude CLI with valid auth, jq, and a built cargo build --example count.

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT license (LICENSE-MIT)

at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.cargo		.cargo
.claude		.claude
.config		.config
.github		.github
.handoffs		.handoffs
assets		assets
bench-reports		bench-reports
benches		benches
examples		examples
record/audit		record/audit
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.justfile		.justfile
.markdownlint.yaml		.markdownlint.yaml
.repo.yml		.repo.yml
.rustfmt.toml		.rustfmt.toml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md
cliff.toml		cliff.toml
deny.toml		deny.toml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ah-ah-ah

Quick start

Backends

Claude (default)

OpenAI

Accuracy

Why overcounting happens

Why undercounting happens (rare)

Accuracy by content type

Decomposer

Smoke testing against the API

License

About

Licenses found

Uh oh!

Releases 1

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ah-ah-ah

Quick start

Backends

Claude (default)

OpenAI

Accuracy

Why overcounting happens

Why undercounting happens (rare)

Accuracy by content type

Decomposer

Smoke testing against the API

License

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages