ToCount: Lightweight Token Estimator

Overview

ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.

PyPI Counter
Github Stars

Branch	main	dev
CI

Code Quality

Installation

PyPI

Check Python Packaging User Guide
Run pip install tocount==0.5

Source code

Download Version 0.5 or Latest Source
Run pip install .

Models

Rule-Based

Model Name	R²	MAE	RMSE	MedAE	D²
`RULE_BASED.UNIVERSAL`	0.8175	106.70	617.78	18	0.6377
`RULE_BASED.GPT_3_5`	0.7266	152.34	756.17	35	0.4828
`RULE_BASED.GPT_4`	0.6878	161.93	808.04	40	0.4502

Tiktoken R50K

Model Name	R²	MAE	RMSE	MedAE	D²
`TIKTOKEN_R50K.LINEAR_ALL`	0.7334	152.39	733.40	28.55	0.4826
`TIKTOKEN_R50K.LINEAR_ENGLISH`	0.8703	62.76	508.20	8.87	0.7287

Tiktoken CL100K

Model Name	R²	MAE	RMSE	MedAE	D²
`TIKTOKEN_CL100K.LINEAR_ALL`	0.9127	64.09	298.02	15.73	0.6804
`TIKTOKEN_CL100K.LINEAR_ENGLISH`	0.9711	27.43	185.07	6.34	0.8527

Tiktoken O200K

Model Name	R²	MAE	RMSE	MedAE	D²
`TIKTOKEN_O200K.LINEAR_ALL`	0.9563	38.23	197.16	9.70	0.7818
`TIKTOKEN_O200K.LINEAR_ENGLISH`	0.9730	26.00	177.54	5.96	0.8581

Deepseek R1

Model Name	R²	MAE	RMSE	MedAE	D²
`DEEPSEEK_R1.LINEAR_ALL`	0.9531	40.66	212.11	10.71	0.7741
`DEEPSEEK_R1.LINEAR_ENGLISH`	0.9696	28.44	192.36	6.36	0.8477

Qwen QwQ

Model Name	R²	MAE	RMSE	MedAE	D²
`QWEN_QWQ.LINEAR_ALL`	0.9342	45.50	257.97	12.17	0.7542
`QWEN_QWQ.LINEAR_ENGLISH`	0.9570	29.06	236.10	6.68	0.8457

Llama 3.1

Model Name	R²	MAE	RMSE	MedAE	D²
`LLAMA_3_1.LINEAR_ALL`	0.9538	44.37	207.58	11.70	0.7578
`LLAMA_3_1.LINEAR_ENGLISH`	0.9731	26.59	177.94	6.24	0.8564

ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].

Usage

>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4

Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to [email protected].

Please complete the issue template

You can also join our discord server

References

1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.

2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.

Show your support

Star this repo

Give a ⭐️ if this project helped you!

Donate to our project

If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
otherfiles		otherfiles
tests		tests
tocount		tocount
.coveragerc		.coveragerc
.gitignore		.gitignore
.pydocstyle		.pydocstyle
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
autopep8.bat		autopep8.bat
autopep8.sh		autopep8.sh
codecov.yml		codecov.yml
dev-requirements.txt		dev-requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToCount: Lightweight Token Estimator

Overview

Installation

PyPI

Source code

Models

Rule-Based

Tiktoken R50K

Tiktoken CL100K

Tiktoken O200K

Deepseek R1

Qwen QwQ

Llama 3.1

Usage

Issues & bug reports

References

Show your support

Star this repo

Donate to our project

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ToCount: Lightweight Token Estimator

Overview

Installation

PyPI

Source code

Models

Rule-Based

Tiktoken R50K

Tiktoken CL100K

Tiktoken O200K

Deepseek R1

Qwen QwQ

Llama 3.1

Usage

Issues & bug reports

References

Show your support

Star this repo

Donate to our project

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages