chardet

Universal character encoding detector.

chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x, just much faster and more accurate. Python 3.10+, zero runtime dependencies, works on PyPy.

Why chardet 7.0?

98.2% accuracy on 2,510 test files. 44x faster than chardet 6.0.0 and 4.1x faster than charset-normalizer. Language detection for every result. MIT licensed.

	chardet 7.1.0 (mypyc)	chardet 7.1.0 (pure)	chardet 6.0.0	charset-normalizer
Accuracy (2,510 files)	98.2%	98.2%	88.3%	84.2%
Speed	533 files/s	372 files/s	12 files/s	129 files/s
Language detection	95.2%	95.2%	40.0%	59.0%
Peak memory	25.9 MiB	25.9 MiB	29.5 MiB	101.3 MiB
Streaming detection	yes	yes	yes	no
Encoding era filtering	yes	yes	no	no
Supported encodings	99	99	84	99
License	MIT	MIT	LGPL	MIT

Installation

pip install chardet

Quick Start

import chardet

chardet.detect(b"Hello, world!")
# {'encoding': 'ascii', 'confidence': 1.0, 'language': 'en'}

# UTF-8 with typographic punctuation
chardet.detect("It\u2019s a lovely day \u2014 let\u2019s grab coffee.".encode("utf-8"))
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': 'es'}

# Japanese EUC-JP
chardet.detect("これは日本語のテストです。文字コードの検出を行います。".encode("euc-jp"))
# {'encoding': 'EUC-JP', 'confidence': 1.0, 'language': 'ja'}

# Get all candidate encodings ranked by confidence
text = "Le café est une boisson très populaire en France et dans le monde entier."
results = chardet.detect_all(text.encode("windows-1252"))
for r in results[:4]:
    print(r["encoding"], round(r["confidence"], 2))
# Windows-1252 0.44
# iso8859-15 0.44
# ISO-8859-1 0.44
# MacRoman 0.42

Streaming Detection

For large files or network streams, use UniversalDetector to feed data incrementally:

from chardet import UniversalDetector

detector = UniversalDetector()
with open("unknown.txt", "rb") as f:
    for line in f:
        detector.feed(line)
        if detector.done:
            break
result = detector.close()
print(result)

Encoding Era Filtering

Restrict detection to specific encoding eras to reduce false positives:

from chardet import detect_all
from chardet.enums import EncodingEra

data = "Москва является столицей Российской Федерации и крупнейшим городом страны.".encode("windows-1251")

# All encoding eras are considered by default — 4 candidates across eras
for r in detect_all(data):
    print(r["encoding"], round(r["confidence"], 2))
# Windows-1251 0.5
# MacCyrillic 0.47
# KZ1048 0.22
# ptcp154 0.22

# Restrict to modern web encodings — 1 confident result
for r in detect_all(data, encoding_era=EncodingEra.MODERN_WEB):
    print(r["encoding"], round(r["confidence"], 2))
# Windows-1251 0.5

Encoding Filters

Restrict detection to specific encodings, or exclude encodings you don't want:

# Only consider UTF-8 and Windows-1252
chardet.detect(data, include_encodings=["utf-8", "windows-1252"])

# Consider everything except EBCDIC
chardet.detect(data, exclude_encodings=["cp037", "cp500"])

CLI

chardetect somefile.txt
# somefile.txt: utf-8 with confidence 0.99

chardetect --minimal somefile.txt
# utf-8

# Include detected language
chardetect -l somefile.txt
# somefile.txt: utf-8 en (English) with confidence 0.99

# Only consider specific encodings
chardetect -i utf-8,windows-1252 somefile.txt
# somefile.txt: utf-8 with confidence 0.99

# Pipe from stdin
cat somefile.txt | chardetect
# stdin: utf-8 with confidence 0.99

What's New in 7.0

MIT license (previous versions were LGPL)
Ground-up rewrite — 12-stage detection pipeline using BOM detection, structural probing, byte validity filtering, and bigram statistical models
42x faster than chardet 6.0.0 with mypyc (34x pure Python), 4.2x faster than charset-normalizer
98.2% accuracy — +10.0pp vs chardet 6.0.0, +14.0pp vs charset-normalizer
Language detection — 95.2% accuracy across 49 languages, returned with every result
99 encodings — full coverage including EBCDIC, Mac, DOS, and Baltic/Central European families
EncodingEra filtering — scope detection to modern web encodings, legacy ISO/Mac/DOS, mainframe, or all
Optional mypyc compilation — 1.42x additional speedup on CPython
Thread-safe — detect() and detect_all() are safe to call concurrently; scales on free-threaded Python
Same API — detect(), detect_all(), UniversalDetector, and the chardetect CLI all work as before

Documentation

Full documentation is available at chardet.readthedocs.io.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 933 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src/chardet		src/chardet
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
prek.toml		prek.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chardet

Why chardet 7.0?

Installation

Quick Start

Streaming Detection

Encoding Era Filtering

Encoding Filters

CLI

What's New in 7.0

Documentation

License

About

Uh oh!

Releases 17

Packages

Uh oh!

Used by 955k

Contributors 53

Languages

Folders and files

Latest commit

History

Repository files navigation

chardet

Why chardet 7.0?

Installation

Quick Start

Streaming Detection

Encoding Era Filtering

Encoding Filters

CLI

What's New in 7.0

Documentation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Used by 955k

Contributors 53

Languages

Packages