🔍 Text Analyzer
Paste any text to see each character's Unicode name, code point, category, block, and script. Detects invisible characters and combining marks.
Paste any text to see a character-by-character breakdown
Enter any text into the analyzer — it accepts plain text in any language, including mixed scripts, emoji, and special characters. Paste content from documents, web pages, databases, or source code to inspect its Unicode composition.
The tool displays a character-by-character table showing each grapheme cluster's code point, Unicode name, General Category (Ll, Lu, Nd, Po, etc.), block assignment, and script. Toggle between grapheme cluster and code point views to see how composite characters are constructed.
Consult the summary statistics to see character count, code point count, byte counts in UTF-8/UTF-16/UTF-32, script distribution, category distribution, and a list of any zero-width or invisible characters present in the text.
Text analysis at the Unicode level reveals the hidden structure beneath what appears as simple characters on screen. Every piece of text is a sequence of code points, each carrying rich metadata defined in the Unicode Character Database (UCD): a name, a general category, a script, a block, a bidi class, a combining class, and dozens of additional properties. Inspecting this metadata transforms opaque strings into comprehensible sequences with well-defined behaviors.
Practical text analysis answers questions that simple character counters cannot. A string containing "naïve" may have 5 or 6 code points depending on normalization. A 140-character tweet measured in Unicode scalar values may be 420 bytes in UTF-8. A username that looks identical to another may contain different code points — a classic homograph attack. Text from different sources may contain invisible format characters, directional overrides, or non-breaking spaces that break parsing, search, or display. Analyzing the Unicode composition of text catches these issues before they reach production.
The Unicode Standard's category system enables sophisticated text processing. Natural language processing pipelines use General Category to identify letters, digits, punctuation, and whitespace without language-specific rules. Regular expression engines with Unicode support use categories like \p{L} (any letter) or \p{Nd} (decimal digit) to write language-agnostic patterns. Developers building internationalized applications, security systems handling user-supplied text, or data pipelines processing multilingual content all benefit from understanding the Unicode properties of every character in their strings.
Paste any text to see each character's Unicode name, code point, category, block, and script. Detects invisible characters and combining marks.
Paste any text to see a character-by-character breakdown