Unicode Text Analyzer — Character Identifier

How to Use

Paste or type your text

Enter any text into the analyzer — it accepts plain text in any language, including mixed scripts, emoji, and special characters. Paste content from documents, web pages, databases, or source code to inspect its Unicode composition.

Review the character breakdown

The tool displays a character-by-character table showing each grapheme cluster's code point, Unicode name, General Category (Ll, Lu, Nd, Po, etc.), block assignment, and script. Toggle between grapheme cluster and code point views to see how composite characters are constructed.

Use the statistics panel for insights

Consult the summary statistics to see character count, code point count, byte counts in UTF-8/UTF-16/UTF-32, script distribution, category distribution, and a list of any zero-width or invisible characters present in the text.

About

Text analysis at the Unicode level reveals the hidden structure beneath what appears as simple characters on screen. Every piece of text is a sequence of code points, each carrying rich metadata defined in the Unicode Character Database (UCD): a name, a general category, a script, a block, a bidi class, a combining class, and dozens of additional properties. Inspecting this metadata transforms opaque strings into comprehensible sequences with well-defined behaviors.

Practical text analysis answers questions that simple character counters cannot. A string containing "naïve" may have 5 or 6 code points depending on normalization. A 140-character tweet measured in Unicode scalar values may be 420 bytes in UTF-8. A username that looks identical to another may contain different code points — a classic homograph attack. Text from different sources may contain invisible format characters, directional overrides, or non-breaking spaces that break parsing, search, or display. Analyzing the Unicode composition of text catches these issues before they reach production.

The Unicode Standard's category system enables sophisticated text processing. Natural language processing pipelines use General Category to identify letters, digits, punctuation, and whitespace without language-specific rules. Regular expression engines with Unicode support use categories like \p{L} (any letter) or \p{Nd} (decimal digit) to write language-agnostic patterns. Developers building internationalized applications, security systems handling user-supplied text, or data pipelines processing multilingual content all benefit from understanding the Unicode properties of every character in their strings.

FAQ

Why can the same visible character have different code point counts depending on how it is encoded?

Visible characters can be represented by multiple code points when composed using combining characters or Unicode sequences. For example, the letter "é" can be a single precomposed code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE) or a base letter U+0065 followed by a combining acute accent U+0301 — both render identically. Similarly, many emoji are sequences of multiple code points joined by U+200D ZERO WIDTH JOINER. Unicode normalization forms (NFC, NFD, NFKC, NFKD) standardize these representations, which is why text comparison requires normalization before matching.

What is a grapheme cluster, and why does it matter for text processing?

A grapheme cluster, defined in Unicode Standard Annex #29, is the smallest unit of text that a user perceives as a single character. It may consist of one or more code points: a base character followed by combining marks, modifier letters, or zero-width joiners. The family emoji 👨‍👩‍👧 is a single grapheme cluster composed of 8 code points. For correct text operations like cursor movement, string truncation, or character counting as a user expects, software must segment text into grapheme clusters rather than code points or bytes. Languages like Swift and Python's grapheme library handle this correctly; naïve byte or code point slicing does not.

What Unicode General Categories exist and what do they mean?

Unicode assigns every code point a General Category from a two-letter code. Major categories include L (Letter, subdivided into Lu uppercase, Ll lowercase, Lt titlecase, Lm modifier, Lo other), N (Number: Nd decimal digit, Nl letter number, No other number), P (Punctuation: Pc connector, Pd dash, Ps open, Pe close, Pi initial quote, Pf final quote, Po other), S (Symbol: Sm math, Sc currency, Sk modifier, So other), Z (Separator: Zs space, Zl line, Zp paragraph), C (Other: Cc control, Cf format, Cs surrogate, Co private use, Cn unassigned), and M (Mark: Mn nonspacing, Mc spacing combining, Me enclosing). These categories drive how text processing algorithms handle tokenization, word boundaries, and casing.

How can I detect invisible or potentially dangerous Unicode characters in text?

Several Unicode categories contain characters invisible to users that can cause security issues or unexpected behavior. Format characters (category Cf) include U+200B ZERO WIDTH SPACE, U+200D ZERO WIDTH JOINER, U+FEFF BYTE ORDER MARK, and U+202E RIGHT-TO-LEFT OVERRIDE (used in homograph attacks to disguise file extensions). Bidirectional control characters (U+202A–U+202E, U+2066–U+2069) can reverse text direction mid-string. Analyzing text for these characters is important in security-sensitive contexts like usernames, URLs, file paths, and code review. The Unicode Consortium's confusable mapping database also helps detect lookalike characters used in phishing.

What does the Unicode block and script assignment tell me about a character?

Unicode blocks are contiguous ranges of code points grouped by character type or origin, such as "Basic Latin" (U+0000–U+007F), "Arabic" (U+0600–U+06FF), or "Emoticons" (U+1F600–U+1F64F). Blocks provide a rough organizational grouping but do not carry semantic meaning. Scripts, defined in Unicode Standard Annex #24, are more semantically meaningful — they identify the writing system a character belongs to, such as Latin, Cyrillic, Han, Devanagari, or Common (for punctuation shared across scripts). Script assignment is used by font selection algorithms, spell checkers, bidirectional text layout engines, and security systems that detect mixed-script domain names (IDN homograph attacks).

🔍 Text Analyzer

Paste any text to see each character's Unicode name, code point, category, block, and script. Detects invisible characters and combining marks.

Paste or type text to analyze