Unicode Converter — Character ↔ Code Point

How to Use

Enter your character or text

Type or paste any character, word, or phrase into the input field. The tool accepts raw text, Unicode escape sequences (U+0041), HTML entities (&), or hexadecimal code points directly.

Select your target encoding

Choose from 17 encoding formats including UTF-8 hex bytes, UTF-16 code units, UTF-32, decimal NCR, hexadecimal NCR, HTML named entity, CSS escape, JavaScript escape, Python literal, URL percent-encoding, Base64, binary, octal, and more.

Copy the converted output

Click any output field to copy the converted representation. Use the "Copy All" button to export every encoding at once in a structured format suitable for documentation or code.

About

Unicode conversion bridges the gap between the abstract identity of a character and its concrete binary representation in various computing contexts. The Unicode Standard assigns a unique code point to every character across all of the world's writing systems, symbols, and emoji — over 149,000 characters in Unicode 16.0 spanning 154 scripts. Understanding how these code points translate into bytes, escape sequences, and markup forms is fundamental to building software that handles multilingual text correctly.

Different encoding formats serve different purposes. UTF-8 is the dominant encoding for the web and file storage, using 1–4 bytes per code point while remaining backward-compatible with ASCII. UTF-16 is used internally by JavaScript, Java, and Windows, representing most characters in 2 bytes but requiring surrogate pairs for supplementary characters. HTML numeric character references (&#decimal; or &#xhex;) and named entities (&, ©) embed characters safely in markup. CSS escapes (\000041) and JavaScript \uXXXX and \u{XXXXX} sequences are necessary for embedding characters in stylesheets and scripts respectively.

Practical knowledge of Unicode conversion prevents common bugs: mojibake occurs when a byte sequence encoded in one format is decoded as another; incorrect URL encoding causes broken links; truncating UTF-8 strings at byte boundaries rather than code point boundaries corrupts characters. Tools that expose all 17 encoding representations simultaneously allow developers to audit every layer of a character's representation at once, making it easier to trace encoding errors from HTML source through JavaScript runtime to database storage.

FAQ

What is the difference between a Unicode code point and a UTF-8 byte sequence?

A Unicode code point is the abstract numerical identity assigned to each character in the Unicode Standard, written as U+XXXX. It exists independent of any physical storage format. UTF-8 is a variable-width encoding that represents code points as sequences of 1 to 4 bytes. For example, U+0041 (LATIN CAPITAL LETTER A) is a single byte 0x41 in UTF-8, while U+1F600 (GRINNING FACE) requires four bytes: 0xF0 0x9F 0x98 0x80. The code point is the identity; the byte sequence is the serialization.

Why do some Unicode characters require surrogate pairs in UTF-16?

UTF-16 uses 16-bit code units, which can directly represent code points in the Basic Multilingual Plane (U+0000 to U+FFFF). However, Unicode defines code points up to U+10FFFF, totaling over 1.1 million possible characters. To encode supplementary characters (U+10000 and above), UTF-16 uses two consecutive 16-bit code units called a surrogate pair: a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). This is why JavaScript's string.length can return 2 for a single emoji — it counts UTF-16 code units, not code points.

What does "U+" notation mean, and why is it the standard way to reference Unicode characters?

The "U+" prefix is the conventional notation established by the Unicode Consortium to reference code points unambiguously. The hexadecimal number following U+ identifies the character's position in Unicode's 17 planes (each plane contains 65,536 code points). Basic Multilingual Plane characters use 4 hex digits (U+0041), while supplementary plane characters use 5 or 6 digits (U+1F600). This notation is language- and encoding-agnostic, making it the universal reference format in standards documents, bug reports, and source code comments.

How does URL percent-encoding handle non-ASCII Unicode characters?

URL percent-encoding, defined in RFC 3986, encodes characters as UTF-8 bytes where each byte is represented as "%" followed by two uppercase hexadecimal digits. For a multi-byte character like ñ (U+00F1), the UTF-8 encoding is 0xC3 0xB1, so the percent-encoded form is "%C3%B1". Emoji like 😀 (U+1F600) produce four UTF-8 bytes encoded as "%F0%9F%98%80". Modern browsers and servers expect UTF-8 as the basis for percent-encoding per RFC 3986 and the WHATWG URL Standard.

What are numeric character references (NCR) and when should I use them in HTML?

Numeric character references are HTML syntax for representing Unicode characters using their code point number, either as decimal (♥) or hexadecimal (♥) notation. They are useful when your HTML file's character encoding does not support the character directly, or when you need to include characters that have special meaning in HTML (like < and >) without using named entities. In HTML5 with UTF-8 encoding, you can use characters directly without NCRs, but NCRs remain valuable in legacy systems, email templates, XML documents, and contexts where encoding support is uncertain.

🔄 Unicode Converter

Convert between characters and Unicode code points, HTML entities, CSS/JS escapes, Python/Java literals, and UTF-8/16/32 byte sequences.