What is UTF-8

UTF-8 is a character encoding that represents every Unicode code point using 1 to 4 bytes. It is backward compatible with ASCII — any valid ASCII text is also valid UTF-8. Over 98% of web pages use UTF-8, making it the dominant encoding on the internet.

How it works

ASCII used 7 bits to represent 128 characters — English letters, digits, and basic punctuation. That worked for English but left out the rest of the world. Unicode assigned a unique number (called a code point) to every character in every writing system. UTF-8 encodes those code points into bytes:

Code point range	Bytes	Bit pattern	Example
U+0000 to U+007F	1	`0xxxxxxx`	`A` = `0x41`
U+0080 to U+07FF	2	`110xxxxx 10xxxxxx`	`e` = `0xC3 0xA9`
U+0800 to U+FFFF	3	`1110xxxx 10xxxxxx 10xxxxxx`	`中` = `0xE4 0xB8 0xAD`
U+10000 to U+10FFFF	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	`🔥` = `0xF0 0x9F 0x94 0xA5`

The leading bits of each byte tell the decoder what to expect. A byte starting with 0 is a single-byte character. A byte starting with 110 begins a two-byte sequence. Continuation bytes start with 10. This design means a decoder can always resynchronize after encountering corrupted data — it just scans forward until it finds a valid start byte.

The variable-width design is the key trade-off. English text stays compact (1 byte per character, identical to ASCII). Chinese, Arabic, and Hindi use 3 bytes per character. Emoji use 4. This means string length in bytes does not equal string length in characters — a critical distinction for any programmer working with text.

Why it matters

UTF-8 solved the encoding fragmentation problem. Before it, different regions used incompatible encodings (Shift-JIS, Windows-1252, ISO-8859-1), and text routinely broke when crossing system boundaries. UTF-8 can represent any character in any language in a single, self-synchronizing format.

Understanding UTF-8 explains why len("cafe") might return 5 in some languages, why hexadecimal byte dumps show multi-byte sequences for non-ASCII text, and why "number of bytes" and "number of characters" are different questions.

See How Character Encoding Works for the full story from ASCII to Unicode to UTF-8.

What is UTF-8

How it works

Why it matters

Prerequisites

This concept appears in

Referenced by

What is UTF-8

How it works

Why it matters

Prerequisites

Related

This concept appears in

Referenced by