What is UTF-8
UTF-8 is a character encoding that represents every Unicode code point using 1 to 4 bytes. It is backward compatible with ASCII — any valid ASCII text is also valid UTF-8. Over 98% of web pages use UTF-8, making it the dominant encoding on the internet.
How it works
ASCII used 7 bits to represent 128 characters — English letters, digits, and basic punctuation. That worked for English but left out the rest of the world. Unicode assigned a unique number (called a code point) to every character in every writing system. UTF-8 encodes those code points into bytes:
| Code point range | Bytes | Bit pattern | Example |
|---|---|---|---|
| U+0000 to U+007F | 1 | 0xxxxxxx | A = 0x41 |
| U+0080 to U+07FF | 2 | 110xxxxx 10xxxxxx | e = 0xC3 0xA9 |
| U+0800 to U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx | 中 = 0xE4 0xB8 0xAD |
| U+10000 to U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 🔥 = 0xF0 0x9F 0x94 0xA5 |
The leading bits of each byte tell the decoder what to expect. A byte starting with 0 is a single-byte character. A byte starting with 110 begins a two-byte sequence. Continuation bytes start with 10. This design means a decoder can always resynchronize after encountering corrupted data — it just scans forward until it finds a valid start byte.
The variable-width design is the key trade-off. English text stays compact (1 byte per character, identical to ASCII). Chinese, Arabic, and Hindi use 3 bytes per character. Emoji use 4. This means string length in bytes does not equal string length in characters — a critical distinction for any programmer working with text.
Why it matters
UTF-8 solved the encoding fragmentation problem. Before it, different regions used incompatible encodings (Shift-JIS, Windows-1252, ISO-8859-1), and text routinely broke when crossing system boundaries. UTF-8 can represent any character in any language in a single, self-synchronizing format.
Understanding UTF-8 explains why len("cafe") might return 5 in some languages, why hexadecimal byte dumps show multi-byte sequences for non-ASCII text, and why "number of bytes" and "number of characters" are different questions.
See How Character Encoding Works for the full story from ASCII to Unicode to UTF-8.