What is UTF-8

UTF-8 is a character encoding that represents every Unicode code point using 1 to 4 bytes. It is backward compatible with ASCII — any valid ASCII text is also valid UTF-8. Over 98% of web pages use UTF-8, making it the dominant encoding on the internet.

How it works

ASCII used 7 bits to represent 128 characters — English letters, digits, and basic punctuation. That worked for English but left out the rest of the world. Unicode assigned a unique number (called a code point) to every character in every writing system. UTF-8 encodes those code points into bytes:

Code point rangeBytesBit patternExample
U+0000 to U+007F10xxxxxxxA = 0x41
U+0080 to U+07FF2110xxxxx 10xxxxxxe = 0xC3 0xA9
U+0800 to U+FFFF31110xxxx 10xxxxxx 10xxxxxx = 0xE4 0xB8 0xAD
U+10000 to U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx🔥 = 0xF0 0x9F 0x94 0xA5

The leading bits of each byte tell the decoder what to expect. A byte starting with 0 is a single-byte character. A byte starting with 110 begins a two-byte sequence. Continuation bytes start with 10. This design means a decoder can always resynchronize after encountering corrupted data — it just scans forward until it finds a valid start byte.

The variable-width design is the key trade-off. English text stays compact (1 byte per character, identical to ASCII). Chinese, Arabic, and Hindi use 3 bytes per character. Emoji use 4. This means string length in bytes does not equal string length in characters — a critical distinction for any programmer working with text.

Why it matters

UTF-8 solved the encoding fragmentation problem. Before it, different regions used incompatible encodings (Shift-JIS, Windows-1252, ISO-8859-1), and text routinely broke when crossing system boundaries. UTF-8 can represent any character in any language in a single, self-synchronizing format.

Understanding UTF-8 explains why len("cafe") might return 5 in some languages, why hexadecimal byte dumps show multi-byte sequences for non-ASCII text, and why "number of bytes" and "number of characters" are different questions.

See How Character Encoding Works for the full story from ASCII to Unicode to UTF-8.