
How Character Encoding Works — From ASCII to UTF-8
A computer stores numbers, not letters. The letter "A" is the number 65. The emoji "🔥" is the number 128,293. A character encoding is the agreed-upon mapping between characters and numbers, and the rules for representing those numbers as bytes.
Getting encoding wrong produces mojibake — the garbled text you see when a Japanese webpage renders as æ–‡å—化ã' or an email shows ü instead of ü. Understanding encoding prevents this.
ASCII — The Starting Point
ASCII (American Standard Code for Information Interchange, 1963) maps 128 characters to the numbers 0-127, using 7 bits per character.
| Range | Characters |
|---|---|
| 0–31 | Control characters (newline, tab, bell, null) |
| 32–47 | Punctuation and symbols (space, !, ", #) |
| 48–57 | Digits 0–9 |
| 65–90 | Uppercase A–Z |
| 97–122 | Lowercase a–z |
| 123–126 | More symbols ({, |
| 127 | DEL (delete) |
ASCII is elegant: uppercase and lowercase letters differ by exactly one bit (bit 5). Digits '0'–'9' have values 48–57, so char - '0' converts a digit character to its numeric value. These properties are still exploited in code today.
But ASCII only covers English. No accents (é, ü, ñ), no non-Latin scripts (中文, العربية, हिन्दी), no emoji. 128 characters is not enough for a global internet.
Unicode — Every Character, One Standard
Unicode assigns a unique code point (a number) to every character in every writing system. As of Unicode 15.1 (2023): 149,813 characters from 161 scripts.
Code points are written as U+XXXX:
U+0041= AU+00FC= üU+4E16= 世U+1F525= 🔥
The first 128 Unicode code points are identical to ASCII — backward compatible.
Unicode doesn't specify how code points are stored as bytes. That's the encoding's job.
UTF-8 — The Encoding That Won
UTF-8 (Unicode Transformation Format, 8-bit) encodes Unicode code points as sequences of 1-4 bytes. Invented in 1992 by Ken Thompson and Rob Pike — the creators of Unix and Go.
The encoding rules:
| Code point range | Bytes | Binary pattern |
|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx |
| U+0080 – U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800 – U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000 – U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The leading bits of the first byte tell you how many bytes the character uses. Continuation bytes always start with 10. This means you can jump into the middle of a UTF-8 stream and find the start of the next character — a critical property for error recovery and random access.
ASCII is valid UTF-8. Any ASCII text is already UTF-8 with no conversion needed. This is why UTF-8 became the dominant encoding — it's backward compatible with the existing English-centric internet.
UTF-8 in Practice
"Hello" = 48 65 6C 6C 6F — 5 bytes, one per character. Pure ASCII.
"Héllo" = 48 C3 A9 6C 6C 6F — 6 bytes. The é (U+00E9) takes 2 bytes (C3 A9).
"世界" = E4 B8 96 E7 95 8C — 6 bytes. Each Chinese character takes 3 bytes.
"🔥" = F0 9F 94 A5 — 4 bytes. Emoji are in the supplementary plane (U+1F525).
This means: len("Hello") = 5 in most languages (5 characters), but the byte length varies by encoding. In UTF-8, a string's byte length is always ≥ its character count. Code that assumes one byte per character breaks on non-ASCII text.
Why Does This Matter for Developers?
String length is ambiguous. Python's len("café") = 4 (characters). Rust's "café".len() = 5 (bytes, because é is 2 bytes in UTF-8). JavaScript's "🔥".length = 2 (UTF-16 code units, because 🔥 is a surrogate pair). Know which "length" your language measures.
Indexing is expensive. In UTF-8, finding the 1000th character requires scanning from the beginning — you can't jump to byte 1000 because characters have variable width. This is O(n), not O(1). Languages that want O(1) indexing use fixed-width encodings internally (Python stores strings as Latin-1, UCS-2, or UCS-4 depending on content).
Search tokenization depends on encoding. A tokenizer splitting on word boundaries must understand that é is one character (2 bytes), not two. CJK text has no spaces between words — tokenization requires language-specific rules.
Network protocols specify encoding. HTTP headers use Content-Type: text/html; charset=utf-8. JSON is always UTF-8. YAML is always UTF-8. HTML defaults to UTF-8 since HTML5. If you don't specify, receivers guess — and guess wrong.
Databases need encoding alignment. PostgreSQL's TEXT type stores UTF-8. MySQL's utf8 was actually UTF-8 limited to 3 bytes (no emoji!) until utf8mb4 added full 4-byte support. Emoji in usernames broke many systems that used the wrong encoding.
The Other Encodings
UTF-16 — 2 or 4 bytes per character. Used internally by Java, JavaScript, Windows, and .NET. Wastes space on ASCII text (2 bytes per character instead of 1). Compact for CJK text (2 bytes instead of UTF-8's 3).
UTF-32 — exactly 4 bytes per character. O(1) indexing but wastes space. Used internally by Python for strings containing characters above U+FFFF.
Latin-1 (ISO 8859-1) — 1 byte per character, covers Western European languages. Still found in legacy systems and email headers.
UTF-8 dominates the web (98% of websites as of 2026). For storage, transmission, and interchange, UTF-8 is the default choice. Use other encodings only when a specific system requires them.
Next Steps
Computation is the foundation. The next layer builds the structures that organize data:
- Data Structures — arrays, hash maps, trees, graphs — how information is organized.
- How Memory Works — where encoded characters actually live.
- How Query Processing Works — how encoding affects tokenization in search.