What is Unicode
Unicode is a universal character encoding standard that provides a unique number, called a code point, for every character used in the written languages of the world. Before Unicode, there were hundreds of different encoding systems, each covering only a subset of characters. Trying to display text from one system in another frequently resulted in garbled characters, known as mojibake. Unicode solves this by providing a single, comprehensive standard that covers all characters from all writing systems.
The Unicode standard currently defines over 149,000 characters covering 161 scripts, from widely used scripts like Latin, Cyrillic, and Chinese to historical scripts like Egyptian hieroglyphics and cuneiform. It also includes thousands of symbols, mathematical operators, technical characters, emoji, dingbats, and control characters. The standard is maintained by the Unicode Consortium, a non-profit organization whose members include Apple, Google, Microsoft, Meta, and other major technology companies.
Each Unicode character is identified by a code point written in the format U+XXXX, where XXXX is a hexadecimal number. The most commonly used characters fall in the Basic Multilingual Plane (BMP), which covers code points U+0000 through U+FFFF. Additional characters, including many emoji and historical scripts, occupy the supplementary planes from U+10000 through U+10FFFF.
Unicode Encoding: UTF-8, UTF-16, UTF-32
While Unicode defines the characters and their code points, the encoding determines how those code points are stored as bytes in computer memory and files. The three main Unicode encodings are UTF-8, UTF-16, and UTF-32, each with different trade-offs between storage efficiency and processing simplicity.
UTF-8 is the dominant encoding on the web, used by over 98% of all websites as of 2026. It uses a variable-length encoding: 1 byte for ASCII characters (U+0000 to U+007F), 2 bytes for code points up to U+07FF, 3 bytes for the rest of the BMP (up to U+FFFF), and 4 bytes for supplementary characters. UTF-8 is backward compatible with ASCII, meaning any valid ASCII text is also valid UTF-8. This compatibility made it easy to adopt incrementally across the internet.
UTF-16 uses 2 bytes for BMP characters and 4 bytes (two "surrogate pairs") for supplementary characters. It is used internally by JavaScript, Java, and Windows. UTF-32 uses a fixed 4 bytes per character, which simplifies random access but wastes space for text that primarily uses ASCII or BMP characters.
Character Categories and Blocks
Unicode organizes characters into blocks, which are contiguous ranges of code points allocated to a specific script or purpose. The Basic Latin block (U+0000 to U+007F) contains the standard ASCII characters. Latin Extended-A and Extended-B add characters for languages that use the Latin script with diacritical marks, such as French, German, Polish, and Vietnamese.
Mathematical operators occupy several blocks, including Mathematical Operators (U+2200 to U+22FF) and Supplemental Mathematical Operators (U+2A00 to U+2AFF). These include symbols for set theory, logic, calculus, and abstract algebra. Arrows fill the Arrows block (U+2190 to U+21FF) and the Supplemental Arrows blocks, providing directional indicators in many styles.
Box Drawing characters (U+2500 to U+257F) provide line segments for creating tables and diagrams in text-mode displays. Block Elements (U+2580 to U+259F) add partial blocks for creating bar charts and graphical elements in terminal applications. These characters remain useful in command-line tools and README files displayed in code repositories.
Using Special Characters in Web Development
In HTML, special characters can be represented using named entities, decimal numeric entities, or hexadecimal numeric entities. Named entities are readable mnemonics like & for the ampersand, © for the copyright symbol, and → for a right arrow. Not all Unicode characters have named entities, but every character can be represented with a numeric entity like → (decimal) or → (hexadecimal).
In CSS, the content property used with ::before and ::after pseudo-elements accepts Unicode escape sequences in the format \XXXX, where XXXX is the hexadecimal code point. For example, content: '\2764' inserts a heart symbol and content: '\2713' inserts a check mark. This approach is commonly used for decorative icons and indicators without adding HTML elements.
In JavaScript, Unicode characters can be included directly in strings if the source file uses UTF-8 encoding, or represented with escape sequences. The \uXXXX syntax handles BMP characters, while the \u{XXXXX} syntax (ES6+) handles any code point including supplementary characters. The String.fromCodePoint() method converts a code point number to its character, and codePointAt() performs the reverse operation.
Unicode in Programming Languages
JavaScript strings are sequences of UTF-16 code units, which creates some counterintuitive behavior with supplementary characters. A single emoji like the pile of poo (U+1F4A9) occupies two code units (a surrogate pair), so "string".length returns 2 even though it appears as one character. The Array.from() method and for...of loop iterate over code points correctly, while the older for loop iterates over code units.
Python 3 uses Unicode strings by default, with each character represented by its code point regardless of the underlying encoding. The len() function returns the number of code points. The ord() function returns the code point of a character, and chr() returns the character for a given code point. Python source files default to UTF-8 encoding since Python 3.0.
Regular expressions require special handling for Unicode. In JavaScript, the /u flag enables Unicode mode, where . matches any code point (not just code units) and character classes like \p{Letter} match Unicode categories. Without the /u flag, supplementary characters may be split across two matches, causing incorrect results.
History and Evolution of Unicode
The Unicode project began in the late 1980s when engineers at Xerox and Apple recognized the need for a universal character encoding. The first version, Unicode 1.0, was published in 1991 and covered 7,129 characters, primarily from scripts used in modern commerce. Early versions fit within 16 bits (65,536 code points), which seemed sufficient at the time.
As the standard expanded to include historical scripts, rare symbols, and the growing collection of CJK (Chinese, Japanese, Korean) ideographs, the 16-bit limit proved inadequate. Unicode 2.0 (1996) introduced the supplementary planes, extending the code space to over 1.1 million possible code points. This expansion was accompanied by the development of UTF-16 surrogate pairs and the now-dominant UTF-8 encoding.
The most visible recent additions have been emoji, which were first standardized in Unicode 6.0 (2010) with 722 characters imported from Japanese mobile phone character sets. The emoji collection has grown substantially with each subsequent release, driven by public proposals and a formal submission process managed by the Unicode Consortium.
Emoji and Modern Unicode
Emoji are the most publicly recognized aspect of modern Unicode development. Each emoji has a code point (or sequence of code points for modified versions), a name, and a reference design. The actual appearance varies between platforms because Apple, Google, Microsoft, Samsung, and other vendors each create their own emoji artwork that conforms to the Unicode description.
Skin tone modifiers, introduced in Unicode 8.0, use combining characters called Fitzpatrick modifiers (U+1F3FB through U+1F3FF) appended to human emoji to produce five additional skin tone variants. Zero Width Joiner (ZWJ) sequences combine multiple emoji code points into single composite characters, enabling representations like family groups, professions, and flags. These sequences allow new emoji representations without requiring new code points.
Flag emoji use a special mechanism: each flag is represented by two Regional Indicator Symbol letters that correspond to the ISO 3166-1 alpha-2 country code. For example, the US flag is the sequence U+1F1FA U+1F1F8 (Regional Indicator Symbol Letter U + Regional Indicator Symbol Letter S). This approach allows any country recognized by ISO 3166-1 to have a flag emoji without explicit standardization.