If you’ve ever heard someone complaining that “this system doesn’t support double-byte characters”, or asking whether “this data’s in Unicode”, and felt as though you really ought to understand what those things mean, then this post is for you.
This isn’t a technical discussion, but I will use the terms bits and bytes. So before we go on, let’s recap and ask:
What’s a bit?
A bit is a binary digit. It can store one of two possible values: 0 or 1. Computers like bits because there are plenty of mechanical ways you can distinguish between two possible values – for example, by switching a current on or off.
What’s a byte?
A byte is a set of 8 bits. Computers typically move data around a byte at a time. You can store 2-to-the-power-8 or 256 possible values in a single byte.
In the beginning there was ASCII
In order to exchange text with people using computers, we need a way of representing words as sequences of zeroes and ones.
ASCII is the American Standard Code for Information Interchange. It uses 7-bit numbers to represent the letters, numerals and common punctuation used in American English.
The fact that ASCII uses 7-bit numbers means there are 2-to-the-power-7 or 128 possible values it can represent, from 0 to 127 inclusive. Each of those 128 values is assigned to a character. For example, in ASCII the number 65 represents an upper-case letter A, 61 represents an equals sign, and so on. So if a word processor that displays ASCII text gets a byte with a value of 65, it displays an upper-case letter A on the screen.
These mappings of numbers to characters are just a convention that someone decided on when ASCII was developed in the 1960s. There’s nothing fundamental that dictates that a capital A has to be character number 65, that’s just the number they chose back in the day.
What’s wrong with ASCII?
Nothing whatsoever – as long as you speak English. Here’s the English word “Hello” with the ASCII numbers for each character shown beneath.
Nothing wrong there! Let’s try something with some punctuation.
It certainly does! This is great. How about foreign languages?
Accented characters such as a-umlaut don’t exist in ASCII. You simply can’t represent them in text encoded as ASCII. The best you could do is use the unaccented equivalent and hope it doesn’t change the meaning to something rude.
ASCII really should have been named ASCIIWOA: the American Standard Code for Information Exchange With Other Americans.
What’s the solution?
“Well,” thought our forebears, “we could just add an extra bit. If we made ASCII an 8-bit code, it could store another 128 values. That should be enough to store all those weird, accented characters, right?”
And this is where it started to go really wrong.
Different bodies came up with different extended ASCII character sets: they all had the same, standard ASCII characters in the first 128 character slots, but the additional 128 characters varied from one character set to the next.
That had two consequences. Firstly, you couldn’t simply take text encoded in an extended ASCII dialect and display it; you had to know which dialect of extended ASCII you were dealing with. For example, in the ISO 8859-1 (Western Europe) extended ASCII character set, number 224 represents a lower-case letter A with a grave accent. However, in ISO 8859-2 (Eastern Europe) the same number represents a lower-case letter R with an acute accent. Interpreting data using the wrong character set was a recipe for disaster.
The second consequence was that if, for example, your software only understood ISO 8859-1 and you received data in ISO 8859-2, and that data included an r-acute character, you couldn’t even display it, since r-acute doesn’t exist in the ISO 8859-1 character set. The best the software could do was to replace these unprintable characters with a question mark, an empty square, or something else to indicate that an encoding problem had occurred. More often than not, they replaced them with the character corresponding to the same number in some other extended ASCII character set, resulting in gibberish.
You could argue that extended ASCII, flawed though it was, could at least provide a usable, basic computing experience for people in most parts of Europe, Africa and the Americas. Put another way, for Slovaks swapping data with other Slovaks, 256 characters gave them enough scope to express their language. It was only when they tried communicating with Greeks, Finns or Estonians that problems arose.
Meanwhile, in Asia…
However, a 256-character set is no use in languages like Chinese and Japanese, where there are thousands of characters in common use. Users of those languages had a whole new level of complexity to deal with.
The solution was multi-byte character sets (sometimes misleadingly referred to as double-byte characters sets). These used sequences of bytes to represent individual characters, and specified rules that software developers could use to determine whether a particular byte in a stream of bytes was a continuation of the previous character or the start of a new one. Multi-byte character sets removed the 256-character limit imposed by extended ASCII.
However, like extended ASCII, they were focused on a particular language. And just like extended ASCII, multiple, incompatible standards emerged. The same code meant different things in different character sets, and as a result cross-encoding issues were as much a problem in Japan as they were in Johannesburg.
Unicode to the rescue
This is where Unicode comes in.
Designed as a single, global replacement for localised character sets, the Unicode standard is beautiful in its simplicity. In essence: collect all the characters in all the scripts known to humanity and number them in one single, canonical list. If new characters are invented or discovered, no problem, just add them to the list. The list isn’t an 8-bit list, or a 16-bit list, it’s just a list, with no limit on its length.
So, in Unicode, character 341 is a lower-case r-acute. Wherever you are in the world, be it in Massachusetts, Mombasa or Myanmar, character 341 is lower-case r-acute. It is the one, true Unicode number for lower-case r-acute. Likewise, character number 2979 is the rather beautiful Tamil letter NNA, character number 5084 is the equally delightful Cherokee letter DLA, and so on.
Unicode contains entries not just for script characters but also mathematical symbols, box drawing elements, braille patterns, domino tiles and all other manner of stuff. The latest version of Unicode contains over 110,000 characters.
Encoding the code
OK, this sounds great. But computers still talk to each other by swapping information a byte at a time. How do I transfer a Cherokee DLA (Unicode character 5084) using bytes?
This is where the Unicode character encodings come in. The most common, UTF-8, is a multi-byte encoding standard. For characters in the original ASCII character set, UTF-8 only needs one byte per character. (In fact, it’s completely backwards compatible with the original 7-bit ASCII standard.) For other Unicode characters, UTF-8 uses two or more bytes per character, and just like the east Asian multi-byte character sets, it adopts a convention that dictates how to determine whether a byte is an ASCII character, the continuation of a previous character, or the start of a new one.
However, unlike those east Asian multi-byte character sets, which were locale-specific, UTF-8 encodes Unicode numbers. A particular UTF-8 byte sequence encodes a particular Unicode number, which in turn represents a particular character, regardless of where in the world you are, or which language you speak. No overlap, no ambiguity.
Alternative schemes for encoding Unicode numbers include UTF-16, also a variable-byte character encoding but one in which every character is represented by one or more pairs of bytes. That ends up being wasteful if you’re dealing predominantly with English text; one of the reasons why UTF-8 has become the dominant choice.
So there you have it. The history of character encoding in a nutshell. I hope you found it useful. I’m @s1mn on Twitter, drop me a line if you have any comments.