ASCII is by far the most commonly used character encoding because it suffices for normal English text and English has long been the dominant (natural) language used on computers. As other languages came into use on computers, other sets of characters, with different encodings, came into existence. Indeed, there is usually more than one encoding for a particular writing system. All in all, there are hundreds of different character encodings.
This proliferation of character encodings causes a lot of problems. If you receive a document from someone else, your software may not be able to display it, print it, or edit it. You may not even be able to tell what language or writing system it is in. And if you need to use multiple writing systems in the same document, matters become much worse. Life would be much simpler if there was a single, universal encoding that covered all of the characters in all of the writing systems in use.
Unicode is a character encoding standard developed by the Unicode Consortium to fulfill this need. It attempts to include in a single encoding, using a single sequence of numbers, all of the characters in all of the writing systems that anyone is likely to want to use. Some aspects of Unicode have come in for criticism, and there are some alternative proposals, but at least for now it is by far the most widely adopted universal encoding.
The current version of the Unicode standard contains almost all of the writing systems currently in use, plus a few extinct systems, such as Linear B. More writing systems will be added in the future. The following chart lists the character ranges that have thus far been defined. The name of the range links to the appropriate section of the Unicode standard on the Unicode web site. These are PDF files. The beginning of the range is a link to an HTML chart.
Each block of 65,636 codepoints is referred to as a plane. Planes are numbered beginning at 0. Plane 0, codepoints 0x0000 through 0xFFFF, is known as the Basic Multilingual Plane or BMP because it contains the great majority of characters in current use for the world's languages.
Most of the Unicode ranges represent a single writing system. However, this is not always the case. In some cases Unicode lumps together several writing systems. For example, what it calls the Canadian Aboriginal Syllabics is not a single writing system. It is actually the union of the Cree writing system, the Inuktitut writing system, several variants used for languages such as Slave, Dogrib, and Dene Souline (Chipewyan), and the historically related but quite different Carrier writing system. Bengali and Assamese are combined since they differ only in the use of an additional character in Assamese and in the shapes of one letter. The Chinese characters used for Chinese, Japanese, Korean and Vietnamese are combined into a single set referred to as "CJK characters".
Languages written in a combination of writing systems, such as Japanese, which is typically written in a mixture of Chinese characters, hiragana, and katakana, will make use of multiple ranges. However, a language need not make use of multiple writing systems for it to draw characters from multiple Unicode ranges. A text in a language written in a non-Roman writing system will almost always contain characters from at least two ranges. This is because whitespace characters such as space and line feed, the arabic numbers, and European punctuation are widely used. These characters, which are included in the Basic Latin range, are not repeated in the other ranges. For example, here is a bit of what we would think of as pure Tamil text: இல்லையே, இது வரைக்கும் பேசவேயில்லையே. However, it actually contains several characters from the Basic Latin range because the spaces and punctuation are Basic Latin.
Here is a listing of the individual characters:
Offset | UTF-32 | Range and Name |
0 | 0x00B87 | TAMIL LETTER I |
1 | 0x00BB2 | TAMIL LETTER LA |
2 | 0x00BCD | TAMIL SIGN VIRAMA |
3 | 0x00BB2 | TAMIL LETTER LA |
4 | 0x00BC8 | TAMIL VOWEL SIGN AI |
5 | 0x00BAF | TAMIL LETTER YA |
6 | 0x00BC7 | TAMIL VOWEL SIGN EE |
7 | 0x0002C | BASIC LATIN COMMA |
8 | 0x00020 | BASIC LATIN SPACE |
9 | 0x00B87 | TAMIL LETTER I |
10 | 0x00BA4 | TAMIL LETTER TA |
11 | 0x00BC1 | TAMIL VOWEL SIGN U |
12 | 0x00020 | BASIC LATIN SPACE |
13 | 0x00BB5 | TAMIL LETTER VA |
14 | 0x00BB0 | TAMIL LETTER RA |
15 | 0x00BC8 | TAMIL VOWEL SIGN AI |
16 | 0x00B95 | TAMIL LETTER KA |
17 | 0x00BCD | TAMIL SIGN VIRAMA |
18 | 0x00B95 | TAMIL LETTER KA |
19 | 0x00BC1 | TAMIL VOWEL SIGN U |
20 | 0x00BAE | TAMIL LETTER MA |
21 | 0x00BCD | TAMIL SIGN VIRAMA |
22 | 0x00020 | BASIC LATIN SPACE |
23 | 0x00BAA | TAMIL LETTER PA |
24 | 0x00BC7 | TAMIL VOWEL SIGN EE |
25 | 0x00B9A | TAMIL LETTER CA |
26 | 0x00BB5 | TAMIL LETTER VA |
27 | 0x00BC7 | TAMIL VOWEL SIGN EE |
28 | 0x00BAF | TAMIL LETTER YA |
29 | 0x00BBF | TAMIL VOWEL SIGN I |
30 | 0x00BB2 | TAMIL LETTER LA |
31 | 0x00BCD | TAMIL SIGN VIRAMA |
32 | 0x00BB2 | TAMIL LETTER LA |
33 | 0x00BC8 | TAMIL VOWEL SIGN AI |
34 | 0x00BAF | TAMIL LETTER YA |
35 | 0x00BC7 | TAMIL VOWEL SIGN EE |
36 | 0x0002E | BASIC LATIN FULL STOP |
Many languages written in extended versions of the Roman alphabet will also draw characters from several ranges. The Basic Latin range includes the basic twenty-six letters with no diacritics. If a language uses accents or other diacritics, or if it includes additional characters, it will draw characters from the Latin-1 Supplement or one of the two Latin Extended ranges. For example, Polish makes use of most of the ordinary Roman letters as well as characters such as Ł, which belongs to the Latin Extended-A range.
The Private Use Areas allow for the inclusion of non-standard characters in Unicode text. Any group of people can agree to use a certain encoding for a certain set of characters and exchange documents using them without fear of conflict between standard Unicode characters and their private character set. If a document contains characters in one of these ranges, one will not be able to display them or manipulate them intelligently unless one knows what they are. However, software processing such a document can simply be told to ignore characters in Private Use Areas.
One current use of the Private Use Areas is for writing systems that may well eventually be included in the Unicode standard but have not yet been included. For example, yudit supports the Hungarian runes and the Klingon alphabet, both encoded in the Private Use Area, Both of these writing systems may eventually be included in the standard. Another use for the Private Use Areas is for writing systems so obscure that they may never be included in the standard.
Unicode originally intended to use two bytes, that is, 16 bits, to represent each character. That would be sufficient for 65,536 characters. Although this may seem like a lot, it isn't really quite enough, so full Unicode makes use of 32 bits, that is, four eight-bit bytes. That's enough for 4,294,967,296 characters. In fact, although a 32 bit representation is used, the current standard actually calls for the use of only 21 bits - the high 11 bits are always 0. This provides for 2,097,152 characters, which should still be plenty. Text encoded in this version of Unicode is said to be in UTF-32.
When it was first realized that more than 65,536 characters might be needed, an attempt was made to expand the character space while keeping what was basically a two-byte encoding. The result was UTF-16. UTF-16 adds a complication: surrogate pairs. The ranges 0xD800-0xDBFF, the High Surrogate Area, and 0xDC00-0xDFFF, the Low Surrogate Area, do not directly represent characters. Instead, pairs of values, one a high surrogate, the other a low surrogate, together encode a character. The low ten bits of the high surrogate are concatenated with the low ten bits of the low surrogate, yielding a 20 bit number. Such surrogate pairs can encode 1,048,576 additional characters. UTF-16 can therefore encode a total of 65,536 -2048 + 1,048,576 or 1,112,064 characters. The characters in the BMP are represented by two bytes; characters outside the BMP are represented by four bytes.
One problem with UTF32 is that every character requires four bytes, that is, four times as much space as the ASCII characters and other single-byte encodings. In order to save space, a compressed form known as UTF-8 is usually used to store and exchange text. UTF-8 uses from one to four bytes to represent a character under the present limitation to 21 bits. If it were extended to use the full 32 bits available, the longest UTF-8 encodings would require six bytes. It is cleverly arranged so that ASCII characters take up only one byte. Since the first 128 Unicode characters are the ASCII characters, in the same order, a UTF-8 file containing nothing but ASCII characters is identical to an ASCII file. Other characters take up more space, depending on how large the UTF-32 code is. Here are the encodings of some of the characters shown above. The 0x indicates that these are hexadecimal (base 16) values.
UTF-32 | UTF-8 | Name |
0x00041 | 0x41 | Latin capital letter a |
0x00570 | 0xD5 0xB0 | Armenian small letter ho (հ) |
0x00BA4 | 0xE0 0xAE 0xA4 | Tamil letter ta (த) |
0x04E09 | 0xE4 0xB8 0x89 | Chinese digit 3 (三) |
0x10024 | 0xF0 0x90 0x80 0xA4 | Linear B qe (𐀤) |
The first byte of the UTF-8 encoding of a character contains the information about how many additional bytes are used to encode it. If the high bit of the first byte is 0, the characteris an ASCII character and no additional bytes are used to encode it. If the high bit is 1, at least one additional byte is part of the encoding. The number of adjacent bits set starting with the high bit is the total number of bytes used to encode the character. For example, if the top three bits are 110, the character is encoded using two bytes. The first byte therefore consists of from zero to six 1s followed by a 0. The remaining bits can be either 1 or 0 and contribute to the encoding of the character.
The following chart shows how characters in different ranges are encoded in UTF-8. (Recall that at present only the low 21 bits of UTF-32 are to be used, so the last two ranges listed would only be used if this limit were removed.) The letter n represents a bit that contributes directly to the encoding; it can be either 0 or 1.
UTF-32 Code | UTF-8 Encoding | |||||||
Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Bits | Slots |
00000000 - 0000007F | 0nnnnnnn | 7 | 128 | |||||
00000080 - 000007FF | 110nnnnn | 10nnnnnn | 11 | 2,048 | ||||
00000800 - 0000FFFF | 1110nnnn | 10nnnnnn | 10nnnnnn | 16 | 65,536 | |||
00010000 - 001FFFFF | 11110nnn | 10nnnnnn | 10nnnnnn | 10nnnnnn | 21 | 2,097,152 | ||
00200000 - 03FFFFFF | 111110nn | 10nnnnnn | 10nnnnnn | 10nnnnnn | 10nnnnnn | 26 | 67,108,864 | |
04000000 - 7FFFFFFF | 1111110n | 10nnnnnn | 10nnnnnn | 10nnnnnn | 10nnnnnn | 10nnnnnn | 31 | 2,147,483,648 |
How this works is most easily seen by examining specific examples. Here are the bit patterns of the same characters as above. In the UTF-8 encoding a hyphen separates the bits that directly contribute to the encoding from the preceding bits.
Name | UTF-32 | UTF-8 |
Latin capital letter a | 00000000000000000000000001000001 | 0-1000001 |
Armenian small letter ho | 00000000000000000000010101110000 | 110-10101 10-110000 |
Tamil letter ta | 00000000000000000000101110100100 | 1110-0000 10-101110 10-100100 |
Chinese digit 3 | 00000000000000000100111000001001 | 1110-0100 10-111000 10-001001 |
Linear B qe | 00000000000000010000000000100100 | 11110-000 10-010000 10-000000 10-100100 |
For example, take Chinese digit 3, encoded as 1110-0100 10-111000 10-001001. Stripping off the bits at the beginning that are not directly part of the encoding, we obtain: 0100 111000 001001. Concatenating these and padding out to 32 places by adding 0s on the left, we obtain: 00000000000000000100111000001001, which is the UTF-32 encoding.
Although in principle we could encode 4,294,967,296 different characters in 32 bits, UTF-8 can only encode 2,216,757,376 characters in six bytes. This is unlikely to be a problem in practice. But if we really did need more than 2,216,757,376 characters, we could use a seventh byte, with the first byte set to 11111110. This would give us 36 useful bits, for an additional 68,719,476,736 slots, allowing us to encode a total of 70,936,234,112 characters. This is considerably more than can be represented in UTF-32.
Notice that it is not only the first byte in the UTF-8 encoding of a character whose upper bits play a special role. The top two bits of all non-initial bytes must be 10. This seems to be a waste, since it means that each additional byte only contributes six bits rather than eight. The reason for doing this is that it allows us to locate ourselves in a stream of UTF-8 encoded characters:
Let's consider what it would be like if we used an encoding scheme like UTF-8, in that we use the first byte of a sequence to tell us how many more bytes contribute to that character, but in which we don't mark continuation bytes by setting their high bits to 10. Since we don't need to distinguish between the leading sequences 10 and 11, we can also modify our rule for encoding the number of bytes in a character. The number of adjacent 1s at the high end of the first byte of a character will now give us the total number of additional bytes needed to complete the character. So if a byte has high bit 0, as before, that byte is a complete character in itself. If the high bits are 10, this will now mean that a total of two bytes are used. If the high bits are 110, this will now mean that a total of three bytes are used, and so forth.
Suppose that someone transmits the Japanese word くるま /kuruma/ "wheel". In UTF-32 this is encoded as 0x304F 0x308B 0x307B . In UTF-8, this becomes:
0xE3 | 0x81 | 0x8F | 0xE3 | 0x82 | 0x8B | 0xE3 | 0x81 | 0xBE | 11100011 | 10000001 | 10001111 | 11100011 | 10000010 | 10001011 | 11100011 | 10000001 | 10111110 |
In our UTF-8-like system with no continuation marker, it will be encoded like this:
0xC0 | 0x30 | 0x4F | 0xC0 | 0x30 | 0x8B | 0xC0 | 0x30 | 0x7B | 11000000 | 00110000 | 01001111 | 11000000 | 00110000 | 10001011 | 11000000 | 00110000 | 01111011 |
Now, suppose that in the course of transmission the third byte is lost. A program reading the UTF-8 input will detect an error because the first byte tells it the second and third bytes must have 10 as the two highest bits. As soon as it reads the third byte (that is, the original fourth byte), whose high two bits are 11, it knows that a byte is missing. Furthermore, it knows which character is damaged and can go on to read the next character. It knows that the byte it has just read (0xE3) begins a new character since its high bits are 11. The result will therefore be ?るま, where ? stands for the unknown, damaged character. In fact, no matter which byte is lost, the damaged character will be detected and the program will be able to resynchronize and locate the next character.
Suppose that the same error, loss of the third byte, happens when we are using our pseudo-UTF-8 system. We will have started off by reading the first byte, 0xC0, which will have told us that we need two more bytes to complete the character. Since the original third byte has been lost, these will be the original second and fourth, 0x30 and 0xC0. The fact that the high bits of the fourth byte are 11 does not signal an error in this system since a continuation byte is permitted to have this pattern. The first character will therefore be taken to be 0011000011000000 = 0x30C0, which is ダ (katakana /da/). The next byte is 0x30. Since its high bit is 0, no additional bytes are needed, and it will be taken to be the digit 0. The next byte is 0x8B. The leading 1 tells us that this is the first of a two byte sequence. We strip the two high bits of the first byte and concatenate the two, yielding: 00101111000000 = 0x0BC0. This is the Tamil diacritic for the vowel /i:/. The last two bytes each have a leading 0 and so stand alone. They are the digit 0 and {. No error is detected by the program, which instead of the intended くるま produces ダ0ீ0{. A human being reading this will of course recognize it as garbled. However, he or she will have no idea what may have been intended, whereas someone who understands Japanese may well be able to guess the missing character in ?るま. Furthermore, if the error can be detected by a computer program, it may be possible to correct it immediately, whereas a human being may not look at the text until much later.
One source of resistance to using UTF-8 in some countries is that it seems to privilege English and other languages that can be written using only the ASCII characters. English only takes one byte per character in UTF-8, while most of the languages of India, for instance, require three bytes per character. By the standards of today's computer processors, storage devices and transmission systems, text files are so small that it really doesn't matter, so I don't think that this is a practical concern. It's more a matter of pride and politics.
If we don't need the extinct writing systems and other fancy stuff outside of the Basic Multilingual Plane, we could all be equal and use UTF-16. English and some other languages would take twice as much space to represent, but other languages would take the same space that they do in UTF-8 or even take up less space. At least from the point of view of those of us who aren't English imperialists, this might not be a bad idea, if not for the fact that UTF-8 has another big advantage over UTF-16: UTF-8 is independent of endianness.
Whenever a number is represented by more than one byte, the question arises as to the order in which the bytes are arranged. If the most significant bits come first, that is, are stored at the lowest memory address or at the first location in the file, the representation is said to be big-endian. If the least significant bits come first, the representation is said to be little-endian.*.
There is a third arrangment that is historically important because it was used on the Digital Equipment PDP-11 series. In PDP-endian order, the most-significant byte is the second byte, the next most significant byte is the first byte, the third most significant byte is the fourth byte, and the least significant byte is the third byte. In other words, it is "big-endian" in the sense that the first two bytes are more significant than the second two bytes, but "little-endian" in the internal ordering of the two halves.
Consider the following sequence of four bytes. The first row shows the bit pattern.
The second row shoes the interpretation of each byte separately as an
unsigned integer.
bit pattern | 00001101 | 00000110 | 10000000 | 00000011 |
decimal value | 13 | 6 | 128 | 3 |
Here is how this four byte sequence is interpreted as an unsigned integer
under the three ordering conventions.
Little-Endian | (13 * 256 * 256 *256) + (6 * 256 *256) + (128 * 256) + 3 | 218,529,795 |
Big-Endian | (3 * 256 * 256 *256) + (128 * 256 *256) + (6 * 256) + 13 | 58,721,805 |
PDP-Endian | (128* 256 * 256 *256) + (3 * 256 *256) + (13 * 256) + 6 | 2,147,683,590 |
These orderings are often described in terms of the sequence of
bytes, from least significant to most significant, like this:
Little-Endian | 1 | 2 | 3 | 4 |
Big-Endian | 4 | 3 | 2 | 1 |
PDP-Endian | 3 | 4 | 1 | 2 |
Most computers these days are little-endian since the Intel and AMD processors that most PCs use are little-endian. Digital Equipment machines from the VAX through the current Alpha series are also little-endian. On the other hand, most RISC-based processors, such as the SUN SPARC and the PowerPC, as well as the IBM 370 and Motorola 68000 series, are big-endian. A program that determines the byte order of the machine on which it is run can be downloaded here.
UTF-16 and UTF-32 are subject to endianness variation. If I write something in UTF-16 on a little-endian machine and you try to read it on a big-endian machine, it won't work. For example, suppose that I encode the Armenian character հ ho on a little-endian machine. The first byte will have the bit pattern 01110000, conventionally interpreted as 112. The second byte will have the bit pattern 00000101, conventionally interpreted as 5. That's because the UTF-32 code, 0x570 = 1392, is equal to (5 * 256) + 112. Remember, on a little-endian machine, the first byte is the least significant one. On a big-endian machine, this sequence of two bytes will be interpreted as (112 * 256) + 5 = 373 = 0x175, since the first byte, 112, is the most significant on a big-endian machine. Well, 0x175 isn't the same character as 0x570. It's ŵ (w with a circumflex). So, if you use UTF-16 you have to worry about byte order. UTF-8, on the other hand, is invariant under changes in endianness. That is a big enough advantage that most people will probably continue to prefer UTF-8.
The various Unicode transformation formats have different storage requirements. The following charts show the minimum, maximum, and average number of bytes per character for text in different ranges in the different formats.
Minimum | ASCII | 8-bit | BMP | All Planes |
---|---|---|---|---|
UTF-32 | 4.00 | 4.00 | 4.00 | 4.00 |
UTF-16 | 2.00 | 2.00 | 2.00 | 2.00 |
UCS-2 | 2.00 | 2.00 | 2.00 | N/A |
UTF-8 | 1.00 | 1.00 | 1.00 | 1.00 |
Maximum | ASCII | 8-bit | BMP | All Planes |
---|---|---|---|---|
UTF-32 | 4.00 | 4.00 | 4.00 | 4.00 |
UTF-16 | 2.00 | 2.00 | 2.00 | 4.00 |
UCS-2 | 2.00 | 2.00 | 2.00 | N/A |
UTF-8 | 1.00 | 2.00 | 3.00 | 4.00 |
Mean | ASCII | 8-bit | BMP | All Planes |
---|---|---|---|---|
UTF-32 | 4.00 | 4.00 | 4.00 | 4.00 |
UTF-16 | 2.00 | 2.00 | 2.00 | 3.94 |
UCS-2 | 2.00 | 2.00 | 2.00 | N/A |
UTF-8 | 1.00 | 1.50 | 2.97 | 3.97 |
The fullest information is found in the Unicode standard. This is available on the Unicode Consortium web site [http://www.unicode.org], in print form, and on CD. Two files that can be obtained from the web site or the CD are often useful. The file UnicodeData.txt contains details of most characters. It is a plain text file in which, for the most part, each line contains information about one character. Each such line contains a series of fields separated by semi-colons. The first field is the code, in hexadecimal; the second field is the name of the character. The other fields contain additional information of various sorts.
UnicodeData.txt is intended primarily to be read by machines. Another file, NamesList.txt, contains a subset of the information in UnicodeData.txt, omitting details primarily of use to computer programs, reformatted to be more readable by human beings. This is the best place to look for a character by name.
Both of these files omit character-by-character descriptions for the Chinese characters. This information is kept in a separate file, Unihan.txt [zip compressed version] since it is voluminous (25Mb uncompressed, 5Mb compressed) and not needed by many users. This file does not give simple descriptions of the Chinese characters comparable to those for other characters; for the most part, the information given consists of cross references to various reference works.
There are also a variety of books about Unicode. One that I recommend is Richard Gillam's Unicode Demystified. See Recommended Reading for this and some related books.
A useful tool for dealing with Unicode is yudit, a Unicode text editor. If supplied with the appropriate fonts (sources for which are listed on the yudit website), yudit can display UTF-8 text. You can edit the displayed text, and you can enter text in several ways. By using a keymap you can type in a romanization and have the text appear in whatever writing system you choose. Numerous keymaps are supplied with yudit, but it is not difficult to write your own if necessary. If you know the code for the character you want to enter, you can enter it by its numerical code. yudit also recognizes Chinese characters drawn with the mouse. If you move the mouse over a character and left-click, yudit will display the corresponding character code.
Here is a screenshot of yudit displaying a sampling of writing sytems.
Sometimes it is useful to find out about the content of a document for which you do not have the necessary fonts, which is in a writing system that you do not understand, where you want to look at characters that are not directly visible, or where you want information about exactly how the text is encoded. Two programs useful for such purposes are unidesc and uniname, both of which can be downloaded here. unidesc reports the character ranges to which different portions of the text belong. It can also identify Unicode encodings flagged by magic numbers. uniname prints the byte offset of each character, its hex code value, its UTF-8 encoding, and its name.
A convenient tool for converting from one Unicode encoding to another is uniconv, which comes with the yudit editor. uniconv can convert from one Unicode encoding to another, or between Unicode and various other encodings. In addition to a number of built-in encodings, uniconv can use keymaps created for use with yudit. For example, if you have a keymap that allows you to enter text into yudit in romanization and have it appear in a non-Roman writing system, uniconv will use the same keymap to convert text from that romanization to Unicode or another encoding. The GNU program iconv can also convert between Unicode encodings and between Unicode and numerous other character encodings.