Unicode

Reuven Lerner — Mon, 16 Sep 2013 17:19:45 +0000

Let's give credit where credit's due: Unicode is a brilliant invention that makes life easier for millions—even billions—of people on our planet. At the same time, dealing with Unicode, as well as the various encoding systems that preceded it, can be an incredibly painful and frustrating experience. I've been dealing with some Unicode-related frustrations of my own in recent days, so I thought this might be a good time to revisit a topic that every modern software developer, and especially every Web developer, should understand.

In case you don't know what Unicode is, or how it affects you, consider this: in C and in older versions of languages like Python and Ruby, a string is nothing more than a bunch of bytes. There's no rhyme or reason to it; you can read whatever data you want into a string, and the language will be fine with it. For example, if I fire up iPython (which uses Python 2.7), I can read a JPEG image into a string:


s = open('Downloads/test.jpg').read()

Most of the time, you use strings not to hold JPEG images, but rather to hold text. If your text is all in English, you're in luck, because all the characters used by the English language are defined in ASCII, a standard that defines 128 different characters, each with a unique number. Thus, character 65 is uppercase A, and the space character is number 32. ASCII is great, and it works just fine—until you want to start using languages other than English.

The problem is most languages require characters that are not used in English, and that aren't defined in ASCII. This means if you want to write words in French, let alone in Arabic or Chinese, you won't have a way to represent characters using ASCII.

A solution for alphabetic languages was a set of ISO standards (ISO 8859-*), which took advantage of the fact that ASCII uses only 7 bits, but that data is transmitted with 8 bits. If you can take advantage of all 8 bits, you double the number of available characters, from 128 to 256. This is more than enough for languages with a defined alphabet. Thus, Western European languages were defined in ISO-8859-1, Hebrew in ISO-8859-8 and so forth. Moreover, these ISO standards were meant to make it possible to mix the "foreign" language with English. Thus, you could have a document with English and French or English and Arabic. ASCII characters retained their original values, and the non-ASCII characters were defined in the upper 128.

Go to Full Article

Encoding

Unicode