Encoding

If you’re reading this using a computer running Windows XP, I have a magic trick for you. Fire up notepad, and make yourself a file that contains the phrase Bush hid the facts. Just that phrase, nothing else: no using the enter key or trailing spaces, and don’t try to copy the italics: what is there, not what it looks like. Once you have the file, save it, close it, and then open that file again: ta da!

If you aren’t running Windows XP, I’d like to start by offering my sincere congratulations. Unfortunately, since the magic trick was based on a bug, you won’t be able to replicate it. Go search the interwebs for it, though: it’s not what you might expect.

The bug is caused by improper handling of the character encoding when the file is opened: the computer detects the encoding, and treats the file as if it was written like that, rendering the detected Unicode as what you see: Chinese mojibake.

One of the things that had to be decided when computers were first being programmed was how to represent the alphabet (which computers can’t understand) using numbers (which they understand very well), and the original solution was unsurprisingly skewed very much towards the experience of those doing the programming. ASCII emerged quickly as a standard encoding for the Latin Alphabet, and works reasonably well, though far from perfectly, in this role. The main problem is that very little scope was built in for extending the encoding beyond the basic latin alphabet, and that in turn means that ASCII is not able to encode all of the necessary characters for most european languages, not to mention almost all asian and african languages.

In an attempt to make up for this oversight, and to help mend the computing gap caused by each language group having its own specific encoding for the characters of their alphabet, a standard was released in 1989 called Unicode. The goal of Unicode is to be all-encompassing, and to that end, the number of characters that may be encoded is 1,114,112, somewhat outnumbering the 128 afforded by ASCII. As of Unicode 6.2, there are 109,976 graphic characters and 141 format characters, covering 100 scripts.

I guess all of this has been leading up to a suggestion. If your computer ever asks you how you want something saved, or how it should be opened, or what encoding to use to store text, your answer should always be Unicode.

This is from the 19th March 2013