Why Word is Bad for the Web.
Every so often, you might see text on the web that appears to be corrupted in some way. It's full of odd foreign letters to the point where it's almost unreadable, and it took ages to load. Believe it or not, nine times out of ten the culprit is a program that many people use every day: Microsoft Word, the world's most popular word processor. That's because, while Word might be perfectly good for producing documents to print and email to people, Word is bad for the web.
The Quote Problem.
All those foreign letters you see in that text were originally nothing more than an attempt to make documents a tiny bit nicer to look at. You see, the design of the keyboard comes from the age of typewriters, and the symbols present represent the kind of writing that appears on typewriters. We're stuck with our keyboard designs, but they were never meant to account for all the extra letters and characters included in modern fonts. This led to the quote problem.
What's the quote problem? Well, to answer it, take a look at your keyboard. Notice how there's only one kind of double-quote mark – the straight one. Worse, when you want a single quote, you have to use the same key as for apostrophes! Now, if you were writing on paper, you'd put different shaped quotes at the start and end of a quote, instead of just making straight lines. Altogether, things that would be represented by five different marks on paper only get two symbols on the keyboard.
Long ago, Microsoft decided to solve this problem. First, they set up Word to look for quote marks and replace them with nicer, curly quotes, known as 'smart quotes'. Then, they took some unused character codes – hey, what could anyone ever want those for? – and decided that they would represent these new, pretty quotes.
Everything was fine until, years later, people started copying text they'd written in Word and pasting onto the web. Because Microsoft didn't stick to any international standard when they chose how to represent their smart quotes, the quotes ended up displaying as all sorts of unintended strange letters in web browsers. Word's users never meant to do this, but Word had gone ahead and done it for them, because smart quotes is turned on by default!
Not so smart after all, was it?
Of course, there's more to all this. When Microsoft finally caught on that the web was going to be big, they quickly added web features to Word, not least of which is the ability to save documents to HTML. Unfortunately for the rest of the world, though, Microsoft again failed to stick to any standards at all. They made up their own HTML tags to represent the layout of Word documents, purely to make sure that the documents would look the same if people wanted to open them in Word and save them in another format. These proprietary tags now pollute HTML documents all over the web, simply because the people who created the pages by saving as HTML in Word don't know enough to remove them – and they make pages load much more slowly.
Worse, even if you do remove all the Word-specific tags from the documents, the leftover HTML is still a nightmare. Presumably Microsoft decided to re-use the HTML generation engine from FrontPage, with the same kinds of results – a complete and utter mess.
Do you think it ends there? Amazingly, it doesn't. For their latest versions of Word, Microsoft decided it'd be great to add something they called 'smart tags' – a kind of 'link' that adds contextual information to things you type. For example, if you type an address in your document, that address allows you to link through to a map. Useful? Very rarely.
The problem comes when documents containing smart tags are saved as HTML – the tags are saved too! This means that documents all over the web have odd text linked to completely frivolous places, simply because Word thought it looked like an address. Not only do these links take ages to load correctly, but they're ugly too.
What might Microsoft Word unleash upon the web next? We can only wait in fear.