Nov/Dec 2001

The Magazine

Past Issues

Write for Us


About the Magazine

Letter from Editor

Order Back Issues



The Lingua Franca of Document Markup


Long ago, copy editors would write formatting instructions (boldface this, italicize that, change font here, etc.) directly on the text of copy destined for publication in books, magazines and newspapers. It was the combination of these instructions, or markups, and the text that made a complete document. All very meticulous, exacting stuff.

As computers proliferated, copy editors embraced the early text processors and formatting tools. The problem was that each of these early systems approached document markup differently. Raw text was not the issue. From the beginning, any text-processing software could deal with raw text. But each of these tools had its own increasingly arcane way of encoding that raw text with formatting commands. Documents created in one system were rendered unreadable by any other system (similar to the Word to WordPerfect conversion problem, but way worse).

If you were a company like Boeing Aircraft Corporation, where the documentation for a 747 could run for millions of pages, you needed technical manuals that would survive decades of software and hardware updates. Publishing software that mangled your documents with esoteric formatting codes was not going to do much for the longevity of your precious instruction books. So, if a digital document were to have any permanence, there would have to be a way to maintain the integrity and portability of the data while still applying the necessary formatting commands. By the early 70s, it became clear that about the only solution was to separate content (text, pictures, tables—the data) from its structure and appearance. And so began the search for a standardized text description language.


Rather than merely concoct something quick and dirty, IBM and others put 15 years into creating the mother of all document description languages: the Standard General Markup Language. First published in 1986, this thing is big and complicated.

SGML puts all the structural-formatting instructions into "tags" embedded in the document. These tags are themselves made up of text, which is simply surrounded by angle brackets. An example: and to mark the beginning and end of a bold section of text. Everything inside the angle brackets is an instruction to the computer; everything outside the tags is data. The fact that the tags themselves are plain text is important. Remember, anything can read plain text. The tags are really descriptions of what the computer is supposed to do with the text (or picture, or movie, or table). Software then interprets these descriptions and translates them into instructions to the computer to do the heavy lifting necessary to deliver the author’s or designer’s intended formatting result.

Governments and the defense, computer and aerospace industries quickly latched onto SGML for their industrial-strength document needs.


At the same time, Big Science was happening at the European Center for Nuclear Research (CERN) in Switzerland. Its research machines were big, its computers were big, and its documentation was big. So naturally, by the late-80s, CERN was experimenting with SGML—which, as it happens, was also big.

Tim Berners-Lee, a CERN researcher at the time, saw that the then-primitive Internet provided a way for information on one computer to make its way to any other computer on the Net. He was also aware of some wild ideas that had been resonating quietly in the background for decades. Vannevar Bush had suggested in an Atlantic article back in 1945 that the world’s information would soon be available to everyone through a device he called a "Memex" ( Twenty years later, this image had inspired Ted Nelson to suggest a system of "non-sequential writing" that he called "hypertext." Inspired by these ideas, Berners-Lee made it his modest goal to connect everything to everything else using simple tools that could be run on any kind of computer.

Berners-Lee realized that he would increase his chances of success if he employed tools that were familiar to prospective users. With SGML stirring up curiosity around CERN, Berners-Lee naturally drifted to it as a way to help his project along. SGML’s portability, promise of longevity and separation of format from data made it a good candidate as a lingua franca for data linking and exchange. But SGML was too big of a gun, so he simplified it enormously into what he called the Hypertext Markup Language. HTML used a similar but less-rigorous system of tags that allowed documents to be formatted reliably on any kind of computer.

More importantly, though, Berners-Lee added a capability to HTML that had profound implications for the future of computing. He created a class of tags that acted as pointers to other documents regardless of where those documents were—as long as they were on the Internet. It could be a document on the computer next door or on a computer in Tokyo. It didn’t matter. To put it all together, he invented what he called the Hypertext Transfer Protocol (the http:// part at the beginning of most URLs) that made sure that the pointers actually worked the way they were supposed to.

In late 1990, Berners-Lee packaged up HTML and HTTP along with some other software magic and gave the world a new medium. He called this new environment the World Wide Web.

Mark Tamminga ( practices law and fiddles with software at Gowling Lafleur Henderson LLP in Toronto.