Vol. 32, No. 3

How clean is your document? What you need to know about metadata

by Marilyn Cavicchia

Scrub. Scrub again. And then, just for good measure, give it another scrub.

No, John Kinas wasn’t talking about removing stubborn household stains. Instead, the director of information technology for the District of Columbia Bar was advising bar communicators on how to remove potentially embarrassing or harmful metadata from documents before they leave a bar department or the bar itself.

Metadata has a bit of a bad rap lately, Kinas told attendees at the National Association of Bar Executives Communications Section Workshop in Philadelphia in October. Many people have heard it’s something to watch out for and think metadata itself is a bad thing. But metadata—information embedded in a document and not always visible—can be quite useful, he said. Kinas likened it to a price tag on a wedding gift—it’s a very useful piece of information when you’re buying the gift, but it becomes problematic if you accidentally leave it on. Just like with the price tag, he said, the important thing is to keep the metadata while it’s still useful, but then remove it before it can do any harm.

What sort of harm can metadata do? Imagine if you sent out an employee performance review or a policy paper in which all your revisions were visible, Kinas said. Or what if a firm sent legal documents that still contained comments regarding strategy in a case?

Where is metadata? Everywhere, Kinas said. Both the Microsoft Office suite and Corel WordPerfect give information that describes how, when, and by whom a data set or document was collected, created, accessed, or modified, and its size and formatting. Even this innocuous-seeming information can be dangerous, as former British Prime Minister Tony Blair found. In 2003, he sent a report regarding progress in Iraq, Kinas recounted, but someone who received it was able to easily learn that the report was composed not on a computer in Blair’s office, but on one belonging to a U.S. contractor. This sparked a scandal over such matters as who actually wrote the report and the nature of Blair’s relationship with the contractor.

Add to that type of information the popular collaborative features such as “track changes” and “versions,” and it’s easy to see how a document might leave your computer carrying more information than you had intended.

Most people are familiar with track changes, which allows a user to see corrections that have been made along the way; Kinas explained that “versions,” under “file,” is a tool that means previous versions of a document are kept all in one place. A lot of people use track changes but then forget to turn it off before sending, Kinas said. “Versions” can be another useful tool—as long as you scrub the document before it leaves you. You can set it to automatically store each saved version of a document, which can be very useful if you need to restore a document to how it was at a particular point in time—something track changes can’t do.

Aside from the fact that there may be outdated or otherwise harmful information in a prior version, another reason it’s important to scrub a document to remove the past versions or the tracked changes once you’re done, Kinas said, is that you want your final document to “stand on its own two feet.” That is, once you decide what should or shouldn’t be part of your document, you don’t want people to read it and be able to quibble about things that were in prior drafts and that you didn’t intend for them to see.

If you print your document and don’t see any metadata there, don’t assume you’re in the clear. Much metadata doesn’t show up in print, Kinas said, but is easily accessed. “This ‘hidden data’ is not hidden very well,” he noted, adding, “If you’ve got Word, you’ve got tools to see it.” For example, he said, if you hit “file properties” in Word, you get a lot of information, such as the title, the author, the number of pages, word count, character count, where this version stands in terms of the number of revised documents completed, and the total editing time. Some of these pieces of information might not be harmful in and of themselves, Kinas said, but it’s important to know that the figures are “not gospel truth”—so the next person viewing the document might be misled by the data regarding word count or how long it took to complete the document.

So with all of that information, some of which you might not even realize is part of your document, how can you make sure there’s nothing potentially detrimental still in there when you hit send?

Scrubbing instructions

You can easily prevent some metadata mishaps before you even create your document, Kinas said. For example, if you have a yearly report with a lot of information that stays the same, don’t be tempted to drop the new information into the old report, because then it will carry all the old metadata with it. Instead, cut and paste into an entirely new document—that way, some metadata will follow, but not all of it. It’s a little safer that way, Kinas noted, but you’ll still need to scrub when you’re done. And never reuse a Word document that you didn’t create yourself, he added—again, that old metadata will follow along.

As for how to scrub, the extreme version, which will create difficulty for the next person who needs to work with your document, as it removes much of the formatting, is to save as “plain text” (.txt).

Fortunately, there’s a less extreme route, one that preserves the formatting. For users of Microsoft Office 2003 or XP, there’s a free upgrade available to install a utility called “remove hidden data.” To learn more or to download it, visit http://support.microsoft.com/kb/834427. The 2007 Office suite automatically comes with a tool called “Document Inspector.”

How important is this type of feature? “I consider it mandatory,” Kinas said, noting that all computers at the DC Bar have the “remove hidden data” tool and that it works not just with Word, but also with Excel and PowerPoint. It’s even worth upgrading your system just so you can use either this tool or Document Inspector, he said, noting that 1997 versions of Word and older should be replaced anyway, because they pose too many security risks that have since been corrected.

If you have “remove hidden data,” the way to use it is to open your document, click on “file” and then “remove hidden data.” At this point, Kinas stressed, you should give your file a different name—once the metadata is scrubbed out, there’s no retrieving it, so you’ll probably want to retain the unscrubbed version in case you need to refer back to it. Type

the new name in the box, and hit “next” and then “finish.” A notepad file will open, listing the results of the review process. For instructions on using Document Inspector, visit http://office.microsoft.com/en-us/help/HA100375931033.aspx. As with “remove hidden data,” Microsoft notes that it’s important to rename your file so you retain your unscrubbed version.

There’s also commercial metadata removal software out there, Kinas noted, adding that it usually costs $40 to $80 per computer. Because most work on all popular word processing and office-related programs, he explained, this can be useful in “mixed offices” where there’s a variety of operating systems and word processing programs being used.

Policy and education

It’s all well and good to install scrubbing software on the bar’s computers. But what if the staff doesn’t know how and why to use it? And what about documents composed outside of bar offices?

“It’s important for organizations to set policies to protect themselves from the risks of unintentional disclosure of metadata,” Kinas stressed. “It’s too important to be left to individual authors’ discretion.”

The policy, he said, needs to distinguish between internal documents—those still in review, which means the collaboration features are still helpful—and public ones, for which only the final, scrubbed version should be sent from the bar. And the definition of “public” should be broad enough to cover anything that will be put on a disk, on a Web site, or sent by e-mail, including any materials provided by presenters or contractors to be sent by the bar, he added.

The policy should be backed up with education, Kinas added. Someone who knows metadata—the bar’s IT person, if you have one—should train either all staff or at least any staff members who do work for wider release, such as press releases or other external communication.

What about PDFs?

Can’t you just send everything as a PDF and not worry about metadata? This is a safer way to go than sending Word documents, Kinas said, both in terms of obscuring the metadata and in terms of preventing unauthorized changes. For those reasons, and because the software to read PDFs is free and the format means it doesn’t matter what version of Word someone uses—or if they use Word at all—Kinas thinks bars should consider setting a policy that all documents leaving the bar must be sent as PDFs.

But don’t think PDFs are “bulletproof,” Kinas added, noting that there’s an “arms race” between the PDF software manufacturers and the hackers who would like to be able to make changes to PDF documents. The most recent versions of Adobe Acrobat—version 8 or higher—require passwords for users to make changes, but others are working on software to help hackers figure out the passwords.

And as for metadata, it’s true that a PDF will carry less of it, but it may still be possible for a user to see some metadata. That’s why, even when using this supposedly safe format, Kinas would still recommend scrubbing the document before making it into a PDF.

“Scrub everything,” he said, summing up his philosophy. “Scrub early, and scrub often.”