Our experience motoring around the Internet has created a happy sense that searching is so easy, it’s almost hands-free. Of course, driving an effective search requires experience behind the wheel. Google’s simple interface—and the raw horsepower of its search engine—can lull even a novice driver into thinking the cruise control is always on. When it comes to e-discovery however, such pleasant conceits can lead to an unforeseen sharp turn and, at times, a disastrous crash in litigation.
What gives us this comfortable sense that we always have a tight grip on the wheel? The answer is indexing. Most Google users think that when they enter a search term or expression into the search window, Google then goes out and scours the entire Internet from top to bottom. This pleasant and quaint assumption is far from the truth. What Google actually searches are petabytes of data located on the estimated one million servers owned by Google that store the data captured when Google’s bots crawl the Internet. When you search with Google (or any Internet search engine), you are not searching the Internet directly. You are searching indexes on Google’s servers. That is why the search is faster than the winning lap at the Daytona 500.
Similarly, processing and indexing are at the heart of e-discovery. Without indexing, your e-discovery tool would have to scan every document for the sought-after terms. This is possible but slow. Indexing creates blazing speed.
Indexing is not new. We have been indexing since the beginning of writing. Almost all textbooks, for example, have an index. A resource book without a good index is incredibly frustrating. Processing creates an electronic index that makes effective e-discovery review and analysis possible.
The computer doesn’t speak English and doesn’t know what it is indexing. The processing engine indexes what information retrieval professionals call “tokens.” Some processing engines eliminate (or do not tokenize) what are called “noise” or “stop” words. A huge percentage of our writing contains little words such as “a,” “an,” “and,” “to,” and “of.” Although these words function as syntactical glue, they have little or no search value. Many processing engines have hundreds of stop words that are eliminated. Once the “noise” is removed, the meaty words that remain form an index.
While enhancing speed, the elimination of stop words can also create e-discovery search problems. But there is a trade-off. You cannot search for stop words that are not indexed. If the processing engine you use is eliminating stop terms, you cannot search for them. The classic example of a search that fails when stop words are eliminated is “To be or not to be.”
Processing also requires decisions about the definition of a “document unit.” Our index relates the word or “token” to an identification number or character given to the document. For example, the index relates the token “Bill” to document number 1000123. This is why it is called an “inverted index.” But what is our definition of a “document”? Is the “document” the entire memorandum with the exhibits, or is each a different document? Suppose the “Bill” is in the memorandum and “Hamilton” is in the exhibit. If the memorandum and exhibit is one document, then the search
Think of the index as the chassis. The search engine rests within that chassis, and without processing and the creation of the index, the engine has nowhere to go.
But not all indexes are equally well designed. “Document unit,” “inverted index,” and “tokenization” are critical to the effectiveness and defensibility of your search. Keep in mind that your search is only as good as your index. Effective and defensible searching requires counsel to look under the hood to find out how the entire search vehicle was built.