The hunt for the ideal e-discovery search method has been a long, winding road. Search software now indexes, digests, organizes, and selects critical documents from the unimaginably large mountain range of electronically stored information (ESI). But has the steady conversion from tomes to gigabytes really taken us to the point where the search will lead to truth as quickly and cheaply as possible? Perhaps not yet. But we are getting closer.
Early in the evolution of our digital society, a first principle of e-discovery emerged: the keyword search. Soon, simple keyword searches were refined with Boolean connectors, stemming, “within” ranges, stop words, and wild cards. But what we all thought was the Rosetta stone of the digital world turned out to be mere illusion.
Keyword searches retrieved too much irrelevant data (poor precision) and too little of the relevant data (poor recall). Polysemy is part of the problem: The same word can mean numerous things (for instance, a bank where money is deposited and the bank of a river). Synonymy is the next problem: One thing can have many names (lawyer, counselor, barrister, attorney) and even new and made-up names (to avoid detection). The combination of polysemy and synonymy wreaks havoc on keyword searches. When slang, code words, technical expressions, acronyms, abbreviations, and texting are added, the chances of snatching the important documents out of the ESI stew drop dramatically.
Keyword searches also lack a sense of context. For example, assume the search target was customer trouble tickets. The keyword is “ticket.” But the search engine does not know that baseball tickets should be excluded from the search results.
So e-discovery turned to a technology called latent semantic indexing (LSI) to solve this problem. LSI acknowledges that human language and expression are so rich that it is impossible to identify every word or expression that names a thing, an idea, or an action. LSI looks beyond explicit keywords for latent (unspecified) words associated with the keywords. If a search yields previously unidentified words in the same documents identified by keywords that are closely associated with the specified keywords, LSI assumes these may also be keywords and begins searching for documents that contain the latent words. LSI identifies documents that might not even contain the original search terms. Thus, search technology becomes better at recall. Today, most major search tools offer customized versions of LSI or other refined “conceptual” technologies.
Meanwhile, software that groups similar documents has emerged to address the problem of precision. If the search term is “ticket,” the software would identify common groups or kinds of documents using the term ticket and place them in separate buckets. For example, the software would display (perhaps visually) separate ESI buckets for speeding tickets, entertainment tickets, and customer trouble tickets. Assuming the case involved only customer trouble tickets, documents in the “speeding” and “entertainment” buckets could be quickly sampled and categorized as not relevant.
But there is more to come. We are now on the brink of a new search wave, generically known as predictive coding, although different software vendors offer variants and versions under their own brands and names. Predictive coding is both a methodology and a technology.
It starts with knowlegable, skilled human reviewers deciding which documents in a sample drawn from the target data set are relevant.
Next, the predictive coding software catalogues these documents and learns their attributes.
The software then analyzes a new sample of documents drawn from the original data set based on what it has learned from the documents reviewed by the expert human reviewers, and then predicts which of these documents in this new sample are relevant.
Human reviewers check the results and provide feedback and corrections to the software about its predictions.
Finally, the software goes to work on the next sample batch from the data set, and human reviewers again check the predictions and feed the corrections to the software.
Eventually, this iterative process allows the software to achieve a statistically sufficient result. At that point, the entire data set is run through the software, which identifies and ranks the relevant documents.
Searching, therefore, has now come full circle. Litigators entered the previous decade looking for ways to eliminate human beings from the search process. We have entered this new decade recognizing that only advanced technology working iteratively with human interaction can solve the digital volume crisis.
The ideal search still eludes us. But we are getting closer.