chevron-down Created with Sketch Beta.
December 03, 2015 Articles

Away with Words: The Myths and Misnomers of Conventional Search Strategies

There are other ways to attack the problem of keyword searches and predictive coding thanks to some recent advances in technology and data science.

By Thomas I. Barnett

You are in a highly contentious litigation, taking the deposition of someone you believe to have the most important information you need in order to win the case. It’s the moment of truth, and, handing the witness a list several pages long, you ask, “Please tell me every time in the last five years you have used the words on the list I just handed you.”

Does that scenario sound absurd? It should. But it is a fairly good approximation of the legal profession’s overuse and over-reliance on keyword searches to identify relevant information in large data sets. And, for better or worse, predictive coding, which is the latest, and some would say greatest, approach to reviewing documents, is nothing more than keyword searching on steroids—matching and ranking the entire set of words in a document with those of other documents in the set, based on the frequency and proximity of the words contained in the document. Fortunately, there are other ways to attack the problem thanks to some recent advances in technology and data science.

How could a skilled lawyer get the required information from the witness? You might try to establish some background, foundation, and context for the questions you want to ask. You could try to establish things like the time frame of the events at issue, who said what, when, and where and, if possible, why certain things happened the way they did. That way of thinking is basic to how we communicate, question, and learn about the world as human beings. It is so instinctive you probably don’t even stop to think about it.

In fact, there is a tremendous amount of information that forms our perception of the world and how it works. Consider this example: You are a tourist in London, and you stop and ask a policeman, “Do you know how to get to the London Bridge?” He replies, “Yes, of course,” and promptly turns and walks away. We all know that you were asking for directions. At the basic level of most commonly used approaches to find relevant documents in litigation and regulatory matters, such a nuance would never be detected.

But in the context of trying to find information, to learn something in very large sets of documents and data, we as a profession have come to rely on what can be highly unreliable methods by which to gain knowledge from potential evidence in our cases.

The approach of using keywords, or any of their variants currently offered in the e-discovery industry (e.g., “concept” search, clustering, and predictive coding), is based on what works well for computers—not on how people actually think, learn, and communicate. Computers are far superior in speed and accuracy in identifying patterns (in 1s and 0s) and matching them, and they get faster every year.

However, the fastest, most powerful computers in the world can’t even come close to our ability to consider the context of events or communications, determine and rank the importance of statements or actions, or interpret nuance in language. These are the very things that allow us to understand and make judgments every day and to arrive at what we believe to be the actual meaning of events, communications, and documents.

The foundational technology used in most currently available approaches to searching, identifying, and classifying discovery data has been around and in use in the world of computer science for decades. The only novelty, to the extent any exists, is the relatively recent adoption of such commonplace technology in the legal field. For example, most technology currently available in the legal industry to classify information (potential evidence) relies on the text of the documents—either individual words or small groups of words, as in keyword searching, or grouping all of the words in an entire document, as in predictive coding.

What these approaches fail to do is incorporate and correlate various types of data that can be easily extracted and brought into the process. Doing so brings us closer to the who, what, when, where, why, and how questions that we actually use to understand events.

In recent years, however, unprecedented advances in a number of areas of computer science have opened the door to better ways of finding information and making sense of it. Two important such advances are (1) the advent of much cheaper and more flexible technology infrastructure leading to the refinement and wider use of so-called “parallel processing,” using groups of relatively inexpensive computers together to achieve processing speeds and power exceeding anything we have seen before; and (2) the advancements in machine learning and statistical engineering allowing us to tackle problems in a more intuitive, human-like, way. These advancements are opening new avenues in fact investigation and document discovery as well.

In addition to the text of a document, other information can be accessed and analyzed together to better understand the data:

metadata, which is information that accompanies our documents (e.g., emails, word-processing documents) but is not part of what we think of as the content of the document. Such information includes the date and time the document was created, accessed, or modified, or the time an email was sent, received, or opened, and who sent it and received it. While such data are commonly used to search and sort data, they are not typically combined with other types of data to determine meaning. Further, the more complex approaches like predictive coding and concept searching generally rely on the text of the documents alone.

entity data (also referred to as extracted metadata), which is information contained in the text of documents that can be identified and classified into predefined categories such as names of persons, organizations, locations, quantities, monetary values, and so on. In other words, instead of looking at the text of a document solely as a string of characters that may or may not match another string of characters, this approach incorporates knowledge about elements of the document into the analysis. What do the words in the document actually refer to or mean?

For example, suppose you have an email from one person to another with text that matches a set of search terms. That, by itself, may not provide enough useful information. But what if you were able to search for documents based on whether they discussed specific events, at a specific time, involving specific people? That is possible when using metadata and entity data along with text. This is just one example of what is possible by incorporating and blending technology and processes used in areas outside the legal profession.

We need to move beyond the confines of conventional approaches. We rely on these approaches either out of habit and familiarity or because these are the only offerings being plied by providers who have invested in their development. To survive, we need to take advantage of the best available techniques and technology. Failing to do so will leave us drowning in the ever-increasing ocean of data.

Keywords: litigation, pretrial practice, e-discovery, metadata, entity data, keyword search, predictive coding

Thomas I. Barnett is special counsel, e-discovery and data science, at Paul Hastings LLP in Los Angeles, California.


Copyright © 2015, American Bar Association. All rights reserved. This information or any portion thereof may not be copied or disseminated in any form or by any means or downloaded or stored in an electronic database or retrieval system without the express written consent of the American Bar Association. The views expressed in this article are those of the author(s) and do not necessarily reflect the positions or policies of the American Bar Association, the Section of Litigation, this committee, or the employer(s) of the author(s).