August 2012 | Survival Guide for Young Lawyers: Taking Charge of Your Career
Predictive Coding: What's New and What You Need to Know
If you have an interest in e-discovery, you likely have already heard the phrase “predictive coding” or “technology-assisted review.” If e-discovery is not your favorite topic, this article is a great opportunity to get familiar with an important trend in legal technology.
Technology-assisted review, or TAR as it has come to be known, is a broad term that (mostly) means what it says: using a technology to assist review. Here, the technology typically refers to some form of language-based analytics. The two primary flavors are “concept clustering” (grouping documents based on content) and “concept search” (or ‘find more like this’). These technologies aren’t new. In fact, the ‘find more like this’ feature has been available in legal research software for some time now.
What’s new is using these analytics in a formal workflow to improve review efficiency. In the past, users were given access to these analytics and, maybe, given ad hoc advice from their vendor on using them, but vendors hadn’t offered a predefined plan for using the analytics. Now, many vendors (and more every day, it seems) are offering a defined workflow for using analytics to improve the efficiency and effectiveness of document review.
So, “technology-assisted review” ultimately is a workflow or process designed to make good use of pre-existing technologies. So what is the secret sauce that transforms a workflow into technology-assisted review? There is no one special recipe for making a TAR workflow. As far as I can tell, each vendor has developed a unique workflow for improving review with analytics.
What’s predictive coding and how is it different from technology-assisted review? Predictive coding is a form of technology-assisted review offers computer-generated document relevance rankings. Those relevance rankings ARE the predictive coding. The computer is predicting the likelihood that you will code a document one way or another. If the relevance ranking is high, the computer is predicting that you are likely to agree the document is relevant. Other TAR methods don’t rank relevance, so they aren’t “predictive coding.”
But relevance rankings are also not new. Relevance rankings have also been available in legal research software, among other examples, for some time now, so it also is not a new technology. Again, what is new here is the formal workflow incorporating analytics, like relevance ranking.
So if the technology isn’t new, what’s the big deal? Well, the big deal is mainly about what is done with the results of the TAR process. Let’s look at an example of a predictive coding workflow to understand why handling the results is controversial.
In a typical TAR workflow, the first thing to do is gather a “seed set” of documents from among all the documents needing review. The seed set will ideally be reviewed by the lawyer who knows the most about the case. It wouldn’t be uncommon for the most senior partner on a matter to delegate this task to a junior partner or senior associate. Either way, the set needs to be reviewed by someone with strong knowledge of the case.
How this seed set is gathered—randomly, using keywords, or some hybrid approach—is also a topic of debate, but, as we will see, the ultimate controversy is with how to properly handle the results. For our example we’ll assume we’re using the random method of gathering a seed set.
Collecting a statistically significant number of documents randomly allows someone to draw inferences about the universe of documents. For instance, if after reviewing the seed set we find that 25% of the documents were responsive, we can then know with a high level of confidence (≥ 95%) that 25% of the remaining documents should also be responsive, give or take a few percentage points.
That is useful for getting a general idea of how many documents we should expect to produce. If we have 1,000,000 total documents, we would expect to need to find about 250,000 responsive documents. If, after completing our planned process, we have only found 50,000 responsive documents, we would have some cause for concern and want to retrace our steps.
Back to the seed set. After gathering and reviewing the seed set, the machine begins to learn what we are looking for by analyzing the relationships between the words in documents we have marked responsive. It then looks for similar documents.
What happens next seems to depend on the particular form of technology-assisted review workflow. In one TAR workflow I’m familiar with that doesn’t include relevance ranking, the machine will keeping bringing back documents it thinks are relevant until it feeds you a batch with no relevant documents. At that point, you’re supposedly done. Whether this process is accurate enough by itself or whether more needs to be done is what I meant earlier about handling the results being controversial, but we’ll revisit this in a minute.
In a predictive coding workflow I’m familiar with, after learning from the first seed set, the machine starts predicting your responses but keeps those predictions to itself. It will gather another set of documents, some of which it believes you will consider responsive and some it believes you will consider non-responsive. You review the documents, and it learns from your calls and compares those calls against its predictions. This process continues until the machine’s internal predictions are statistically accurate. That is, it is highly confident it knows what you’re looking for. Then it ranks the remaining documents based on that knowledge. Then what? Again, this is where things get dicey.
It is very often suggested that applying quality control measures to the process is needed. For our non-predictive coding TAR example above, after we’ve seen all the responsive documents the machine could find, we should probably get a random sample of all the other documents it thinks are non-responsive and take a look at them. If there’s nothing responsive there, we’re probably good. If we find 1% of them are responsive, we may be done or may want to do more checking, depending on our preferences. If, however, we find 20% of these “non-responsive” documents are actually responsive, we need to go back to the drawing boards.
Despite the controversy, the standard in e-discovery is reasonableness, and it seems that these methods can be used reasonably with enough thoughtfulness and planning. If used properly, they certainly offer the potential for great savings of time and costs. So, take a deep breath, smile, and embrace your new computer overlords.
Christopher J. Spizzirri, Esq. is an Associate with Morris James LLP in Wilmington, Delaware. Chris' practice is dedicated exclusively to electronic discovery and information governance issues, focusing on making eDiscovery manageable, defensible, and cost-effective.