chevron-down Created with Sketch Beta.
September 27, 2012 Articles

Can Predictive Coding Save the World?

The technology is powerful, but the human element is still required.

By Mike Flanagan

One of the most talked-about methods aimed at improving the discovery of electronically stored information (ESI) during litigation is “predictive coding,” which uses sophisticated technology algorithms to reduce the amount of time required as compared to traditional attorney review. “Talked about” however, does not always translate to utilized, accepted, or even well-understood.

What Is Predictive Coding?
Predictive coding is a series of processes and workflows built around initial input from experienced attorneys that are knowledgeable about the matter at hand, extrapolating human input on a subset of documents from the overall data population (the “seed set”). This seed set is then used to identify similar documents from the overall corpus of documents that are likely to be relevant to key issues in a given matter, and therefore responsive to discovery requests. The focus of predictive coding is to gather a representative sample set from a large data population and carefully “train” technology systems on what makes documents “responsive” or “non-responsive” to the issues in a particular case. Predictive coding also can also identify additional items such as potentially privileged documents, at which point complex algorithms take the decisions from the sample set and extrapolate them to the entire data set over the course of hours rather than weeks or months.

Here are some key terms, derived from the world of statistics, that are relevant to predictive coding:

  • Recall. The percentage of the total number of relevant documents that are tagged as such by the system (completeness).
  • Precision. The percentage of documents tagged as relevant by the system, which are in fact relevant (exactness).
  • Confidence level. A measure of reliability of a result usually expressed along with an associated margin of error.
  • Prevalence. The number of responsive documents identified from the entire document population.

The Proponents: Benefits of Predictive Coding
Predictive coding supporters argue that using technology trained by a handful of experienced attorneys is more effective and efficient at capturing relevant documents from the entire data population (higher recall), with fewer false positives (greater precision), and a lower margin of error (higher confidence level) than a manual review by attorneys alone or attorney review aided by keyword searching. In short, predictive-coding technology offers the potential for faster, more accurate results, at a lower cost.

The Opponents: Risks of Predictive Coding
Detractors point to the risks of a “black box” technology solution based on complex algorithms that are poorly understood. Opponents complain that an admitted subset of documents is being used to define legal relevance for the entire data corpus. There is also the potential high cost of “getting it wrong”—small errors around relevance or privilege in the seed set are significantly magnified through the extrapolation process. In addition, legal issues such as privilege and reliability are not yet uniformly resolved. Finally, because this technology is still early in its application to the legal process, litigants who use predictive coding may not be able to rely on the developing case law as a road map as to how to apply it and defend its use.

Workflows (or How It Actually Works)
Workflows can be described simply as the protocols that are used to get from point A to point B. In the case of predictive coding, workflows may generally consist of topics including: (1) creation of seed sets, (2) human review of seed sets to establish baselines, (3) training systems through iterative review of results, (4) application of systems to the overall data population, (5) human review of computer-suggested results, and (6) quality-control validation.

Key Themes Relating to Predictive Coding 
After examining the proffered benefits and potential risks of predictive coding, several key issues arise that must be considered when deciding whether to use this technology to reduce the burden of review.

Defensibility is about process and execution. It is not necessarily about the technology, but whether the discovery process was reasonable and implemented in a reasonable manner. While courts generally consider review by trained attorneys to be reasonable, it is not necessarily the most cost-effective choice. Thus, technology can be used to enable the processes.

If predictive coding fails, the most obvious risk is the cost and time of having attorneys review the large amounts of documents. While this may represent a substantial additional cost, risk to the producing party can be mitigated if there is effective cooperation between the opposing parties and an agreement on acceptable search protocols. In matters where a party is looking to use predictive coding unilaterally, great attention should be paid to the workflow, effectiveness of the tool, and quality-control efforts because a failure of the process could lead to a risk of potential sanctions or adverse rulings for failures to disclose.

What level of accuracy is required or even desired in legal review? There is currently no agreed-upon standard set by the courts as to what satisfies the accuracy requirements of the discovery process. While courts have long held that attorney reviewers are the gold standard for accuracy, there are a few studies that challenge this belief and suggest that human reviewers are not as accurate as once believed. A study cited by the Richmond Journal of Law and Technology found that human-review rates for a data set achieved an average of 50.9 percent recall and 19 percent precision, while technology-assisted review on the identical data set achieved averages of 49.3 percent recall and 28.3 percent precision. See Maura R. Grossman & Gordon V. Cormack, “Technology-Assisted Review In E-Discovery Can Be More Effective And More Efficient Than Exhaustive Manual Review,”XVII Rich. J.L. & Tech. 11 (2011). Courts will be called upon to define what constitutes reasonable efforts and defensible protocols under the Federal Rules of Civil Procedure given the technology-assisted reviews and processes that are being implemented in litigation.

This gets to the heart of the accuracy matter considering that judgment calls are often required during legal review and there is a difference between statistical perfection and quality results. While predictive coding offers the promise of greater accuracy, it is important to use technology in conjunction with attorneys whose experience, training, and judgment can help improve the ultimate quality of the results.

Case-Law Discussion
One of the frustrations facing both corporate litigants and law firms alike is the limited number of courts that have ruled upon the sufficiency and defensibility of predictive coding. Two recent cases have examined predictive coding and offered qualified endorsements of the technology as generally useful, but have left the door open for objections to sufficiency and completeness based upon the actual production derived from predictive coding. In a third case, the court is currently evaluating the appropriateness of using predictive coding at the plaintiff’s request despite the defendant’s considerable investment in using traditional attorney reviewers.

Da Silva Moore v. Publicis Groupe, Case No. 1:11-cv-01279 (S.D.N.Y. Apr. 26, 2012)
In Da Silva Moore, the parties agreed to use predictive coding for the collection and production of ESI, but the plaintiff took issue with the reliability of the method (workflow) by which the defendant planned to use the technology. The plaintiff argued, among other things, that the defendant's predictive-coding method did not include a standard of relevance mutually agreed upon by the parties. Da Silva Moore v. Publicis Groupe, 2012 WL 607412, *8 (S.D.N.Y. Feb. 23, 2012).

Magistrate Judge Andrew Peck authorized the use of predictive coding using the defendant’s proffered methodology, concluding that

the use of predictive coding was appropriate considering: (1) the parties' agreement [to use predictive coding], (2) the vast amount of ESI to be reviewed (more than three million documents), (3) the superiority of computer-assisted review [compared] to the available alternatives (i.e., linear manual review or keyword searches), (4) the need for cost-effectiveness and proportionality under Rule 26(b)(2)(C), and (5) the transparent process [for conducting predictive coding] proposed by [the defendant].

Id. at *9. The court made useful and hopefully precedential comments concerning the viability and reliability of predictive coding. This case continues to progress, and the plaintiff was permitted to revisit the accuracy and sufficiency of data produced by the defendant based on predictive coding.

Global Aerospace Inc. v. Landow Aviation, Case No. CL 61040 (Va. Cir. Ct. Loudon Co. Apr. 23, 2012)
In a Virginia state court, Landow Aviation was sued over a collapsed airplane hangar, and the defendant filed a motion with the court, requesting either that predictive-coding technology be allowed or that the plaintiff, Global Aviation, pay the additional costs associated with traditional review. Judge James Chamblin ordered that defendants could use predictive coding, despite the plaintiff's objections that the technology was less effective than traditional human-only review.

The defendants had offered testimony supporting the use of predictive coding from several experts, including personnel from several e-discovery vendors as well as members of the predictive-coding software manufacturer’s team. Judge Chamblin found that, "[h]aving heard argument with regard to the Motion of Landow Aviation . . . it is hereby ordered [d]efendants shall be allowed to proceed with the use of predictive coding for purposes of processing and production of electronically stored information." Global Aerospace Inc. v. Landow Aviation, Case No. CL 61040, 2012 WL 1230554, at *3 (Va. Cir. Ct. Loudon Co. Apr. 23, 2012). Judge Chamblin specifically reserved the rights of the plaintiff to question "the completeness of the contents of the production or the ongoing use of predictive coding." Id. at *4.

Kleen Products v. Packaging Corporation of America, Case No. 10 CV 05711 (N.D. Ill. Aug. 21, 2012)
This antitrust case is currently pending in the U.S. District Court for the Northern District of Illinois. The plaintiffs initially asked Judge Nan Nolan to order the defendants to redo their previous productions and conduct all future productions using predictive-coding technology. The plaintiffs argued that, given the complexity of the issues in their antitrust case, the keyword-search strategy employed by the defendants to find responsive documents was deficient because, among other things, it was incapable of identifying variations of key concepts relevant to the case. The plaintiffs claimed that if the defendants had used predictive-coding technology, their production would have been more thorough.

The defendants asserted that the keyword-search method by which they collected and reviewed documents is what courts regularly endorse in commercial litigation, and further argued that they already produced over three million documents and invested thousands of hours in the review process, and it would represent an undue cost burden for them to be required to start the process over using predictive-coding technology.

Magistrate Judge Nolan urged the parties to focus on developing a mutually agreeable keyword-search strategy for e-discovery instead of debating whether other search-and-review methodologies would yield better results. At the moment, there have been no definitive pronouncements from the court on the viability of predictive coding, but this case bears watching due to the unique circumstances.

Predictive coding offers the promise of faster, more accurate e-discovery review at a reduced cost, but the technology is still in the early stages of acceptance within the legal industry, and there are no easy answers that allow for the wholesale replacement of attorney reviewers. The truth, as with most things, lies somewhere in the middle where technology will be employed to support human decision making and plays a part, alongside trained attorneys, in capturing process efficiencies and improving speed while lowering costs involved in the discovery process.

The real value of predictive coding comes when attorneys can employ strategic uses of technology that benefit the litigant beyond simply capturing time savings and cost efficiencies, recognizing opportunities where technology adds value.

One such possibility is using predictive coding as a form of “smart quality control.” The potential benefits of using technology in this manner could include improved consistency in results, as well as the ability to address documents that were not collected or even contemplated at the start of the case.

In today’s environment of multi-party, multi-national litigation sometimes with foreign language collections, the best solutions require a combination of targeted technology tools, subject-matter experts, and experienced project managers. Carefully constructed processes can be designed to solve complex discovery-search projects, and predictive coding can be a highly effective tool in a larger arsenal of solutions as opposed to debating between technology and humans. There are rarely absolutes in the legal world, and the discovery process offers a wide spectrum of issues that are not easily resolved with a simple yes or no. The best course of action is to balance technology, people, and processes together to solve complex problems.

Keywords: litigation, corporate counsel, ESI, e-discovery

Mike Flanagan is general counsel for First Advantage Litigation Consulting.