chevron-down Created with Sketch Beta.
November 16, 2018 Articles

Implications of Text Bias in E-Discovery

Lawyers who use text-based technology should at the very least be aware of pro-text biases that can affect how well e-discovery processes work.

By John Martin

A bias is an inclination or preference for approaching a task or problem in a certain way. When bias is based on objective facts, it saves time otherwise required to constantly reconsider prior decisions. However, bias is less beneficial when decision makers fail to consider relevant underlying facts or newly available options. There are two major types of biases to consider: selection and confirmation.

Selection Bias: Key-Term Searching

Selection bias involves selection methods that ignore a significant part of the population that is supposedly under evaluation. The biggest example in e-discovery is selecting documents to review based on key-term searching. The implicit assumptions underlying a bias toward key-term searching are that (a) text searching can find documents based on the words that are visible in the documents and (b) humans can anticipate the terms needed to find the right documents. Neither assumption is necessarily true. Consider the following potential problems.

Image-only files contain no searchable text. In some document populations, over 25 percent of the documents do not have accurate text available for indexing and searching because they are image-only files. For example, if there are drafts of an agreement in Word format and one scanned copy of the executed final document that did not undergo optical character recognition (OCR)—the process of using software and hardware to make images searchable by the text they display—text search will locate the drafts but not the most legally significant version.

There are many industry sectors and business processes that use and generate image-only documents. In the mortgage industry, for example, loan documentation is often faxed, or scanned and emailed, and some document images are created using cell phone cameras. In engineering, CAD/CAM systems output vector graphics, which may not include any text-searchable characters. In oil and gas and petrochemicals, many document types, such as seismic, land records, and construction documents, are often image-only.

Overcoming the selection bias against image-only files involves considering several different issues: nonrandomness, OCR and metadata work-arounds, quality measurement, and transparency.

  • Non-Randomness. In terms of nonrandomness, image-only documents are not randomly distributed across an organization; some processes or document sources will have far more image-only documents than others. That means that in some litigation, selection bias against image-only documents is completely harmless: nothing relevant or responsive will be missed. Other times, it can result in very significant information not being considered at all, as illustrated in the preceding examples.

  • OCR & Metadata Workarounds for Image-Only Files. One work-around for analyzing image-only files is to OCR them to obtain searchable text. This can be expensive and time-consuming in large document populations. Even then, the text produced may contain extensive OCR errors, making many terms unsearchable. Another work-around is to analyze metadata associated with image-only files. Unfortunately, however, metadata analysis often does not meaningfully differentiate between responsive and unresponsive documents.

    Failing to gather nontextual documents is not option. As discussed later in this article, such failure can make recall statistics misleading.

  • Transparency. The best approach to dealing with image-only files in e-discovery, as well as other text-bias issues, is transparency: the requesting and responding parties should discuss the issues, and both parties should agree ahead of time how they will be handled.

Embedded graphics in partial-image files are not text-searchable. Documents don’t need to be image-only to create search problems. Documents, especially presentations, often include embedded graphics, such as screen captures, desktop publishing graphics, commercial stock illustrations, or cell phone photos. While these embedded graphics are completely legible by their human audience, the text displayed in them is unavailable to text-indexing software.

This is not easily dealt with via OCR. Most OCR packages won’t convert embedded graphics into the text values needed by text-indexing software if there is text available anywhere in the file; the text displayed in the embedded graphics is unavailable for selection purposes. Some final-review platforms may provide the option to extract embedded graphics and make image-only documents from them, and these image-only objects can then be OCRed and indexed for text searching, but that is well past the point of initial selection.

Document boundary issues impact key-term searches. Lawyers often implicitly assume that a “file” is the same as a “document.” That is not necessarily so. Consider these problems:

  • Multi-Document Files. In some collections there is a high prevalence of multidocument files. For example, in mortgage loans, an entire loan package with potentially hundreds of documents may be contained in a single PDF file with thousands of pages. The consequence is that searches with logic that is expansive (e.g., Term A OR Term B) are more apt to find these long files even though most of the individual documents in those files do not contain the specified terms; they just get swept into the results because of the words in other documents in those files. It also means that searches with exclusionary logic (e.g., Term C NOT Term D) are more apt to exclude files containing individual documents that should have been included. In short, multidocument files present issues of both underinclusiveness and overinclusiveness.

    Document boundary issues also impact how reviewers evaluate combined documents. Most reviewers spend their time on the first few pages in a file and can easily miss documents embedded later in the file.

  • Single-Page Files. The opposite of the multidocument file is single-page files. In this scenario, multiple single-page files may be used represent individual multipage documents. Unless document boundaries are properly established, text search logic will apply at the page, not the document, level, resulting in both underinclusive and overinclusive search results. An example of underinclusiveness would be where there is a search for documents with Term A AND Term B. That search will not locate documents where Term A is in the file for one page of a document and Term B is in a file for a different page. An example of overinclusiveness would be where a search specifies that Term C should NOT be in documents in the result set, but Term C only occurs in the file for one of the pages in a document; any of the other pages could still be included in the result set if they satisfied the other parts of the query.

  • No Text Fix. While document boundary issues cause significant text search issues, text-based analysis and AI are not effective in breaking larger documents into smaller ones or deciding how to combine single-page files into logical documents.

Language affects key-term searches. As more business is conducted across national boundaries, more languages are used in the files that are evaluated for e-discovery. Searches that might work well in one language can obviously fail completely in another.

In addition to the issue of multiple languages, there is the issue of vagaries of a single language. Even when the textual representations of documents are perfect (e.g., no image-only or partial-image documents, just one language, and correct document boundaries), text search has challenges resolving inherent language ambiguities. This causes searches to miss some responsive documents while including nonresponsive ones. These flaws could be generalized as falling into one of two categories: (a) synonyms, which are multiple words that convey the same concept; and (b) homonyms, which are single words with multiple meanings. Synonyms and homonyms are long-standing search challenges. Every lawyer and e-discovery consultant should be familiar with the landmark study of full-text search effectiveness by David C. Blair and M. E. Maron. An Evaluation of Retrieval Effectiveness for A Full-Text Document Retrieval System, 28 Comm. of the ACM 289 (Mar. 1985). The short version is that lawyers searching documents often overestimate the percentage of relevant documents they believe they have identified using text searching.

Nontextual glyphs do not respond to text searches. Important information can be conveyed by nontextual glyphs (i.e., graphical elements) that will not be located by text search. Examples include the following:

  • Logos can be used to identify a company or product without an accompanying text label, such as using the Nike Swoosh logo without the text term Nike.
  • At times, it can be desirable to be able to select maps, engineering drawings, plats, and similar documents by specifying which symbols are used in them, such as the symbol for a shut-in oil well or a specific electrical component.
  • The black rectangles used to redact or mask underlying content are not directly searchable using text-based systems. Sometimes redactions can be found if the term “redacted” appears in the rectangle and is successfully converted to text, but that is not consistent or reliable.

Selection Bias: PC/TAR

Predictive coding and technology-assisted review technologies (PC/TAR) have helped lower the cost and turnaround time required to select documents for production by analyzing the text patterns in documents classified as responsive or nonresponsive by human reviewers.

With PC/TAR, decisions made on a subset of documents are applied to nonreviewed documents. However, depending on the conceptual approach used, a TAR system will ignore or fail to correctly analyze some types of documents even if all the words they contain are fully text searchable.

Because of the following types of selection bias issues, PC/TAR can leave a significant percentage of documents for linear review:

  • Non-Sentence Content. Linguistic systems that depend on having sentences to parse may skip documents that don’t have sentences, such as lists or spreadsheets.
  • Numeric Data. T the presence or absence of numeric data may not be used at all by many PC/TAR systems.
  • Documents with Few or Too Many Words. D documents with fewer or more than a specified number of “words” may not be analyzed at all; this can be a real problem with improperly unitized documents, such as multidocument files.
  • Languages. E each language may require its own processing.

Confirmation Bias: Text-Based E-Discovery

Confirmation bias is a predisposition to conclude that a favored approach is working. In e-discovery review and production, one of the key measurements of success is “recall,” which is an estimate of the percentage of responsive documents that have been identified as such. Recall is estimated by evaluating samples drawn from the entire document set or from the smaller set of documents marked as nonresponsive. This approach is especially problematic when text search is used as a first-pass filter for responsiveness review. An apt analogy would be estimating the average weight of fish in a lake by collecting a sample using bait that appeals to just one type of fish, or only gathering fish within a few feet of the surface.

It’s easy to get lost in all the statistical details of sample sizes, confidence levels, and margins of error and ignore the fact that text-based collection methodology sometimes overlooks 15 percent to 25 percent or more of the document population because it is image-only, contains embedded graphics, features foreign language, etc. The initial selection bias is perpetuated by the faulty success measurement. Recall should always be stated as a percentage of the relevant documents collected, not as a percentage of all relevant documents.

Confirmation Bias: Text-Based Redaction

Confirmation bias can lead to the acceptance of inefficient processes through acclimation to the inefficiency. One example is found in the redaction of personally identifiable information, which is a recurring legal challenge when responding to e-discovery or Freedom of Information Act requests.

The text-bias approach to redaction combines searches with manual review. A search for the low-hanging fruit looks for text string patterns for things like Social Security account numbers. When found, the characters in the patterns can be programmatically redacted. For example, a search for “NNN-NN-NNNN” where “N” is any number from 0 to 9, could permit programmatic redaction of the characters in a matching string. Pattern searches can be supplemented by searching for flags or indicators that such PII data is included, such as the terms Social Security number or SSAN. Via matching patterns, flag searches can assist people performing manual redactions by finding where PII occurred even if it wasn’t directly identified.

However, pattern and flag searches leave unaddressed the issue of handwritten documents, scanned or partially illegible documents, typographical errors, and a host of other potential problems. Manually redacting documents is time-consuming, expensive, and error-prone. Comprehensive redactions can require page-by-page review.

Classification Alternative. Just as image-only documents are not distributed randomly across an organization, PII data is not distributed randomly among all documents in an organization. For example, credit card applications will almost all have Social Security numbers, while Material Safety Data Sheets will have none. In this scenario, it will be more efficient and result in reusable work product to classify the documents, identify the classifications that regularly display PII, and then redact the zones where the redactions need to occur—even if they are in handwriting.

Conclusion

Lawyers who use text-based technology should at the very least be aware of pro-text biases that can affect how well e-discovery processes work. They may also want to evaluate nontextual information governance technology that can be used to collect, logically unitize, and evaluate e-discovery.

John Martin is the founder and CEO of BeyondRecognition LLC in Houston, Texas.


Copyright © 2018, American Bar Association. All rights reserved. This information or any portion thereof may not be copied or disseminated in any form or by any means or downloaded or stored in an electronic database or retrieval system without the express written consent of the American Bar Association. The views expressed in this article are those of the author(s) and do not necessarily reflect the positions or policies of the American Bar Association, the Section of Litigation, this committee, or the employer(s) of the author(s).