Email Threading
When we send emails, we may send many individual emails that make up one long conversation “thread” with a group of recipients. Your phone or your email program can keep these emails grouped together if you have your settings configured to do so. Yet when we gather emails and put them into a review tool, the emails can be pulled out of their conversation group and spread out among thousands of other emails. A review will be much more efficient, however, if all those emails remain grouped together to get reviewed at the same time. Even better, if you gather emails from multiple witnesses, a review will be more efficient if you can keep the conversations together across witnesses. Email threading makes this possible.
Threading also identifies which emails in a thread contain unique information, such as an attachment that is not attached to other emails in the thread. You can then choose whether you want to review the entire conversation of emails or if you want to save time and only review and produce the emails that have unique content. There are pros and cons for each method. Only reviewing and producing “inclusive” emails is a savings in time for all parties. However, you lose metadata (e.g., to/from/cc/bcc/subject line/date sent) for each individual email in the conversation. This reduces the parties’ abilities to filter for all emails to or from a certain person because the only metadata they will have will be the metadata associated with the inclusive email. Regardless of whether or not you review and produce only inclusive emails, grouping the emails together by conversation still makes a review more efficient and makes it easier to follow the trail of an email if you find one that is particularly relevant to a case.
Near-Duplicate Identification
Near-duplicate identification uses the text of a document to find textually similar documents that cannot be identified as exact duplicates. Electronic documents have an identification number called the “hash value.” Because of this, it is not difficult to identify exact duplicate documents if they are all processed in the same tool from the native document. However, if a third party processed the document in a different tool, or did not provide the native hash value along with the production, the hash value will not be available. Even if the hash value is available, very-close-but-not-identical documents (e.g., an extra period) do not share the same hash value, so it is common to end up reviewing similar documents multiple times. Near-duplicate identification helps with this problem because it evaluates the text of a document to identify whether it is a close duplicate of another document, rather than relying on the hash value. This is very useful when you locate a key document and want to find every instance in the database where it appears.
Concept Searching
Email threading and near-duplicate identification are the easier analytics tools to understand and accept because they are based on concrete information about the document—e.g., where it falls in a conversation or the actual text of the document. Another tool, “conceptual analytics,” is where the ship starts to veer into what some consider uncharted territory. But the technology is not foreign to us. You’ve likely seen a version of it in action on social media sites when the advertisements reflect your interests. Search engines on the Internet also use concept searching to bring you back relevant results. We have all been using concept searching for many years. It is just new(ish) to the legal world. Understanding how it works can help take some of the mystery away from an unfamiliar review method.
To take advantage of conceptual analytics, the review programs create an analytics index that is charting the relationship of the words in a document to each other (not the words themselves). The computer is then able to create concepts based on how the words in your documents relate to each other. Once the concepts are created, the computer can then find documents based on a concept search, rather than a search for a particular term. You are also able to find documents that are similar to one you are reviewing if you find a document of particular interest. Different programs offer different ways to visualize the concepts in graphic form, to make it easier to explore the concepts in your data.
TAR (Predictive Coding)
TAR takes concept searching to another level. In TAR, a reviewer codes documents as either responsive or not responsive, and the computer takes that coding and extrapolates it to the entire data set based on the concepts contained in the documents. Statistics are gathered along the way to back up the process. Depending on the type of TAR being done, the software’s “training” is done in rounds of review followed by quality control rounds, or as each new document is coded, the program reevaluates the entire data set. The idea is that you can code an entire data set by reviewing only a few thousand documents, plus those documents that the computer is not able to categorize.
The goals of TAR projects vary. Sometimes, TAR is used to identify responsive documents that will be produced without further review. Sometimes, the goal is to reach a level of recall of responsive documents that is acceptable, so that the nonresponsive documents can be set aside and never reviewed. The legal team would then manually review all the responsive documents. Other times, the goal is just to prioritize which documents should be reviewed first as the most likely to be relevant.
No matter what the desired result is, the goal of TAR is not perfection. Relevant documents will be missed. But courts do not require perfection; they require a reasonable search for relevant documents. More and more courts accept (or even prefer) TAR. Our next article will focus exclusively on TAR—how courts view it and how, in greater detail, it can be used to streamline a document review.