Read the first part in this two-part series on e-discovery tools.
Classification Tools
Classification tools organize documents into groups or categories or predict the likelihood of their belonging to a category. Simple classification tools (such as document sorting based on date, size, or title/subject) are included in nearly all platforms. And more advanced tools, like email threading, near-deduplication, and predictive coding are increasingly ubiquitous. Use them to safely isolate relevant and irrelevant material that can be classified without (or with minimal) manual review.
Unsupervised learning tools can greatly accelerate review in most cases. Unsupervised machine learning tools group documents into pre-determined categories; the categories are supplied by the machine. Many of these tools are designed for a specific and narrow purpose and are easy to use. They can greatly accelerate review. These tools can be used to group documents that the user can classify together as a batch, with little or no additional manual review (e.g., all near-duplicate drafts of a contract). Some tools group documents together in ways that make them easier to understand (e.g., email threading). Others enhance search capabilities so that users can better isolate relevant and irrelevant results (e.g., faceted search, latent semantic indexing, content extraction). Examples include:
- Near duplicate detection: identifies all near-duplicates of a document (e.g., all drafts of a contract).
- Email threading: organizes email into its original conversational order. Some software identifies email that do not need to be separately reviewed because their text is reproduced in the body of the reply. Users may then limit their review to these “inclusive email.”
- Find related/smart search/clustering: allows the user to select a document to find items that resemble it (though not necessarily a near-duplicate) based on selected features.
- Latent semantic indexing: identifies patterns and relationships between terms based on their co-occurrence in the documents. It is used for multiple purposes, including to improve Boolean search results (by reducing synonymy and polysemy problems), search by topic or concept, group related documents, or to enhance predictive coding systems.
- Faceted search: powerful tool that allows user to explore a corpus of documents and “drill down” to the relevant documents, by imposing filters; filters correspond to “facets” of the documents (including metadata, such as file type, author, sender, date created, date sent, etc., or content in the text); makes it easier to impose metadata filtration across different document types.
- Sentiment analysis: identifies text reflecting particular sentiments, allowing the user to filter their search on this basis (e.g., to find all angry email).
- Content extraction: analyzes text to recognize and extract all named entities, proper names of people, places, times/dates, money, events, social security numbers, phone numbers, but also nude images, geographical origination of data, and more.
Supervised machine learning tools are extremely effective in appropriate cases. With supervised machine learning tools, the user defines the categories for classification. In other words, the machine learns new categories from the user—e.g., to distinguish between relevant and irrelevant material—by analyzing how the user classified a set of documents.
Supervised machine learning is also known as “predictive coding.” The term does not designate a specific technology. But from the standpoint of the user, these tools all operate the same basic way. The user classifies a set of documents (e.g., as “relevant” or “irrelevant.”) And the machine “learns” from the attorney’s classifications by analyzing the features of the documents in each category. The algorithm then uses that information either to classify the documents that have not been reviewed, or predict the likelihood that they belong to that classification (e.g., 87% likelihood of relevance).
- Predictive coding isn’t appropriate for all cases. It doesn’t work well on small datasets—typically requiring a few hundred thousand documents to be worthwhile. The minimum depends on the specific tool.
- Don’t assume you should include every legal issue within the same predictive coding model. It may be better to train multiple models rather than one.
- Don’t confuse relevance and responsiveness during training.
- Select the right documents to include in the training set (e.g., skip spreadsheets).
- Monitor and test your “precision” and “recall” rates.
Modern predictive coding systems continue to “learn” as additional documents are classified by users throughout the review. Thus, they are usually easy to integrate into your workflow. You can use them to help prioritize the documents you review or to cull irrelevant material. Regardless of the tools you select, however, you must understand their capabilities and limitations to implement them effectively.