Every Move You Make . . .
There are more ways to communicate and more ways to collect, store, and transmit data than ever before. There is more data than ever before—by far. A byte of data is the amount of storage space necessary to store a single character, such as the number 1 or the letter A. The World Economic Forum estimates that by 2020, there will be in excess of 40 zettabytes of data—this is the equivalent of 40 times more bytes than there are stars in the observable universe.
Computers and devices gather data that tracks everywhere we go, what we read, what we buy, who we talk to, and what we talk about. Further, the tools to organize and analyze that data are more sophisticated than ever. By combining advanced analytical tools with machine learning algorithms (a set of instructions to a computer that can compare information, find patterns, and improve or “learn” based on feedback), we are able to find far more specific information in a large set of data faster and more accurately than ever before. A vast and ever-increasing portion of the economy is based on gathering and analyzing this data. Putting the pieces together and learning from them can tell an amazingly detailed story and has become an increasingly important driver of economic growth, led by companies such as Google, Amazon, and Facebook, just to name a few.
Notwithstanding all that, in the vast majority of legal matters, the documents and data used to develop the story are analyzed using one of the most limited, imprecise technologies available: keyword searches. The “art” of keyword searching involves guessing what words, or combinations of words, might possibly have been used by people communicating about key events in a case. The documents and communications that match those guesses are then collected and reviewed. It’s rarely very effective and every academic study of the accuracy of keyword searching reaches this conclusion. Even with more detailed and varied data available than ever before, we only look at a small subset of data types. And the tools most commonly used (often mandated by courts) are primitive and largely ineffective compared to what is used in the wide world outside of litigation.
Not only are the standard search tools highly ineffective, the myopic way we look at data in isolation rather than putting different types of data together also severely limits effectiveness. The facts of a case consist of more than a set of individual communications such as emails and text messages. The real story also includes related physical actions, interactions, transactions and events. Suppose two employees in a company who never work together and don’t live in the same city conduct suspicious communications via text and email that may relate to improper insider trading. Would you want to limit yourself to just the text and email messages themselves? Or, would you prefer to consolidate and analyze concurrent wire transfers, stock trades, banking records, internet search history, travel patterns, or outside events that may bear on the behavior you are investigating?
To sum it up, trying to understand complex interactions and events that occurred in the real world, in real time and space, using a two-dimensional approach, ignores many of the most important elements of the available information.
Why would people intentionally limit themselves to such a restricted view of reality in trying to find the facts necessary to tell the story of their case? The answer is, there is no good reason. It’s probably more a case of “that’s how it’s always been done.” Lawyers tend to be creatures of habit. They learn from day one in law school the importance of following precedent. The most important rule that practicing lawyers learn is that you should never expose your client to unnecessary risk. Let’s just say that this is not exactly a recipe for groundbreaking innovation.
Even within the limited universe of text and email messages, most commercially available technology is limited to analyzing only the text of the communications and certain metadata. Without a doubt, that’s a good start. But such a limited view does not tell the whole story. Limiting yourself in this way is like a detective who can only see black-and-white, two-dimensional images of physical objects and videos of potential witnesses and suspects. There are many more “layers” or “dimensions” of communication beyond pattern-matching individual words in texts and emails.
Getting a Clue
Some of the most significant additional dimensions used to analyze data outside of the legal industry go beyond simply matching words that appear in communications. These tools can be used individually or in combination, to dramatically increase the speed and accuracy of identifying important information in an investigation.
- Sentiment analysis. A lot of information can be gathered from the way people communicate beyond simply looking at the individual words that they use. A whole science has developed, originating in the intelligence community, that can assess the emotional state, mood, and “sentiment” of the person who is communicating. This is based on many factors including word and phrase choice, punctuation, and the length of sentences. This technology has become a mainstay in the advertising and media worlds, where gauging people’s reactions to products, news topics, or other people has become very big business. The added dimension of emotional state can be an important additional avenue of pursuit. Receiving an email that states, “Yes, I understand. Thank you” might have a very different meaning and impact if written in ALL CAPS, RED, BOLD, AND WITH SEVERAL EXCLAMATION POINTS AT THE END!!!!!
- Topic modeling. The same words can mean very different things depending on how they appear in combination—another way of describing their context. “Topic modeling” describes technology that attempts to take into account the frequency and combinations of words in communications that comprise what those communications are about, or their “topics.” For example, the combination of the words “apple, growth, climate, and price” can have very different meanings depending on whether you are talking about the effect of weather on growing apples or the global technology company.
- Named entity recognition. As humans, we look at words as symbols that represent things, ideas, and activities in the world—not just collections of letters as in keyword searching. Named entity recognition allows you to look for and identify things by type rather than just matching the letters in each word as in keyword searching. For example, to find everywhere that any person is referred to in a large set of communications, you don’t need to create a list of every single person in the world. Instead, you can have the technology identify everywhere (the entity) “person” appears.
- Concurrent activity. One of the advantages of the vastly increased processing power of computers is that disparate sets of information and data can be analyzed and cross-referenced rapidly. It may be interesting to find out that two people were communicating via email about buying a certain stock based on improper insider information. But wouldn’t it be more informative to concurrently track their financial information, transfers of funds, stock purchases, and the timing and content of specific phone or text conversations?
- Outside events. A frequently overlooked area in an investigation is what events are taking place in the world generally at the same time as the potentially suspicious activity. This can provide useful background information and can sometimes suggest motives or reasons for activity. For example, a potentially suspicious stock trade could have been motivated by insider information, or, if there was a widely discussed recession looming based on a Federal Reserve report, other motives may have driven the activity. Technology available to everyone with a computer and an internet connection, such as Google, can be used in this type of analysis, but it is often overlooked.
- Communication networks. Highly developed tools are available to track and analyze communications between different people over various platforms. These tools can identify patterns based on the timing and frequency of the communications as well as the subject matter. Spikes in the frequency of communications between individuals about specific topics can be helpful in figuring out what may have happened related to an investigation.
- Location/physical movement and activity. In some cases, where people go and what they do can be as important as what they are actually saying. Given the high prevalence of smart phones (a.k.a. human tracking devices) as well as security cameras, card swipe and access control systems, not having everywhere you go and everything you do thoroughly tracked and documented requires an extraordinary effort. AI-based systems that use machine learning to search and analyze this data may be a readily available source of useful information.
Open Your Eyes
The next time you are considering potential evidence in a case, think broadly—don’t limit yourself to emails, texts, and documents. What data may exist about internal and external events surrounding the important actions in your case? There may be databases of information about transactions, physical movements, and actions by key players. Potentially relevant communications may not be contained solely within one platform. Consider combining and analyzing communication threads across different media such as email, text, phone, messaging applications, social media, and so on. Rather than focusing solely on potential keywords, consider the broader categories of information you are seeking (e.g., named entity recognition) or the possible subjects of communication that might be relevant even if you don’t know the specific words used (e.g., topic modeling). Consider the possible significance of the emotional state of the individuals you are investigating (e.g., sentiment analysis). For example, in a case of suspected trade-secret theft, most people, aside from professional criminals, become nervous or agitated when doing something they believe is wrong.
As data continues to proliferate, the need to find critical information faster will make these approaches essential components of the legal toolkit. All of the technology described above is well established and employed extensively outside of the legal industry. Yet, many lawyers continue to restrict themselves to the most basic tools and techniques such as keyword searches of email and documents. It is imperative that lawyers seek the guidance they need from knowledgeable people within their firms or legal departments, or consult outside experts to understand all of the available options. Because, every lawyer interested in getting information from data as quickly and accurately as possible (i.e., every lawyer) needs to employ the most accurate, timely, and cost-effective ways to tell their client’s story. It’s elementary.
Tom Barnett is special counsel, chief, data science analysis & investigation with Paul Hastings in Los Angeles, California.