Lawyers know a lot about a wide range of subjects—the result of constantly dealing with a broad variety of factual situations. Nevertheless, most lawyers might not know much about machine learning and how it impacts lawyers in particular. This article provides a short and simple guide to machine learning geared to attorneys.

###### November 01, 2017

# A Simple Guide to Machine Learning

###### By Warren E. Agin

“Artificial intelligence” (AI) usually refers to machine learning in one form or another. It might appear as the stuff of science fiction, or perhaps academia, but in reality machine learning techniques are in wide use today. Such techniques recommend books for you on Amazon, help sort your mail, find information for you on Google, and allow Siri to answer your questions.

In the legal field, products built on machine learning are already starting to appear. Lexis and Westlaw now incorporate machine learning in their natural language search and other features. ROSS Intelligence is an AI research tool that finds relevant “phrases” from within cases and other sources in response to a plain language search. Through the use of natural language processing, you can ask ROSS questions in fully formed sentences. Kira Systems uses machine learning to quickly analyze large numbers of contracts.

These are just two of dozens of new, machine learning–based products. On the surface, these tools might seem similar to those currently available—but they actually do something fundamentally different, making them not only potentially far more efficient and powerful, but also disruptive. For example, machine learning is the “secret sauce” that enables ridesharing services like Uber to efficiently adjust pricing to maximize both the demand for rides and the availability of drivers, predict how long it will take a driver to pick you up, and calculate how long your ride will take. With machine learning, Uber and similar companies are rapidly displacing the traditional taxicab service. Understanding what machine learning is and what it can do is key to understanding its future effects on the legal industry.

## What Is Machine Learning?

Humans are good at *deductive reasoning*. For example, if I told you that a bankruptcy claim for rent was limited to one year’s rent, you would easily figure out the amount of the allowed claim. If the total rent claim was $100,000, but one year’s rent was $70,000, you would apply the rule and deduce that the allowable claim is $70,000. No problem. You can determine the result easily, and you can also easily program a computer to consistently apply that rule to other situations. Now reverse the process. Assume I told you that your client was owed $100,000 and that the annual rent was $70,000, and then told you that the allowable claim was $70,000. Could you figure out how I got that answer? You might guess that the rule is that the claim is limited to one year’s rent, but could you be sure? Perhaps the rule was something entirely different. This is *inductive reasoning*, and it is much more difficult to do.

Machine learning techniques are computational methods for figuring out “the rules,” or at least approximations of the rules, given the factual inputs and the results. Those rules can then be applied to new sets of factual inputs to deduce results in new cases.

For instance, consider number series games. For example:

2 4 6 8 10 ?

The next number is 12, right? Here, the inputs are the series of numbers 2 through 10, and from this we induce the rule for getting the next number—add 2 to the last number in the series. Here is another one:

1 1 2 3 5 ?

The next number is 8. This is a Fibonacci sequence, and the rule is that you add together the last two numbers in the series.

With these games, what you are doing in your head is looking at a series of inputs and answers, and using inductive reasoning to figure out the rule. You then apply that rule to get the next number. Broken down a little, the prior game looks like this:

Input Result

1 1 2

1 1 2 3

1 1 2 3 5

1 1 2 3 5 ?

We look at the group of inputs and induce a rule that gives us the displayed results. Once we have derived a workable rule, we can apply it to the last row to get the result 8, but more importantly we can apply it to any group of numbers in the Fibonacci sequence. This is a simple (very simple) example of what machine learning does.

Let’s take a more complex example. Assume we wanted to predict the amount of a debtor’s counsel’s fees in a Chapter 11 case. We could take a look at cases in the past and get information about each: for example, the number of creditors, the debtor’s market capitalization, where the case was filed, and, of course, the eventual fee awarded to the debtor’s counsel. We might compare these numbers and discover that if we graphed the fee awards against the debtor’s market capitalization, it looks something like figure 1 (purely hypothetically).

There seems to be a trend. The larger the market capitalization (the *x* axis), the higher the legal fee seems to be. In fact, the data points look sort of like a line. We can calculate the line that best fits the data points using a technique called linear regression (see fig. 2).

We can even see the equation that the line represents. You take the market capitalization for the debtor, multiply it by 4.92 percent, and add $116,314 (these two variables are the “weighting mechanisms,” explained in detail below). This is called a “prediction model.” The prediction model might not perfectly fit the data used to create it—after all, not all the data points fall exactly on the line—but it provides a useful approximation. That approximation will provide a pretty good estimate for legal fees in future cases (that’s what the R^{2} number on the graph tells us). For the record, the data here is imaginary, hand-tailored to demonstrate the methodology.

Naturally, real-world problems are more complex. Instead of a short series of numbers as inputs, a real-world problem might use dozens, perhaps thousands, of possible inputs that might be applied to an undiscovered rule to obtain a known answer. We also do not necessarily know which of the inputs the unknown rule uses!

To solve a more complex problem, we might begin by building a database with the relevant points of information about a large number of cases, in each instance collecting the data points that we think might affect the answer. To build our prediction model, we would select cases at random to use as a “training set,” putting the remainder aside to use as a “test set.” Then we would begin to analyze the various relationships among the data points in our training set using statistical methods. Statistical analytics can help us identify the factors that seem to correlate with the known results and the factors that clearly do not matter.

Advanced statistical methods might help us sort through the various relationships and find an equation that takes some of the inputs and provides an estimated result that is pretty close to the actual results. Assuming we find such an equation, we then try it out on the test set to see if it does a good job there as well—predicting results that are close to the real results. If our predictive model works on our test set, then we consider ourselves lucky. We can now predict a debtor’s counsel’s legal fees ahead of time; at least until changing circumstances—perhaps rules changes, a policy change at the U.S. Trustee’s Office, or the effect our very own model has on which counsel get hired for cases—render our model inaccurate. If our model does not work on the test set data, than we consider it flawed and go back to the drawing board.

For real-world problems, this kind of analysis is difficult. The job of collecting the data, cleaning it, and analyzing it for relationships takes a lot of time. Given the large number of potential variables that affect real-world relationships, identifying those that matter is somewhat a process of trial and error. We might get lucky and generate results quickly, we might invest substantial resources without finding an answer at all, or the relationships might simply prove to be too complex for the methods I described to work adequately. Inductive reasoning is difficult to do manually. This brings us to machine learning. Machine learning can efficiently find relationships using inductive reasoning.

As an example of what machine learning can do, consider these images:

Assume we want to set up a computer system to identify these handwritten images and tell us what letter each image represents. Defining a rule set is too difficult for us to do by hand and come up with anything that is remotely usable, but we know there *is* a rule set. The letter *A* is clearly different from the letter *P*, and *C* is different from *G*, but how do you describe those differences in a way a computer can use to consistently determine which image represents which letter?

The answer is that you don’t. Instead, you reduce each image to a set of data points, tell the computer what the image is of, and let the computer induce the rule set that reliably matches all the sets of data points to the correct answers. For the image recognition problem, you might begin by defining each letter as a 20 pixel by 20 pixel image, with each pixel having a different grayscale score. That gives you 400 data points, each with a different value depending on how dark that pixel is. Each of these sets of 400 data points is associated with the answer—the letter they represent. These sets become the training set, and another database of data points and answers is the test set. We then feed that training set into our machine learning algorithm—called a “learner”—and let it go to work.

What does the learner actually do? This is a little more difficult to explain, partially because there are a lot of different types of learners using a variety of methods. Computer scientists have developed a number of different kinds of techniques that allow a computer program to infer rule sets from defined sets of inputs and known answers. Some are conceptually easier to understand than others. In this article, I describe, in simple terms, how one of these techniques works. Machine learning programs will use a variation of one or more of these techniques. The most advanced systems include several techniques, using the one that fits the specific problem best or seems to generate the most accurate answers.

In general, think of a learner as including four components. First, you have the input information from the training set. This might be data from a structured, or highly defined, database, or unstructured data like you might find in a set of discovery documents. Second, you have the answers. With a structured database, a particular answer will be closely identified with the input information. With unstructured information, the answer might be a category, such as which letter an image represents or whether a particular email is spam; or the answer might be part of a relationship, such as text in a court decision that relates to a legal question asked by a researcher. Third, you have the learning algorithm itself—the software code that explores the relationships between the input information and the answers. Finally, you have weighting mechanisms—basically parts of the algorithm that help define the relationships between the input information and the answers, within the confines of the algorithm. Once you have these four components, the learner simply adjusts the weighting mechanisms in a controlled manner until it finds values for the weighting mechanisms that allow the algorithm to accurately match the input information with the known correct answers.

Let’s see how this might work with my hypothetical system for estimating a debtor’s counsel’s fees. In the example (see fig. 2), the market capitalizations are the input information (*X*). The known legal fees for each case are the answers (*Y*). For purposes of illustration, let’s assume the algorithm is *Y* = *aX* + *b* (a vast simplification, but I’m going to use it to demonstrate a point). The weighting mechanisms are the two variables *a* and *b*. Instead of manually calculating the values of *a* and *b* using linear regression, a machine learning program might instead try different values of *a* and *b*, each time checking to see how well the line fits the actual data points mathematically. If a change in *a* or *b* improves the fit of the line, the learner might continue to change *a* and *b* in the same direction, until the changes no longer improve the line’s fit.

Of course, in my example it is easier just to calculate *a* and *b* using linear regression techniques. I don’t even need to have math skills to do it—the functionality is built right into Microsoft Excel and other common software products. Given a spreadsheet with the data, I can perform the calculation with a few mouse clicks. Machine learning programs, however, can figure out the relationships when there are millions of data points and billions of relationships—when modeling the systems is impossible to do by hand because of the complexity. Machine learning systems are limited only by the quality of the data and the power of the computers running them.

Now, let’s look at an example of a machine learning system.

## Neural Networks

The term “neural network” conveys the impression of something obscure and mysterious, but it is probably the easiest form of a machine learning system to explain to the uninitiated. This is because it is made up of layers of a relatively simple construct called a “perceptron.”

Credit: https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/

This perceptron contains four components, the first being one or more inputs represented by the circles on the left. The input is simply a number, perhaps between 0 and 1. It might represent part of our input information, or it might be the output from another perceptron.

Second, each input number is given a weight—a percentage by which the input is multiplied. For example, if the perceptron has four inputs of equal importance, each input is multiplied by 25 percent. Alternatively, one input might be multiplied by 70 percent while the other three are each multiplied by 10 percent, reflecting that one input is far more important than the others.

Third, these weighted input numbers are added to generate a weighted sum—a single number that reflects the weights given the various inputs.

Fourth, the weighted sum is fed into a step function. This is a function that outputs a single number based on the weighted sum. A simple step function might output a 0 if the weighted sum is between 0 and 0.5, and a 1 if the weighted sum is between 0.5 and 1. Usually a perceptron will use a logarithmic step function designed to generate a number between, say, 0 and 1 along a logarithmic scale so that most weighted values will generate a result at or near 0, or at or near 1, but some will generate a result in the middle.

Some systems will include a fifth element: a “bias.” The bias is a variable that is added or subtracted from the weighted sum to bias the perceptron toward outputting a higher or lower result.

In summary, the perceptron is a simple mathematical construct that takes in a bunch of numbers and outputs a single number. By computing the weighted sum of the inputs, running that number through the step function, and adjusting the result using a final bias, the perceptron tells you whether the collection of inputs produces a result above or below a threshold level. This mechanism works much like a switch. The result of that switch might be fed to another perceptron, or it might relate to a particular “answer.” For example, if your learner is doing handwriting recognition, you might have a perceptron that tells you the image is the letter *A* based on whether the output number is closer to a 1 than a 0.

In a neural network, the perceptrons typically are stacked in layers. The first layers receive the input information for the learner, and the last layer outputs the results.

In between are what are called “hidden layers” of perceptrons, each taking in one or more input numbers from a prior layer and outputting a single number to one or more perceptrons in the next layer. By stacking the layers of perceptrons, the “deep learner” acts a little bit like a computer circuit, one whose operations are programmed by the changes in the weights.

The computer scientist building the neural network determines its design—how many perceptrons the system uses, where the input data comes from, how the perceptrons connect, what step function gets used, and how the system interprets the output numbers. However, the learner itself decides what weights are given to each input as the numbers move through the network, and what biases are applied to each perceptron. As the weights and biases change, the outputs will change. The learner’s goal is to keep adjusting the weights and biases used by the system until the system produces answers using the input information that most closely approximate the actual, known answers.

Returning to the handwriting recognition example, remember that we broke down each letter image into 400 pixels, each with a grayscale value. Each of those 400 data points would become a input number into our system and be fed into one or more of the perceptrons in the first input layer. Those outputs would pass through some hidden layers in the middle. Finally, we would have an output layer of 26 perceptrons, one for each letter. The output perceptron with the highest output value will tell us what letter the system thinks the image represents.

Then, we pick some initial values for the weights and biases, run all the samples in our training set through the system, and see what happens. Do the output answers match the real answers? Probably not even close the first time through. So, the system begins adjusting weights and biases, with small, incremental changes, testing against the training set and continuously looking for improvements in the results until it becomes as accurate as it is going to get. Then, the test set is fed into the system to see if the set of weights and biases we just determined produces accurate results. If it does, we now have an algorithm that we can use to interpret handwriting.

It might seem a little like magic, but even a relatively simple neural network, properly constructed, can be used to read handwriting with a high degree of accuracy. Neural networks are particularly good at sorting things into categories, especially when using a discrete set of input data points. What letter is it? Is it a picture of a face or something else? Is a proof of claim filed in a bankruptcy case objectionable or not?

## Machine Learning in Action

These examples are basic, designed to provide some understanding of what are fairly abstract systems. Machine learners come in many flavors—some suitable for performing basic sorting mechanisms, and others capable of identifying and indexing complex relationships among information in unstructured databases. Some systems work using fairly simple programs and can run on a typical office computer, and others are highly complex and require supercomputers or large server farms to accomplish their tasks.

To understand the power of machine learning systems compared with nonlearning analytic tools, let’s revisit an earlier example: ROSS Intelligence. ROSS is built on the IBM Watson system, although it also includes its own machine learning systems to perform many of its tasks. Watson’s search tools employ a number of machine learning algorithms working together to categorize semantic relationships in unstructured textual databases. In other words, if you give Watson a large database of textual material dealing with a particular subject, Watson begins by indexing the material, noting the vocabulary and which words tend to associate with other words. Even though Watson does not actually understand the text’s meaning, it develops, through this analysis, the ability to mimic understanding by finding the patterns in the text.

For example, when you conduct a Boolean search in a traditional service for “definition /s ‘adequate protection,’” the service searches its database for an exact match where the word “definition” occurs in the same sentence as the term “adequate protection.” ROSS does something different. Using the Watson AI systems and its own algorithms, it looks within the search query for word groups it recognizes and then finds the results it has learned to associate with those word groups. If you search for “what is the definition of adequate protection,” the system will associate the query “what is the definition” with similar queries, such as “what is the meaning of” or just “what is.” It will also recognize the term “adequate protection” as a single concept instead of two separate words, and likely, given the context, understand it as a word found in bankruptcy materials. Finally, it will have associated a successful response as being one that gives you certain types of clauses including the term “adequate protection.” It will not understand specifically that you are looking for a definition, but because others who used the system and made similar inquiries preferred responses providing definitions, you will get clauses containing similar language patterns and, viola, you will get your definition.

You should not even have to use the term “adequate protection” to get an answer back discussing the concept when that is the appropriate answer to your question. So long as your question triggers the right associations, the system will, over time, learn to return the correct responses.

The key is that a machine learning system learns. In a way, we do the same thing ROSS does. The first time we research a topic, we might look at a lot of cases and go down a lot of dead ends. The next time, we are more efficient. After dealing with a concept several times, we no longer need to do the research. We remember what the key case is, and at most we check to see if there is anything new. We know how the cases link together, so the new materials are easy to find.

A machine learning–based research tool can do this on a much broader scale. It learns not just from our particular research efforts, but also from those of everyone who uses the system. As the system receives more use, it employs user feedback to assess how its model performs and to allow for periodic retraining. As a result, it will become extremely adept at providing immediate responses to the most common queries by users. It might also be able to eventually give you a confidence level in its answer, comparing the information it provides against the entire scope of reported decisions and its users’ reactions to similar, prior responses, to let you know how reliable the results provided might be. Even though the system doesn’t understand the material in the same manner as a human, its ability to track relationship building over a large scope of content and a large number of interactions allows it to behave as you might, if you had researched a particular point or issue thoroughly many times previously. This provides a research tool far more powerful than existing methodologies.

Legal tools based on machine learning have enormous application. Lawyers are already using learners to help with legal research, categorize document sets for discovery, evaluate pleadings and transactional documents for structural errors or ambiguity, perform large-scale document review in mergers and acquisitions, and identify contracts affected by systemic changes like Brexit. General Motors’ legal department, and likely other large companies, are exploring using machine learning techniques to evaluate and predict litigation outcomes and even help choose which law firms they employ. Machine learning is not the solution for every question, but it can help answer a large number of questions that simply were not answerable in the past, and that is why the advent of machine learning in the legal profession will prove truly transformational. u