chevron-down Created with Sketch Beta.


Practice Management

Big Data Textualism: The Basics of Corpus Linguistics

Benjamin Reese


  • Corpus linguistics, a method of analyzing language usage through databases of written materials, is a valuable tool for legal interpretation.
  • The process involves searching a database to understand how specific words or phrases were used historically. Useful for constitutional provisions, statutes, and contracts, it offers empirical grounding compared to dictionaries.
  • Various corpus databases, such as those from BYU Law School, are recommended, with the complexity acknowledged but deemed worth the effort for lawyers.
Big Data Textualism: The Basics of Corpus Linguistics

Jump to:

Yeah, I don’t know what to make of that. That’s . . . that’s something new.

Confronted with statistics about word use and something strange called corpus linguistics, Chief Justice John Roberts spoke for most members of the bar during last term’s oral argument in ZF Automotive, Inc. v. Luxshare, Ltd, 142 S. Ct. 2078 (2022). But corpus linguistics is neither as complicated nor as new-fangled as it sounds, and it’s becoming an increasingly popular way to search for ordinary meaning. So, what is it, and how can it affect your practice?

What Is Corpus Linguistics?

Simply put, a corpus is a database of written materials. This is the case whether it’s a collection of newspaper articles, judicial opinions, books, transcripts, speeches, or something else entirely. The key is that the materials must be incorporated into a single searchable database.

Corpus linguistics, for its part, is the process of searching that database to see how a particular word or phrase is used. Do people really use the phrase bear arms to mean carrying weapons for personal defense? Or do they only use it when describing participation in a militia? If we look only at sources dating from the time of the country’s founding, did people use it in the same way or differently?

Corpus-linguistics analysis is very similar to how dictionaries—the other common way of divining ordinary meaning—are constructed. Merriam-Webster, for example, keeps files of clippings from books, magazines, and other print sources showing interesting uses of words, which its lexicographers use to determine what a word means and how it is being used. Increasingly, dictionary makers themselves are turning to corpus analysis as a substitute for this sort of work.

When Can I Use Corpus Linguistics?

Corpus analysis is most useful when interpreting a constitutional provision, statute, or contract. These databases allow users to pull dozens or hundreds of examples showing how a word or phrase is used, the most common sense of a word when a legal document was drafted, and more. And it is far more empirically grounded than dictionaries, which often do not order word senses (i.e., the various, usually numbered definitions for a word listed under its entry in the dictionary) by their most common or core usage (the American Heritage Dictionary series being a notable exception). Certainly, they are more objective than a judge’s or lawyer’s intuitive, “common sense” view of a word’s ordinary meaning. Pick a corpus fit for your purpose.

To see how this works in practice, let’s look at an example from the US District Court for the District of Columbia’s 2023 opinion in Pierre-Noel ex rel. KN v. Bridges Public Charter School (1:23-cv-00070, D.D.C. 2023). One of the questions in that case was whether the word “transportation” in the Individuals with Disabilities Education Act (IDEA) included lifting or carrying a child to the bus. After consulting several dictionaries and a related statute, the court turned next to corporate linguistics. Using one of the databases discussed below, the court searched for mentions of “transportation” between 1965 and 1975 (the decade leading up to the passage of the relevant section of IDEA) and found that 30.6 percent of the results referred to systems of transportation and 25 percent referred to vehicular travel. However, it found only three instances (out of a random sample of 288) that used “transportation” in a broad sense that would apply to helping a child to the bus, and none that used the word to describe pedestrian travel. Based in part on this analysis, the district court concluded that using “transportation” in the way the plaintiff student suggested would be “highly anomalous.”

What Databases Are Available?

There are several corpus databases out there. For example, Brigham Young University Law School has developed a corpus focused on Founding-era sources to help with constitutional interpretation, as well as several other law-focused corpus collections. A BYU linguistics professor has also put together a wider-ranging, modern collection—the Corpus of Contemporary American English—that covers modern sources (1990–2019), as well as his own historical corpus (1820–2019). These resources are available for free online.

Beyond that, a host of other, more specialized corpus collections exist (some free and some not) but are beyond the scope of a general knowledge advice column. If you hope to use corpus-linguistics analysis, you should devote some time to finding the database that works for you.

Do I Need an Expert?

That may sound complicated . . . and it can be. To parse particularly tricky usage questions or help with make-or-break cases, an expert linguist, either as a consultant or testifying expert, could be helpful.

But as Thomas Lee and Stephen Mouritsen point out in their pathbreaking 2018 Yale Law Journal article describing the potential value of corporate linguistics to the law: “In a way, lawyers have been doing corpus analysis for a long time; they scour Westlaw or Lexis to determine how courts have interpreted a phrase or concept.” This is simply extending that sort of analysis to a new type of database and a less law-focused context. “The fact of the matter is that judges and lawyers are linguists. We may not be trained in linguistic methodology, but our work puts us consistently and inevitably in the position of resolving ambiguities in legal language.” In other words, while you might want to spend some time reading the helpful resources available from BYU and other sources (or find a CLE focused on corpus analysis), this is not beyond a lawyer’s skillset in most cases. Lawyers have been doing this all along in our own way.

Corpus Linguistics Is Complicated . . . But Worth It

It’s not possible to give you a step-by-step instruction manual on how to use corpus linguistics most effectively in a short explainer like this one. But if you’re interested, it is well worth diving deeper. After all, at least one lawyer has made a practice out of corporate linguistics analysis. And very few are conversant in it. Familiarizing yourself with corpus linguistics will allow any young lawyer to add value to any litigation team.