July 01, 2020 Feature

Corpus Linguistics: Just Another Tool in the Judge’s Toolbox?

By Andria Dorsten Ebert

Corpus linguistics—the practice of using computerized databases of text (corpora) to determine the meaning of a word—has been trending recently in state and federal judicial opinions, including in briefs to the U.S. Supreme Court.1 Practitioners should seek to understand this hot and trending topic and be ready for the judge who requests briefing on the issue.2 Equipped with an understanding of the strengths and weaknesses of using corpus linguistics, practitioners can provide strong and effective briefings to the court on issues of statutory interpretation.

Justice Thomas R. Lee of the Utah Supreme Court has long been a proponent of corpus linguistics, using it to inform his opinion in In re Adoption of Baby E.Z. in 2011.3 Since then, Justice Lee has promoted corpus linguistics in many judicial opinions, media interviews, and law review articles.4 More recently, state supreme courts in Idaho, Michigan, and Minnesota have followed Utah’s lead in employing corpus linguistics and have issued opinions and reasoning supported by a corpus search and linguistic analysis.5 The Ohio Supreme Court has yet to issue an opinion disclosing corpus linguistics use, but Justice Pat DeWine held a conference for his fellow justices, Ohio appellate judges, and other colleagues on corpus linguistics and its use as a tool of statutory interpretation, stating, “[corpus linguistics] is not something we’re going to use in every case, but in certain cases I think it’s something that can be helpful as part of a judge’s toolkit to get at those thorny issues of statutory interpretation.”6 Federal circuit courts are also starting to use corpus linguistics as a judicial tool, with judges in the Third and Sixth Circuits joining their state colleagues in asking that briefs include statutory analyses using corpus linguistics.7

Even if judges and lawyers are not trained linguists, they often engage in linguistic reasoning when resolving ambiguity in legal texts and interpreting the law. To illuminate and interpret the ordinary meaning of statutory text, judges and scholars often turn to a familiar tool: the dictionary. Generally, using the dictionary to determine the ordinary meaning of statutory text is a relatively straightforward application with little subtlety. But even in these straightforward applications, a legal interpreter can use “curiously unscientific linguistic methods,” judging ordinary meaning “based on intuition, aided by the dictionary.”8 For example, employing a well-recognized, unabridged dictionary correct for the time period does not ensure that the meaning applied by the court will be correct. The first meaning provided in a dictionary is not always its primary or most common meaning; some dictionaries may simply list by the oldest meaning of a given word.9 The ordering of definitions is a “lexical convenience” that does not establish a hierarchy of importance among the definitions listed for a given word.10 As Justice Lee stated in his first opinion utilizing corpus linguistics:

judges seldom provide a rationale for selecting among the alternatives; nor do they explain why one dictionary definition is more “ordinary” than the other. This suggests that such determinations are intuitive rather than principled. . . . But dictionaries and our own intuition may not tell us how words are ordinarily used, and our reliance on both to determine the ordinary meaning of a statutory term in a particular context is problematic.11

Recognizing these issues, scholars have long criticized the use of dictionaries to aid in statutory interpretation. They argue that dictionaries are “detached from ordinary meaning and legislative intent, they are often deliberately devoid of context, they do not purport to describe all semantically acceptable word meanings, they contain definitions that support multiple readings of the statute, and they are often used opportunistically by legal interpreters.”12 But if judges should not use dictionaries, what other tools are available to interpret a text’s ordinary meaning?

Corpus linguistics might provide an answer. Or it might at least provide additional and more rigorous analyses than dictionary definitions alone. Its proponents state that while a dictionary provides a static interpretation of a given word, corpus linguistics can give a much more dynamic and measurable interpretation. As a result, they have turned to corpus linguistics to aid in their interpretation, believing it presents a more pragmatic and transparent tool that can account for differences in historical usage and semantics.13 According to James Heilpern, a senior fellow at BYU who helps run BYU’s Law and Corpus Linguistics Project:

Corpus linguistics can provide judges with empirical evidence about a word or a phrase’s ordinary meaning, or its relative clarity versus ambiguity. . . . A dictionary cannot tell you what the ordinary meaning of a phrase is. In fact, if you read the front matter of a dictionary, dictionaries even say that they can’t answer that question.14

Proponents of corpus linguistics aim to improve upon opaque dictionary definitions, using corpus searches to clarify their legal reasoning. By using corpus linguistics instead of a dictionary definition, legal interpreters can mitigate unconscious bias associated with choosing a particular definition within the dictionary’s acceptable uses.

Corpus linguistics is an approach to studying language that uses digitized, searchable collections of written texts known as corpora. These corpora are built from real-world language used in their initial context—in books, magazines, legal documents, and transcripts of spoken language. One well-known example of a corpus is Google Books, which has more than forty million titles in over 400 languages in its database.15 Other examples include Brigham Young University’s Corpus of Contemporary American English (COCA). On its website, COCA states that it is the most widely used English corpus, with more than 600 million words of text from 1990–2019, and a representative sample of spoken word, fiction, popular magazines, newspapers, and academic texts. BYU has also created more specialized corpora, including the Corpus of Founding Era American English, the Corpus of the Supreme Court of the United States, the Corpus of U.S. Caselaw, and the Corpus of Early Modern English. Digitized databases like these allow legal practitioners to analyze language for patterns of usage in a more targeted and transparent way than a dictionary can provide.16 The variety of corpora available can be tailored to answer empirical questions about language use, unlike the static definitions contained in dictionaries. In other words, it is an empirical, rather than theoretical and sometimes biased, way to determine the ordinary meaning of a statute.17

Some legal linguists have called corpus linguistics “like Lexis on steroids” for its ability to analyze enormous collections of corporal texts for patterns of usage.18 For example, a legal linguist can use the open-source BYU corpora to discover not only the dictionary definition of the word “personal,” but also how it is used as an adjective to modify other nouns. In his amicus brief in FCC v. AT&T, Inc., Neal Goldfarb provided extensive detail on his methodology and its results to demonstrate whether the word “personal” is merely an adjectival form of the noun, so that “personal privacy” can be extended to include corporate personhood, or whether “personal” has a distinct meaning unaffected by the legal treatment of corporations as persons.19 Goldfarb’s replicable search demonstrated to the Court that the most common nouns that “personal” modifies include “personal life,” “personal experience,” “personal friend,” and “personal appearance,” all to demonstrate that “personal privacy” should only apply to people, and not to corporate entities, despite a corporation’s status as a legal “person.”20 A search can be broad, to encompass all of the text within the corpora, or narrowed to the most common usage for a particular era, like the particular decade in which the statute in dispute was enacted.

Opponents of corpus linguistics take a more pessimistic view. While corpus linguistics use is appealing because it promises to promote replicability and remove personal preferences in statutory interpretation, some doubt whether corpus linguistics can reasonably achieve these lofty goals.21 Just as a dictionary definition can be misapplied, so can a corpus search be misdirected. Opponents argue that corpus linguistics is not the panacea of objectivity and transparency that its proponents claim, but instead can lead legal interpreters to define a word radically out of its context.22 To use an extreme example, how does the use of a word in Moby Dick, the King James Bible, or newspaper articles clarify the meaning of a word in an ambiguous present-day statute? Instead, the use of corpus linguistics is just as (or perhaps even more) subjective a method of statutory interpretation as the other tools available to judges. Despite its seemingly transparent and scientific nature, corpus linguistics is still vulnerable to human subjectivity. Corpus linguistics analysis still requires human judgment when choosing which corpus to use and the search terms used, and then analyzing the results for interpretive application. Instead of pretending that corpus linguistics provides a cover of neutrality, legal interpreters should recognize—and possibly even embrace—the human judgment necessary to interpret statutes. Without the addition of human judgment to create the search parameters or add context and purpose to the terms used, corpus linguistics might not be “the most helpful tool in the toolkit.”23 Instead, its use should be tempered by recognizing that it “brings us no closer to an objective method of statutory interpretation” and involves human “judgment calls.”24 Using corpus linguistics with the full range of judicial tools, such as common sense and historic and present-day considerations, can guide the court and avoid corpus linguistics’ limitations.

There are also practical implications to corpus linguistics use. For example, how does one cite to a corpus search? Although the Bluebook is regularly updated with changes in legal reporting—the twentieth edition, published in 2015, includes the proper citation for a tweet or Facebook post25—it has yet to include standards for providing the details of a corpus search in a way that could be easily replicated by the reader. Goldfarb, in his amicus brief, helpfully provided the results of his corpus search in the appendices. However, this is still an imperfect solution and does not provide the search methodology and may not be enough to execute an accurate replication. Not to mention that most practitioners, hopefully conscientious of a brief’s length, will be hesitant to commit too much space to explaining their methodology. So, even though a corpus search seems to be more transparent and replicable, this may not, practically speaking, be the case.

Even committed textualists, such as the late Supreme Court Justice Antonin Scalia, understand that context matters.26 Contextual evidence must be compelling, especially when a judicial opinion seeks to stray from the primary meaning of a word. A word should not be interpreted in a vacuum. Instead, judges should use their human judgment to interpret a word in its full context. “[S]ound interpretation requires paying attention to the whole law, not homing in on isolated words or even isolated sections. Context always matters. Let us not forget, however, why context matters: It is a tool for understanding the terms of the law. . . .”27 But corpus linguistics as a tool of interpretation promotes this “homing in” onto one word, outside of its context, and can promote poor reasoning. This homing-in does little to understand terms of law if it is not paired with context and judicial reasoning.

To provide a simple example, when discussing the need for context when interpreting ordinary language, Justice Stephen Breyer stated:

When I see the word “any” in a statute, I immediately know it’s unlikely to mean “anything” in the universe. . . . When my wife says, “there isn’t any butter,” I understand that she’s talking about what is in our refrigerator, not worldwide. We look at context over and over, in life and in law.28

Corpus linguistics, used on its own, cannot distinguish between “in life and in law.” But recognizing the role of human judgment in corpus linguistics can aid in interpreting a statute and can keep ordinary language in context.

“Law requires both a head and a heart,”29 and so, too, do interpretive tools. To interpret laws and legal text, corpus linguistics—the intellectual head—needs to be paired with judicial reasoning—the equitable heart. When making decisions that affect human beings, it is not enough to look for answers through history, text, language, or tradition. While relevant to any interpretation, those tools alone will too often produce a law that is too rigid when applied to human circumstances. Instead, pairing textual tools with human judgment will provide an equitable interpretation. Corpus linguistics can be a valuable tool to “help courts as they roll up their sleeves and grapple with a term’s ordinary meaning.”30 Even so, lawyers and judges utilizing corpus linguistics should recognize that “corpus linguistics is one tool—new to lawyers and continuing to develop—but not the whole toolbox.”31


1. See, e.g., FCC v. ATT, 562 U.S. 397 (2011) (Neal Goldfarb provided an amicus brief to the Court that used corpus linguistics to describe how “personal” is typically used to describe human beings and not artificial entities like corporations).

2. In Wright v. Spaulding, Judge Thapar asked the parties to file supplemental briefs on whether the corpus of Founding-era American English could illuminate the meaning of Article III’s case-or-controversy requirement. 939 F.3d 695, 700 n.1 (6th Cir. 2019).

3. 266 P.3d 702 (Utah 2011). In this opinion, Justice Lee used the Corpus of Contemporary American Usage (COCA) to review 500 randomized sample sentences, in the context of their articles or transcripts, to determine that “custody” was not often used in the adoption context, but instead generally used in divorce proceedings. Therefore, the custody proceedings covered by the act in question were limited to modifying custody orders in divorce. Id. at 37–38.

4. See, e.g., id.; Thomas R. Lee & Stephen C. Mouritsen, Judging Ordinary Meaning, 127 Yale L.J. 788 (2018); Justice Thomas Lee & Stephen Mouritsen, The Path Forward for Law and Corpus Linguistics, Wash. Post (Aug. 11, 2017).

5. See, e.g., State v. Lantis, 447 P.3d 875 (Idaho 2019); State v. Thonesavanh, 904 N.W.2d 432 (Minn. 2017); People v. Harris, 885 N.W.2d 832 (Mich. 2016).

6. Csaba Sukosd, Justices Get Word About New Tool to Interpret Law, Court News Ohio (Oct. 9, 2019), http://www.courtnewsohio.gov/bench/2019/corpusLinguistics_100919.asp#.XjuKWmhKhPY (last accessed Feb. 10, 2020).

7. See Caesars Entm’t Corp. v. Int’l Union of Operating Engineers Local 68 Pension Fund, 932 F.3d 91 (3d Cir. 2019); Wilson v. Safelite Grp., Inc., 930 F.3d 429 (6th Cir. 2019); Wright v. Spaulding, 939 F.3d 695 (6th Cir. 2019).

8. State v. Rasabout, 356 P.3d 1258, 1285 (Utah 2015).

9. See Muscarello v. United States, 524 U.S. 125, 128 (1998) (Breyer, J., writing the opinion for the Court). In Muscarello, Justice Breyer mistakenly indicated that the first meaning listed in the Oxford English Dictionary was its “primary meaning,” when it was instead the oldest meaning of the word “carry.” See also Antonin Scalia & Bryan A. Garner, A Note on the Use of Dictionaries, 16 Green Bag 2d 419, 423 n.18 (2013).

10. Thomas Lee & Stephen Mouritsen, Corpus Linguistics and a Dictionary-Based Jurisprudence, Wash. Post (Aug. 8, 2017), https://www.washingtonpost.com/news/volokh-conspiracy/wp/2017/08/08/corpus-linguistics-and-a-dictionary-based-jurisprudence/ (quoting Webster’s Third New International Dictionary of the English Language at 17a (1993)).

11. In re Adoption of Baby E.Z., 266 P.3d 702, 726–27 (Utah 2011) (Lee, J., concurring).

12. Evan C. Zoldan, Corpus Linguistics and the Dream of Objectivity, 50 Seton Hall L. Rev. 401 (2020).

13. Lee & Mouritsen, supra note 10.

14. Sukosd, supra note 6.

15. Haimin Lee, 15 Years of Google Books (Oct. 17, 2019), https://www.blog.google/products/search/15-years-google-books/.

16. Ben Zimmer, The Corpus in the Court: “Like Lexis on Steroids,” Atlantic (Mar. 4, 2011), https://www.theatlantic.com/national/archive/2011/03/the-corpus-in-the-court-like-lexis-on-steroids/72054/.

17. Lee & Mouritsen, supra note 10, at 795.

18. Zimmer, supra note 16.

19. Brief for the Project on Gov’t Oversight, the Brechner Ctr. for Freedom of Info. & Tax Analysts as Amici Curiae in Support of Petitioners at 30–32, FCC v. AT&T, Inc., 562 U.S. 397 (2011).

20. Id.

21. See, e.g., Carissa Byrne Hessick, Corpus Linguistics and the Criminal Law, 2017 BYU L. Rev. 1503, 1503 (2018).

22. Zoldan, supra note 12.

23. Wright v. Spaulding, 939 F.3d 695, 700 n.1 (6th Cir. 2019).

24. Wilson v. Safelite Grp., Inc., 930 F.3d 429, 448 (6th Cir. 2019) (Stranch, J. concurring); Id. at 441 (Thapar, J. concurring).

25. The Bluebook: A Uniform System of Citation R. 18.2.2, at 184 (Columbia Law Review Ass’n et al. eds., 20th ed. 2015).

26. King v. Burwell, 135 S. Ct. 2480, 2497 (2015).

27. Id.

28. Eve Gerber, Stephen Breyer on His Intellectual Influences, Five Books, https://fivebooks.com/best-books/stephen-breyer-intellectual-influences/.

29. Id.

30. Wilson v. Safelite Grp., Inc., 930 F.3d 429, 445 (6th Cir. 2019) (Thapar, J., concurring).

31. Id. at 441.


By Andria Dorsten Ebert

Andria Dorsten Ebert is a third-year student at The Ohio State University Moritz College of Law. She writes for the Ohio State Law Journal’s Sixth Circuit Review.