chevron-down Created with Sketch Beta.

Jurimetrics Journal

Jurimetrics: Spring 2023

Epsilon-Differential Privacy, And A Two-Step Test For Quantifying Reidentification Risk

Nathan Reitinger and Amol Deshpande

Summary

  • Differential privacy is one of the most popular data sanitization concepts of the twenty-first century.
  • Statutes regulating data require data to be kept confidential if the data is considered personally identifiable information.
  • The two-step test provides needed confidence to data stewards hosting legally protected data.
Epsilon-Differential Privacy, And A Two-Step Test For Quantifying Reidentification Risk
Adrienne Bresnahan via Getty Images

Jump to:

Abstract: Sharing data in the twenty-first century is fraught with error. Most com­monly, data is freely accessible, surreptitiously stolen, and easily capitalized in the pur­suit of monetary maximization. But when data does find itself shrouded behind the veil of “personally identifiable information” (PII), it becomes nearly sacrosanct, impenetrable without consideration of ambiguous (yet penalty-rich) statutory law—inhibiting utility. Either choice, unnecessarily stifling innovation or indiscriminately pilfering privacy, leaves much to be desired.

This Article proposes a novel, two-step test for creating futureproof, bright-line rules around the sharing of legally protected data. The crux of the test centers on identi­fying a legal comparator between a particular data sanitization standard—differential pri­vacy: a means of analyzing mechanisms that manipulate, and therefore sanitize, data—and statutory law. Step one identifies a proxy value for reidentification risk which may be easy calculated from an ε-differentially private mechanism: the guess difference. Step two finds a corollary in statutory law: the maximum reidentification risk a statute toler­ates when permitting confidential data sharing. If step one is lower than or equal to step two, any output derived using the mechanism may be considered legally shareable; the mechanism itself may be deemed (statute, ε)-differentially private.

The two-step test provides clarity to data stewards hosting legally or possibly legally protected data, greasing the wheels in advancements in science and technology by providing an avenue for protected, compliant, and useful data sharing.

Citation: Nathan Reitinger & Amol Deshpande, Epsilon-Differential Privacy, and a Two-Step Test for Quantifying Reidentification Risk, 63 Jurimetrics J. 263–317 (2023).

One of the most popular data sanitization concepts of the twenty-first cen­tury is differential privacy. The idea may be stated in one line: purposefully failing to see the trees for the forest. Differential privacy allows one to learn the statistics of a group without also learning the statistics of the individuals making up the group. To illustrate why this type of crowd-but-not-individual learning is necessary in our data-agnostic world, consider the case of Merck’s block­buster drug Vioxx.

A pharmaceutical boon, Vioxx grossed drug maker Merck around eight bil­lion dollars in the approximately four years it was prescribable, 1999–2004; the drug was marketed as a safer alternative to ibuprofen and quickly became heav­ily prescribed. All that changed in September 2004 when Merck voluntarily, abruptly pulled Vioxx from the market, resulting in a nearly twenty-seven per­cent stock drop.

While Vioxx had great success at treating arthritis, it also had great success at causing heart attacks. This was a surprise to many—but not the Food and Drug Administration (FDA). Almost three and a half years before Merck im­posed its to-be-permanent moratorium on Vioxx sales, the FDA possessed data which, if analyzed, would have illuminated this danger, possibly preventing fu­ture harm.

The specter of confidentiality is one of the many reasons why this data never saw the light of day. Releasing legally protected or even potentially le­gally protected data into the wild is a low-benefit, high risk endeavor. The ben­efits rely on unpredictable, open-source engagement while the risks involve highly likely public scrutiny and legal backlash (e.g., the mires of anonymiza­tion from the 2004s). To be sure, without tools like differential privacy, the risks may outweigh the benefits, and it may be reasonable to withhold data.

Differential privacy solves the confidentiality problem by offering future­proof guarantees to individuals, and therefore data stewards. Differential pri­vacy can guarantee that releasing your data will not be the cause of any adverse harm—a guarantee which is not matched by any other sanitization technique. However, while removing the risk in the to-release-or-not-to-release dilemma should result in a positive-sum game, differential privacy today is a mixed bag.

To be sure, differential privacy can be (mostly) life changing or some­thing of a dubitante. We see the census going all-in on differential privacy and technical scholars having difficulty mentioning privacy without also discussing differential privacy—like a Marbury v. Madison for a mathematical under­standing of privacy itself. On the other hand, we also see attempts at critiquing differential privacy into the dustbin. Both in academia and in the courts, dif­ferential privacy has seen its fair share of challenges. Pundits and politicians aside, there may be some justification for this polarity.

For one, differential privacy is not a well-understood concept. Math is hard. Additionally, applying differential privacy in a practical, legal setting is in many ways terra incognita. This Article attempts to solve both problems.

The Article begins by teaching differential privacy using a building blocks approach, starting from the most basic and building to a full mathematical def­inition. The Article then proposes a novel means of understanding differential privacy, which intends to be readily applicable in a statutory framework. By default, differential privacy does not directly translate to statutes regulating data. If data protection laws made statements like “it is permissible to share data if the mechanism of release applies differential privacy with an epsilon value of less than or equal to .05,” then this Article would be superfluous. For good reason, this will likely never happen. Instead, most statutes create ambiguous mandates like “remove any information which could lead to identification.” This, however, leaves a data steward in a difficult position: How much sanitiza­tion does a dataset need to undergo before there is no data remaining that could “lead to identification?” Likewise, this leaves differential privacy in a difficult position: When does a differentially private mechanism permit legally shareable data and when does it not?

This translation problem stems equally from issues rooted in both law and technology. To solve it requires finding a common element among data protec­tive statutes that provides a metric against which differential privacy can be measured. Stated otherwise, is there a single, mathematical value that (step one) may be derived from a differentially private mechanism and (step two) is trans­latable to what statutes require for the sharing of confidential data? Yes.

For the legal piece, all data protective statutes, we argue, regulate “reidenti­fication risk.” Statutes go about this by using a variety of unique textual phrases (e.g., “personally identifiable information” (PII) or “personal data”) but what all of these phrases have in common is an intent to reduce the potential for harm that an individual described in a dataset faces when their data is shared. The harm is that someone can look at a statutorily compliant dataset and say “this record is your record.” Some statutes may have a very low threshold for risk (e.g., Health Insurance Portability and Accountability Act (HIPAA)) while others may have a high tolerance (e.g., Video Privacy Protection Act (VPPA)); ultimately, however, all statutes share a common goal of reducing this risk—at varying thresholds.

For the technical piece, the Article looks at identifying a value native to differential privacy which may be called the reidentification risk of the mecha­nism. We call this value the guess difference: the risk of being reidentified in data that comes from an ε-differentially private mechanism. This proxy value for reidentification risk is easy to understand, adds context to an otherwise am­biguous number, and allows differential privacy to be directly compared to what statutes mandate in terms of data confidentiality.

Working together, the two-step test provides much needed confidence to data stewards hosting legally protected data. The test permits easy line drawing around how little or how much sanitization is required before sharing data within a regulatory ecosystem—greasing the wheels on private, useful data shar­ing. Before introducing the primitives on which differential privacy operates, the Article first elaborates on how modern-day privacy leads to, and necessi­tates, differential privacy.

I. Constitutional Heritage

Data is inescapable, revolutionary, and commoditized. Everything you do online and offline is captured. True enough, this has negative side ef­fects—for example, Panopticon-styled chilling effects, overbroad NSA drag­nets, and the transactional cost of reductionism in the pursuit of category-themed, ad-based monetization. But at the same time, big data also has pos­itive side effects—for example, democratized education via massive open online courses, the proliferation of e-commerce, and worldwide, instantaneous communication networks. To be sure, no effect (positive or negative) is without a privacy loss.

A. Privacy Loss: Legal Protections

Privacy loss is a difficult-to-describe harm, but one which, when looking for it, may be easily found in marking the boundary lines of governmental in­trusion. Marriages, procreation, and parenthood have all been subject to fierce protection (or at the very least debate) by the Supreme Court, which has ex­plicitly noted the right to privacy as one of the most valued rights for all citizens. For example, the Court stated in Lawrence v. Texas: “[There is] no legitimate state interest which can justify its intrusion into the individual’s personal and private life.” And the Court observed in Stanley v. Georgia:

If the First Amendment means anything, it means that a State has no business telling a man, sitting alone in his own house, what books he may read or what films he may watch. Our whole constitutional heritage rebels at the thought of giving government the power to control men’s minds.

Though not explicitly tied to the polarizing invasions highlighted in Law­rence v. Texas and Stanley v. Georgia, data itself makes these types of invasions possible. Data permits oppression just as easily as it permits freedom. If not properly harnessed “data availability” can become “the database of ruin,” trans­forming worldwide communication networks into worldwide surveillance net­works. The bright side is that data sanitization has grown in leaps and bounds since the early days of reidentification awareness.

B. Privacy Loss: Technical Protections

Techniques, tools, and a mountain of interest have matured around the “safe” sharing of data, vis-à-vis standards like k-anonymity, l-diversity, and t-closeness. At a high level, these concepts are all attempts at assessing the privacy preserving qualities of an “anonymized” dataset. In simple terms, taking a set of raw data, applying some measure of noise (e.g., “suppression” by redacting all zip code digits), and then assessing how privacy preserving the resulting dataset is.

The thorn for each of these standards, however, is that none of them provide guarantees in the same way that differential privacy provides guarantees. For example, consider a dataset that is k-anonymized. In shorthand, what this means is that, assuming a value of three, for every record (i.e., row in a spread­sheet) at least two other records are identical.

 

If your data were within this anonymized dataset, would you feel com­fortable allowing it to be released publicly, into the wild, forever? Your answer likely depends on the sensitivity of the data, the trustworthiness of the data stew­ard, how the data may be used by others, and the many other potential implica­tions arising from releasing the data. Underneath each of these concerns, however, is a singular risk: How likely is it that you will be “reidentified?” In a worst-case scenario, what is the risk that someone will be able to point at your record and say “this is you.” All other adverse effects stem from this singular, null-privacy end result: reidentification.

Most sanitization standards, like k-anonymity, say nothing about how likely or unlikely the threat of reidentification is. Is a k value of three, four, or five required for a release to be privacy preserving enough to make the risk of reidentification minimal? If “joins” with auxiliary data are off the table (i.e., a privacy attack which matches known information with unknown information—one which k-anonymity was designed to protect against), is the threat of reidenti­fication completely eliminated? Are contractual requirements needed to add teeth to a sanitization technique’s gums or are the technical mechanics of sani­tization a sufficient deterrent? No standard mentioned above provides definitive answers to these questions except for differential privacy.

Differential privacy is no tourist when it comes to guarantees. In fact, dif­ferential privacy was built with these guarantees in mind, requiring that individ­ual data, by itself, be meaningless. As Cynthia Dwork and her coauthors state: “[D]ifferential privacy by definition [protects against] re-identification.”

Similar to k-anonymity, differential privacy looks at a process of sanitiza­tion (e.g., when asked for your age, answer with your real age plus a random number from 1 to 10) and assesses how privacy preserving the output is. The difference occurs, in part, because differential privacy tells you how privacy preserving an output will always be, in a worst-case scenario, no matter what new privacy attacks are identified and no matter what new information an at­tacker learns. True enough, this protection comes at a cost—it is heavy handed, it does not apply to all scenarios, and it creates diminishing usefulness implica­tions for the type of questions that may be answered—but what it provides to the forest for the sake of the trees guarantees privacy like none other.

To more fully understand differential privacy, and how it may be “attacked” in a legal scenario, we must start with a few building blocks. The next Part pro­vides the viewpoint from which differential privacy is most easily accessible, discussing the type of “noise” differential privacy uses to measure privacy and explaining how the mathematical quantity of epsilon is used as an adjustable knob in the privacy–⁠utility tradeoff.

II. Building Blocks of Differential Privacy

Differential privacy requires a particular frame of mind. Similar to how, when considering the rule against perpetuities, it is important to take a step back and understand the perspective driving the rule, differential privacy is most easily understood using the viewpoint from which it operates. This viewpoint may be grouped around three core concepts: (1) differential privacy focuses on “mechanisms” or “algorithms” (i.e., descriptions for how to accept some type of input, engaging with that input, and produce some type of output); (2) differ­ential privacy is not a tool used to sanitize data, but is more like a standard, a statement about the privacy preserving abilities of a mechanism itself; and (3) differential privacy lives in a world of datasets, and produces its guarantees by measuring itself against a powerful adversary, quantifying how much infor­mation an attacker would, at most, be able to learn. The next section discusses each of these building blocks in turn.

A. Preliminary Cairns

For starters, differential privacy only concerns itself with mechanisms. A mechanism, broadly speaking, is a recipe, like a cooking recipe. Another term used for these recipes is a function or algorithm: a repeatable, consistent way of doing something that takes in a certain type of input and produces out a cer­tain type of output. It is best to think of this at the highest level possible: widget A goes in and widget B comes out. For example, a mechanism for making a peanut butter and jelly sandwich would look like this:

 

Figure 1. An Algorithm for Making a Sandwich

Figure 1. An Algorithm for Making a Sandwich

Figure 1. An Algorithm for Making a Sandwich

We have an input of bread (two slices), peanut butter, and jelly. The algorithm takes this input, executes a sequence of operations (i.e., add peanut butter, add jelly, put the two together), and produces an output: a sandwich.

Differential privacy operates on mechanisms, like Algorithm 1 as shown in Figure 1. In fact, only mechanisms may be deemed ε-differentially private, not the results of the mechanism. We would not say the peanut butter and jelly sand­wich (output) is ε-differentially private, but that the algorithm used to make the sandwich is ε-differentially private. The ε part (Greek for epsilon) is discussed in Section II.D below. In short, it signifies how “private” the output is. Also noteworthy is how differential privacy allows these functions to be made pub­licly available in a “don’t-roll-your-own-crypto” type of way. This allows for better mechanisms through open-source analysis and reproducibility. For in­stance, anyone may fully assess Google’s RAPPOR system, which uses differ­ential privacy in the Chrome web browser to check on intimate telemetry details such as users’ default homepages in their browsers.

The second building block concerns differential privacy’s role not as a tool of anonymization, but as a way of reasoning about mechanisms. To be sure, differential privacy is not itself a tool for creating privacy. Unlike the methods of “generalization” (i.e., modifying data by generalizing it) or “suppression” (i.e., modifying data by removing it) differential privacy is a measure of pri­vacy under a particular scenario. In many ways, it is like PII in that it ties to a general concept of privacy, but is not itself a way to achieve privacy. For in­stance, COPPA defines “personal information” as “individually identifiable information [(e.g., name, social security number, and e-mail address)] about an individual collected online” while VPPA defines PII as any “information which identifies a person.” Neither one of these definitions provides a way to create privacy, yet, both provide a standard for evaluating privacy (or lack of privacy) in a specific setting. Similarly, the concept of differential privacy is not the narrow application of a tool to data; to understand the term in a working sense requires a setting, and this is where the third concept comes in.

Lastly, differential privacy lives in a world of datasets (i.e., tables of columnized information) and adversaries (i.e., actors who wish to reidentify individuals in those datasets). In fact, the mathematical definition of differential privacy makes no sense without these two elements. To better understand this world, consider the following table, which lists the names of individuals who ate certain types of cookies. This table would be deemed a “dataset.”

 

Table 1. Original “Raw” Dataset

First

Last

Birth-Year

Cookie Eaten

Alice

Westminster

1984

Chocolate Chip

Bob

Kensington

2000

Gingersnap

Abigale

Westminster

1989

Chocolate Chip

Bob

Chelsea

2010

Gingersnap

 

The “adversary” here would be someone who tries to reidentify the individ­uals described by the dataset, which is a very simple task if Table 1 is shared in its raw form (i.e., just look at the table). Instead, what a data steward who owns the dataset and wants to “release and forget” it into the wild may do is opt (even today) for something like k-anonymity, which would create a sanitized-looking dataset.

 

Table 2. Sanitized-Looking Dataset

First

Last

Birth-Year

Cookie Eaten

-

Westminster

1980s

Chocolate Chip

Bob

-

-

Gingersnap

-

Westminster

1980s

Chocolate Chip

Bob

-

-

Gingersnap

 

For every row in the table, there are at least other rows in the table with the exact the same information (i.e., allowing Alice to “hide in the crowd” with Abigale by making the first and third rows in Table 1 identical). Practically, we are achieving a privacy “crowd” via k-anonymity by applying suppression (i.e., replacing a cell with “–”) and generalization (i.e., making more general, like replacing the year 1984 with 1980s) all the while keeping our goals in mind.

Although the above table looks sanitized, how confident should Alice be that no attacker could reidentify her? Should Alice be comfortable knowing that her data is deidentified? No.

A hypothetical adversary, Mallory, may have an easy time reidentifying Alice if Mallory happens to have access to another dataset (i.e., auxiliary infor­mation, Table 3) with full name and age information. All Mallory would have to do is search for all the Westminsters born in the 1980s to figure out that both Alice and Abigale ate chocolate chip cookies (i.e., PII may have been un­veiled).

 

Table 3. Original Plus Auxiliary Datasets

Original Dataset

Auxiliary Dataset

First

Last

Birth-Year

Cookie Eaten

Last

First

Birth-Year

-

Westminster

1980s

Chocolate Chip

Alice

Westminster

1984

Bob

-

-

Gingersnap

Bob

Kensington

2000

-

Westminster

1980s

Chocolate Chip

Abigale

Westminster

1989

Bob

-

-

Gingersnap

Bob

Chelsea

2010

 

True enough, k-anonymity may preserve privacy if the value is increased; the kernel in this example, however, is not how the adversary was able to un­cover information, but that the strength of a privacy preserving standard is meas­ured against an adversary who is assumed to always exist and always possess the goal of unveiling who is in the dataset. This is the perspective taken by dif­ferential privacy, motivating how it is technically defined.

In summary, differential privacy is a way of measuring the privacy of mech­anisms acting on datasets in the face of an adversary. The following three Sec­tions outline the core of why differential privacy works (randomness), what exactly differential privacy guarantees (the adversary), and how differential pri­vacy uses randomness like a knob to increase or decrease privacy (epsilon). To­gether, these Sections represent the building blocks for the mathematical def­inition of differential privacy introduced in Part III. The next Section starts by introducing the randomness that differential privacy uses to purchase privacy.

B. Why Differential Privacy Works: Randomness

Differential privacy works against an age-old quandary: How do you hide information while at the same time reveal information? For differential privacy, privacy is purchased by avoiding real answers in a particular way, providing a veil of “plausible deniability” from the implications of a mechanism’s output.

Consider a hypothetical where Alice does not want anyone to know her real age. When asked her age, Alice responds with a random age near her real age. The recipe or mechanism Alice uses has an input of “what is your age” and an output of {real age plus or minus a random number}.

To be sure, trusting Alice’s response at face value, given that she uses the random-age mechanism, is unreliable; it is entirely possible and very likely that Alice has not provided her real age. True enough, knowing additional infor­mation about how Alice picks random numbers (to add or subtract from her real age) would help a detective (i.e., adversary) figure out exactly what Alice’s real age is, but assuming that random-age-choice information is off the table, Alice is free to proffer a responsive answer because her provided age is meaningless.

In the same way, differential privacy relies on randomness to attain privacy. In fact, its inventors go so far as to state that “any non-trivial privacy guarantee that holds regardless of all present or even future sources of auxiliary infor­mation . . . . requires randomization.” On the other hand, using randomness, though effectuating privacy, degrades utility—what if we really did want to know Alice’s real age?

1. Truth and Not-Truth: The Privacy-Utility Tradeoff

If privacy cannot be attained without returning an unreal answer, then per­fect privacy may be considered the opposite of perfect utility. We have privacy via randomness, but what if we also want utility?

Imagine trying to determine the ages of everyone in a particular neighbor­hood. If the Alice from our hypothetical, using the random-age generator, lived in this neighborhood, then the age-counts for this neighborhood would be inac­curate, because Alice would most likely lie about her real age. Alice’s privacy is preserved, but the utility or accuracy of the overall count is harmed.

Elegantly, differential privacy uses the privacy-utility tradeoff to its ad­vantage. By concerning itself with large-enough questions, differential pri­vacy is able to play nice with Alice and the count, preserving Alice’s desire to keep her true age private while also preserving the accuracy of the overall tally-count being close enough.

To see how this plays out, consider a neighborhood of 1,000 people and a specific question: How many people in this neighborhood are 33 years old? As­suming 100 of them are, in truth, 33 years old (including Alice), the real answer to this question is 100 (i.e., 10%). Given the lying mechanism that Alice uses, however, the privacy-preserved answer would most likely be 99 (i.e., 9.9%). Out of all the people who live in this neighborhood, 99 of them are 33 years old (including Alice’s privacy preserved, unreliable answer).

The point here is not that 10% (i.e., the answer) is numerically close to 9.9%; rather, the point is that if the question concerns a large enough group, then the whole will be greater than its parts, the truth of the crowd outweighs Alice’s lie. Imagine instead that the neighborhood only consisted of ten people. A lie here has an impressive impact on the outcome—adding an inaccurate 10% mar­gin to any tally looking at age. This would be a large impact on utility.

Differential privacy is powerful because it gets away with adding much more noise than simply one person out of the group lying—in fact, every person in the group receives the same insulation from the truth as Alice. For example, if we took an ASCII art picture of a bike, modified all individual characters in the picture by flipping them blank (i.e., “ ”) or leaving them as is with 50% probability, then we would still be able to discern the overall picture, even though each character is insulated with a 50% chance of accurate–not accurate.

 

Figure 2. Original ASCII Image and Figure 3. Sanitized ASCII Image

Figure 2. Original ASCII Image and Figure 3. Sanitized ASCII Image

Figure 2. Original ASCII Image and Figure 3. Sanitized ASCII Image

 

So long as responsive answers (e.g., flipping each character or answering the “what is your age” question) are provided in a particular way, nothing will be learned about the individuals making up the group, but a fairly accurate some­thing will be learned about the group as a whole. Stated in more mathematical terms, any output of a differentially private mechanism is nearly as likely re­gardless of whether one individual was “in” the dataset or not.

C. Adversarial Perspective

Taking a step back, it is important to note the why behind differential pri­vacy’s use of noise to provide inaccurate answers. The why here comes directly from the historic perspective of reidentification attacks: deidentification talks more than it walks.

Differential privacy takes a nod from the failings of Netflix Prize and the AOL search log by leaving room for the possibility that someone may try and use any and all auxiliary information (i.e., information unbounded by the instant dataset) in a hodgepodge aimed at reidentification. And this goes beyond the practical attack Professor Sweeney persuasively demonstrated in 1997 (e.g., tak­ing public voter list records and joining them with deidentified medical records). Instead, differential privacy directly addresses the means used to ef­fectuate those attacks: reconstruction attacks.

Simply speaking, reconstruction attacks take advantage of the fact that computer time and human time are different. One of the most magical parts of computing is a computer’s ability to execute computations with blazing speed (e.g., variable a = 1 + 1) and remember those computations (e.g., variable b = 1 + a) in a useful way (e.g., variable b is 3). What this means for a chess master like Garry Kasparov is bad news, at least when it comes to winning, because as long as a problem can be represented mathematically, then a com­puter can blindly work on it for what would be considered decades of human time—that is, Kasparov lost because Deep Blue checked many, but not all, pos­sible combinations of chess moves that could be made—but are merely sec­onds in computer time (also known as cheating).

The same time difference is leveraged in reconstruction attacks by knowing an output (i.e., you are given an answer, like, the mean age of my classmates is 24) and finding all possible combinations of numbers that could lead to that output (e.g., (23 + 25) ⁄ 2). This may seem impossible given a complicated output, but with unlimited guessing and nearly unlimited storage capacity, it accords with logic to say that the answer will eventually come to the fore.

Two key presuppositions can be learned from the reconstruction attack. First, some combinations of numbers are more likely than others. For example, it is unlikely that, if the average age of a group of classmates is 24, and if I know that there are nine classmates in the class, then the ages of the nine classmates are 200, 9, 1, 1, 1, 1, 1, 1, 1—though this makeup does produce an average age of 24. Second, hunches about real answers will improve over time if repeat ques­tions are permitted.

In the first case (i.e., likely combinations), the questions you may ask a function are not all created equal, particularly with respect to an output. Some questions have specific answers, others have general answers. What this means for a reconstruction attack is that some answers are more reconstructable than others because only a few combinations produce the particular output. Differ­ential privacy takes this into consideration when deciding how much noise to add to a function’s output. In fact, differential privacy, using the common La­place method, considers explicitly the maximum range of values there might be when assigning noise. As Part III more fully explains below, the fewer combi­nations there are, the more noise is needed.

And in the second (i.e., repeat questions), if we are playing the guess-this-input-given-that-output game, from the perspective that each time we see an output we come up with a list of combinations that produce that output, then it is easy to see how repeating questions allows for a paring down of possible combinations. In a brute force type of way, if we ask the same question over and over again, we will eventually find the real answer, regardless of the inaccura­cies reported over time. For example, if you give me a random answer which deviates slightly from the real answer each time I ask for it, all I need to do is average the random answers to get better and better hunches of the real answer. If I ask Alice “what is your age” over and over again, and Alice says: 33, 34, 30, 31, 33, 33, 37 then I might start to get the suspicion that Alice’s real age is 33.

In a more nuanced sense, each time we reconstruct the possible inputs to produce an observed output, we are producing a set of combinations, and we know that the space between all of these combinations is where the real an­swer lies. In this way, the space gets smaller and smaller as we ask more and more questions. This is why you may have heard rumors of a privacy budget. The budget runs out the more you ask questions. That said, this is a well-known aspect of differential privacy, and one that can be controlled.

Differential privacy overcomes both how much noise and repeat questions with a adjustable knob known as epsilon. Knowing how to turn this knob de­pends, essentially, on how privacy sensitive an output is. The following section discusses how epsilon responds to these two issues in more depth.

D. What the Knob Means—Epsilon

Epsilon is the most important part of differential privacy. The reason for this, however, may not be what you are thinking.

1. Non-Contextual Epsilon

A naïve way to think about epsilon would be to consider it the amount of noise that is added within a mechanism. A lot of noise is added with a small epsilon value (e.g., .01) and almost no noise is added with a large epsilon value (e.g., 10). If the output is privacy sensitive, like the answer to a sensitive ques­tion such as “have you ever had an abortion,” then you will likely want more buffer room between the real answer and the mechanism’s output; but if the question is not considered very sensitive, such as “do you like pizza,” then you might use a larger epsilon value, meaning the provided answer is more likely close to the truthful answer.

This is how epsilon works in a mathematical sense, with more nonsensical output associated with low epsilon values and basically real outputs associated with high epsilon values, but the problem with this understanding is that it has no context. What does an almost-real output mean? Why should anyone be com­fortable with a mechanism that used a small epsilon value but nonetheless out­puts a number close to the real answer? Context is necessary and context for differential privacy comes from the adversarial perspective.

2. Contextual Epsilon: Bounding

Epsilon matters is because it bounds the threat of privacy loss. Epsilon says: this output (i.e., mechanism’s answer) is no more meaningful than an increase in some percentage of a belief that it is correct. In other words, confidence in a guess at the real answer, when seeing the output of a mechanism, will never go beyond the limit set by epsilon. Your answer of “yes I have had an abortion” may only be 2% more likely to be the true answer, which is likely not high enough for me to trust that it is the true answer. Epsilon says that someone see­ing your answer to this question will never have more than a 2% confi­dence boost that this is the real answer.

In more concrete terms, differential privacy guarantees that an attacker, with some predefined, best-guess idea at an outcome, who views the results of a mechanism, cannot learn more, in percentages, than is controlled by epsilon. For low values of epsilon, this means that the attacker’s initial suspicion (e.g., 50%) will not change very much, probability wise, after seeing the mechanism’s output (e.g., from 50% to 52%). For high values of epsilon, this means the at­tacker’s initial belief that an outcome is real (e.g., 50%) may grow substantially after seeing an output (e.g., from 50% to 75%). The same is true regardless of the level of initial suspicion. If an attacker knew the real answer was a number between one and ten, then attacker has a 10% guess out of the gate—but if ep­silon was set to be high, then the attacker may, after seeing an output, believe there is a 95% chance that the observed output is real (i.e., believe that this spe­cific value in a range of possibilities is likely to be the real answer with a 95% chance). And this is why differential privacy is only meaningful in terms of the particular epsilon a mechanism wields—a high epsilon means that there is practically no privacy, the results of the function are almost-but-not-quite right; a low epsilon means that there is practically no usefulness to the data, the results of the function are too incorrect to be useful. This is why we do not call a mech­anism (i.e., recipe) differentially private, but ε-differentially private. The fol­lowing Part takes this understanding one step further by unveiling the mathematical definition of differential privacy.

III. Definition and Step One

A more formal definition of differential privacy looks like this:

 

[Mechanism(inputdataset1)=output][Mechanism(inputdataset2)=output]eε

 

Although this equation may appear jarring, Part I covered its most difficult parts. For notation, the ℙ in both the numerator (top) and denominator (bottom) simply mean “the probability”; in this case, the probability that version one or version two of the mechanism’s input will have a particular (same) output, which has to be less than or equal to e, a number, raised to ε, the epsilon value discussed in Section II.D above. If the numerator were one and the denominator were two, then the equation would simply look like this: 1 ⁄ 2 ≤ e^ε. The number e, Euler’s number, may be simply thought of as approximately 2.71828 (i.e., 1 ⁄ 2 ≤ 2.7^ε).

The two datasets of the mechanism’s input (dataset1 and dataset2, nu­merator and denominator, respectively) are meant to capture the situation where the data the function operates on differs in a small way, while using the same mechanism. For example, using Table 1’s “cookie eaten” column (mechanism: name the type of cookie eaten), dataset1 would be the a dataset with someone eating a gingersnap and dataset2 would be the same dataset, but this time with­out that person eating the gingersnap. Differential privacy looks at the prob­lem this way because it attempts to capture the reconstruction attack: If my best guess combination to produce an output similar to the mechanism’s output is as good as I can get—i.e., my combination which produces this output is only one missing piece away—then what does that mean for privacy loss?

In summary, at a high level, the mathematical definition of differential pri­vacy requires that a mechanism’s output (e.g., cookie count) on a dataset (e.g., one gingersnap eaten by the individual) be close to the mechanism’s output (e.g., cookie count) on a similar dataset (e.g., zero gingersnaps eaten by the individ­ual). Why differential privacy is able to offer a “privacy guarantee” is because it is able to define close mathematically: the left side of the equation (i.e., fraction) must be equal to or smaller than the right side (i.e., Euler’s number raised to epsilon). In other words, the mechanism makes a similar statement both with and without the data from the person who ate a gingersnap—the gingersnap lover’s data must be, in some ways, meaningless.

For a more technical explanation, which is helpful didactically, we can take a look at a mechanism that has been around for a long time: randomized re­sponse. Indeed, any mechanism, including those created before the invention of differential privacy, may be analyzed with the lens of differential privacy. Differential privacy did not invent privacy preserving algorithms, it is simply a means of measuring one type of privacy loss that an algorithm encumbers. If a mechanism has some measure of randomness, then the mechanism may be proved to have a calculable ε, representing an ε-differentially private algorithm. Randomized response, in the setup given below, has an ε value of ~1.098.

A. Mechanism—Randomized Response: A Teaching Tool

Imagine we are using the following algorithm:

 

 

Figure 4. Algorithm 2

Figure 4. Algorithm 2

 

Figure 4. Algorithm 2

This mechanism has an input of a question and an output of the answer to that question. The mechanism uses coin flips to insulate a respondent’s secrets, similar to Alice’s random-age generator. If the coin lands tails, then the question is answered truthfully, but if the coin lands heads, then the question is answered true or false depending on another coin flip. In this way, the mech­anism buys privacy with the fifty-fifty–tails-heads odds.

We may analyze this mechanism by noting all possible outcomes. We can then find the best-case scenario for an attacker to learn as much as possible from the response. Notably, an attacker’s best-case scenario is the one which has the most likely outcome.

As Figure 4 shows, only two possible outcomes for Algorithm 2 exist: yes or no. You either are or are not a member of the Communist Party. Given the definition of differential privacy from above, we may consider the case where inputdataset1 is a yes—a “dataset” with a person who would answer yes (“real answer”) to the question being asked. Therefore, the only other possibility for inputdataset2 is a “no,” and the person would answer “no” as the real answer. Stated otherwise, what is the probability (numerator) of a “yes” (output) with someone whose real answer is yes, and what is the probability (denominator) of a “yes” (output) with someone whose real answer is no—we are trying to figure out all the ways a yes occurs, letting us know what the probability of seeing a yes is. This gives us the probability of a yes in the best case for the attacker (i.e., the most we can learn when we see the output of the randomized response algo­rithm—the worst case for privacy).

 

probability of a yes given a real answer of yesprobability of a yes given a real answer of noeε

 

For the numerator (i.e., top line), a yes can occur with a 50% chance of being truthful (line 2, Algorithm 2) or a 50% chance of landing heads and then a 50% chance of landing on heads again (line 6, Algorithm 2). Together, this is a 75% chance (.50 + (.50 * .50)). For the denominator (i.e., bottom line) to be yes with a real no answer, the first flip must be heads (line 4, Algorithm 2) and the second flip must also be heads (line 6, Algorithm 2). This case happens with a 25% chance (.50 * .50). Therefore, assuming we are talking about the proba­bility of a yes in general, we can say that there is a (.75 / .25) fraction that this occurs, or a whole number of three. We plug this into the differential privacy equation:

 

3 ≤ e ε

 

A math trick allows us to rephrase this statement to make it cleaner: the natural logarithm of three must be less than or equal to epsilon. This number, rounded, is approximately 1.098.

This results in the worst-case scenario for the respondent (i.e., highest prob­ability of seeing a yes), meaning that this value sets our epsilon in this particular algorithm. Algorithm 2 is therefore deemed (1.098)-differentially private. Im­portantly, as Section II.D.2. emphasized above, this is an expression regarding the bounds of what an attacker may learn when seeing the output of a mecha­nism.

B. Differential Privacy Takeaways

Taking a step back and focusing on the task at hand—translating differen­tial privacy into something legally meaningful—a problem is found with the previous Section’s closing statement: it is legally meaningless. Data regula­tion does not speak directly to differential privacy and the idea of bounded pri­vacy loss; instead, statutes regulating data require data to be kept confidential (i.e., not shared) if the data is considered PII. And although one way to trans­form PII into non-PII is to sanitize it, it is difficult to know exactly how san­itized resulting outputs are and how much sanitization a statute requires. Therefore, what needs to be found is a common measurement between what a statute deems sufficient sanitization and what a mechanism technically provides.

Luckily, differential privacy offers one of the most applicable, system-to-system comparisons for privacy that exists: epsilon—privacy by any other name. If properly framed, the attributes inherent to differential privacy allow it to be consistently and repeatably applied to legal questions. In this way, what a mechanism technically provides may be rephrased, legally speaking, as reidentification risk. Before diving into possible options for framing differential privacy in terms of a reidentification risk, however, we must first address the fact that, mathematically speaking, differential privacy says nothing about reidentification risk.

1. Reidentification: Appropriate Overprotection

To clarify, reidentification occurs when an individual’s data found within a dataset is no longer anonymous. An attacker is able to point at a record and say “this is your data,” or, for differential privacy in the query setting discussed so far, see the output of a mechanism (e.g., did Abigale eat a chocolate chip cookie—yes) and know it is real. This is a spectacular failure for a dataset—game over for an individual.

Importantly, differential privacy does protect against reidentification at­tacks, but it also protects against other types of attacks as well. Differential pri­vacy must protect against all types of attacks for its guarantee to hold. For example, in a successful tracing attack, which is covered by differential pri­vacy’s protection guarantee, an attacker merely learns whether an individual is in a dataset, not what the individual’s data is (e.g., is Abigale in the “cookies eaten” dataset).

The problem is that summarizing differential privacy in terms of reidentifi­cation risk inherits this overprotection, and what this means for our legal com­parator, introduced in Section III.C below, is that we will be necessarily overprotecting data. That said, hinging protection on an overinclusive definition has several advantages.

First, using overprotection provides breathing room to an otherwise uneasy ask—releasing protected data into the wild. Understanding that the measure of reidentification risk borne from a differential privacy mechanism assumes a worst-case scenario gives balance to that proposition. Second, and more im­portant, this amount of overprotection is necessary to prevent new, currently unknown attacks from degrading current standards of sanitization (i.e., differ­ential privacy is futureproof). As discussed in Section IV.A.2 below, a thorn for many of the standards in use today is that new attacks are later invented that undermine the assurance of outdated methods to sanitize data—what can you do with anonymized Massachusetts hospital information? With this in mind, we turn to identifying an aspect of an ε-differentially private mechanism that is transferrable to a legal understanding of statutorily mandated data confidential­ity.

2. Legal Comparator

Distilling a legal comparator from an ε-differentially private mechanism first requires an understanding of what various values of epsilon mean for a pri­vacy loss. To aid this understanding, it is helpful to visualize the bounds of Al­gorithm 2’s mechanism (as shown in Figure 4).

That mechanism had an epsilon value of 1.098, which produced an upper bound of 75%. In the best-case scenario for an attacker, an observed output would be known to be real with a 75% confidence. Stated otherwise, if the at­tacker sees that an output to “are you a member of the Communist Party” is yes, then the attacker has a 75% confidence level that this was the participant’s real answer—there is a 75% chance that this person is a member of the Communist Party. A visualization here allows us to more fully contextualize that 75%.

 

Figure 5. Bounding of The Randomized Response Algorithm

Figure 5. Bounding of The Randomized Response Algorithm

Figure 5. Bounding of The Randomized Response Algorithm

 

The attacker in this mechanism has a 50% chance of correctly guessing an output a priori: an answer to a yes or no question is either yes or no. This knowledge is represented as the initial suspicion found along the x axis, at the 0.5 mark. Tracing this initial suspicion value vertically to the y axis will end at the thick black diagonal line, which represents what may be thought of as the home base position (i.e., the attacker did not learn anything from initial to up­dated suspicion). Using Figure 4’s Algorithm 2, we found that the attacker’s guess may be adjusted by at most 25% for a highest-possible confidence of 75%. This is represented by vertically adding 25% to that diagonal line, ending at 75% along the y axis. This point on the y axis is the a posteriori confidence, belief in the correctness of an output after seeing the mechanism’s output.

The final value here is known as the upper bound—an adversary can gain no more confidence when witnessing a mechanism’s output than this percentage (i.e., that a provided answer by a mechanism is the real, truthful answer). The lower bound moves the confidence in the opposite direction and represents the best-case scenario for a respondent. Based on an observed output of a mecha­nism (i.e., a “no” answer in randomized response), the attacker may lose confi­dence in a guess (e.g., you thought there was a 50% chance of something happening, but when seeing a particular output value, your confidence drops to 25%). Another way to think of these two boundaries is that not all answers are equal, some answers may be more likely than others, and therefore an at­tacker’s confidence may change depending on the observed output. This change occurs because of how the mechanism is built.

To practicalize this dance of probabilities, imagine you owned a crystal ball which tells you whether it will rain tomorrow: yes or no. Unfortunately, because the ball is magical, it is regulated, and you are only allowed to access predictions from the ball which have been sanitized using differential privacy. Further assume that you know the mechanism the crystal ball uses has an epsilon value of 1.098. Given that there is, at baseline, a 50% chance that it will rain tomorrow, if your crystal ball answers “yes,” then you can be 75% confident that it will rain tomorrow—and this might be high enough for you to carry an umbrella. The output of the mechanism, even though differential privacy is be­ing used, greatly impacted your decision to carry an umbrella.

On the other hand, assume the crystal ball were using a (.08)-differentially private mechanism to sanitize its future-predicting outputs. If you had an a priori guess that it would rain tomorrow, 50%, and the crystal ball said “yes”—i.e., the same setup from before, with a revised epsilon value—then you would only have gained a 2% boost in confidence. You are now able to say there is a 52% chance of rain tomorrow—and that might not be high enough for you to take an umbrella. In other words, learning the output of this particular (.08)-differentially private mechanism does very little for your choice in umbrella en­cumbrance.

This fluidity in confidence is what must be translated into legal language. At a high level, lower epsilons mean that the data provided by a mechanism is more sanitized, and a statute that is highly sensitive to the risk of a privacy loss (i.e., risk of reidentification) would be more likely to approve the mechanism’s outputs. However, there are a few important nuances not captured by such a cursory view. Three options may exist for the accurate and portable packaging of a mechanism’s risk of reidentification. Each of these options is discussed in turn.

a. Epsilon Alone

One possibility for translating a mechanism’s legal risk is simply using ep­silon alone. On the positive side, this approach places the focus on an easily adjustable quantity, allowing simple changes in epsilon to reposition the legal viewpoint of a mechanism’s sanitization abilities. The downside, however, is that this approach is not very granular. Low epsilons may be considered more private, as “small [epsilons] are happy epsilons,” but distinguishing between an epsilon of .01 versus .05 versus 1.0 would be practically difficult. At the same time, this could impact a decision by a court given that not all data are created equal, and the purposes of data exploration are also not equal (i.e., some objec­tives are more worthwhile than others). If a court has trouble distinguishing between “small” epsilons, then it could lead to permissible sharing when the risk is, in reality, too high.

Additionally, there is no context provided when considering an epsilon value by itself, which may produce a rubber-stamping effect on certain mecha­nisms. The quantity being assessed here should be the mechanism’s ability to provide an attacker with a lot or a little information. Simply looking at epsilon alone does not provide a sense for how much information is being gained by the attacker. Indeed, an epsilon value of 1.098 may seem low, but comports with a 25% boost in confidence when observing some outputs. Depending on the par­ticular scenario and an initial suspicion probability, a 25% boost could be an untenable amount of privacy loss. Therefore, epsilon alone is likely a nonideal fit for a legally portable understanding of differential privacy.

b. Upper Bounds

A second option for a legal comparator may be to consider the upper bound produced by a mechanism (e.g., the 75% in Algorithm 2). This approach has the benefit of capturing the worst-case scenario for any users’ data that may be in the dataset. As not all answers provided by a mechanism carry the same amount of risk (e.g., in the randomized response mechanism discussed in Section III.A above, observing a “yes” answer carried the most risk, with an upper bound of 75%), this quantity appropriately captures all possible output, best case and worst case for the attacker.

The downside to this approach, however, is that only the upper limit is taken into consideration. In this way, this measurement may oversell the adversary, leading to a court being more wary of a situation that presents less risk than perceived. For example, at an initial suspicion level of 75% and an epsilon value of one, the attacker ends with an 89.08% upper bound percentage. Although 75% is fairly high to begin with, the epsilon value being used here is in some ways low. Despite this, a nearly 90% upper bound probability is unlikely to be approved by a court looking to protect a user’s data.

In summary, regardless of how it may be beneficial to consider the worst-case scenario given that we would be matching this number with the maximum risk permitted by a statute, this comparator ignores important context like a pri­ori guessing ability, which provides useful context for a court to consider. For this reason, the upper bounds are less likely to be the best fit for the type of legal comparator we are looking for.

c. Guess Difference

A final possibility is to use what we deem the guess difference. The guess difference is the difference between the initial suspicion and the upper bound; in short, taking out what the attacker already knew and only keeping what was learned from the algorithm’s output in the best-case scenario for the attacker. For example, an initial suspicion of 50% with epsilon 1.098 (i.e., Algorithm 2) produces an upper bound of 75%; therefore, we have a guess difference of 25%, the difference between the initial suspicion and the upper bound.

This approach allows us to take into consideration the fact that some ques­tions are more privacy sensitive than others by relying on the upper bound, but tempers this by removing the default guessability of a query. To be sure, in this way, the guess difference may undersell the attacker’s overall guessing ability. For instance, it might seem odd that a high initial suspicion and low epsilon value nonetheless produces a low guess difference score, despite the fact that the attacker had a high likelihood of guessing initially and that guess was only made stronger after seeing the output of a mechanism. Looking closely at the aims of differential privacy, however, shows that this is likely a moot point.

Differential privacy does not concern itself with information not gleaned via the dataset. Imagine that an individual who has a particular disease par­ticipates in an experimental drug study where the data from the study is pro­tected using differential privacy. Further imagine that the published results of the study are that the experimental drug increased life expectancy rates by one year. Would we say that differential privacy failed to protect this individual if the individual’s insurance rates are increased after the insurance provider learns of this exact study and its conclusion? No.

The insurance company learned from the broad result published by the study, which differential privacy does not claim to protect. If, on the other hand, the insurance provider increased the individual’s rates after querying the dataset and coming up with some confidence level that the individual was “in” this da­taset, meaning the individual had the potentially life-threating disease, then we would say that differential privacy failed to protect the individual. Differential privacy allows us to draw hard lines around how much the insurance company may learn from the data—and guess difference captures that ability. For in­stance, we may say that the insurance company will never be able to increase a blind guess likelihood by more than 2%; a blind guess that this individual is a smoker cannot be confirmed by querying the data because the likelihood that that guess is correct will never be increased by more than 2%, no matter what result is found in the dataset. Stated otherwise, it would be illogical to conclude, based on the results of any query on this dataset—which the individual is in fact “in”—that the individual’s rates should be increased. The insurance company may nonetheless increase the individual’s rates, but would not be basing this decision on a reliable fact learned from the dataset.

Overall, the guess difference approach provides a singular, but context-filled legal comparator. This quantity highlights differences in risk when epsilon is small, allowing a court to meaningfully interpret the .01 to .05 to 1.0 epsilon range, it incorporates the worst-case scenario for any user who is in a dataset, by working with the upper bound set by a particular epsilon value, and it accords with preexisting considerations of reidentification risk, as discussed further in Section IV.A below. Therefore, we conclude that out of the three options dis­cussed above, guess difference should be the quantity used to interpret the san­itization abilities of an ε-differentially private mechanism from a legal vantage. The following Section generalizes the guess difference as a proxy for a mecha­nism’s risk of reidentification.

C. Step One: Reidentification Risk vis-à-vis the Guess Difference

Taking these options together leads to the conclusion that guess difference is the most appropriate legal comparator—guess difference may be considered a proxy value for the reidentification risk a mechanism encumbers. This option adequately balances the attacker’s best-case scenario, but tempers that confi­dence with the a priori guessability of the query. In this way, the measurement does not oversell or undersell the sanitization abilities of a mechanism. This metric will therefore form step one of our two-step test permitting the compari­son between what differential privacy provides and data-protecting regulation mandates.

1. Epsilon Visualized

With that in mind, we may visualize a range of popular epsilon values in terms of the guess difference each mechanism provides:

 

Figure 6. Guess Difference Visualized

Figure 6. Guess Difference Visualized

Figure 6. Guess Difference Visualized

 

Figure 6 shows epsilon values (i.e., the curved lines) ranged .03 to 5, with smaller epsilon values found closer the thick black diagonal line. The diagonal line, as discussed in Section III.B.2 above, may be thought of as the home base for an inquiry when visualizing a mechanism this way (i.e., nothing is learned from initial suspicion to updated suspicion—perfect privacy).

To find a guess difference using Figure 6, first locate an initial suspicion value provided by a mechanism along the x axis (i.e., along the bottom). Then, take note of the epsilon value of the mechanism under consideration. Each ep­silon value is associated with a resulting line drawn across the top half and bot­tom half of the figure (upper and lower bound, respectively). This line may be called the epsilon line. Trace the initial suspicion value along the x axis ver­tically until you hit the thick black diagonal line. Take note of this point (what may be called home base) and then keep tracing it until you cross the epsilon line. The point where the epsilon line meets the vertical line drawn by the initial suspicion value is called updated suspicion. Guess difference is equated by sub­tracting the update suspicion number from the home base position (i.e., where the initial suspicion line meets the thick black horizontal line). As an example, in Figure 6, we can see that the initial suspicion of .5 meets the epsilon line at .52 when using an epsilon value of .08. This would be a guess difference of .02 or 2%.

Overall, one can visually see the guess difference by looking at the distance between thick black horizontal line in that home base position and then measur­ing to the updated suspicion value. As a whole, this visualization allows us to see how larger values of epsilon affect the guess difference, with an epsilon value of 1.098 being much farther from the diagonal line than the .03 or .08 epsilon values.

2. Takeaways

Assessing an ε-differentially private mechanism from a legal vantage may be easily accomplished by considering the guess difference—what may be deemed the risk of reidentification a mechanism accommodates. This value is found by knowing: (1) the epsilon value associated with the mechanism; (2) the initial suspicion value provided by a mechanism (i.e., likelihood of guessing a “real” output without seeing a mechanism’s output); and (3) the updated suspi­cion value of the upper bound of a mechanism. When subtracting out the initial suspicion from the updated suspicion, derived using the updated suspicion and epsilon value, we arrive at the guess difference; essentially, the risk of reidenti­fication a mechanism permits. From an attacker’s perspective, we can guarantee that there is no more than a guess difference chance that an attacker will be able to take the output of a mechanism and say: “This is the real answer.”

Step one adds context to a mechanism and provides a legally framed bench­mark that may be measured against to a variety of statutes to assess whether the mechanism produces private-enough data to permit sharing. The next Part in­troduces the legal corollary against which the guess difference is measured: a statute’s threshold for reidentification risk.

IV. Step Two

The following Part examines step two: a statute’s maximum allowance for reidentification risk. Step two, practically, requires a statute-by-statute inspec­tion which is in many ways lackluster when attempted from the armchair. That said, an argument about why the risk of reidentification is at the heart of all data protective statutes (i.e., the applicability of step two), and why the quantity dis­cussed in step one speaks the same language as step two will be necessary. Fol­lowing these two arguments, we look at how HIPAA may be interpreted under the two-step test.

A. Statutory Privacy

Wearing a legal hat while considering the implications of differential pri­vacy gives rise to two primary obstacles. First, statutes regulating data do not speak in terms of a measurable privacy loss. Instead, shareable data is protected under explicate terms like “remove identifiers” or ambiguous terms like “re­move any information which could lead to identification.” Regardless of the phrasing, however, both terms belie what sits at bottom: protection against the risk of reidentification. Second, when a statute does find itself associated with a measurable reidentification risk, one which sets the bar for permissible data sharing, the end result has been, in many ways, meaningless—the permissible risk changes depending on the question being asked and the invention of novel, adversarial techniques, not to mention how these approaches are difficult to ap­ply across a variety of statutes. For these reasons, before illustrating how our two-step test would stack up against a statute, it is necessary to: (1) illustrate how the risk of reidentification is at the heart of all statutes built to protect data; and (2) evidence how and why current measurements of reidentification risk fall short.

1. The Heart of Statutory Privacy

When drafting a statute intended to protect data, a common approach is to hinge that protection on the definition of PII. It is impermissible to share PII, but permissible to share non-PII. VPPA’s prohibition on sharing “information which identifies a person” or COPPA’s prohibition on sharing “individually identifiable information about an individual collected online” are par for the course. This is true even for regulations which seem to swallow any and all data—for example, the GDPR.

The difficulty with sharing data while trying to comply with these regula­tions, however, is that it creates a red herring, a “find-the-gaps” exercise that obfuscates the intent of the regulation. The exercise plays out like this: it is permissible to share data as long as the data does not include a specific set of attributes that could, would, or do link to an identity or it is permissible to share data as long as the actors (i.e., a specific type of entity which is regu­lated, as opposed to an unregulated entity) or substance (i.e., a specific type of data which is regulated, as opposed to unregulated data) are not subject to the law. Unfortunately, this exercise provides seemingly simple answers (i.e., look for the gaps when trying to share protected data) which break when considering far reaching statutes like the GDPR.

The GDPR’s reach on regulated data is one of the broadest, swallowing any data “relating to an identified or identifiable natural person.” This means that, absent statutorily prescribed exceptions, data may not be shared. Even pseu­donymized data (i.e., data which has undergone privacy-protective measures, but which may nonetheless be joined with auxiliary information and lead to the identification of a person) is unshareable without statutory proscriptions like consent and minimization. The regulatory line does stop, however, at anony­mized data: “The principles of data protection should therefore not apply to anonymous information [i.e.,] . . . personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” This statement highlights the core issue: How much sanitization is enough; when is data anon­ymized to the point where the individuals it describes are no longer identifiable? In some light, the GDPR’s anonymization requirement may seem impossible—100% anonymization would require 0% utility in certain use cases.

That the GDPR’s reach stops short of anonymized data highlights the im­petus behind finding the gaps—the gaps highlight permissibly shareable areas because these are areas the drafters felt needed little or no privacy protection. These are areas where there is little or no risk of a privacy loss, a reidentification. Stated otherwise, regulating the risk of privacy loss sits on a spectrum. Along this spectrum are those sanitization techniques which produce mitigated, but not eliminated, privacy loss (i.e., pseudonymized), and also those techniques which reduce the risk of privacy loss to the point where it is merely theoretical (i.e., anonymization). To be sure, no method of data release is completely risk free, even differential privacy. Nonetheless, all data protective statutes do reg­ulate risk, the “risk” (i.e., the spectrum) of identifying individuals within a da­taset. PII is merely a proxy describing that reidentification risk—so to, as we introduce, is the guess difference, albeit a technical understanding of that risk.

A good way to measure this spectrum is to waypost the ease at which reidentification occurs, the point at which anyone can point at a record and say “I know who that is.” Raw data would be at one end of the spectrum and anonymized data would be at the other, with pseudonymized data and data with stripped identifiers (e.g., HIPAA) lying closer to anonymized data. That an individual has been identified, regardless of where this occurs along the spec­trum, means there is no additional loss that might occur; it is merely the question of how likely released data could reach that point, and this is what statutes reg­ulate—where on the spectrum the to-be-released data must fall.

This theme is not without support from the courts. In Pub. Citizen Health Research Grp. v. Pizzella, when discussing plaintiff’s argument that an OSHA requirement that was more privacy protective than its incumbent was not suffi­ciently justified by the record, the court found ample support for the new regu­lations because the risk of reidentification under the previous requirements was too high. In other words, OSHA’s position on requiring further privacy pro­tections was reasonable because of the ultimate harm OSHA sought to protect against, reidentification.

Furthermore, several courts focus directly on reidentification risk when considering what a statute requires to release protected—but sanitized—data. Partially, this comes from HIPAA’s statutory language that strikes very close to the risk of reidentification (e.g., “the risk is very small that the information could be used . . . to identify an individual”), but courts have also come to this con­clusion on their own. In Sander v. Superior Court, when discussing whether records could be released pursuant to a FOIA-themed statute, the court made its determination in large part based on the risk of reidentification that a release would incur; in Setinberg v. CVS, the court, when providing guidance on whether the sharing of deidentified, but possibly reidentified, records would be permissible under HIPAA, found that assessment by an expert about the reidentification risk the released records bore would be necessary; and in Cohan v. Ayabe, the court interpreted HIPAA’s expert-deemed safe harbor to rest on a determination that the reidentification risk was “very small.”

In summary, statutes protecting data are foundationally regulating the risk of reidentification. Statutes may go about this task with a variety of artisanal linguistic options, but the core of what is being regulated is privacy loss, which may be quantifiably expressed as the risk or likelihood of reidentification. The more privacy sensitive a statute, the less risk is tolerated; the less privacy sensitive a statute is, the more risk is tolerated. The next question, therefore, concerns the permissible level of risk, quantitively, that a statute allows. Here, unfortunately, despite seeming clarity, we find an unworkable standard.

2. Moving Targets

Statutes like HIPAA, which have been subjected to a fair amount of tech­nical interpretation regarding whether data is “sanitized enough” to meet the statute’s reidentification risk threshold, have fallen into a rut when it comes to defining permissible reidentification risk. The crux of this rut centers on how the technical literature has acquiesced to a definition of reidentification that was stated loosely at first, but which has, over time, grown to take on a meaning of its own. In turn, this definition has worked its way into the courts as fact. The incorrect statement looks like this: HIPAA allows for data to be released if there is a .04% to 25% risk of reidentification.

To begin, only two ways exist to release regulated data under HIPAA. The first option is for a data steward to strip the record of a series of explicit attrib­utes like name, email address, and social security number. The second is to rely on an expert to certify that “the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the in­formation.” To be sure, neither of these options mentions a number between .01 to 25; yet, a short survey on what “very small” means to a statistical audience would draw that conclusion: “Based on the nationally accepted standard of re-identification risk no greater than . . . 0.04” and “[w]hen the redacted data contained the exact birth year, as allowed by HIPAA Safe Harbor, we correctly identified [25 percent in a subsection of the dataset] . . . . In comparison, earlier studies found unique re-identification rates in data that adhered to the level pre­scribed by HIPAA Safe Harbor to be much lower, namely 0.013 percent and 0.04 percent.” In these examples, researchers are taking it as a fact that HIPAA requires some level of reidentification risk to permissibly share data. This alone is not a problem, and accords with the two-step approach we argue for in this Article. The problem occurs, however, because of how that reidenti­fication risk is being calculated—the number keeps changing.

These percentages come from attempts to read hard numbers into HIPAA, but with non-futureproof methods. The original idea was to scrub records pur­suant to HIPAA’s safe harbor provision (i.e., remove the series of identifiers), check how many resulting records were nonetheless unique, and then report that number as the inferred reidentification risk. The argument would go like this: this data release is permissible because it is the same level of sanitization that is accepted under HIPAA’s explicit safe harbor provision.

Professor Sweeney debuted this method over two decades ago, finding that “0.04% . . . of the population of the United States is likely to be uniquely iden­tified by values of {gender, year of birth, ZIP}.” Since then, a line of work has sprung up reapplying the approach—but the numbers have changed drasti­cally over time, ranging anywhere from 0.01% to 25%. The reason for this moving target is owed to a property of differential privacy that is not found in assessments of uniqueness-based metrics like the .04% rule—futureproof.

Differential privacy, in assuming the worst-case scenario for privacy, con­siders all information that is currently available or may be available in the future. This is why the definition of differential privacy focuses on the two versions of the mechanism’s input which differ in a small way. This is also why differential privacy, according to some, is like using a sledgehammer to crack a nut (i.e., a great loss of utility at the gain of privacy).

Yes, differential privacy is in some ways excessive, but this property also allows differential privacy to make guarantees in perpetuity. The .04% rule, on the other hand, only looks to the narrow situation of a particular identifier-strip­ping provision and a particular (uniquetotal) equation. Using vague definitions “unique” and “total” means the resulting calculation will oscillate as definitions change over time. Moreover, this equation was written for HIPAA as an inter­pretation of the identifier-stripping provision; statutes which lack such a provi­sion would require additional inferential leaps to make the argument that the equation generalizes.

Ultimately, however, despite the reasons why this approach may be ill-suited for statutory interpretation and why better solutions might exist, the ap­proach is nonetheless making its way into the judicial branch. As seen in the Sanders case, one of the only cases to debate methods of data sanitization in terms of what level of sanitization is required by a statute, the court held in dicta that any proposed method of sanitization, as offered by the Plaintiff to access Bar admissions data, warranted too high a risk of reidentification—when meas­ured against the HIPAA standard of “.02% to .22%.” Though not preceden­tial, the reasoning is persuasive, suggesting that other courts may follow suit in trying to apply a unique-record-count fraction as a proxy for risk of reidentifi­cation.

This is wrong. What should a court do if tomorrow the target is moved from .02% to 2%? Yesterday’s sanitization method may have failed, but ex post facto, it succeeds; the meaning of the risk of reidentification, as a result, is diluted. A better definition of risk of reidentification is needed; differential privacy’s “guess difference” is needed.

In summary, current methods of analyzing the risk of reidentification point in the right direction, but fall short. A better approach is to rely on differential privacy for its futureproof property. In this way, the legal comparator introduced in Section III.C will hold despite any number of auxiliary pieces of information which come to the fore, and despite new attacks which attempt to pierce pri­vacy’s veil.

B. Step Two: A Statute’s Measurement

Given that (1) all privacy-protective statutes, at bottom, aim to regulate the risk of reidentification; and (2) the risk of reidentification may be understood as the “guess difference” value, the next question to ask is: What amount of reidentification risk does a particular statute permit? Due to statutory diversity, it would be shallow to argue for a definitive and generalizable answer in an Article like this. Instead, the Article provides a hypothetical revolving around HIPAA, discussing what likely risk the statute tolerates and what likely settings of epsilon would meet that risk threshold.

Consider a set of publicly accessible hospital records and a sample query: “How many individuals in this dataset have Crohn’s disease.” Further imagine that the interaction with this dataset is filtered through a differentially private algorithm (i.e., the query mode of differential privacy), hence the reason it is publicly accessible. To be sure, raw data is transformed into noisy data before its receipt by the individual making the query.

For concreteness, the below table visualizes this information, both with a real answer of one and a real answer of 5,000; either one individual in the dataset has Crohn’s disease or 5,000 do (i.e., consider this the ground truth). Both epsi­lon values, .08 and 1.098 (see Figure 6, visualizing epsilon values and guess difference), are compared across a sampling of ten possible answers a mecha­nism might provide, with averages noted in the last row. The data has been “post processed” by rounding to positive, whole numbers, and the Laplace method was used to generate noise.

 

Table 4. Epsilon Affecting Differentially Private Queries

 

Real Answer: 1

Real Answer: 5,000

 

ε = 0.08

ε = 1.098

ε = 0.08

ε = 1.098

1:

23

1

5,014

5,000

2:

3

2

5,003

5,000

3:

8

1

4,994

5,001

4:

12

1

5,020

4,999

5:

2

2

4,970

4,999

6:

1

2

5,000

5,000

7:

2

1

4,983

5,000

8:

9

1

4,996

4,998

9:

12

0

4,993

5,000

10:

3

1

5,005

5,000

Average

4

1

4,998

5,000

 

Would a mechanism using an epsilon value of .08, assuming this particular data setup, and answering this particular question, run afoul of HIPAA?

Pursuant to our two-step test, we first ask what is the risk of reidentification (i.e., the guess difference) a (1.098)-differentially private mechanism affords? The answer is 25%, according to Section III.B.2. With this in mind, we turn to step two, the maximum risk of reidentification a statute permits. Because HIPAA would apply here, we may look to the “very small” language found in the statute regarding expert-deemed “safe” data release. Is “very small” a term typically associated with 25%? If there was a 1/4th chance of rain tomorrow, would it be reasonable to call that a “very small” chance? In the balancing act a court would engage in, a one fourth chance that an individual is reidentified is likely too high for HIPAA. Therefore, this mechanism, with this epsilon value, under this statute, would likely not produce legally compliant outputs.

If, on the other hand, the epsilon value was .08, would this change the out­come? In this case, the first step, the guess difference of the (.08)-differentially private mechanism, is 2%. An attacker would gain a mere 2% increase in con­fidence that a provided answer is the real answer; from something like a 50% chance that an output is truthful to a 52% chance. The second step would be inquiring whether HIPAA’s maximum permitted risk of reidentification is less than or equal to 2%. Is this setup likely to be permitted by HIPAA? Yes. Al­though a court would have to balance the competing interests and risks being presented, a 2% chance of reidentification—especially when HIPAA does not require a 100% free-from-all-harm guarantee—is likely sufficient. A few points from this short hypothetical are notable.

For starters, this permittance would cover a .08 epsilon value and lower. What this means is that a data steward would be able to interpret a stamp of approval on .08 to also mean that .07, .05, or .01 epsilon values are all appropri­ate. This provides freedom to adjust mechanisms to suit individual use cases while maintaining compliance.

Secondly, it is notable that four, the average of the answers provided in the .08 epsilon column in Table 4, is fairly removed from the real answer of one. In fact, many of the responses in the .08 epsilon column appear to be inaccurate, though very privacy preserving. In short, this occurs because our numbers are not high enough for differential privacy. If instead the real answer to this ques­tion were 5,000 then the sampling of likely outputs becomes more practical. These answers appear much closer—yet are still privacy preserved—to the real answer. This example, therefore, highlights the non-panacea nature of differ­ential privacy: it is workable only in certain settings with certain assumptions, one of which is that large numbers are needed to maintain accuracy in the face of the type of noise differential privacy requires. If granular accuracy is a neces­sity, differential privacy may not be the best tool for the job.

Finally, a likely counterargument would be that using an epsilon value of 1.098 assuming a real answer of 5,000 nonetheless appears privacy preserving, with answers like 4,998 and 4,999. That these responses are possible, however, does not affect the type of information an attacker would be learning from view­ing these responses. A 1.098 epsilon value gives the attacker more assurance that any answer provided will be closer to the real answer, a feature we quantify with step one’s guess difference. This high guess difference amount is validated when we look at the sampling of responses the mechanism would likely provide, averaging out to 5,000. Though an arguable position, the knowledge learned by the attacker from witnessing outputs with this particular epsilon value is likely to lead an attacker to learning too much (according to HIPAA) about how many individuals in this dataset have Chron’s disease, making it more likely that HIPAA would not approve this type of data sharing.

In summary, HIPAA would likely not permit the sharing of data under a mechanism using a 1.098 epsilon value, but likely would permit the sharing if the mechanism instead used a .08 epsilon value. The risk of reidentification—guess difference—at 25% is too high for the statute to stomach, but a 2% risk of reidentification—a no-greater-than 2% boost in confidence—is likely to see a green light. In this way, this mechanism is: (HIPAA, .08)-differentially private.

C. Grease

Finally, it is worth being explicit about a few of the advantages, and limi­tations, our two-step test provides. True enough, the above analysis permitting the sharing of health data using a (.08)-differentially private mechanism is con­trived. This hypothetical may not, and likely does not, fully comprehend the nuances of a legally sufficient case if such a case were to arise organically. Additionally, the application of mathematical answers to legal questions, and particularly mathematical answers to statutory deidentification-type questions, has been opposed, including by the U.S. Department of Health and Human Ser­vices explicitly, and with good reason—ambiguity is often helpful when de­bating these types of questions.

That said, this data is out there; it is being purchased and sold right now. Like a spile to a tree, companies are siphoning off profit from personal, sensitive data in any and every way that is monetarily feasible. What this means for pri­vacy is similar to what Deep Blue meant for Kasparov, the game has changed; a brave new reality has already taken hold.

The surveillance economy should not be ignored by the same statutes that seek to protect it, albeit in disjoint dollops. Instead, public policy should directly embrace the data-driven economy with the aim to promote clear, black-letter guidance. That the above test lacks a dose of justiciability does not offset the fact that it allows a data steward to reason about a privacy preserving mechanism as it would be measured against a statute—in turn giving rise to compliance-inspired confidence, something with monumental side effects for societal ad­vancement, which are worth noting explicitly.

For one, this confidence encourages the liquidity of privacy-protective data, allowing for breakthroughs in science and technology with reduced privacy harms. Second, the test permits clear guidance on exactly how much sanitization to require for data sharing to become legal, which has additive incentives when paired with the iterative approach to common law. More specifically, when in­terpreting statutes, assuming no further administrative guidance is offered, the law builds on itself iteratively (e.g., the common law is marked with judicial precedent developing an understanding in a particular area). In this way, an ep­silon value applied to a specific statutory situation may be deemed legal, in turn offering standard-setting effects. This is the same process that less mechanical concepts undergo: under the Fourth amendment, police can conduct a stop and frisk of someone on the street if there is reasonable suspicion of criminal activ­ity, but police cannot conduct a stop and frisk if there is no reasonable suspicion of criminal activity. Differential privacy would run similarly: under HIPAA, you can release data using an epsilon value of .08, but you cannot release data using an epsilon value of 1.098. This would provide guidance to the idea of societal decision making in setting epsilon, as Professor Dwork has emphasized in her work on epsilon and risk-balancing decisions.

Conclusion

Differential privacy is well equipped to do exactly what it says it will do, mathematically speaking. As defined, the concept happily and routinely meets the guarantees it espouses—an ε-differentially private algorithm acting on a da­taset with one gingersnap eaten versus the same algorithm acting on a dataset with no gingersnaps eaten produces a very similar output. Assuming you were the one who ate the gingersnap, and you knew that anyone could inspect the differentially private mechanism’s source code and access its outputs, you may be assured that the chance you will be reidentified is low—an e^ε type of low. What this means for a cookie-eating regulatory statute, however, is anything but well defined.

This Article introduces a novel, two-step test which may be easily applied to statutes regulating data. Step one looks at the best-case scenario for an at­tacker, that is, someone who, ultimately, wants to reidentify the gingersnap-eating epicure. The result of this first step is a single percentage, a legally com­parable quantity representing the risk of reidentification an ε-differentially private mechanism accommodates. This percentage may then be measured against step two—the highest risk of reidentification a statute permits. If step one is lower than or equal to step two, the mechanism may be deemed (statute, ε)-differentially private. For example, if a court were to deem HIPAA as per­mitting a no more than a 2% reidentification risk (i.e., setting epsilon at .08, for a guess difference of 2%) then a mechanism could be deemed compliant: (HIPAA, .08)-differentially private.

That this outcome may not perfectly capture the exact percentage the legis­lature had in mind when it used the “very small” language found in HIPAA is beside the point; to be sure, a court, possibly with the help of an expert witness, will be able to assess the risks and weigh the benefits of data release given a justiciable case. Rather, the true benefit for this type of test lies in its ability to provide a black-letter line to data stewards, a line which is able to tout the same guarantees differential privacy touts—giving rise to confident, safe, and useful data sharing. The law and legislature, in turn, would be well served to grease the wheels on this type of compliance-inspired confidence, not continue to hinder technological progress by shielding data behind ambiguous requirements with­out a definitive means of meeting those requirements.

    Authors