So, what is the relationship between an exasperated German, a sanguine Frenchman, and a pile of dirt? Some, like the German, would have you believe that the best way to preserve privacy is to defer to technologists and sequester data under lock and key. Others, like the Frenchman, are more sanguine about the technical challenges while preferring to maximize the data’s utility. Finally, the pile of dirt serves as a reminder that the conundrum posed by novel legal issues serves as fertile ground for critical thinking, steering clear of the oversimplification that buzzwords can introduce.
This article outlines how data scientists use “differential privacy” technology to protect personal and sensitive data. A critical input to their work is the “privacy budget,” which establishes the amount of privacy society is willing to exchange for increased utility. In addition to sharing the emerging field of data privacy, I hope to convince you that to safeguard privacy, technologists require the skills (e.g., balancing values, fact-finding) that judges excel in.
The Paradox of Privacy
Balancing the insights gained from a large pool of data with the privacy of its participants presents a delicate challenge. While anonymizing data by excluding names or sharing aggregate information like averages may seem sufficient, sophisticated computing can link these data back to individuals. Indeed, the “Fundamental Law of Information Recovery” holds that any detail shared from a study, no matter how minor it appears, can compromise participant privacy to some extent. As we share more, the intrusion on personal privacy proportionally increases. Given the stakes, how do we quantify and protect privacy against loss?
Learning from Failure
To understand how to measure privacy, it’s helpful to start by imagining how an outsider would try to use published data from a study and then reconstruct the private information of the study participants. Attempts to steal or access data by hacking will not be considered as these approaches rely on breaches of physical and cybersecurity.
Conversely, we can mathematically shield against attempts to use vast computing power on public statistics, seeking the likeliest participant responses. It’s a method similar to guessing a padlock combination through countless trials but with the focus on finding the most accurate “answer combinations” that align with demographic data like the census. The closer the guessed data align with known statistics, the more they intrigue the seeker; the less they align, the lower their confidence in the accuracy.
Consider the following example. In a quest to understand the impact of lifestyle choices on health, medical researchers interview gym members in a small rural town. The anonymized dataset includes age, nutritional preferences, and workout frequency. The researchers’ challenge is to ensure the privacy of the seven dedicated participants in the face of potential statistical inference attacks.
The researchers want to collaborate, so they publish the following anonymous data:
- Participant 1: 24 years, Vegetarian, Gym 4x/week
- Participant 2: 30 years, Keto, Gym 2x/week
- Participant 3: 45 years, Vegan, Gym 5x/week
- Participant 4: 38 years, Omnivore, Gym 3x/week
- Participant 5: 52 years, Paleo, Gym 6x/week
- Participant 6: 27 years, Vegetarian, Gym 2x/week
- Participant 7: 41 years, Vegan, Gym 4x/week
Combining the social media profiles of the town’s residents, a bad actor collects the following public information hoping to identify the seven participants:
- Only one 24-year-old vegetarian is known to frequent the gym 4 times a week (Probability 1/1).
- Among the 30-year-olds, three follow a Keto diet, but only one pairs it with twice-weekly gym visits (Probability 1/3).
- The local vegan society’s chairperson, who is 45, is an avid gym-goer, present 5 times a week (Probability 1/1).
- For the 38-year-olds, omnivorous diets are common, making this a more difficult match (Probability 1/20).
- The 52-year-old who endorses the Paleo diet and works out 6 times a week is a notable local athlete (Probability 1/1).
- The 27-year-old vegetarian demographic is fairly large; however, gym frequency refines this (Probability 1/10).
- A 41-year-old vegan gym enthusiast is featured in a recent wellness blog, hinting at a match (Probability 1/2).
With this approach, we can see how public data might reveal identities when interwoven with survey results. For participants 1, 3, and 5, their unique habits make reidentification a certainty.
Using more sophisticated statistical techniques, real-world datasets involving sensitive personal data have been compromised. Notable examples include the following:
- In 1997, Massachusetts made state employee health data research accessible, removing identifiers for privacy. Yet, Governor William F. Weld’s records were returned to him by Latanya Sweeney, a Massachusetts Institute of Technology student, who matched remaining data like birthdates and ZIP codes with voter information.
- In 2008, researchers highlighted the dangers of a National Institutes of Health (NIH) policy requiring the release of aggregate genomic data. The researchers identified the diseased participants by comparing mutation frequency in the released data versus the general population. Following this revelation, the NIH reversed its policy.
- In 2011, researchers showed that it’s possible to extract personal information about Amazon.com purchases through its product recommendation system, which provides aggregate-level recommendations like, “Customers who bought this item also bought A, B, and C.” By monitoring how recommendations changed over time and cross-referencing them with customers’ public reviews, researchers could infer precisely which item a customer purchased on a particular day—even before the customer wrote a review.
- In 2021, the U.S. Census Bureau was able to reconstruct block-level microdata (e.g., name, age, race) with 80 percent accuracy by combining the anonymized 2010 census and commercial demographic information (e.g., white pages).
As these examples illustrate, our intuition of privacy is often flawed. Not only are humans biased to present benefits, but computers are increasingly adept at extracting personal information from sources that might seem innocuous to the untrained eye. The answer then is not to rely on intuition but instead to quantify privacy.
Measurable Privacy
An attacker’s potential certainty is the means by which privacy loss from sharing data may be measured. When the possible combinations of factors (e.g., age, gender, race) are similarly plausible, an attacker cannot distinguish among the possibilities to draw inferences, thus protecting the data.
In the pursuit of privacy, it’s paramount that all hypothetical outcomes within a dataset carry a similar weight of credibility, avoiding any distinct “plausibility peaks” that might betray underlying truths. Picture this concept as a graph: probabilities along the vertical, combinations along the horizontal. An even distribution of potential outcomes yields a gently wavering line, not perfectly straight, but consistent.
However, when a few select scenarios seem more plausible than the rest, they create a noticeable bulge in our graphical landscape, akin to a dromedary’s hump, drawing attention and suggesting proximity to actual data points.
We mathematically assess the graph’s maximum slope to measure and manage the risk of revealing such distinguishing peaks. Keeping this gradient moderate ensures no particular outcome disproportionately suggests accuracy. In practice, we introduce a technique known as “jittering”—a strategic sprinkling of randomness into the data—to blur the steepness of any slopes. This method effectively disguises the likelihood of any one outcome, thwarting attempts to discern the actual from the possible.
In a courtroom, jittering would be unacceptable. For example, if a witness willfully described an assailant as a 21-year-old male wearing a red shirt, and the assailant was, in fact, a 23-year-old male wearing a blue shirt, we would say the witness was lying. Contemplate, however, a medical study trying to determine the role of shirt-wearing on skin cancer rates in males in their twenties. In this context, the jittering seems reasonable because the adjustments are not large enough to change the conclusions drawn from the data.
The concept of a “privacy budget” helps navigate the delicate balance between preserving confidentiality and retaining the usefulness of data. For instance, if one researcher wishes to provide another with an age estimate without revealing the exact number, they might randomly adjust the actual age within a range of, say, five years. This method ensures that age remains a protected piece of information, albeit with a window of uncertainty.
To further enhance privacy, expanding the range of random adjustment to, for example, 10 years widens the field of plausible ages. This increase comes with the cost of accuracy, reflecting the inherent trade-off between privacy and precision. The more we obscure the data, the less precise it becomes. When sharing information, the goal is to find the sweet spot where data remain informative, yet personal privacy is respected.
As datasets grow, striking a balance between robust privacy and reliable accuracy becomes increasingly achievable. With larger datasets, it becomes easier to hide individual details while still spotting overall patterns, which helps keep personal information private without losing the general accuracy of the data. This is akin to an Impressionist painting. Even though there is less accuracy than a photograph given artistic license (what mathematicians would call “noise”), you can still get the overall picture even if you’d lost hope of understanding what any particular dab of paint represented.
It’s important to carefully calibrate the random noise added to enhance dataset privacy rather than using arbitrary amounts. To see why this is the case, flip to the front cover of this magazine. Picture the cover shrunken down to the size of a postage stamp. Now imagine a sheet of these stamps with the same dimensions as the cover. Adding copies of the same image, even very small ones, does little to protect privacy because the original can be extracted by computing the average of the noisy image.
Managing a Finite Budget
Differential privacy offers a mathematical approach to understanding how privacy accumulates when publishing statistics. For instance, if we publish both the average and the median ages of a group, each with a privacy loss of 3, the total privacy loss doesn’t exceed 6. This allows us to strategically allocate our privacy budget across multiple data releases, choosing between multiple lower-impact releases or a single, more accurate one.
Navigating the intricate balance between privacy preservation and the integrity of shared data is complex. It calls for a nuanced understanding of the societal value placed on the knowledge gleaned against the privacy risks involved. Conveying these concepts of “accuracy” and “privacy loss” in accessible terms is essential for public engagement.
We cannot share personal data without affecting privacy. To mitigate this, we may alter the data slightly—a process known as “jittering”—which affects its accuracy. For example, if we adjust the true population number of a town, we aim for our altered figure to fall within 10 people of the actual number 98 percent of the time. This ensures that the data are accurate and maintain privacy. The essence of privacy loss is how easily an individual could be identified from the data shared.
Protecting privacy is a matter of limiting obvious inferences from the data we share. It’s about ensuring that no single interpretation of the data stands out distinctly. As individuals participating in research, we should advocate for and expect robust privacy safeguards.
Data curators are tasked with understanding the community’s privacy versus accuracy preferences, making judicious use of the privacy budget and ensuring that the most accurate responses are reserved for the most critical inquiries. As the lead data scientist of the U.S. Census has opined, it is important not to waste an inherently limited privacy budget by publishing accurate answers to unimportant questions.
You Know More Than You Think
Differential privacy, particularly the concept of a privacy budget, shares several notable similarities with the duties of a judge. Here are 10 major points of similarity:
- Balancing Competing Interests: Just as a judge in a civil lawsuit must balance the competing interests of the parties involved, ensuring fairness and justice, differential privacy requires balancing the need to protect individual privacy with the societal benefits of data analysis.
- Using Discretion in Decision-Making: A judge exercises discretion in interpreting laws and making decisions based on the unique circumstances of each case. Similarly, in differential privacy, decisions about allocating a privacy budget involve discretion in determining how much “noise” to add to a dataset or how much privacy risk is acceptable in a particular context.
- Allocating Resources Judiciously: Managing a privacy budget in differential privacy is akin to a judge managing courtroom resources and judicial power—both require careful allocation to ensure fairness and efficiency.
- Guarding Against Bias: Judges must remain impartial and guard against bias, similar to how differential privacy mechanisms are designed to prevent biases in data analysis that could compromise privacy.
- Adhering to Predefined Parameters: Judges operate within the confines of legal frameworks, procedural rules, and precedents, much like how differential privacy operates within the constraints of its mathematical framework. The privacy budget in differential privacy is a predefined parameter that guides how much information can be disclosed, analogous to how guidelines and precedents shape judges’ decisions.
- Assessing Incremental Impact of Decisions: Just as each decision a judge makes in a case can incrementally affect the outcome of the lawsuit, each query made against a dataset consumes a portion of the privacy budget in differential privacy. This cumulative impact requires careful management and foresight, similar to how a judge must consider the cumulative impact of their rulings throughout a case.
- Making Ethical Considerations: Ethical considerations are paramount for judges in making fair decisions, just as ethical considerations are crucial in applying differential privacy to ensure fairness in privacy protection, ensuring that all subsets of data are represented fairly without compromising individual data points.
- Applying Objective Standards for Decision-Making: Judges are expected to apply laws and legal principles objectively, without personal bias, ensuring fair and consistent decisions. In differential privacy, the concept of a privacy budget is grounded in mathematical principles that provide an objective standard for decision-making. This ensures that privacy is protected in a consistent and unbiased manner, not unlike the expectation of objectivity in judicial decisions.
- Making an Impact Assessment: Just as a judge must consider the broader impact of their rulings, managing a privacy budget in differential privacy involves assessing the impact of data release on privacy—now and in the future.
- Evolving with Changing Standards: The law and judicial decisions evolve over time, just as the field of differential privacy evolves with new research and methodologies for better privacy management.
These similarities highlight the importance of careful, balanced decision-making within a defined framework to achieve privacy protection.
Conclusion
In light of the complexities and potential vulnerabilities associated with data privacy, differential privacy stands out as a beacon of hope, a mathematical fortress designed to protect individual information in an era when data are both currency and commodity. It offers a nuanced approach that does not simply obscure but intelligently masks the presence of individual data. Using a privacy budget—much like a financial budget—enables data curators to manage the balance between the richness of data utility and the sanctity of privacy.
As the concept of privacy continues to evolve alongside technology, differential privacy principles will remain essential, ensuring that personal information is safeguarded against the relentless progression of data analysis techniques. The promise of differential privacy, therefore, is not just in its current applications but in its potential to adapt and provide robust privacy protections for generations to come. Just like the law.