February 28, 2020 Feature

Big Data Analytics: Ethical Considerations Make a Difference

By Cathy Petrozzino

Big data is ubiquitous; when used by governments, it can have significant, sometimes existential consequences for individuals, communities, or even the nation. This article is adapted from a study of the ethical challenges faced by institutions—particularly agencies of the government—in reconciling the diverse legal obligations to agency “missions” with sometimes conflicting requirements of the Federal Privacy Act and other statutory and policy protections and overarching ethical obligations confronting stewards of big data sets and their counsel. What considerations apply to tools and techniques used to analyze this data? How should big data stewards address these differing requirements and obligations associated with the data life cycle—the collection, stewardship, use, and dissemination of data, particularly through passive collection and predictive algorithms and other machine learning tools?

In 2011, big data was listed on Gartner’s Hype Cycle for Emerging Trends1 in the “On the Rise” category. In 2015, big data had dropped off Gartner’s Emerging Trends Hype Cycle altogether.2 According to Betsy Burton, who authored the 2015 study, “big data has quickly moved over the hype curve’s ‘Peak of Inflated Expectations’ . . . and has become prevalent in our lives. . . .”3 Although some believe this was a premature declaration, data science based on big data is playing an increasingly more important role in organizations’ efforts such as to better understand the habits of their customers or users and to increase operational efficiencies.

What is big data? Although big data does not have an authoritative definition, there are three commonly agreed-upon attributes:

  • Volume—it has a large amount of data.
  • Velocity—the data is ingested rapidly; in some cases, in real time.
  • Variety—the data has varying types (e.g., structured and unstructured) and may come from disparate sources.

The volume, velocity, and variety of the data make it unsuitable for processing by traditional relational database applications. This article is focused on practices with the following three characteristics:

  • Big data (defined above) is processed using big data tools (e.g., Hadoop).
  • The data has personal information.
  • Algorithms and modeling are used to derive “hidden, meaningful” information from the data. To clarify, an algorithm is comprised of a set of rules that need to be followed in order to solve a problem. A model is built by using an underlying algorithm and is shaped by the training data.4

Many of the challenges identified in this article are also relevant to data projects more generally; big data is highlighted because of its challenging attributes, such as the variety of data and amount of personal information involved.

In addition to private-sector organizations collecting, storing, and conducting analytics on massive amounts of personal and personal health information, the public sector at every level—federal, state, local, and tribal—also has benefited from its creation of big data collections and applications of data science.

An example of this is the Veterans Health Administration (VHA), which employs the Care Assessment Needs (CAN) score, a weekly analytic predicting the likelihood that a patient is at a high risk of hospitalization or death.5 The Social Security Administration’s (SSA’s) Office of Anti-Fraud Program and fraud prevention units use data analytics for early potential fraud detection to assist experts who are trying to identify abusers of the system.6

At the same time, concerns have been raised about the use of data science to drive decision making in the government sector. Is the diversity of the population fairly and accurately represented? Are individuals’ privacy respected, and are they permitted control over their information? Is there bias in the system that serves to disparately impact different segments of the population? Is there sufficient transparency to support democratic principles that are needed to support a government of the people, by the people, and for the people?

Some of these concerns have sounded more like howls of protests as different segments of the population, many of whom are woefully underrepresented in the high-tech employment sphere, are worried that once again, people unlike them are developing “disruptive” solutions that could result in further inequality, whether it be in the form of benefits that disproportionality benefit the developers’ demographic or of harms that unintentionally target already marginalized populations.

An example of where these concerns of bias or discrimination arise is the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS)7 tool that is used by local judicial systems to predict recidivism risks of offenders. After completing a questionnaire about the offender, the data is used to generate a recidivism risk score. A Pro Publica report revealed that COMPAS scored black offenders more harshly than white offenders who have similar or even more negative backgrounds. This was followed by a rejoinder from the independent Community Resources for Justice association that argued the analysis that formed the basis for the Pro Publica article was faulty.8 To add further pepper to the stew, there are different studies that question if COMPAS works—does it achieve its mission objective to improve recidivism prediction accuracy? All this suggests that COMPAS needs more careful ethical forethought to ensure demonstrable useful, fair results.

There are other examples of problematic or questionable data models used in the public sector. Predictive policing in particular also has been subject to scrutiny and has been criticized as being bias-afflicted self-fulfilling prophecies.

These concerns are at odds with the eagerness of the public sector to jump on the data analytics bandwagon. In June 2019, the Office of Management and Budget issued Memorandum M-19-18, a memo outlining the Federal Data Strategy with the following objective9:

The mission of the Federal Data Strategy is to leverage the full value of Federal data for mission, service, and the public good by guiding the Federal Government in practicing ethical governance, conscious design, and a learning culture.

Although “ethics and privacy” are one of the three main principles of the Federal Data Strategy, the Strategy’s emphasis is clearly focused on the missions of federal agencies—to promote and enable data sharing. GSA is assigned the responsibility for providing ethical guidance. As such, GSA “will publish and promote the Data Ethics Framework for the Federal Government.” Given that over the past 45 years, the Privacy Act has had challenges to meaningfully protect U.S. citizens’ information, and the lack of a robust legal foundation for big data ethics in the public, GSA’s efforts, although a start, will not fully allay concerns.

What is it about big data analytics that raises ethical concerns? To fully understand the underlying issues giving rise to evidence of bias or discrimination and thus provide bases for examining these data tools’ methodologies under an ethical framework, it is important to analyze the data science life cycle and the ethical (including privacy) concerns that arise during the life cycle. This analysis uses the Cross Industry Standard Process for Data Mining (CRISP-DM)10 for the life-cycle definition. CRISP-DM consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Validation, and Deployment. These phases are detailed below (some are combined) along with ethical considerations.

The principles underpinning the ethical analysis are grounded in The Menlo Report: Ethical Principles Guiding Information and Technology Research,11 which are an extension of the ethical principles established in the 1979 Belmont Report12 and the Fair Information [Privacy] Practices or Principles (FIPs).13

Business Understanding Phase

The Business Understanding phase identifies the target problem that a particular agency initiative is trying to address. In this phase, the problem definition and what constitutes “fairness” should be analyzed from an ethical perspective. Ideally, it should take into account such factors as the nature and elements of diversity of the populations reflected in the data on such issues as gender, race, age, and other categories that often implicate concerns of discrimination.

Examples abound of ethically questionable data science initiatives. A 2019 New York Times Opinion piece14 describes facial recognition software that is used by one commercial company to analyze job candidates (evaluation categories include “personal stability” and “conscientiousness and responsibility”), and in academic research to predict sexuality (i.e., “gaydar”).

There are also more subtle ethical issues. In a 2017 HHS paper on Predictive Analytics in Child Welfare,15 MITRE’s Christopher Teixeira and Matthew Boyas present an example where “one stakeholder prefers to model directly from birth, using information from birth certificates as the base set of information to inform the models about a child’s parents, their health, and their home environment. . . . Other stakeholders contend that it is not ethical to predict on such information from a child who has yet to be abused, instead preferring to focus their modeling on children who are already active cases.”16

On the flip side, there’s the question of agency policy and obligation in responding to this type of predictive algorithm. If there’s a red flag (or pinkish?), is the agency ethically or legally obligated to act? Does the agency have legal authority, substantive expertise, and adequate resources to act on all flags raised by the data science program?

The question of “fairness” is a factor in the problem definition, modeling, and deployment stages. For an organization that is exploring the use of predictive analytics to determine how to group individuals (e.g., those who are or are not likely to be repeat offenders, fraudsters, or terrorists), there are nuanced definitions of fairness. This will be discussed more in the latter life-cycle phases.

Data Understanding and Preparation Phases

The Data Understanding phase is tightly coupled with the Business Understanding phase. When formulating the data analytics to address the business problem, it is critical to understand what data is available and will be used to support the analysis. The Data Preparation phase typically involves collating data from different sources and addressing known quality issues that exist with the different data sources. This includes interpreting outliers and incomplete data sets. Assessing what to do about outliers can have profound impacts, especially if the initiative’s goal is focused on outliers.

This phase is ripe with privacy-related ethical concerns. In some cases, data that is collected as part of providing a service is repurposed for an entirely different reason. Long legal or privacy notices that are vehicles for transparency are often difficult for individuals to understand, may be vague in terms of personal information use (e.g., “research”), and place the burden squarely on the individual to spend the time and effort to navigate.

These concerns are amplified with the use of “publicly available” (e.g., social media) data in big data analytics. Danah Boyd and Kate Crawford make the following argument: “Data may be public (or semi-public) but this does not simplistically equate with full permission being given for all uses.”17 This is especially true given the frequent lack of full transparency regarding agency-originated collections or use of public data and the ignorance of individuals on what’s collected and shared, as well as the specific intended use of the data for more consequential purposes such as behavioral predictions. Beyond that, a review of social media users’ attitudes towards data being used for research showed that people do care and want autonomy in how their data is used; more concern was expressed about government and corporate uses versus uses by nonprofit entities.18

The use of de-identified data sets is sometimes deemed sufficient for allowing use of individuals’ data even with sensitive analytics. But de-identification is a risk mitigation measure; it is not an absolute solution for ensuring privacy. The risk of re-identification grows as disparate data sets are combined, which is common with data science programs using big data.

Beyond privacy concerns, there are ethical concerns surrounding the use of historical data that inculcates the societal biases that surround its creation. A recent article from the Harvard Business Review highlighted bias in hiring algorithms. According to the author: “Unfortunately, we found that most hiring algorithms will drift toward bias by default.”19

It also has been noted that just because a data set does not include a field associated with a protected class (e.g., race, gender) does not mean that the protected class cannot be derived from the data set. Address can be a proxy for race; income and title can be proxies for race and gender. Thus, a tool capable of discerning or synthesizing specific data that may become the basis of a discriminatory or biased outcome may be capable of supporting discriminatory analyses even if not designed for that purpose, or even imagined to be capable of abuse.

Another challenge with data is appropriate representation of different populations in the data itself. The push for evidence-based medicine highlights the criticality of representation. For decades, there have been concerns about minority representation in research. The 1993 National Institutes of Health (NIH) Revitalization Act20 required all federally funded clinical research to prioritize the inclusion of women and minorities. Yet a 2015 study showed that since 1993 less than 2% of the 10,000-plus cancer studies have included enough minorities to be relevant, as did less than 5% of the respiratory studies.21

Such data bias and representation challenges are examples of a broader challenge with data—the available data may not be ideal for a particular data science initiative. Modeling requires a large amount of the “right data” to function accurately; this data may not exist. For example, there may be insufficient data that reflects the outcome that is being studied for predictive analytics, especially outcomes that are low frequency (e.g., who is at high risk to be a terrorist). An initiative may need other somewhat related data that introduces additional ethical concerns and reliability questions.

Modeling and Evaluation Phases

During the Modeling phase, algorithms are applied to the data to look for significant or useful patterns in addressing the target problem, models are created to reflect these patterns, and promising models are further analyzed. Human domain expertise should be used to help shape the modeling analysis, although the human expert may be a source of inadvertent bias in the modeling. As Cathy O’Neil noted, “models are opinions embedded in mathematics.”22 The Evaluation phase assesses how accurately the model performs based on statistical analysis.

Often big data models are developed to answer an organization’s question on how to categorize or classify individuals into groups, a concept known as social sorting.23 The purpose of social sorting is to manage defined populations for particular purposes. There are ethical concerns in both the construction of the groups and how they are eventually used. While the sorting of populations by a commercial advertiser may be done for a predominantly benign purpose such as for marketing analysis and product targeting, the same techniques also support sorting of populations by government agencies for the much more significant purpose of applying or excluding populations from benefits or other government programs. It is during the modeling phase that the rules to define these groups are established through discovered patterns. The very definition of what groups are of interest and the inferences that are used to form these groups can be shaped by biases. Individuals, as members of these groups, are impacted by how these groups are formed.

Big data modeling often does not follow the traditional scientific model of hypothesis, experimentation, and repeatability. Even if a hypothesis is established (and there is an ongoing debate in the modeling community on the value of establishing a hypothesis), it can be difficult to independently validate the model due to the “volume, variety, and velocity” of the data. Without validation, there’s a higher risk that the results are unreliable, which can have profound implications for individuals and the mission.

Along these lines, an observation made by a number of critics is that with enough data, patterns will emerge. Big data “is the kind of data that encourages the practice of apophenia: seeing patterns where none actually exist, simply because massive quantities of data can offer connections that radiate in all directions.”24 This is not to say that meaningful patterns don’t emerge—the challenge is how to delineate a pattern that is meaningful and how to analyze ethical issues associated with spurious patterns.

Concern has also been raised about models that influence the outcome that they are trying to predict. Predictive policing models have been criticized for this reason. “While proponents advocate that such systems remove subjective biases in favor of objective empirical fact, data on crime is often biased because arrests are more likely to occur in neighborhoods that are monitored more heavily to begin with.”25

As the sensitivity of the inference increases, and the gulf between the context of the source data and the model grows, so can ethical concerns. There are privacy concerns surrounding the use of the data beyond the purpose for which it was collected. This is not just a transparency and trust issue; there are concerns about the appropriateness (including integrity) of using personal information out of context for an entirely different purpose than for which it was initially connected. This is heightened in those situations where the individual has little choice in participating in the collection (e.g., ubiquitous surveillance cameras, digital breadcrumbs). There is also a loss of autonomy; individuals lack control over the sensitive information that has been inferred and cannot represent their self-interests. In some cases, as the earlier-referenced New York Times Opinion piece highlights, sensitive inferences may appear to be all but random and nonsensical.

There are recognized issues with the fairness and transparency of algorithms and modeling; the topic is an active research field, especially in the area of machine learning (ML) (or artificial intelligence (AI)) algorithms. ML algorithms use training data to automatically develop models that fit the data. Different algorithms are used to model the data; and the most promising models are evaluated against test data.

An immediate challenge is that there is currently no singular agreed-upon definition of fair. There are different definitions—which is best often depends on subjective and pragmatic considerations, including the available data, the goals of the program, and the applicable policy/regulatory framework(s) that shape the initiative. This should be addressed early in the Business Understanding phase.

AI algorithms may also lack transparency, particularly in black-box algorithms where it is unclear how data is being utilized. With the advent of complex, deep-learning, neural network algorithms, genuine insight on how resulting models work can be very difficult for a human to fully understand—and there is divisiveness in the field as to the need for transparency when the model is shown to be highly accurate. Yet, transparency remains a fundamental ethical and privacy principle and key for autonomy. It is also essential for the operations of the federal government.26

Even from an Evaluation perspective, ethical concerns can arise. Prior to using algorithms to attempt the normalization of diverse data or for conducting analytics, a target accuracy rate may be established by a combination of domain experts and the analytics team. If the model falls short by a few percentage points, will the program be abandoned? Or after the costs that are often involved in getting through the Modeling phase, will the accuracy rate be close enough? Who is responsible and held accountable for making this type of decision?

Deployment Phase

In the Deployment phase, the model is integrated into the organization. This includes technical considerations, business process and practice considerations, and possibly even policy considerations—does an organization establish policies for when a human should become involved? Technical considerations include future validation of the model to ensure it is still functioning within acceptable accuracy parameters. Without this potentially costly validation step, the model may unknowingly generate false results. The Google Flu Trend is a well-documented example of how an initially accurate model can drift over time27 and become inaccurate.

A societal concern that has been expressed with analytics is that a result that may be based on data of varying provenance, quality, and representation, and a model generated from inferred (and sometimes black box) relationships, rapidly transforms into fundamental knowledge. One report notes: “[T]he rise of data science, which emphasizes finding meaning in patterns, has begun to threaten elements of traditional scientific method. . . . Without these characteristics of science we may create vulnerabilities in our knowledge base.”28

As summed up in the “Humility” principle of the Council for Evidence-Based Policymaking, “[c]are should be taken not to over-generalize from findings that may be specific to a particular study or context.”29 In the world of data science, this principle goes beyond humility to mitigating the risk of modeling results that lead to harm due to poor operational use, to reusing and/or repurposing in downstream analytics, or through sharing with other organizations. Ultimately, these risks may result in observable harm to individual citizens. The caveats associated with the data and the modeling, including the nuance of appropriate versus inappropriate use, are often neither captured nor communicated in an effective way to leadership (and other stakeholders).30 The designers of the Allegheny Family Screening Tool (AFST), which identifies children at high risk for severe maltreatment, were deliberate in explaining that “[t]he AFST is supposed to support, not supplant, human decision-making in the call center. And yet, in practice, the algorithm seems to be training the intake workers.”31

Another fundamental challenge with big data is using it as a tool to track to and make conclusions at the individual level. Much of big data’s utility is based on probability—on the likelihood that “patterns” and other conclusions emerging among very large-scale sets, reflecting large populations, will support valid conclusions based on statistical analyses. But some models are applied to individuals with potentially dire consequences. For example, a Pearson factor of .9 means there is a 90% correlation between two factors. This is generally considered a strong correlation when applied to correlation-based analytics. However, the ability to say if a specific individual will be in the 90% versus the 10% is questionable. If the size of the group being evaluated is 18 million (about the number of U.S. veterans), then the 10% represents 1.8 million people. In the words of Duncan Watts, a former Microsoft researcher, reflecting on the inability of analytics to predict individual outcomes based on years of data collected for the Fragile Families Study:32

We find exactly the same pattern everywhere we look, which is that there’s a lot of white space. There’s a lot that cannot be explained whether it’s a tweet or a person. When you’re talking about individual outcomes, there’s a lot of randomness.33

Such unfortunate realities of unschooled or inappropriate application of sophisticated data analytics give rise to an additional concern with the deployment of data science: that is, the lack of individual recourse in dealing with the results of data science analyses. Often an individual may be unaware that big data analytics contributed to a negative decision, particularly in the case of omission—the individual did not receive a benefit or offer that some of her peers did. Even if the individual is aware of data analytics as a contributing factor, the ability to understand why she falls inside or outside of a particular group may be difficult or hard to explain with any degree of accuracy (particularly with black-box algorithms). An explanation that relies on an inference-based grouping without supporting rationalization may prove unsatisfactory.34

Big data analytics may also impact an individual’s agency to meaningfully participate in accessing and correcting his or her information. The Privacy Act provides individuals the ability to access and correct data in a federal government Systems of Records. It is likely that no such guarantees exist for personal information in unstructured data, and the ability to access and correct data in the model is uncertain. It is possible to update a model based on changed data, but that comes with a cost of the time it takes to reprocess data and compare results to see if the underlying change had an impact. Ultimately, it can depend on who is the decision maker: the business, data scientist, or modeler, or some combination of roles.


Exploring the data analytics life cycle phase by phase in detail sheds light on the complexity of the ethical landscape, and how this unmanaged complexity can also impact the reliability of the results. Of additional concern is the interplay between and among the different phases.

It is unrealistic to expect that a team of well-intentioned public-sector data experts, scientists, and modelers will be able to holistically identify and manage these concerns. The ethical complexity makes it important to establish a multidisciplinary team for data science initiatives to identify and manage ethical issues. In addition to the scientific and technical staff, the team should have representatives from legal, compliance, the business operations, and leadership with clear roles, responsibilities, and communication paths for each.

An important member of the multidiscipline team is a knowledgeable legal representative. Formal legal authority provides the foundational policy guidance with respect to potential decisional substantive issues involving privacy, civil rights, and civil liberties; participation by a legal expert who is familiar with this substance and who also possesses the foundational ethics background that is needed to help the project team think through the ethical challenges is essential to a successful implementation of such initiatives.

Importantly, however, the incorporation of legal counsel on such multidisciplinary teams carries its own source of risk. That is the potential of conflicts between counsel’s sworn obligations to the agency and its mission, and his or her oath as an attorney, where the ethical considerations associated with the collection or use of a data set contemplated by a big data initiative may pose a conflict between the attorney’s sworn departmental/organizational roles and duties and specific ethical considerations imposed by the Canons and their legacy, as expressed in the individual’s oath and obligations as an attorney.

While such a conflict is not inevitable—and much more prevalent in areas such as national security practice and intelligence community data collections—those developing big data–dependent analytical initiatives should encourage the observation of potential competing obligations of law and ethics, including the express identification of overt sources of potential “conflict” between explicit legal authority and guidance, and ethical considerations embedded in individuals’ experience in their decision making. These risks and conflicts of obligation for those providing legal advice on data use would appear to be most impactful when they arise in situations of data use and sharing with other noncollecting agencies and other third parties, where the dissemination and exchange of agency-held data could have unanticipated consequences for the data subject or other individuals, and where the originating data steward is in the best position, or perhaps the only party in a position, to exercise control and limitation on data dissemination carrying the potential of harm to an individual data subject.

Data-sharing agreements between a host big data set steward agency and those with which it shares data should provide some basis of recourse to individual subjects. Principles and criteria expressed in any data-sharing agreements could also be important foundational elements for the multidiscipline team, supporting any initiative and providing criteria to evaluate the tools, algorithms, and other elements of the data system to identify and mitigate sources of bias and potential discrimination and other elements of the initiative that might conflict with the rights of individual subjects.

The enthusiasm of organizations to use big data should be married with the appropriate analysis of potential impact to individuals, groups, and society. Without this analysis, the potential issues are numerous and substantively damaging to their mission, organization, and external stakeholders.


1. Hung LeHong & Jackie Fenn, Hype Cycle for Emerging Technologies, 2011, Gartner Research (July 28, 2011), https://www.gartner.com/en/documents/1754719/hype-cycle-for-emerging-technologies-2011.

2. Alex Woodie, Why Gartner Dropped Big Data off the Hype Curve, datanami (Aug. 26, 2015), https://www.datanami.com/2015/08/26/why-gartner-dropped-big-data-off-the-hype-curve.

3. Id.

4. Difference Between Model and Algorithm, Fin. Train (n.d.), https://financetrain.com/difference-between-model-and-algorithm.

5. P’ship for Pub. Serv. & IBM Ctr. for the Bus. of Gov’t, From Data to Decisions III: Lessons from Early Analytics Programs (Nov. 2013), https://ourpublicservice.org/wp-content/uploads/2013/11/c775c3a46c80b2ab7f20cae68ee3cfb5-1396889821.pdf.

6. Soc. Sec. Admin., No. 05-10004, Social Security Protects Your Investment 2019 (Sept. 2019), https://www.ssa.gov/pubs/EN-05-10004.pdf.

7. COMPAS was produced by Northpointe, which became Equivant (www.equivant.com).

8. Anthony W. Flores, Christopher T. Lowenkamp & Kristin Bechtel, False Positives, False Negatives, and False Analyses: A Rejoinder to “Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks.”, Cmty. Res. for Justice (2017), http://www.crj.org/assets/2017/07/9_Machine_bias_rejoinder.pdf.

9. Memorandum from Russell T. Vought, Acting Dir., Office of Mgmt. & Budget, for Heads of Exec. Dep’ts & Agencies, M-10-18 (June 4, 2019), https://www.whitehouse.gov/wp-content/uploads/2019/06/M-19-18.pdf.

10. John D. Kelleher & Brendan Tierney, Data Science (MIT Press Essential Knowledge Series 2018).

11. Sci. & Tech., Dep’t of Homeland Sec., The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research (Aug. 2012), https://www.dhs.gov/sites/default/files/publications/CSD-MenloPrinciplesCORE-20120803_1.pdf.

12. Dep’t of Health, Educ. & Welfare, The Belmont Report: Ethical Principles & Guidelines Involving Human Subjects (Apr. 1979), https://www.hhs.gov/ohrp/sites/default/files/the-belmont-report-508c_FINAL.pdf.

13. Robert Gellman, Fair Information Practices: A Basic History (Apr. 10, 2017), https://bobgellman.com/rg-docs/rg-FIPshistory.pdf.

14. Sahil Chinoy, Opinion, The Racist History Behind Facial Recognition, N.Y. Times (July 10, 2019), https://www.nytimes.com/2019/07/10/opinion/facial-recognition-race.html.

15. Christopher Teixeira & Matthew Boyas, MITRE Corp., Case No. 17-0679, Predictive Analytics in Child Welfare: An Assessment of Current Efforts, Challenges, and Opportunities (Oct. 2017), https://aspe.hhs.gov/system/files/pdf/257841/PACWAnAssessmentCurrentEffortsChallengesOpportunities.pdf.

16. Id. at 10–11.

17. Danah Boyd & Kate Crawford, Six Provocations for Big Data, A Decade in Internet Time: Symposium on the Dynamics of the Internet & Soc’y (Sept. 21, 2011), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431.

18. Su Golder et al., Attitudes Toward the Ethics of Research Using Social Media: A Systematic Review, 19 J. Med. Internet Res. no. 6, 2017, at e195, http://www.jmir.org/2017/6/e195.

19. Miranda Bogen, All the Ways Hiring Algorithms Can Introduce Bias, Harv. Bus. Rev. (May 6, 2019), https://hbr.org/2019/05/all-the-ways-hiring-algorithms-can-introduce-bias.

20. NIH Revitalization Act of 1993, Pub. L. No. 103-43, 107 Stat. 122, https://www.ncbi.nlm.nih.gov/books/NBK236531.

21. Kristen Bole, Diversity in Medical Research Is a Long Way Off, Study Shows, Univ. of Cal. San Francisco (Dec. 15, 2015), https://www.ucsf.edu/news/2015/12/401156/diversity-medical-research-long-way-study-shows.

22. Cathy O’Neil, Weapons of Math Destruction (Broadway Books 2017).

23. David Lyon, Surveillance as Social Sorting, Computer Codes and Mobile Bodies, in Surveillance as Social Sorting: Privacy, Risk and Digital Discrimination 13, 13–30 (David Lyon ed., Routledge 2003).

24. Boyd & Crawford, supra note 17, at 2.

25. Jasmine Liu, Big Data and the Creation of a Self-Fulfilling Prophecy, Stan. Daily (Apr. 5, 2017), https://www.stanforddaily.com/2017/04/05/big-data-and-the-creation-of-a-self-fulfilling-prophecy.

26. The goal of the Defense Advanced Research Projects Agency (DARPA) Explainable AI (XAI) program is to produce more explainable results that are accessible to humans.

27. David & RyanWhat We Can Learn from the Epic Failure of Google Flu Trends, Wired (Oct. 1, 2015), https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends.

28. Kate Crawford et al., Big Data, Communities and Ethical Resilience: A Framework for Action, at 7 (Bellagio Ctr. Report, Oct. 24, 2013), https://assets.rockefellerfoundation.org/app/uploads/20131024184336/71b4c457-cdb7-47ec-81a9-a617c956e6af.pdf.

29. Comm’n on Evidence-Based Policymaking, Report: The Promise of Evidence-Based Policymaking 17 (Sept. 2017), https://cep.gov/report/cep-final-report.pdf.

30. Ida Asadi Someh et al., Ethical Implications of Big Data Analytics (Research-in-Progress Papers 24, 2016), https://aisel.aisnet.org/ecis2016_rip/24.

31. Virginia Eubanks, Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor 142 (St. Martin’s Press 2017).

32. Fragile Families & Child Wellbeing Study, https://fragilefamilies.princeton.edu.

33. Invisibilia: The Pattern Problem (NPR radio broadcast Mar. 30, 2018), https://www.npr.org/templates/transcript/transcript.php?storyId=597779735.

34. GDPR provides a provision for “human intervention” when a substantive decision about a subject is made based on “automated” processing. The U.S. has a narrower right to explanation, specifically in the context of credit scores.


By Cathy Petrozzino

Cathy Petrozzino is a Principal Cybersecurity Engineer within the National Security Engineering Center, the federally funded research and development center (FFRDC) operated by MITRE for the U.S. Department of Defense. Approved for public release. Distribution unlimited. Case number: 19-3743. ©2020 The MITRE Corporation. All rights reserved.