Big data implicates a plethora of different issues, from cutting-edge legal considerations to ethical, social, and moral challenges. Whether society is truly able to gain anything from the immense collection of data about consumers and businesses alike depends on society’s ability to thoughtfully address these issues while balancing the flexibility necessary to allow for innovation. The ABA Antitrust Section’s report Artificial Intelligence & Machine Learning: Emerging Legal and Self-Regulatory Considerations endeavors to address both the proliferation of mass data collection and the legal and social implications of its use. Focusing on existing regulatory regimes regarding privacy and consumer harm, and the potential for self-regulation, the report highlights the need to move quickly as artificial intelligence and machine learning become more ubiquitous throughout society. The following excerpts from the Section’s report touch on the major issues it addresses.
Data Collection and Infrastructure
First-, Second-, and Third-Party Data
A multitude of different types of information fuel the modern data economy. Electronic devices of all kinds; individuals from every walk of life and every corner of the earth; entities of every type, whether the largest corporations or the smallest nonprofits; every government in the world—all are continually generating data. Virtually every organization is now generating first-party data, whether intentionally or not. First-party data include information that a company collects directly from its customers and users, its processes and finances, as well as any information digitized in the normal course of business. First-party data may be sold to or shared with known partners (meaning the data become what are sometimes referred to as second-party data), or collected or purchased by third parties, including high-volume data aggregators (third-party data). Data aggregators and data brokers may acquire data either through an arrangement with the company that collected the data or by entering into an arrangement to allow the third party to collect the data directly from the company’s customers. Alternatively, data aggregators may gather data from public-facing pages using a variety of means, such as bots, scrapers, or other technological measures.1
Along with the proximity of the data to its origin, another key element of data collection is the type of infrastructure used to derive value from the data. This includes high-capacity hardware specifically designed to store data in a secure and accessible format, the analytical software to extract meaningful insights from the data, and separate, specialized analytical hardware that can efficiently retrieve data from the storage hardware and execute the mathematical operations for the analytical software.
Types of Information and Data Sources
In addition to the first-, second-, and third-party modes of data collection and disclosure, and relevant types of infrastructure necessary to analyze collected data, two additional taxonomies may apply when characterizing data and considering potential regulatory issues: (1) personal versus nonpersonal information and (2) government versus private data sources. The various legal regimes regulating data often focus on personal information, or various subcategories of personal information. The precise definition of personal information can vary significantly based on the relevant specific legal scheme(s), but, in general, these definitions include a wide range of information that may be linkable to an individual on its own or in combination with other information. These data may contain sensitive personal information (e.g., credit card numbers, details of medical treatment) or information from which it is possible to extrapolate sensitive insights (e.g., rideshare information concerning a user’s schedule).
Similar to the private sector, government entities also may engage in the collection and generation of large amounts of data. However, the legal regimes governing government data can differ dramatically from those governing the private sector. Constitutional provisions that have significant implications concerning the collection and regulation of information—chiefly, the First and Fourth Amendments—generally only apply to the government and not the private sector. The fact that government-created information is, in some circumstances, a public good to which the government must provide free and open access means that governmental data are often treated differently than private sector data, typically protected as proprietary or a trade secret.
What Makes Big Data “Big”?
Big data is “big” in two ways: long and wide. First, it is “long” because it entails many different observations (e.g., millions of observations of an individual’s choices regarding whether or not to buy a particular product after exposure to an ad). Second, it is “wide” because it contains many different variables for each observation (as one example, for a given individual, one may have the characteristics of different ads an individual has seen, the individual’s past purchases, past searching behavior, demographic data, political data, location information, social media posts, etc.). Modern data analytical techniques are well suited for dealing with “wide” data because they can determine whether and how to use each new piece of datum, or variable, to best predict the applicable outcome. Where traditional analytical techniques require that researchers select the relevant variables and model specifications, modern techniques can automatically select and synthesize important variables from very wide datasets to maximize accuracy.2
Modern data analysis methods span many approaches, including “artificial intelligence,” “machine learning,” and “neural networks,” each of which can have distinct analytical applications and limitations. Various data analytical techniques can provide potential insights for a variety of purposes, such as pattern recognition, predictive analysis, and causal inference. Artificial intelligence (AI), speaking broadly, is the overarching term for algorithmic-powered computer processes that learn to perform actions that correspond to and even surpass human abilities.3 An AI process can include predictive models or decision-making algorithms that ensure that the entity “acts so as to achieve the best outcome or, when there is uncertainty, the best expected outcome.”4
Instead of relying solely on human instruction, some current AI programs incorporate machine learning to develop their algorithms. Machine learning occurs when a program can adapt in response to new observations. AI based on machine learning, once trained, can make determinations or decisions through algorithms that are driven by what has been learned by the data, rather than being dependent on programmed or preset inputs.
Benefits and Drawbacks of Data Collection and Use
The collection of wide swaths of data, when combined with the capabilities of recent advances in data analytics, especially machine learning, promises substantial and wide-ranging benefits for consumers, businesses, other institutions, and society at large. At the same time, these technological advances raise legitimate concerns, including in relation to consumer privacy; the maintenance of competitive markets; equity and fairness; economic opportunity; the potential uses of such technologies for state surveillance; and the need to maintain human control over autonomous systems, among others. Thus, big data regulation, either self-imposed or otherwise, is important to limit the harms associated with its use.
Data Use and Regulation
The Federal Trade Commission and Other Law
As the nation’s primary regulator of consumer privacy and data security, the Federal Trade Commission (FTC) is likely to play a significant role in shaping AI-related data practices. The FTC has held workshops and hearings on the subject and has issued a report on discriminatory practices resulting from the use of “big data.” The FTC’s authority under Section 5 gives the agency the ability to ensure transparency about big data practices and to police consumer harm, such as discriminatory outcomes that may violate the FTC Act. The FTC also expects companies to enact reasonable data security measures for data and appropriate vendor management to prevent unauthorized access and use of big data.
As companies continue to use data to automate decision-making regarding eligibility, the improper use of such data may also implicate civil rights laws, such as the Fair Credit Reporting Act, the Equal Credit Opportunity Act, and the Fair Housing Act. Similar to the FTC Act, civil rights law specifically requires accuracy, transparency, and nondiscrimination. Common law regimes, such as tort law, may also be an avenue within which to consider consumer harm, although how that regime may apply will depend on how courts apply current precedent to new data uses and harms.
The collection, storage, and use of big data also implicate various privacy laws, including the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Of particular importance under the GDPR is its regulation of “automated decision-making” about a consumer, which only allows such processing if (1) it is necessary for the completion of a contract, (2) it is authorized by law, or (3) an individual gives explicit consent to the processing.5 The intent of the provisions regarding automated processing is to promote individual autonomy and empowerment6 and the creation of opportunities for public engagement.7
In the United States, although there is no national privacy law, the CCPA will likely impact the use of big data, especially because the law regulates the sale of broadly defined “personal information.” The CCPA also prohibits businesses from discriminating against consumers who exercise their rights by engaging in actions such as the denial of goods or services or charging different rates, subject to exceptions.8 These privacy considerations will continue to shape the use of big data both in the United States and internationally. Separately, U.S. companies will also have to deal with the evolution of civil liability associated with the use of big data in new ways.
One way to limit the need for legal regulation is to enact a strong self-regulatory body to oversee big data creation and use. Self-regulation may potentially serve as an alternative framework that provides companies with certain parameters that can be more quickly implemented and that do not unduly inhibit innovation. The complex technology and speed of development of AI and big data may make the flexible and adaptable nature of industry self-regulation a particularly appropriate approach for regulating certain risks created by the technology.
Individual companies, industry organizations, consumer advocates, civil rights groups, academics, and governmental organizations have put forth several sets of principles or ethical frameworks specific to AI. Although self-regulation in the big data and AI space could take many forms and may vary across industry sectors or within specific applications, the end products likely will reflect a common set of industry-agreed-upon principles. And while the key concepts that make up the core principles of current technology-focused self-regulatory frameworks—transparency, choice, accountability, accuracy, and security—address many of the primary areas of concern with AI and big data as well, AI self-regulatory efforts have some distinct features that warrant particular consideration.
Companies that commit to a set of principles or a self-regulatory regime, or that more broadly seek to comply with applicable laws and regulations, will need to create internal policies and procedures, along with other measures as part of a formal compliance program. Still, self-regulation may be an effective tool in determining the metes and bounds of AI as machine-learning uses expand.
1. Some of these techniques have been and are the subject of litigation. See, e.g., Bradley Saacks, Hedge Funds Are Watching a Key Lawsuit Involving LinkedIn to See If They Can Spend Billions on Web-Scraped Data, Bus. Insider (Mar. 14, 2019, 9:48 AM), https://www.businessinsider.com/hedge-funds-watching-linkedin-lawsuit-on-web-scraped-data-2019-3.
2. For example, researchers used a dataset of 7.4 million Medicare beneficiaries, and more than 3,000 variables within Medicare claims, to predict mortality rates in the first 12 months after hip or knee replacement. Such a predictive model allows doctors to allocate scarce joint replacements to patients who are predicted to benefit most from the procedure. See Jon Kleinberg et al., Prediction Policy Problems, 5 Am. Econ. Rev.: Papers & Procs. 105 (2015). In a more extreme case, image classification problems commonly involve datasets with hundreds of thousands of pixel values for each individual observation, yet successful analytical frameworks are able to highlight specific pixel groups that inform accurate predictions. See Alex Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems (Fernando Pereira et al. eds., 25th ed. 2012). But see, e.g., Susan Athey on Machine Learning, Big Data, and Causation, Library of Economics and Liberty, Libr. Econ. & Liberty (Sept. 12, 2016), www.econtalk.org/susan-athey-on-machine-learning-big-data-and-causation (“The problem with ‘data mining’ [big data], or looking at lots of different models is that you might find, if you look hard enough you’ll find the result you are looking for.”).
3. There is no settled definition of artificial intelligence. In Artificial Intelligence: A Modern Approach (3d ed. 2010), Stuart J. Russell and Peter Norvig provide four approaches to the definition of artificial intelligence and discuss the limitations of each approach to fully define AI.
4. Id. at 4. This definition uses the common “rational agent” understanding of AI. Multiple understandings/definitions of the field exist. Id. at 1–2.
5. General Data Protection Regulation (EU) 2016/679 (GDPR), art. 22.
6. See High-Level Expert Group on Artificial Intelligence, Ethics Guidelines for Trustworthy AI 16 (2019).
7. Int’l Conference of Data Prot. & Privacy Comm’rs, Declaration on Ethics and Data Protection in Artificial Intelligence (2018).
8. See Cal. Civ. Code § 1798.125.