©2020. Published in Landslide, Vol. 13, No. 1, September/October 2020, by the American Bar Association. Reproduced with permission. All rights reserved. This information or any portion thereof may not be copied or disseminated in any form or by any means or stored in an electronic database or retrieval system without the express written consent of the American Bar Association or the copyright holder.
Machine learning (ML) systems have a fundamental vulnerability from a legal perspective: there is no rewind button. That is, once information is incorporated into an ML model, there is virtually no way to remove it. This vulnerability is inherent to the technological framework of ML systems and could be fatal from a legal perspective. If ingested data is tainted (e.g., via trade secret misappropriation or violation of license restrictions), there may be no way to comply with a court order to remove the offending data. This article reviews some of the technology and legal dynamics underlying this conundrum and offers some preventative guidelines to avoid this trap.
No Rewind Button
Products using ML are all around us, and as the understanding and technology of frameworks to build features using ML technologies improve, more and more products will be incorporating them. Some everyday examples of ML included in products are image recognition, facial recognition, speech recognition such as Siri and Alexa, and recommendation engines.
ML is conceptually simple. In the training phase, engineers collect a data set (training set) to be used to train the ML algorithms—for example, a set of images that contain a dog and a set of images that do not contain a dog. The ML framework then digitizes or prepares the training set so that through an iterative mathematical process distinguishing common characteristics (“signals”) can be found. The output of this process is called a model, which is typically embedded in the product along with a runtime. A runtime is simply code that knows how to use the model.
It is important to understand that after training, an ML system does not retain the images used to train it. In the dog example, there is no database of images of dogs in the product, by which the system would match a new instance against a known database of dog images. That database approach would be hopelessly slow and memory-intensive. Rather, ML systems, through the use of the model obtained through training, “know” if an image is a dog, much like humans “know” if an image is a dog. This “knowledge” is the result of training the system over thousands or millions of examples, on a true/false basis, whether an object is a dog. The collective impact of all these true/false training examples is a generated signal that corresponds to the essential characteristics of a dog.
In database systems, it is easy enough to remove ill-gotten data from a system. Not so with ML systems. Rather, the model ingests each instance of data, and “learns” from each successive true/false training session. Generally speaking, it is impossible to go back and “remove” the imprint of the training from a particular set of data that was once ingested. Thus, there is no “rewind button” for backing up and deleting traces of misappropriated data. In addition, in many systems, user feedback is used to further improve the model, making it even harder to disambiguate where any data came from and remove that data.
ML systems require ingesting massive amounts of data. Obtaining these troves of data can be a challenge. Often, there are legal restrictions tied to the ingested data. Data could be misappropriated from a competitor through trade secret misappropriation (for example, through employees leaving a prior employer). Data might be harvested from “public” sources like the web, Facebook, and Google and used in a manner outside the terms and conditions governing those sources. Even data inside a company might be used in improper ways, such as if the extraction of data from a proprietary enterprise resource planning (ERP) system violates the terms of service of that ERP provider. Privacy concerns, such as those arising under Europe’s General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA), also pose risks. In any of these situations, and countless permutations, legally encumbered data may become ingested into your company’s ML models, with difficult consequences if detected. Following is more elaboration around each of these potential pitfalls.
Trade Secret Misappropriation via Departing Employees
Employees leaving one employer and bringing data to their next job provide endless fodder for lawsuits. These lawsuits may result from malicious misappropriation or may result more innocently when a departing employee fails to delete cloud-based data repositories, such as Dropbox, Google Drive, or email accounts, that contain employee data. In either situation, the prior employer’s confidential information can be traced to the next employer. Depending on the facts, there may be a credible allegation that the offending data was ingested into the new employer’s ML model.
One recent example is the case of WeRide Corp. v. Huang, in which an executive and a technical employee of one self-driving car company joined a rival.1 It was undisputed that the technical employee downloaded large amounts of data before hopping companies. As alleged, this data was brought to the new employer, whose capabilities for detecting pedestrians suddenly improved. On a motion for a preliminary injunction, the Northern District of California issued an order enjoining the defendants from using the former employer’s confidential information. That is standard relief in a trade secret case. The wrinkle in the ML context, however, is how to comply with that order. If indeed the misappropriated data was ingested into the new employer’s model, it is unclear how to stop using that data. Answering that question will have to wait for another case, as the WeRide litigation was resolved through terminating sanctions.
Violations of Copyright and Terms of Service from Public-Facing Sources
An easy-to-access source of vast repositories of data is, of course, the internet. Some data may be truly free for the taking. Other sources are tightly controlled. In between these poles is a broad continuum of legal restrictions. This public-facing data might be ingested through any number of techniques, including through scraping and application programming interface (API) access. “Scraping” typically refers to the use of automated processes (e.g., a “bot” or “web crawler”) to extract data from web pages for later retrieval or analysis. Scraping is not per se illegal, and in one high-profile case, the Ninth Circuit Court of Appeals found that hiQ had not violated the Computer Fraud and Abuse Act when it scraped professional biographies from LinkedIn’s site.2 Nonetheless, the Ninth Circuit reserved ruling on whether other laws may have been broken.
Copyright remains untested in the ML context, including with respect to scraping. Depending on the approach taken, it is likely that a “copy” of data is generated and reproduced through the scraping process and by ingesting the data into an ML model. Fair use may or may not provide a defense against copyright liability in this context. To date, no reported cases evaluate whether generating and using an ML engine is fair use of copyrighted works and whether the outputs of ML engines are derivative works of the original copyrighted matter. Thus, companies run considerable risk in trying to guess where the copyright fair use boundaries may lie when ingesting scraped data from the internet.
Contractual restrictions arise from the terms and conditions of website user policies, which commonly govern the collection and use of the data. These policies are generally enforceable when a visitor to the site must affirmatively assent to those policies, either by signing up as a user or otherwise clicking a box to manifest assent to the terms and conditions.3 If enforceable, those terms and conditions will usually prohibit users from scraping data from a website and making unauthorized use of the data.4 Additional contractual restrictions may arise when companies allow the limited collection and use of data through APIs. For example, Twitter allows the collection of data through an API to facilitate that data transfer. Of course, use of that API entails its own set of contractual restrictions, the violation of which may trigger loss of any licensed use of the content.
Potential Misuse of Internal Corporate Data
Companies’ own data may be encumbered with surprising entanglements. If data is stored in proprietary ERP systems, for example, the provider of those systems may exercise control over how the stored data can be used. Because those ERP providers make money by providing data analytics studies for the benefit of the enterprises they serve, those ERP providers may be resistant to allowing the enterprises to perform their own data mining. If the enterprises attempt to extract the underlying data from the ERP systems for running ML studies, those extractions may run afoul of API agreements or may be found to violate the copyrights of the ERP providers in the metadata that was used to structure the data. For example, in Madison River Management Co. v. Business Management Software Corp., the court found that when a company copied its data to extract it from a proprietary database structure, there may have been a violation of the database provider’s copyrighted metadata.5
Privacy Risks: CCPA
The CCPA allows for private rights of action, including for injunctive relief, in the event of unauthorized access of personal data.6 Furthermore, the California attorney general may bring a civil action with civil penalties up to $7,500 “for each violation” of any provision in the CCPA, which would include the unauthorized sale of consumer data.7 While these enforcement provisions are subject to a 30-day cure period, given the inability to “rewind” ML systems, any wrongful ingestion of consumer data into ML models would present novel and potentially astronomical penalties. The injunctive relief provisions in civil actions could provide plaintiffs with powerful leverage if companies are unable to dissociate the wrongfully ingested data from their systems.
Privacy Risks: GDPR
U.S. companies may be subject to Europe’s GDPR restrictions, either as a controller of European data or as a processor. To date, while it does not appear that the European authorities have cracked down on wrongful use of personal data in ML systems, there are specific provisions in the GDPR restricting the use of such data in automated decision-making systems.8 Furthermore, the transparency requirements under the GDPR are at odds with the “black box” architectures of ML systems. Therefore, the potential remains that stiff penalties available under the GDPR could be brought to bear against unauthorized ML systems.
The examples listed above are only a handful of the scenarios in which companies might ingest tainted data into their ML systems. If detected, attempting to comply with a court order to cease the use of misappropriated data raises novel and thorny questions. Because there is typically no rewind button to remove traces of the misappropriated data, courts could potentially order a complete stop to the use of that ML model. Courts may be unaware of the new challenges posed by ML systems; therefore, litigants may need to take special care to sensitize judges that customary language for rulings may be impracticable in the ML context.
Companies should plan ahead to avoid getting boxed in by court orders that may be impossible to obey. Recognizing that it may be infeasible to retract ill-gotten data from ML systems, the following guidelines could prevent the problem or at least limit the fallout from having been caught ingesting tainted data into an ML model.
License Your Data Sets
Any data sets that are used to train your model should be licensed correctly, i.e., you should have the right to use them. There are a number of publicly available data sets. The terms under which these data sets are made available vary wildly. Carefully document your licensing scheme, and enforce it through the employee ranks, to ensure that the entire team is complying with the plan.
Preserve Data Lineage
Companies developing ML models should preserve thorough and accurate records memorializing the sources of input data. If later accused of having misappropriated data, there is nothing as effective as being able to point to the records containing the source of the data that was actually ingested. Maintaining that data could be useful in demonstrating that the model was generated free of any tainted sources.
Keep Versions of Your Training Data Sets
Training data sets are not static. Over time, additional data may be added to these data sets to improve the outcomes (reduced error rate), expand the scope, and cover edge cases. In software engineering, it is common to use version control to manage the evolution of software source code. Any change to the software is committed to a repository, which contains every point-in-time version of the software. Consideration should be given to do the same for data sets. Having access to various versions of the data used to train the model, as well as the model itself, would enable the company to revert to an earlier state in the event of an adverse court order.
For example, a company may have just acquired a startup, without fully knowing the source of that startup’s data, and may want to use that data to improve some of the models. Or a company might hire an employee from a rival, where there is a risk of a trade secret lawsuit arising from luring away the new hire. In these circumstances, a good preventative measure would be managing the versions of the ML model prior to onboarding the new data.
Having complete versions of the training data is useful in an M&A context too: if complete, the model could, in theory, be regenerated from the versioned data, proving that no misappropriation occurred.
Track Ephemeral Data
Commonly, data that power ML models cannot be archived. For example, self-driving cars stream a constant torrent of data to central servers. This data may be simply ephemeral, and once run through the ML systems, the data is lost. Thus, it may be impossible to save that data to prove how a particular facet of the ML models was generated. In that situation, maintaining robust records of the architecture of the data intake protocol, along with the licensing scheme for data to ensure proper use, may be the best approach.
Due Diligence Demands
Conversely, companies acquiring targets should demand records of what data was ingested to develop their ML models. As part of M&A due diligence, acquirers should take pains to ascertain that the ingested data was free and clear of legal restrictions.
Companies should be prepared to prove that their ML models are generated from proper sources and to structure an “out” if their data may be tainted. While a rewind button in ML models might not make sense technologically, it may be a legal imperative to structure ML systems to be able to revert to an earlier time period in the event of a finding of misappropriation. Proactively, through documentation and scrutiny in due diligence, companies can minimize their exposure to a court order that may be otherwise impossible to obey. Litigants should be prepared to educate judges about the unique characteristics of ML systems that may render impracticable some of the standard language that courts are accustomed to using in their orders.
1. 379 F. Supp. 3d 834 (N.D. Cal. 2019).
2. hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019).
3. Mere “browsewrap” policies, when a viewer of a website is not required to affirmatively agree to the terms, are commonly found not enforceable.
4. See, e.g., Sw. Airlines Co. v. Roundpipe, LLC, 375 F. Supp. 3d 687 (N.D. Tex. 2019) (enforcing terms of service against a scraper that had signed up as a user of Southwest’s website).
5. 387 F. Supp. 2d 521 (M.D.N.C. 2005).
6. Cal. Civ. Code § 1798.150.
7. Id. § 1798.155.
8. See Council Regulation 2016/679, art. 22, 2016 O.J. (L 119) 46 (EU).