©2015. Published in Landslide, Vol. 8, No. 2, November/December 2015, by the American Bar Association. Reproduced with permission. All rights reserved. This information or any portion thereof may not be copied or disseminated in any form or by any means or stored in an electronic database or retrieval system without the express written consent of the American Bar Association or the copyright holder.
Imagine you own a lemonade stand, and you want to spark business by offering free samples. Most people take one or two. Now, imagine that someone takes all the free samples from the counter, and leaves without payment, recognition—or even a thank you. Few customers would do so, and merchants certainly would not let such a customer walk out the door without paying. However, this is the Internet age; it is much easier to load up on free samples.
Website data are the free samples, and “web scraping” or “screen scraping” is the method of taking the free samples. It is the practice of automatically extracting large amounts of data from publicly available websites using bots. Bots are automated software programs that scour websites, such as those of airlines, ticket brokers, news organizations, and online merchants, just to name a few, in order to “load up on free samples” of information.1 These are sophisticated software programs that simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP) or embedding a full-fledged web browser, such as Safari, Google Chrome, or Mozilla Firefox.
Web scraping is a persistent issue—rising exponentially.2 In fact, in 2013, web scraping accounted for 18 percent of site visitors and 23 percent of Internet traffic. In 2014, 22 percent of visitors to websites were identified as scrapers, with a 17 percent increase in scraping attacks across all industries and accounting for 27 percent of all Internet traffic.3
Because there are growing numbers of people who depend on scraping for profit, content theft is becoming a major concern,4 affecting the competitiveness of businesses. Scraping tools elicit valuable information to gain competitive business intelligence. That information can be used to undercut the competitor’s prices. Even though companies, such as airlines, still receive the sale from customer bookings through a third-party website, the company may miss out on customer data or website traffic.5 Because repeated scraping can use up bandwidth and lead to network crashes, scraping can also have the unintended—or intentional—consequence of slowing access to the scraped site and disrupting service to consumers.
Whether or not scraping is bad depends on the situation. On one hand, website owners do not want their information to be used by competitors to their detriment.6 Additionally, whether owners view web scraping unfavorably depends on the nature of the information, such as whether the information is general corporate information or time-sensitive and competitively sensitive information. Website operators may incur damages resulting from, among other things, increased bandwidth usage, network crashes, the need to employ antispam and filtering technology, user complaints, reputational damage, and costs of mitigation.
Some scraping is tolerated—even welcomed. In one aspect, the website owner is offering its website to a target audience and wants to ensure that it reaches that target audience. Owners want to be scraped by search engines, such as Google or Yahoo, because they want people to find them. Even if owners do not want Google to scrape their information, Google provides guidance on how to block search indexing with meta tags.7 Also, companies with limited resources may use scraping to access large amounts of data, spurring innovation and allowing such companies to identify and fill areas of consumer demand. For example, Mint.com uses screen scraping to aggregate information from bank websites, which allows users to track spending and finances. In essence, some website owners want to be scraped for their benefit.8
Despite its business application, is web scraping stealing when one takes data without permission from a publicly available website, or is it simply making use of resources that are made available to you? How does one tell?
Laws in the United States
Web scraping may be actionable in the United States as copyright infringement, trespass to chattel, contract claims, or under the Computer Fraud and Abuse Act (CFAA).9 While outright duplication—of a substantial part—of a website will be copyright infringement, duplication of facts is allowed.10 Some actions, such as misappropriation and copyright infringement, have already been litigated against use of “scraper tools” that are used to collect pricing information.11
In eBay, Inc. v. Bidder’s Edge, Inc., the court ordered Bidder’s Edge to stop accessing, collecting, and indexing auctions from eBay’s website to aid in Bidder’s Edge’s business of automatically placing bids on items for sale on eBay’s website.12 The court stated that in order to succeed on a trespass to chattels claim under California common law, the plaintiff must demonstrate that the “(1) defendant intentionally and without authorization interfered with plaintiff’s possessory interest in the computer system; and (2) defendant’s unauthorized use proximately resulted in damage to plaintiff.”13
In Craigslist Inc. v. 3Taps Inc., Craigslist sued 3Taps, Padmapper, and Lovely for scraping rental listings posted on Craigslist and republishing the listings on their own sites, alleging that Padmapper (through the use of a company called 3Taps) violated § 1030(a)(2)(C) of the CFAA,21 as well as committed trespass, breach of contract, and copyright infringement.22 3Taps replicates the entire Craigslist website, Padmapper provides real estate listings, consisting of postings originally posted to Craigslist, and Lovely provides real estate listings, including Craigslist content that it receives from 3Taps.
Although Craigslist is a publicly available website, the court held that Padmapper accessed the site “without authorization” because Craigslist had sent the parties a cease and desist letter and implemented IP address blocking measures, which Padmapper intentionally circumvented. This case may be an outlier because Padmapper intentionally circumvented Craigslist’s anti-scraping mechanisms.23
As for the trespass claim, the access to the system must cause “actual damage,” such as impairment to the “condition, quality, or value” of the system or deprivation of its use “for a substantial time.”24 The court stated that “it is plausible that such access could divert sufficient computing and communications resources to impair the website’s and servers’ functionality.”25 However, the court denied a motion to dismiss the trespass claim because actual damage is a question of fact more appropriate for determination on summary judgment or at trial than on a motion to dismiss.
Interestingly, 3Taps and Padmapper argued that copyright protection does not apply to anything on Craigslist’s site because it is all information in the public domain.26 Although limited because of the nature of the website, Craigslist is a place for its users to post their own listings, written in their own words, and thus the users are the authors and copyright owners. To assert copyright protections, Craigslist must have an exclusive right to the content under a license giving it standing to sue for infringement.27 However, without express words actually granting an exclusive license, Craigslist does not have the right to sue for copyright infringement.28
Laws in Europe
Web scraping has not just been judicially considered in the United States. Earlier this year, the Court of Justice of the European Union (CJEU) ruled on both the intellectual property infringement aspect of web scraping and the contractual breach aspect of a website’s “terms and conditions.”29 The CJEU held that where a website operator cannot establish intellectual property rights in its database, an operator may still be able to rely on its website’s terms and conditions to prohibit scraping.30
Ryanair Ltd. sued PR Aviation BV for infringement of database rights under the Database Directive (96/9/EC)31 and breach of its website’s terms and conditions, seeking an order against PR Aviation to refrain from any further infringement and for PR Aviation to pay damages.32 PR Aviation operates a website allowing consumers to search flight data of low-cost air companies. It obtained the necessary data to respond to an individual search by automated means from a dataset linked to the Ryanair website. Ryanair claimed that PR Aviation infringed copyright law and the database sui generis right, and that it had acted contrary to the terms and conditions of Ryanair’s website. PR Aviation successfully argued that Ryanair could not rely on copyright protection because its database was not sufficiently original to constitute copyright protection, and that there had been insufficient investment by Ryanair, in compiling its database, to claim the sui generis right provided for in the European Database Directive.
In turn, the CJEU upheld Ryanair’s claim that PR Aviation had breached the website’s terms and conditions. Ryanair’s terms and conditions explicitly prohibited “screen scraping.”33 PR Aviation argued that the prohibition against screen scraping was not enforceable because, under article 15 of the Database Directive, any contractual provisions that are contrary to articles 6 and 8 are null and void. The CJEU ruled that the limitations on rights introduced by the Database Directive do not apply to databases that are not protected by the directive. Articles 6, 8, and 15 of the Database Directive do not preclude a website operator from laying down contractual limits on the use of a database without prejudice to applicable national law.34 The case was remanded to Dutch courts to decide the enforceability of the website’s terms and conditions.35
The ruling certainly impacts third-party business models that rely on mining data from websites and social media platforms without permission.36 It is prudent to determine if there are any contractual provisions they may be infringing through the use of screen scraping. In principle, this decision underlines the importance of having clear contractual terms and conditions in force if the business wishes to limit the use of such screen scraping.
Extrapolating from the cases in the United States and the European Union, it appears the strongest legal weapon against web scraping is an explicit prohibition against scraping in a site’s terms and conditions. However, the site’s terms and conditions must be conspicuous because scrapers could claim they never saw such restrictions and thus were not bound by them.37 To combat scraping, a website owner should include click-through agreements that require users to click an “I agree” button before gaining access to valuable data.
Another weapon is to technologically block bots from scraping. Similar to Craigslist, a website operator could also issue cease and desist letters along with IP address blocks to preserve its CFAA claims—and DMCA claims.38 If a scraper disregards a cease and desist letter and circumvents IP address blocks, then the court would likely find that the scraper violated § 1030(a)(2)(C) of the CFAA as well as any applicable DMCA provisions.
Another possibility—although technical—is to place more authentication procedures (think CAPTCHA39) to limit whether bots can access publicly available information. Also, the website operation could limit the number of searches for a period of time. This essentially blocks large bandwidth uses and monitors any large outflow of data. For several companies, this has proven effective in the short term until scrapers are able to discover a workaround. If scrapers circumvent these technological blockades, similar to IP blocks, the website operator could assert a DMCA claim against them as long as no exceptions apply, as well as a claim for breach of the CFAA’s “without authorization” element.40
The law is cloudy.41 For the most part, companies have succeeded in stopping unwanted scraping—at least partly. But, it is far too early to call this an area of settled law. After all, if a company is giving away information for free, can it stop someone from taking too much? Questions arise: How much is too much?
Congress has spent a decade debating greater legal protection for publicly disseminated databases, such as stock prices or real estate listings. Reed Elsevier, Martindale-Hubbell’s parent company, lobbied for database protection in Congress, along with eBay (comparison shopping), the National Association of Realtors (online listings), and the Newspaper Association of America (classified advertisements).42 On the other side, coalitions that include Google and Yahoo, along with financial services firms that collect information about companies, libraries, and scientists, traditionally want unfettered free access to information.43
But, database owners have yet to gain more control over who uses their data and how they use it. It is too soon to tell, but more cases would likely gain traction as more startup operations begin scraping data and compiling them as a service to their consumer base, especially in the age of “big data.”
Any case involving the rights and wrongs of website scraping is interesting because there have been so few of them—still, they are becoming increasingly popular in the Internet age and the age of “big data.” Researchers, startup operations, particularly search engines and comparison websites, scrape other people’s sites to provide a service. This may be actionable. The law is not 100 percent certain in this area as many sites do not want to implicate negative public relations of being seen to challenge people’s access to their sites. Only time will tell, but be careful when scraping from a publicly accessible website. And remember that just because it is posted on the Internet, it is not always “free.”
1. Brian Kladko, Screen Scrapers: Web Sites Fight to Stop Theft of Free Data, N.J. Rec., Apr. 3, 2005, available at 2005 WLNR 26680149.
2. ScrapeSentry issued a Scraping Threat Report in April 2014, warning that web scraping is on a rise. Web Scraping and Data Theft on the Increase: Sentor ScrapeSentry Scraping Report Identifies the Risks to Online Businesses, Marketwired (Apr. 10, 2014), http://www.marketwired.com/press-release/web-scraping-and-data-theft-on-the-increase-1898358.htm.
3. ScrapeSentry, The Scraping Threat Report 2015, at 3 (2015); Tara Seals, Data Theft Watch: Web Scraping Attacks Almost Double, Info-Security Mag. (June 23, 2015), http://www.infosecurity-magazine.com/news/data-theft-watch-web-scraping/.
4. Also, scraping can have data privacy implications. Spokeo.com, an information-scraping website that includes estimates of financial worth and education, name, addresses, family members, interests, ethnicity, age, etc., scrapes publicly available databases and aggregates the information in one location. Recently, the United States Supreme Court granted certiorari in Spokeo, Inc. v. Robins, 135 S. Ct. 1892 (2015). The ruling will be significant as it will determine whether plaintiffs whose personal information is exposed may be able to sustain legal action against defendants under certain federal information privacy laws, even when the plaintiffs suffer no real harm. Christin McMeley et al., Supreme Court Grants Cert in Spokeo v. Robins, Privacy & Security Law Blog (Apr. 27, 2015), http://www.privsecblog.com/2015/04/articles/marketing-and-consumer-privacy/supreme-court-grants-cert-in-spokeo-v-robins/.
5. Ryanair has been fighting against screen scrapers across Europe to prevent its customers from being subjected to extra charges and to ensure Ryanair has appropriate contact details to communicate with its customers. Many of the screen scraping websites cause problems for customers and fail to pass on vital information to both passengers and Ryanair regarding issues such as flight changes, web check-in, and special needs assistance. See Ryanair Wins EU Court Case against PR Aviation, Ryanair (Jan. 15, 2015), http://corporate.ryanair.com/news/news/150115-ryanair-wins-eu-court-case-against-pr-aviation/?market=en.
6. See Kladko, supra note 1.
7. Block Search Indexing with Meta Tags, Google Console Help, https://support.google.com/webmasters/answer/93710?hl=en (last visited Sept. 21, 2015).
8. Jerome S. Osteryoung, Web Scraping Analysis Is a Business Necessity, NetMarketZine (Feb. 10, 2011), http://netmarketzine.com/4430/web-scraping-analysis-is-a-business-necessity/; Ritesh Sanghani, Web Research—The Most Effective Methods to Catch on the Nerve of Volatile Market, Bizcommunity.com (Mar. 9, 2015), http://www.bizcommunity.com/Article/100/16/125380.html.
9. The CFAA imposes liability on “[w]hoever . . . intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains . . . information from any protected computer.” Although the CFAA is a criminal statute, it provides a civil remedy where a plaintiff suffers more than $5,000 in aggregate losses during any one-year period. 18 U.S.C. § 1030(a)(2), (4).
10. Feist Publ’ns, Inc. v. Rural Tel. Ser. Co., 499 U.S. 340 (1991).
11. EF Cultural Travel BV v. Explorica, Inc., 274 F.3d 577 (1st Cir. 2001) (holding that use of “scraper” program “exceed[ed] authorized access” within the meaning of the CFAA); Sw. Airlines Co. v. Farechase, Inc., 318 F. Supp. 2d 435 (N.D. Tex. 2004) (finding “scraping” illegal under misappropriation theory). But see Tamburo v. Dworkin, 974 F. Supp. 2d 1199, 1216 (N.D. Ill. 2013) (stating in dicta that the allegedly aggrieved site owner is obligated to state use restrictions on its site, additionally indicating that the copying of mere data would not likely, under Feist, constitute infringement).
12. 100 F. Supp. 2d 1058 (N.D. Cal. 2000).
13. Id. at 1069–70.
15. 739 F. Supp. 2d 927 (E.D. Va. 2010).
16. Id. at 932.
18. Id. at 936–37.
19. Id. at 937.
20. This is contrary to the ruling in Ryanair Ltd. v. Billigfluege.de GmbH,  IEHC 47, where Ireland’s High Court ruled Ryanair’s clickwrap agreement to be legally binding because it was plainly visible to the consumer, and held that placing the onus on the user to agree to terms and conditions in order to gain access to online services is sufficient to comprise a contractual relationship. The case is currently under appeal.
21. Section 1030(a)(2)(C) imposes liability on one who “intentionally accesses a computer without authorization . . . and thereby obtains . . . information from any protected computer.”
22. 942 F. Supp. 2d 962, 966 (N.D. Cal. 2013).
23. It is important to note that not only does Padmapper’s access satisfy the “without authorization” element of the CFAA, but it also may implicate the Digital Millennium Copyright Act (DMCA) as it contains provisions regarding the circumvention of some technological barriers to copying intellectual property; thus, if there is some “technological measure that effectively controls access to a work,” it is a violation of the DMCA to circumvent that measure—with exceptions. 17 U.S.C. § 1201(a).
24. Craigslist, 942 F. Supp. 2d at 980.
26. The court denied the defendant’s motion to dismiss Craigslist’s copyright claims on the basis that the Craigslist website is a noncopyrightable compilation. The court found that Craigslist in “deciding which categories to include and under what name” displays some minimal level of creativity. Id. at 972.
27. See id. at 973; Silvers v. Sony Pictures Entm’t, Inc., 402 F.3d 881, 889 (9th Cir. 2005) (en banc).
28. “The meaning of the phrase ‘You also expressly grant and assign to [Craigslist] all rights’ was the subject of some debate at the hearing on these motions, but the ‘all rights’ language relates specifically to enforcement rights—not rights to the content of the posts. The language assigning rights to the content did not use the phrase ‘all rights,’ and did not specify that the rights granted were ‘exclusive.’” Craigslist, 942 F. Supp. 2d at 974.
29. Case C-30/14, Ryanair Ltd. v. PR Aviation BV (Jan. 15, 2015).
30. See id.
31. Database rights are a form of unregistered intellectual property rights introduced by the Database Directive in 1996 and implemented into national law across the European Union. The Database Directive provides two forms of protection: article 3(1) establishes that “databases which, by reason of the selection or arrangement of their contents, constitute the author’s own intellectual creation shall be protected as such by copyright”; and article 7 provides protection where there “has been qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents [of a database].” However, article 6 allows lawful users to make a copy of a copyright-protected database without consent where it is necessary to do so in order to access its contents. Further, article 8 permits lawful users of a publicly available database to extract and/or reuse insubstantial parts of its contents, as long as this use does not conflict with normal exploitation of the database or unreasonably prejudice the legitimate interest of the database’s author.
32. Typical in the price-comparison business model, consumers book with the third party and the third party receives a commission. Price-comparison websites rely on information obtained by scraping publicly available data from other websites.
33. Ryanair’s terms and conditions provided: “The use of automated systems or software to extract data from this website or www.bookryanair.com for commercial purposes, (‘screen scraping’) is prohibited unless the third party has directly concluded a written license agreement with Ryanair in which permits it access to Ryanair’s price, flight and timetable information for the sole purpose of price comparison.” Case C-30/14, ¶ 16.
34. Id. ¶ 45.
35. The Ryanair judgment does not answer the question of whether a screen scraper would ever be able to rely on the lawful use exceptions set out in article 6 or 8 of the Database Directive if the database owner were able to establish copyright protection or the sui generis right in the database.
36. Mark Weston, US Web Site Scraping Case Proceeds to Trial, Matthew Arnold & Baldwin LLP (May 17, 2010), http://www.mablaw.com/2010/05/us-web-site-scraping-case-proceeds-to-trial/.
37. See Cvent, Inc. v. Eventbrite, Inc., 739 F. Supp. 2d 927 (E.D. Va. 2010).
38. See Craigslist Inc. v. 3Taps Inc., 942 F. Supp. 2d 962, 966–67 (N.D. Cal. 2013).
39. CAPTCHA means “Completely Automated Public Turing test to tell Computers and Humans Apart.” CAPTCHA requires that the user type letters or numbers of a distorted image, sometimes with the addition of an obscured sequence of letters or digits that appears on the screen.
40. 17 U.S.C. § 1201(a).
41. Scraping can be further muddied. Currently, in the e-discovery context, the industry lacks a standard production deliverable format for web content; thus, there is room for interpretation about what format is considered ordinary course of business. If web scraping methods are employed, it is important to know the particular site’s terms of service. If the site does not allow web scraping, then the collection could be considered a breach. Also, when dealing with scraped data, issues of privacy and security arise; and web scrapers will need to tread extremely carefully in order to avoid problems under applicable privacy laws.
42. See Kladko, supra note 1.
43. See id.
44. Craigslist Inc. v. Eventbrite, Inc., 942 F. Supp. 2d 962, 967 (E.D. Va. 2010).
45. In February 2010, Ireland’s An Ard-Chúirt delivered a verdict that illustrates the inchoate state of developing case law. See, e.g., Case C-30/14, Ryanair Ltd. v. PR Aviation BV (Jan. 15, 2015).