Indexing Is Fair Use

804 F.3d 202 (2d Cir. 2015) · Read the opinion (Justia)

Google scanned twenty million books, built a full-text search index, and displayed short snippets in results. The Authors Guild called it infringement. The Second Circuit called it fair use. The reasoning: the purpose of a book is to be read; the purpose of an index is to be searched. Different purpose, different analysis.

Twenty million books

Starting in 2004, Google partnered with major research libraries — Michigan, Harvard, Stanford, Oxford, the New York Public Library — to scan their collections. By the time of the lawsuit, Google had digitized roughly twenty million volumes, the majority under copyright. The scans were not published. They fed a search index. A user who searched for a phrase would see which books contained it, along with a short "snippet" in context. Three snippets per page, never more than an eighth of any page.

The Authors Guild sued in 2005, claiming mass infringement. The case wound through courts for a decade. A proposed settlement — a comprehensive digital library with revenue sharing — was rejected in 2011. The case went to trial on the fair use question alone.

Different purpose, different analysis

The Second Circuit affirmed the district court: Google Books is fair use. The opinion, written by Judge Leval, rested on a single analytical move. Copying an entire book to build a search index is transformative because the copy serves a fundamentally different purpose than the original.

Google's making of a digital copy to provide a search function is a transformative use, which augments public knowledge by making available information about Plaintiffs' books without providing the public with a substantial substitute for matter protected by the Plaintiffs' copyright interests.

The word "transformative" carried the analysis. Google did not alter the books. It did not add commentary. It copied them wholesale. But the purpose of the copy was different from the purpose of the original. A book is written to be read. An index is built to be searched. The transformation was not in the content but in the function.

The four-factor analysis

Factor 1: Purpose and character of the use. Transformative. Google's use "communicates something new and different from the original" by enabling search across a corpus no human could survey. The commercial nature of Google's enterprise did not defeat the claim. Transformative purpose can outweigh commercial motive.

Factor 2: Nature of the copyrighted works. The books were published, creative works. This factor slightly favored the plaintiffs, but the court gave it little weight. When the use is transformative, the nature of the original matters less.

Factor 3: Amount and substantiality of the portion used. Google copied entire books. Ordinarily fatal. But the court held that copying the whole work was necessary for the transformative purpose. You cannot build a search index from partial copies. The amount used was justified by the function it served.

Factor 4: Effect on the market. No evidence that snippet display substituted for buying books. The court found that Google Books might actually increase book sales by helping readers discover works they would not otherwise have found. The snippet view was too fragmentary to serve as a replacement for reading.

Google's provision of snippet view is not a "market substitute" because Google does not provide a substantial enough portion of the protected text to threaten the rights-holders' interest.

HathiTrust: the companion case

A year earlier, the same circuit decided Authors Guild v. HathiTrust (755 F.3d 87, 2d Cir. 2014). HathiTrust was a consortium of research libraries that received digital copies from Google and used them for full-text search and accessibility services for the visually impaired. The Second Circuit ruled both uses were fair use.

Together, the two cases established a pattern. Digitizing copyrighted works for search and accessibility is fair use when the purpose is transformative and the display is limited. The library's traditional role—making knowledge findable—survived the transition from card catalogs to full-text indexes. The legal framework bent to accommodate the new technology rather than breaking it.

The open question

The court said copying for indexing is fair use because indexing serves a different purpose than reading. But the opinion's logic does not stop at search indexes. If copying for indexing is transformative, what about copying for embedding? For training?

An embedding model converts text into a vector. The vector preserves semantic relationships but cannot reconstruct the original text. The transformation is arguably more radical than an index: an index still contains the original words; an embedding contains none of them. If different purpose justifies copying whole works, and embeddings serve a more different purpose than indexes do, the opinion's logic points in a clear direction.

The courts have not reached that question yet. The pending cases—New York Times v. OpenAI, Thomson Reuters v. Ross Intelligence, and others—will test whether the "transformative purpose" doctrine extends to machine learning. The Second Circuit built the framework. Whether it holds under the weight of generative AI is the next chapter of this story.

Neighbors

Wikipedia: Authors Guild v. Google
Wikipedia: Authors Guild v. HathiTrust
Wikipedia: Google Books
Wikipedia: Fair use
Wikipedia: Transformative use
🔑 Logic — the legal reasoning structure: if indexing is transformative, and embedding is more transformative than indexing, then the doctrine extends—unless the court draws a line the opinion does not contain

Blog connection: The court said indexing is fair use because it serves a different purpose. Canon indexes copyleft works by meaning. The derivative-work question—does compiled output carry the prose's copyleft obligation?—is the frontier the courts have not reached. Canon

← Lessig 2004 · 6 of 8 by june.kim Mikolov 2013 · 8 of 8 →