Inside Anthropic’s Book-Scanning Operation for AI Training

Follow the Paper Trail
New court filings reveal that Anthropic, the AI startup behind the Claude assistant, spent tens of millions of dollars on a physically destructive book-scanning program. Between bulk purchases of used volumes and high-speed OCR workflows, the company literally shredded millions of books to feed its large language models. Details of the operation surfaced in a recent 32-page federal fair use ruling, raising fresh questions about legal precedent, data quality, and the environmental impact of modern AI development.
Background: From Pirated eBooks to Physical Libraries
In early 2023, Anthropic’s leadership—led by CEO Dario Amodei—searched for reliable, high-quality text corpora to train Claude. Initial strategies relied on scraped and pirated eBooks, but by mid-2024 the company pivoted. The complex negotiations with publishers proved onerous, and licensing fees threatened to collapse projected budgets. Instead, Anthropic pursued a workaround under the first-sale doctrine: buy physical books, digitize them, and destroy the originals.
Strategic Hire: Tom Turvey Joins Anthropic
- February 2024: Anthropic recruits Tom Turvey, former head of partnerships for Google Books.
- Turvey’s mandate: “Obtain and digitize all the books in the world,” leveraging Google’s proven scanning techniques.
- Turvey brings expertise in automation, OCR accuracy optimization (up to 98% with post-processing), and legal risk mitigation.
Technical Workflow of Destructive Scanning
Anthropic’s end-to-end pipeline mirrored industrial archival systems but at unprecedented scale:
- Bulk Acquisition: Millions of used books purchased via online marketplaces, warehouse auctions, and remainder tables.
- De-binding: Automated guillotine machines cut bindings at up to 300 pages per minute, producing stacks of loose sheets.
- High-Resolution Scanning: 600 dpi color scanners captured full bleed images; infrared lamps and multispectral sensors improved OCR of marginalia and complex scripts.
- OCR & Cleanup: Proprietary software clusters text blocks, corrects skew, and fuses image and text layers into searchable PDF/A-3 format.
- Preprocessing for AI: Text is tokenized (byte pair encoding), chunked into 2,048-token windows, and deduplicated against Common Crawl and newswire corpora.
- Shredding & Recycling: After verification, paper shards are pulped; less than 2% of materials (glue strips, covers) require special disposal.
Legal Ruling on Fair Use
In June 2025, Judge William Alsup of the Northern District of California issued a landmark opinion finding that Anthropic’s destructive scanning qualified as transformative fair use, because:
- Books were legally purchased in the first sale; no licenses were bypassed.
- The process converted physical editions into a new digital format, preserving functional utility without distribution.
- Anthropic’s internal use for model training did not supplant the market for the original works.
“This format conversion conserves space and advances scientific progress,” wrote Judge Alsup, echoing precedents set by Google Books in 2013.
Data Quality and Model Performance
Expert benchmarks show LLMs trained on well-edited prose outperform those relying solely on web data by up to 15% in factual accuracy and 20% in coherence metrics. Books provide:
- Curated vocabulary with fewer typographical errors.
- Balanced genre distribution (fiction, technical manuals, reference works).
- Long-form structure enabling improved discourse and summarization abilities.
Anthropic internal tests reportedly showed a 0.3 reduction in perplexity when including the book corpus—critical for Claude’s performance in creative writing, legal reasoning, and long-form Q&A.
Alternative Non-Destructive Methods
Organizations like the Internet Archive and Project Gutenberg champion robotic page-turning systems that preserve bindings and rare editions. In May 2025, OpenAI and Microsoft announced a partnership with Harvard’s libraries to train on one million public domain titles using non-destructive capture and metadata enrichment.
Ethical and Sustainability Considerations
Critics highlight:
- Environmental Impact: Paper pulping and high-power scanning consume significant energy—estimated at 2 MWh per million pages.
- Cultural Loss: Bulk purchases risk depleting shelf copies of modern and out-of-print works.
- Transparency: Lack of public reporting on which titles were digitized raises archiving and preservation concerns.
Future Outlook: Regulation and Industry Response
With the EU’s upcoming AI Act and UNESCO’s draft “Training Data Transparency” guidelines, companies may soon need to disclose data sources and preservation protocols. Anthropic’s fair use victory could catalyze more in-house scanning programs, or push the industry toward collaborative licensing pools and federated learning on public data.
Conclusion
Anthropic’s destructive scanning saga underscores the escalating trade-offs between legal risk, data quality, cost, and ethics in AI development. As AI models grow in scale and ambition, so too will the debates over how we source, preserve, and value the printed word.