Meta’s Torrenting Relevant in Llama AI Copyright Case

Case Background and Current Status
In a pivotal order dated June 26, 2025, U.S. District Judge Vince Chhabria partially granted Meta’s motion for summary judgment in a lawsuit brought by 13 bestselling authors—including Sarah Silverman and Junot Díaz—over the use of their copyrighted works to train Meta’s Llama large language models (LLMs). While Meta prevailed on most copyright infringement claims, one outstanding issue remains: whether torrenting pirated e-books from shadow libraries like LibGen, amounting to over 80.6 terabytes of data, bears on the fair use analysis.
Chhabria ordered the parties to meet on July 11 to determine next steps for the authors’ separate allegation that Meta unlawfully distributed their books during the BitTorrent process. Although discovery on this topic is still scant—since the torrenting claim was raised late—the judge ruled that Meta’s argument declaring torrenting “irrelevant” to fair use was untenable.
Legal Analysis: Fair Use Factors and Torrenting Relevance
The U.S. Copyright Act codifies four factors of fair use. Chhabria identified at least three ways that Meta’s torrent-based acquisition could influence the analysis:
1. Character of the Use and Bad Faith
“The law is in flux about whether bad faith is relevant to fair use.” – Judge Vince Chhabria
According to the authors, Meta first approached traditional publishers for licenses. After negotiations stalled, CEO Mark Zuckerberg allegedly “escalated” efforts, resorting to peer-to-peer piracy. Chhabria suggested that downloading without license could demonstrate bad faith under Factor 1, even though courts are split on whether intent matters.
2. Impact on the BitTorrent Ecosystem
If Meta’s clients or servers seeded material back into the network, they may have “benefitted those who created the libraries and thus supported their unauthorized distribution,” potentially deepening infringement. Chhabria noted precedent finding most P2P file-sharing to be infringing, and observed that some of the libraries Meta tapped have previously been found liable.
3. Transformative Use Connection
Meta maintained that its ultimate use—training Llama’s neural networks—was “highly transformative.” The judge countered that torrenting is part and parcel of the same transformative pipeline: downloading to train equates to downloading for transformation.
Technical Deep Dive: Torrenting Mechanics and Data Scale
- Data Volume: Expert analysis estimates Meta acquired roughly 80.6 TB—equivalent to 20 billion pages—via BitTorrent over multiple magnet links.
- BitTorrent Protocol: The process relies on a Distributed Hash Table (DHT) and peer exchange (PEX) to locate pieces of files. Torrent clients verify each chunk via SHA-1 hashing before reassembly.
- Seeding and Swarms: If Meta’s infrastructure seeded after download, it increased swarm health, potentially violating DMCA anti-circumvention rules. Authors may subpoena server logs to confirm seeding activity.
AI Model Training Pipeline and Data Requirements
Training state-of-the-art LLMs like Llama 2 or Llama 3 requires massive, diverse corpora. Industry benchmarks show:
- Token Count: Leading models use 1–3 trillion tokens, sourced from books, web crawls, and code repositories.
- Compute Infrastructure: Meta employs GPU clusters—NVIDIA H100 or A100 Tensor Core GPUs—running PyTorch or JAX to fine-tune transformer architectures.
- Preprocessing: E-books undergo OCR correction, HTML/XML parsing, and Unicode normalization. Torrent-sourced files often require additional cleanup due to inconsistent formatting.
Industry Implications and Future Licensing Markets
Chhabria predicted that large-scale licensing markets will emerge if licensors cannot pursue unauthorized training. Publishers may need to renegotiate subsidiary rights (e.g., digital text mining). Key developments include:
- Collective Licensing Platforms: Analogous to music rights organizations (ASCAP, BMI), new entities could aggregate text-mining licenses for book authors.
- Rights Metadata Standards: Implementation of ONIX for Books extensions specifying AI training rights and data use constraints.
- Recent Deals: OpenAI’s settlement with news publishers (March 2025) included pay-per-token rates—suggestive of a pricing model for book text licensing.
Regulatory Landscape and Emerging Standards
Across jurisdictions, regulators are beginning to weigh in on AI training:
- U.S. Copyright Office: Proposed rulemaking on text and data mining exemptions, expected Q4 2025.
- EU Digital Services Act (DSA): Requires proactive checks for copyrighted content in large platforms, potentially affecting LLM datasets.
- WIPO AI Treaties: Discussions on international IP frameworks for AI-generated content continue, with proposals to clarify permissible training data.
Next Steps and Outlook
As discovery unfolds, authors may seek server logs, network traffic records, or hiring digital forensics experts to trace Meta’s BitTorrent nodes. If evidence emerges that Meta seeded or supported the P2P network, the remaining fair use claim could tip in the authors’ favor. Regardless of the outcome, this case is already spurring innovation in licensing infrastructure and could redefine how AI developers source copyrighted text.