Meta’s Llama 3.1 70B Memorizes Half of Harry Potter

Overview
In June 2025, a multidisciplinary team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University published a detailed analysis showing that Meta’s open-weight Llama 3.1 70B model memorized 42 percent of Harry Potter and the Sorcerer’s Stone with a > 50 percent probability of verbatim reproduction for 50-token excerpts. This benchmark far exceeds comparable open-source and commercial models, and it injects fresh momentum into generative AI copyright litigation.
Background: Copyright Suits Against AI Labs
- December 2023: The New York Times sued OpenAI, demonstrating GPT-4 exact-match passages from Times articles.
- February 2024: Authors including Richard Kadrey filed a class action against Meta over Llama models.
- April 2025: EU Parliament advanced the AI Act, mandating transparency and watermarking for generative outputs.
Key Findings
- Llama 3.1 70B: Memorization rate of 42 percent for the first book in the Harry Potter series.
- Llama 1 65B (Feb 2023): Only 4.4 percent memorization on the same title.
- Microsoft’s model and EleutherAI’s GPT-NeoX: Memorization rates ranged from 1 percent to 8 percent.
- Popular vs. obscure titles: Llama 3.1 70B reproduced ~35 percent of The Hobbit and 30 percent of 1984, but only 0.13 percent of Sandman Slim.
Technical Methodology: Measuring Memorization
The researchers partitioned each of 36 books into overlapping 100-token windows. They used the first 50 tokens as a prompt and computed the joint probability of the next 50 tokens via logged logits outputs, rather than sampling outputs directly. A passage is considered “memorized” if P(reproduction) > 50 percent, implying an average per-token probability ≥ 98.5 percent. This approach leverages GPU-accelerated matrix multiplications to extract token distributions, enabling precise probability estimates without generating quadrillions of samples.
Why 50 Tokens?
- Statistical significance: Longer spans reduce false positives from random generation.
- Legal threshold: Courts may consider any contiguous reproduction > 50 tokens as substantial copying.
- Efficiency: 50-token sequences strike a balance between detectability and computational cost.
Deep Dive: Training Regimen and Data Sources
Meta reported training Llama 3.1 70B on ~15 trillion tokens over mixed sources: CommonCrawl, code repositories, and Books3—a curated corpus of ~200,000 English-language books. The tenfold increase in training tokens from Llama 1 to Llama 3 likely exacerbated memorization. Two working hypotheses explain the Harry Potter spike:
- Dataset duplication: Books3 may have been over-sampled or re-used, increasing example frequency.
- Secondary sources: Online fan forums, book reviews, and educational sites quoting large passages.
“If secondary quotations were the sole cause, you’d expect scattered citations rather than near-complete reproduction,” said Professor Mark Lemley (Stanford). “The data suggests the full text was present in training.”
Additional Analysis: Model Architecture and Hyperparameters
- Parameter count: 70 billion parameters, Transformer stack depth of 96 layers.
- Context window: 4,096 tokens, enabling long dependencies.
- Sampling strategy: Top-k=50, temperature=0.7 by default; tests used temperature=0 to evaluate peak probabilities.
- Regularization: Minimal weight decay and no differential privacy, which can reduce memorization at the cost of utility.
New Section: Potential Mitigation Strategies
To address unwanted memorization, practitioners are exploring:
- Data deduplication at scale, using locality-sensitive hashing to remove near-identical passages.
- Differentially private SGD, imposing noise on gradient updates to limit memorization of rare examples.
- Adaptive curriculum learning, exposing the model to high-frequency examples less often.
New Section: Regulatory and Ethical Landscape
As of mid-2025, the EU AI Act requires watermarking of AI-generated text and audits for memorization. In the U.S., the Copyright Office is evaluating safe-harbor provisions for model training, while pending bills in Congress aim to define fair use boundaries for LLM pretraining.
New Section: Future Research Directions
Key open questions for academia and industry:
- Can rate-limiting and token-masking during training reduce verbatim reproduction without degrading generalization?
- What role do advanced memorization detectors like Rolling Winnowing have in model governance?
- How can open benchmarks like the Memorization Attribution Suite (MAS) standardize testing across closed and open-weight models?
Legal Implications: Three Theories of Liability
In U.S. copyright law, liability theories include:
- Unauthorized reproduction during the training data ingestion phase.
- Derivative work creation, where copyrighted text is embedded in model weights.
- Infringing output, when the model generates protected text.
Meta may invoke fair use defenses, citing Google Books (2015), but the scale of memorized text complicates claims of “transformative use.” Closed-weight providers (OpenAI, Anthropic, Google) may avoid external audits and deploy output filters, but open-weight models face heightened scrutiny due to transparency.
Conclusion
This study underscores that memorization is not a fringe behavior but a quantifiable phenomenon that varies dramatically across models, datasets, and hyperparameters. As litigation advances and regulation tightens, AI labs must adopt rigorous data-governance protocols and invest in privacy-preserving training techniques to balance innovation with copyright compliance.