Meta’s Llama 3.1 70B Memorizes Half of Harry Potter

Home page — News — Meta’s Llama 3.1 70B Memorizes Half of Harry Potter

Overview

In June 2025, a multidisciplinary team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University published a detailed analysis showing that Meta’s open-weight Llama 3.1 70B model memorized 42 percent of Harry Potter and the Sorcerer’s Stone with a > 50 percent probability of verbatim reproduction for 50-token excerpts. This benchmark far exceeds comparable open-source and commercial models, and it injects fresh momentum into generative AI copyright litigation.

Related topic

Meta AI Misattribution and Automated Chat Helpers’ Pitfalls

2025-06-20

Background: Copyright Suits Against AI Labs

December 2023: The New York Times sued OpenAI, demonstrating GPT-4 exact-match passages from Times articles.
February 2024: Authors including Richard Kadrey filed a class action against Meta over Llama models.
April 2025: EU Parliament advanced the AI Act, mandating transparency and watermarking for generative outputs.

Key Findings

Llama 3.1 70B: Memorization rate of 42 percent for the first book in the Harry Potter series.
Llama 1 65B (Feb 2023): Only 4.4 percent memorization on the same title.
Microsoft’s model and EleutherAI’s GPT-NeoX: Memorization rates ranged from 1 percent to 8 percent.
Popular vs. obscure titles: Llama 3.1 70B reproduced ~35 percent of The Hobbit and 30 percent of 1984, but only 0.13 percent of Sandman Slim.

Related topic

xAI Legal Challenge on Colossus Emissions in Memphis

2025-06-18

Technical Methodology: Measuring Memorization

The researchers partitioned each of 36 books into overlapping 100-token windows. They used the first 50 tokens as a prompt and computed the joint probability of the next 50 tokens via logged logits outputs, rather than sampling outputs directly. A passage is considered “memorized” if P(reproduction) > 50 percent, implying an average per-token probability ≥ 98.5 percent. This approach leverages GPU-accelerated matrix multiplications to extract token distributions, enabling precise probability estimates without generating quadrillions of samples.

Why 50 Tokens?

Statistical significance: Longer spans reduce false positives from random generation.
Legal threshold: Courts may consider any contiguous reproduction > 50 tokens as substantial copying.
Efficiency: 50-token sequences strike a balance between detectability and computational cost.

Deep Dive: Training Regimen and Data Sources

Meta reported training Llama 3.1 70B on ~15 trillion tokens over mixed sources: CommonCrawl, code repositories, and Books3—a curated corpus of ~200,000 English-language books. The tenfold increase in training tokens from Llama 1 to Llama 3 likely exacerbated memorization. Two working hypotheses explain the Harry Potter spike:

Dataset duplication: Books3 may have been over-sampled or re-used, increasing example frequency.
Secondary sources: Online fan forums, book reviews, and educational sites quoting large passages.

“If secondary quotations were the sole cause, you’d expect scattered citations rather than near-complete reproduction,” said Professor Mark Lemley (Stanford). “The data suggests the full text was present in training.”

Related topic

Google Adds Veo 3 AI to YouTube Shorts: Technical Analysis

2025-06-18

Additional Analysis: Model Architecture and Hyperparameters

Parameter count: 70 billion parameters, Transformer stack depth of 96 layers.
Context window: 4,096 tokens, enabling long dependencies.
Sampling strategy: Top-k=50, temperature=0.7 by default; tests used temperature=0 to evaluate peak probabilities.
Regularization: Minimal weight decay and no differential privacy, which can reduce memorization at the cost of utility.

New Section: Potential Mitigation Strategies

To address unwanted memorization, practitioners are exploring:

Data deduplication at scale, using locality-sensitive hashing to remove near-identical passages.
Differentially private SGD, imposing noise on gradient updates to limit memorization of rare examples.
Adaptive curriculum learning, exposing the model to high-frequency examples less often.

Related topic

Preserving Pre-AI Content: From Nuclear Steel to Digital Low-Background

2025-06-18

New Section: Regulatory and Ethical Landscape

As of mid-2025, the EU AI Act requires watermarking of AI-generated text and audits for memorization. In the U.S., the Copyright Office is evaluating safe-harbor provisions for model training, while pending bills in Congress aim to define fair use boundaries for LLM pretraining.

New Section: Future Research Directions

Key open questions for academia and industry:

Can rate-limiting and token-masking during training reduce verbatim reproduction without degrading generalization?
What role do advanced memorization detectors like Rolling Winnowing have in model governance?
How can open benchmarks like the Memorization Attribution Suite (MAS) standardize testing across closed and open-weight models?

Related topic

X Sues to Block NY Content Moderation Law After CA Win

2025-06-17

Legal Implications: Three Theories of Liability

In U.S. copyright law, liability theories include:

Unauthorized reproduction during the training data ingestion phase.
Derivative work creation, where copyrighted text is embedded in model weights.
Infringing output, when the model generates protected text.

Meta may invoke fair use defenses, citing Google Books (2015), but the scale of memorized text complicates claims of “transformative use.” Closed-weight providers (OpenAI, Anthropic, Google) may avoid external audits and deploy output filters, but open-weight models face heightened scrutiny due to transparency.

Conclusion

This study underscores that memorization is not a fringe behavior but a quantifiable phenomenon that varies dramatically across models, datasets, and hyperparameters. As litigation advances and regulation tightens, AI labs must adopt rigorous data-governance protocols and invest in privacy-preserving training techniques to balance innovation with copyright compliance.