OpenAI and Media Outlet Differ on Scope of Data Needs

End of Discussion?
OpenAI has submitted a revised proposal in its ongoing copyright suit with The New York Times, offering to produce 20 million ChatGPT conversation logs for review—one-sixth of the 120 million logs sought by the newspaper and other news plaintiffs. The move follows last month’s court order granting broad access to user data and reflects OpenAI’s attempt to balance transparency with user privacy, technical feasibility, and data-security concerns.
Case Background and Recent Developments
In July 2025, U.S. District Judge Ona Wang granted The New York Times and allied plaintiffs permission to search millions of ChatGPT logs for potentially infringing outputs. OpenAI immediately appealed the scope, branding the demand as “mass surveillance” without statistical justification. After losing that initial skirmish, OpenAI pivoted to negotiating a narrower, statistically valid data sample.
“A 20 million-log sample, selected via stratified random sampling, is sufficient to estimate the prevalence of regurgitation at a 95% confidence level with ±0.5% margin of error,” said Taylor Berg-Kirkpatrick, associate professor of computer science at the University of Texas at Austin and the sole data expert to file an amicus brief.
Despite Berg-Kirkpatrick’s recommendation, the NYT and co-plaintiffs countered with an “extraordinary request” for the complete set of 120 million logs spanning a 23-month period. OpenAI’s Wednesday filing outlines the prohibitive technical and operational burdens this would impose.
Technical Challenges of Retrieving Deleted Chats
OpenAI stores ChatGPT logs in cold-tier object storage (AWS S3 Glacier Deep Archive), compressed in Apache Parquet format. Each log is an unstructured JSON blob—often >5,000 words—even for brief user prompts. To prepare logs for external review, OpenAI engineers must:
- Locate each log among tens of billions via an index service (Elasticsearch).
- Invoke decompression and format conversion pipelines (PyArrow-based ETL jobs).
- Run de-identification routines to scrub PII (passwords, email addresses, IP logs) using differential privacy libraries.
- Transfer sanitized data into a locked-down, FIPS 140-2–compliant enclave and provision read-only access.
At current AWS pricing, OpenAI estimates the process for 20 million logs would consume ~1 petabyte-hours of compute and require 12 weeks of continuous ETL pipeline operations; scaling to 120 million would triple both the time and cost.
Statistical Sampling and Legal Standards
OpenAI argues that the NYT’s demands far exceed the Federal Rules of Civil Procedure’s proportionality principles. Key points include:
- Disproportionate Burden: 120 million logs expand data exposure risk and increase the window for potential breach or leak.
- Marginal Utility: Additional logs beyond a statistically valid sample add diminishing returns in proving paywall circumvention.
- Precedent Concerns: Granting full access could set a broad data-discovery precedent for AI services worldwide.
“Requiring access to every monthly snapshot to analyze trends over time is an extraordinary grant beyond any comparable digital-discovery case,” OpenAI’s filing states.
Implications for Data Privacy and Security
Privacy advocates caution that expanded data retention and exposure—especially of chats users believed deleted—could undermine trust in conversational AI. OpenAI’s recent policy updates now retain logs for up to 30 days even after user deletion requests, pending legal holds. The proposed “AI privilege” concept, championed by CEO Sam Altman, seeks to establish a legal doctrine treating user–bot dialogs as confidential communications.
Microsoft’s Parallel Dispute Over ChatExplorer
Co-defendant Microsoft is concurrently battling the NYT over internal ChatExplorer logs. The newspaper resists producing logs from over 80,000 journalist and legal staff interactions, arguing extraneous data would be swept in. Microsoft contends its request is integral to defending against copyright claims in its own AI offerings.
Additional Analysis
1. Data Governance and Compliance Strategies
Large AI providers must implement robust data-governance frameworks. Best practices include data minimization, automated PII redaction, and end-to-end encryption in transit and at rest. Experts recommend using attribute-based access control (ABAC) and secure multi-party computation (SMPC) to limit exposure during legal reviews.
2. Future of AI Privacy Litigation
This case may become a bellwether for how courts handle AI log discovery. Legal scholars foresee a potential rise in specialized e-discovery tools architected for AI datasets, incorporating homomorphic encryption and zero-knowledge proofs to satisfy both discovery obligations and privacy safeguards.
3. Industry Reactions and Expert Opinions
Dr. Emily Bender, professor at the University of Washington, notes: “The intersection of AI transparency and user privacy is a major regulatory frontier. How this suit resolves will guide policy in the EU’s AI Act and forthcoming U.S. privacy legislation.”
Next Steps
Both parties have agreed to a confidential conference on August 7 to hammer out sample-size parameters. Observers anticipate that a compromise sample—potentially with tiered escalation—may emerge, balancing evidentiary needs against technical and privacy constraints.