MUNCH AI Tool Fails on VA Contract Cancellations

Home page — News — MUNCH AI Tool Fails on VA Contract Cancellations

In early 2025, facing a mandate to review 90,000 federal contracts in just 30 days, the Department of Government Efficiency (DOGE) turned to an in-house AI prototype to identify ‘nonessential’ Veterans Affairs agreements. Lacking domain expertise and developed under an extreme time crunch, the result was a deeply flawed system that generated widespread inaccuracies and risked undermining veteran care.

Background and Goals

The Trump administration’s February 2025 executive order instructed all cabinet-level agencies to assess the utility and cost of existing contracts. With the VA holding over 76,000 active contracts worth nearly $100 billion annually, manual review was deemed impossible within 30 days. DOGE, overseen by Elon Musk until his departure in April, proposed an AI-driven ‘contract munching’ tool leveraging off-the-shelf large language models (LLMs).

Key Objectives

Rapidly classify contracts as ‘MUNCHABLE’ or essential
Minimize human workload by prefiltering low-value deals
Provide transparency through open-source code release

Related topic

xAI Employees Protest ‘Skippy’ Data Initiative During Grok Release

2025-07-22

Technical Architecture

The tool’s core was a Python-based pipeline using GPT-3.5-Turbo via a FedRAMP-approved API for text classification. Data ingestion relied on bulk downloads from the Federal Procurement Data System (FPDS) in CSV format, parsed with pandas. Preprocessing included basic OCR for scanned PDFs and extraction of the first 2,500 words — the maximum token window for the chosen model.

Model Selection and Limitations

Lavingia selected GPT-3.5-Turbo to reduce costs, at roughly $0.06 per 1,000 tokens processed. However, this model has a context window limited to 4,096 tokens and lacks domain-specific fine-tuning for procurement terminology. As a result, the system frequently misread numerical values and contract scopes, a phenomenon known as model hallucination.

Prompt Engineering and Scoring

Each contract snippet was fed a structured prompt instructing the model to:

Extract the contract number and stated total value
Determine if the service supports direct patient care or is a back-office function
Assign a binary label: ‘MUNCHABLE’ or ‘SAFE’

Output was parsed via regex, without robust validation, leading to errors where multiple monetary figures appeared in a single document.

Flaws and Real-World Impact

ProPublica analysis found over 2,000 contracts flagged as ‘MUNCHABLE’, including:

A $34 million gene sequencing equipment maintenance contract (actual value: $35,000)
Blood sample analysis services essential to ongoing VA cancer research
Software licenses for patient data monitoring tools used by VA nurses

At least two dozen flagged contracts were officially canceled before human review could catch the errors, endangering research continuity and potentially delaying veteran care.

Related topic

Tesla Skepticism Grows After Underwhelming Austin Robotaxi Demo

2025-07-22

Expert Opinions and Governance Concerns

Cary Coglianese, Penn Law professor specializing in AI regulation, warned that general-purpose LLMs lack the reliability for complex procurement decisions. Former Treasury IT contracting head Waldo Jaquith described the approach as ‘deeply problematic’. NIST’s AI Risk Management Framework, updated in late 2024, recommends rigorous human oversight and model testing — a process DOGE bypassed.

AI gives convincing looking answers that are frequently wrong. There needs to be humans whose job it is to do this work — Waldo Jaquith

Security, Compliance, and Ethical Considerations

Handling VA contracts involves Protected Health Information (PHI), invoking HIPAA and FedRAMP requirements. The tool’s open-source release on GitHub lacked a data handling policy, raising security and privacy flags. Experts recommend model cards and datasheets for datasets to enhance transparency and accountability.

Related topic

AI DaVinci Robot Achieves Autonomous Gallbladder Removal

2025-07-21

Technical Analysis Deep Dive

Detailed logs show the pipeline performing up to 5 API calls per contract—each call incurring latency of 300–500 ms. No rate limiting or retry logic was implemented, resulting in timeouts and silent failures on large PDF files. Data parsing relied solely on naive regex patterns, without leveraging specialized libraries like Apache Tika or PDFMiner for structured extraction.

Policy Recommendations and Future Directions

Adopt domain-specific models or fine-tune existing LLMs on VA procurement data
Implement robust data validation pipelines using FPDS API endpoints and deterministic parsing
Establish an AI governance board involving procurement, legal, and veteran care experts
Leverage cloud-based MLOps platforms (AWS GovCloud or Azure Government) for secure model training, monitoring, and auditing

Related topic

Gemini Deep Think Wins Gold at IMO with Parallel Reasoning

2025-07-21

Conclusion

The missteps of the ‘MUNCH’ tool underscore the risks of deploying unvetted AI solutions in high-stakes government environments. As agencies increasingly embrace AI to drive efficiency, integrating technical rigor, domain knowledge, and robust oversight will be critical to safeguarding public trust and ensuring essential services are not inadvertently compromised.