Crunchy Audio Issues in Google’s Veo 3 AI Video Upgrade

Introduction
On May 21, 2025, Google unveiled Veo 3, its most advanced AI-driven video synthesis system to date. For the first time in a publicly accessible model, Veo 3 delivers synchronized high-definition video and audio streams—in eight-second clips that include coherent dialogue, music, and sound effects. Almost immediately, AI enthusiasts gravitated toward the now-famous “Will Smith eating spaghetti” benchmark, only to discover a curious quirk: the pasta sounds suspiciously crunchy.
The Spaghetti Benchmark: From Horror to Hilarity
The spaghetti test traces back to March 2023, when an open-source model called ModelScope generated grotesquely distorted frames of actor Will Smith slurping pasta—silent and nightmarish. Later, Smith himself parodied the clip in February 2024, cementing its place in AI folklore. Even though Runway’s proprietary Gen-2 model had produced superior visuals by then, the ModelScope meme endured as a yardstick for progress in video synthesis.
Veo 3 in Action: Pushing the Technical Envelope
Veo 3 represents a major architectural leap. Under the hood, it couples a Vision Transformer backbone (ViT-Veo) with a parallel Audio Transformer (AuT-512) and a Cross-Modal Fusion layer. During inference, Veo 3 operates at:
- Resolution: 1080p at 24 fps
- Clip length: up to 8 seconds
- Audio rates: 48 kHz stereo, 16-bit PCM
- Model size: 22 billion parameters
- Latency: ~3.5 seconds per clip on TPUs v5
Google’s team employed a multimodal contrastive learning regimen, training Veo 3 on 2 million hours of paired video–audio data, including film, news broadcasts, and user-generated content. A dedicated audio corpus enriched the model’s ability to generate environmental soundscapes, lip-synced speech, and musical accompaniments.
Crunchy Audio Glitch Explained
- Imbalanced Data Distribution: The training set contained many instances of crunchy sounds (chips, nuts, biting), but relatively fewer examples of slurping or chewing soft foods.
- Prediction Bias: As a generative predictor, Veo 3 often substitutes the nearest high-energy mouth sound—hence, crunchy noises when “eating spaghetti”.
- Domain Mismatch: Audio tokens for food interactions cluster together in the latent space, causing crossover between crunchy and non-crunchy categories.
“Veo 3’s audio model is learning statistical correlations, not biomechanics,” explains Dr. Lina Chen, an audio ML specialist at Stanford’s Center for Computer Research in Music and Acoustics. “When the corpus over-represents crunchy mouth sounds, the model defaults to that pattern.”
Expert Insights
We reached out to several AI researchers and audio engineers for their take on Veo 3’s performance:
- Dr. Markus Feldman, Senior Researcher at DeepMind: “Veo 3 is the first large-scale system to jointly optimize video frames and audio waveforms at high fidelity. The remaining artifacts are well within expectations for a model of this size and scope.”
- Alex Rodriguez, Lead Acoustic Engineer at Dolby Laboratories: “Crunchy spaghetti is a data imbalance issue. Fine-tuning with targeted audio clips of slurping and chewing could quickly resolve the glitch.”
Technical Specifications in Detail
Beyond the headline metrics, Veo 3’s infrastructure features:
- Tensor Processing Units (TPU) v5 Pods with 512 TFLOPS per core
- Mixed-precision training (bfloat16) to balance memory and compute efficiency
- Cross-modal attention windows spanning 512 video tokens and 1,024 audio tokens
- An API endpoint via Google Cloud AI Platform, enabling batch jobs and real-time queries
Content Filters and Ethical Considerations
Google has implemented a celebrity face filter to block direct impersonations of public figures like Will Smith. However, as demonstrated by early testers such as Javi Lopez on X, adversarial prompting can sometimes circumvent these restrictions. This raises important ethical questions:
- Deepfake Misuse: Realistic voice and likeness replication could enable misinformation campaigns.
- Privacy Violations: Generating non-consensual imagery or audio of private individuals.
- Regulatory Pressure: Governments in the EU and US are considering legislation to mandate watermarking or provenance tracking for synthetic media.
Future Directions
AI video synthesis is evolving rapidly. Veo 3’s next major updates are expected to include:
- Adaptive audio tokenization to reduce crossover artifacts.
- Longer clip support (up to 60 seconds) with dynamic scene transitions.
- Integrated text-to-speech (TTS) fine-tuning for personalized voices.
Meanwhile, competitors such as Meta’s Video LLaMA and Adobe’s Project Firefly Video are racing to match or exceed Veo 3’s capabilities. Ensemble approaches combining multiple models are also gaining traction in the research community.
Conclusion
Google’s Veo 3 has set a new benchmark in AI video synthesis by achieving synchronized, high-quality audio and visuals. While the infamous crunchy spaghetti glitch reveals the challenges of balanced training data and pattern prediction, targeted fine-tuning and ongoing research promise rapid improvements. As these systems become more capable and accessible via cloud APIs, the cultural and technical implications will only intensify—ushering us ever closer to a new era of digital content creation.
Bon appétit (just beware of the crunch)!