AI Models’ Challenges in Predicting Gene Activity

Some AI tools don’t understand biology yet
Despite spectacular AI successes in protein folding and enzyme design, recent work reveals that foundation models trained on single-cell transcriptomic data fail to outperform simplistic baselines when predicting gene expression changes. This finding underscores the extraordinary complexity of cellular regulation and cautions against overhyping AI’s capabilities in genomics.
AI and Gene Activity: The Current Landscape
Genome-scale datasets—ranging from bulk RNA-seq to single-cell RNA-seq and multiplexed Perturb-seq experiments—provide an unprecedented view of how genes are regulated under diverse conditions. Researchers have leveraged transformer architectures, graph neural networks, and variational autoencoders to build so-called single-cell foundation models. These models are pre-trained on millions of cells’ gene expression profiles in an unsupervised manner, intending to capture a generalizable representation of cellular states.
“Foundation models hold promise for in silico hypothesis generation, but their predictive power in perturbation assays remains marginal,” says Dr. Avanti Rao, a computational biologist at the Broad Institute.
Underwhelming Performance in Perturb-seq Predictions
Constantin Ahlmann-Eltze, Wolfgang Huber, and Simon Anders (Nature Methods, 2025) systematically evaluated multiple leading AI packages for their ability to predict gene expression changes following CRISPR-mediated activation of one or two genes. They compared model outputs against two naïve baselines:
- Null baseline: predict no change in gene expression.
- Additive baseline: sum the individual effects of each single-gene perturbation.
Across 100 single-gene and 62 dual-gene activation experiments, all foundation models exhibited substantially higher mean-squared error than the additive baseline. In particular, they rarely captured synergistic or antagonistic interactions—key hallmarks of gene regulatory networks.
Technical Challenges in Modeling Gene Regulatory Networks
Several factors underlie this shortfall:
- Dimensionality and sparsity: Single-cell data often spans 20,000 genes but samples only a fraction of cell types, leading to high-dimensional, sparse matrices that challenge deep learning optimization.
- Nonlinear dynamics: Feedback loops, post-translational modifications, and epigenetic states introduce complex, time-dependent behaviors that static snapshot models cannot easily capture.
- Lack of multi-omics integration: Foundation models trained solely on RNA counts neglect chromatin accessibility (ATAC-seq), histone modifications, and spatial context, which collectively shape transcriptional outcomes.
Incorporating Multi-Omics and Spatial Data
Emerging research suggests that integrating orthogonal data types can significantly enhance prediction accuracy:
- Chromatin and epigenetics: Models like EpiGenFormer incorporate ATAC-seq and ChIP-seq profiles via attention mechanisms to infer regulatory element interactions.
- Spatial transcriptomics: New graph-based frameworks embed location coordinates to model cell–cell signaling effects on gene expression.
- Proteomics and metabolomics: Hybrid multimodal networks merge transcript and protein abundance to reflect post-transcriptional regulation.
A recent preprint from Stanford’s CZI labs demonstrates a boost in perturbation prediction accuracy by up to 30% when integrating multi-omics features into a graph transformer backbone (bioRxiv, 2025).
Expert Opinions and Industry Perspectives
“We need hybrid models that combine mechanistic network inference with deep learning,” argues Dr. Maria Strunz, head of computational genomics at Genentech. “Purely data-driven approaches hit a ceiling unless they reason about causality.”
Companies like Deep Genomics and Insitro are now embedding biochemical prior knowledge—such as transcription factor binding motifs and enhancer–promoter contacts—into their AI pipelines to improve interpretability and predictive power.
Future Directions: Hybrid and Physics-Informed AI
To bridge the gap between abstract embeddings and biological realism, researchers are exploring:
- Physics-informed neural networks: encode differential equations governing gene regulatory kinetics into the loss function.
- Causal inference frameworks: leverage do-calculus and structural equation models to disentangle direct from indirect gene interactions.
- Transfer learning: fine-tune pre-trained models on small, high-quality perturbation datasets to capture rare but critical synergies.
As the field advances, collaboration between experimental biologists and AI specialists will be crucial. Large-scale consortium efforts, such as the Human Cell Atlas and ENCODE, are generating richer multi-modal datasets that promise to fuel next-generation models.
Conclusion
While foundation models have transformed areas like protein structure prediction, their current incarnations are not yet ready to replace bench experiments in transcriptomics. Continued innovation in model architectures, data integration, and causality-driven approaches will be essential to unlock AI’s full potential in understanding and manipulating gene activity.