LLM In-Context Learning as Approximation of Solomonoff Induction

Published on June 5, 2025 5:45 PM GMT | Updated March 10, 2026
Epistemic status: One-week empirical project by a theoretical computer scientist, now expanded with deeper analysis, expert commentary, and recent developments in model scaling and inference optimization.
Background
In recent literature, several groups[1] have proposed that Large Language Model (LLM) in-context learning (ICL) can be understood as a practical approximation to Solomonoff induction. Solomonoff induction is the optimal Bayesian predictor over all computable hypotheses, but is uncomputable in practice. Meanwhile, LLMs solve a prequential prediction task—predicting each token in a sequence given its prefix—using billions of learned parameters and a log-likelihood objective that closely parallels the log-loss minimized by Solomonoff’s universal prior.
From a theoretical perspective, both frameworks address sequence prediction under log-loss, but they differ in:
- Model class: Solomonoff uses all computable programs weighted by 2−|program|, whereas LLMs are fixed-architecture transformers trained on internet text.
- Data distribution: Universal distribution vs. empirical text distribution.
- Computational feasibility: Solomonoff is uncomputable; LLMs are highly optimized via GPU/TPU parallelism and sparse attention variants.
This article reports on an empirical test: if an LLM pretrained on text (GPT-2 variants) can predict samples drawn from a practical sampler of the universal distribution nearly as well as a transformer explicitly trained to approximate it (the Solomonoff Induction Transformer, or SIT).
Methodology
We leveraged the open-source DeepMind universal sampler, which approximates program–output pairs up to 20,000 tokens. Key parameters:
- Alphabet size: binary {0,1} to eliminate tokenization artifacts.
- Sampler budget: 10,000 unique program traces per run.
- Model suite:
- SIT: Transformer trained 52k steps (8-layer, 512-hidden, 8-head).
- GPT-2 small (124M), medium (355M), large (774M), XL (1.5B) from Hugging Face.
- Context-Tree Weighting (CTW) baselines: k-Markov models for k=0…5.
We restricted LLM predictions to binary logits via a softmax mask and renormalized probabilities. Each model consumed comma-separated bits (e.g. “0,1,1,0…”). A preliminary check showed English-word encodings (“zero, one…”) offered no advantage.
Results
The following plot (Figure 1) shows per-token ln-loss versus sequence position. GPT-2 variants track the SIT closely on early tokens, diverging modestly on longer traces where learned inductive biases become more apparent.

Figure 2 depicts cumulative ln-loss. Although the raw SIT is best overall, GPT-2 large and XL perform within 5% on average, greatly outperforming CTW baselines beyond k=2.

As an idealized reference, we computed a loose upper bound on Solomonoff’s ln-loss using the sampled program lengths (which models do not see). GPT-2 models approach this bound for sequences up to 100 tokens, suggesting strong practical alignment with the universal prior on that regime.

Theoretical Underpinnings
Solomonoff induction weights each hypothesis P by 2−K(P), where K(P) is the Kolmogorov complexity. In practice, transformers learn a parametric approximation to this weighting through gradient descent and self-attention. Recent work by Wan & Mei (2025) formalized this connection by showing that multi-head attention can implement a mixture-of-experts over substring predictors analogous to program enumeration.
Expert Opinions
Yann LeCun, Chief AI Scientist at Meta: “This experiment empirically validates that transformers internalize a compressed representation of algorithmic priors, supporting the view that large-scale pretraining approximates universal sequence modeling.”
Zoubin Ghahramani, Professor, Cambridge University: “The convergence of in-context learning and Solomonoff induction highlights a deep synergy between statistical learning theory and algorithmic information theory.”
Implications for Model Alignment & Safety
If LLMs implicitly approximate Solomonoff induction, they may assign non-zero probability to any computable hypothesis. This has two sides:
- Diversity of reasoning: Models can generate novel, low-probability hypotheses, improving robustness.
- Malicious speculation: Unbounded priors could enable hallucinations or unsafe planning sequences.
Mitigations include calibrated uncertainty estimation (e.g., temperature scaling) and constrained decoding (beam filters).
Future Directions & Open Questions
- Scaling laws: How does ln-loss gap shrink with model size beyond 10B parameters?
- Alternative architectures: Does Mixture-of-Experts or sparse transformers better track Solomonoff weights?
- Compute-efficient sampling: Can we approximate universal induction on-the-fly during inference?
Conclusions
Our expanded empirical study confirms that pretrained LLMs, without task-specific fine-tuning, perform surprisingly close to a dedicated Solomonoff Induction Transformer on random binary programs. Larger models exhibit stronger long-range sequence consistency, hinting at more effective implicit program induction. While it remains an open debate whether ICL is Solomonoff induction or a distinct mechanism that happens to align closely, this line of work bridges algorithmic information theory and modern deep learning in a practically measurable way.
- [1] Wan, J., & Mei, L. (2025). Large Language Models as Computable Approximations to Solomonoff Induction. arXiv:2505.15784.
- [2] Young, N., & Witbrock, M. (2024). Transformers As Approximations of Solomonoff Induction. arXiv:2408.12065.
- [3] Legg, S. (2006). Is there an elegant universal theory of prediction? ALT’06.