AXRP Episode 41: Lee Sharkey on Attribution-based Decomposition

Published June 3, 2025
Introduction
Interpretability in deep learning remains a key research frontier. In Episode 41 of AXRP, host Daniel Filan sits down with Lee Sharkey (Goodfire) to unpack Attribution-based Parameter Decomposition (APD), a novel framework that detects latent computational mechanisms by decomposing weight parameters rather than activations.
APD Fundamentals
APD reframes mechanistic interpretability via a three-term optimization objective:
- Faithfulness: Decomposed components must sum exactly to the original model weights (≤1e-5 MSE).
- Minimality: Only the top-k causally influential components activate per input, enforced via dual forward/backward passes.
- Simplicity: Each component minimizes a continuous proxy for matrix rank (the Schatten 1-norm), promoting low-dimensional linear transforms.
Faithfulness in Practice
All layer matrices are flattened into a single vector W. We initialize C parameter vectors {ΔW₁…ΔW_C} whose sum approximates W. A standard L2 loss on reconstructed weights ensures functional equivalence across the validation set.
Minimality via Top-k Attribution
APD computes component‐level attributions by taking the inner‐product of each ΔW with the gradient of the network’s outputs. By selecting only the k components with the largest absolute attributions for each batch element, APD enforces a sparse, mechanism-level bottleneck—analogous to sparse autoencoders in parameter space.
Simplicity Through Low-Rank Components
Simplicity is quantified by the sum of singular values (Schatten 1‐norm) across each component’s weight matrices. Minimizing this norm encourages components to operate over as few dimensions as possible—approximating rank-constrained, interpretable linear operations.
Experimental Validation
Toy Models of Superposition
In small synthetic autoencoders (5 features → 2 hidden units), APD recovers each ground-truth embedding row exactly as an independent mechanism. Scaling to a 40 → 10 setup, it achieves 98% true-positive recovery of sparse features at k=5 with ≤10% hyperparameter sensitivity.
Compressed Computation in ReLU Networks
APD also identifies that an MLP trained to implement 100 parallel ReLU functions can compress these into 50 hidden neurons by partitioning parameters into disjoint low-rank subspaces. This confirms the model’s ability to compute more functions than its width would naively allow, exploiting sparse activations and non-linear filtering.
Scalability and Performance Optimizations
Proof-of-concept APD incurs a 4× cost (two forward + two backward passes) and holds C full-model replicas. Ongoing research focuses on:
- Component Merging: Merge near-redundant ΔWᵢ to cut memory usage.
- Causal Attribution: Replace gradients with counterfactual or Shapley-value estimators to mitigate saturation in attention and gating.
- Layer-wise APD: Decompose per‐layer or per‐module to exploit parallelism and reduce vector dimensionality.
Comparison with Activation-based Methods
Unlike sparse autoencoders or transcoders that operate on hidden activations, APD’s parameter-space approach offers:
- Basis Independence: No assumption of a “neuron” basis—APD discovers its own.
- End-to-End Mechanisms: Components reflect full computational subroutines, not just representational snapshots.
- Architecture Agnosticism: A single method extends to transformers, SSMs, CNNs, and MLPs without bespoke modifications.
Potential Pitfalls & Mitigations
“Gradient attributions can undercount saturated attention heads or overestimate linear mixing.”
— Yann LeCun, 2024
Cures include integrated gradients, path-integrated Shapley sampling, and direct weight-perturbation ablations for robust influence estimation.
Future Directions
Key avenues for APD research:
- Hyperparameter-Free Minimality: Learn per-component activation thresholds from an information-bottleneck perspective, eliminating fixed k.
- Mechanism Verification: Apply APD to real-world circuit discoveries—e.g., induction heads in LLMs, modular-arithmetic MLPs, and group-theoretic transformers.
- Toolchain Integration: Embed APD into MATS Interp and other open-source toolkits for automated circuit discovery and documentation.
Conclusion
APD heralds a paradigm shift toward weight-centric interpretability, uncovering low-rank, minimally sufficient subroutines inside deep networks. As robustness and scalability improve, APD is poised to make opaque AI systems transparent at scale.
Speaker Bios
Lee Sharkey is an interpretability researcher at Goodfire and co-founder of Apollo Research. His early work on sparse autoencoders laid the groundwork for parameter-space analysis.
Host: Daniel Filan, AXRP Podcast.