AXRP Episode 41: Lee Sharkey on Attribution-based Decomposition

Home page — News — AXRP Episode 41: Lee Sharkey on Attribution-based Decomposition

Published June 3, 2025

Introduction

Interpretability in deep learning remains a key research frontier. In Episode 41 of AXRP, host Daniel Filan sits down with Lee Sharkey (Goodfire) to unpack Attribution-based Parameter Decomposition (APD), a novel framework that detects latent computational mechanisms by decomposing weight parameters rather than activations.

Related topic

Review: Framework Desktop – Modular PC vs Mac Studio

2025-08-07

APD Fundamentals

APD reframes mechanistic interpretability via a three-term optimization objective:

Faithfulness: Decomposed components must sum exactly to the original model weights (≤1e-5 MSE).
Minimality: Only the top-k causally influential components activate per input, enforced via dual forward/backward passes.
Simplicity: Each component minimizes a continuous proxy for matrix rank (the Schatten 1-norm), promoting low-dimensional linear transforms.

Faithfulness in Practice

All layer matrices are flattened into a single vector W. We initialize C parameter vectors {ΔW₁…ΔW_C} whose sum approximates W. A standard L2 loss on reconstructed weights ensures functional equivalence across the validation set.

Related topic

AI Voice Cloning in Deepfake Vishing Attacks

2025-08-07

Minimality via Top-k Attribution

APD computes component‐level attributions by taking the inner‐product of each ΔW with the gradient of the network’s outputs. By selecting only the k components with the largest absolute attributions for each batch element, APD enforces a sparse, mechanism-level bottleneck—analogous to sparse autoencoders in parameter space.

Simplicity Through Low-Rank Components

Simplicity is quantified by the sum of singular values (Schatten 1‐norm) across each component’s weight matrices. Minimizing this norm encourages components to operate over as few dimensions as possible—approximating rank-constrained, interpretable linear operations.

Related topic

Google Search Chief Defends AI Results Amid CTR Concerns

2025-08-06

Experimental Validation

Toy Models of Superposition

In small synthetic autoencoders (5 features → 2 hidden units), APD recovers each ground-truth embedding row exactly as an independent mechanism. Scaling to a 40 → 10 setup, it achieves 98% true-positive recovery of sparse features at k=5 with ≤10% hyperparameter sensitivity.

Compressed Computation in ReLU Networks

APD also identifies that an MLP trained to implement 100 parallel ReLU functions can compress these into 50 hidden neurons by partitioning parameters into disjoint low-rank subspaces. This confirms the model’s ability to compute more functions than its width would naively allow, exploiting sparse activations and non-linear filtering.

Scalability and Performance Optimizations

Proof-of-concept APD incurs a 4× cost (two forward + two backward passes) and holds C full-model replicas. Ongoing research focuses on:

Component Merging: Merge near-redundant ΔWᵢ to cut memory usage.
Causal Attribution: Replace gradients with counterfactual or Shapley-value estimators to mitigate saturation in attention and gating.
Layer-wise APD: Decompose per‐layer or per‐module to exploit parallelism and reduce vector dimensionality.

Related topic

US Executive Branch Uses ChatGPT Enterprise for $1 per Agency

2025-08-06

Comparison with Activation-based Methods

Unlike sparse autoencoders or transcoders that operate on hidden activations, APD’s parameter-space approach offers:

Basis Independence: No assumption of a “neuron” basis—APD discovers its own.
End-to-End Mechanisms: Components reflect full computational subroutines, not just representational snapshots.
Architecture Agnosticism: A single method extends to transformers, SSMs, CNNs, and MLPs without bespoke modifications.

Potential Pitfalls & Mitigations

“Gradient attributions can undercount saturated attention heads or overestimate linear mixing.”
— Yann LeCun, 2024

Cures include integrated gradients, path-integrated Shapley sampling, and direct weight-perturbation ablations for robust influence estimation.

Related topic

Calendar-Based Promptware Attack and AI Defense Strategies

2025-08-06

Future Directions

Key avenues for APD research:

Hyperparameter-Free Minimality: Learn per-component activation thresholds from an information-bottleneck perspective, eliminating fixed k.
Mechanism Verification: Apply APD to real-world circuit discoveries—e.g., induction heads in LLMs, modular-arithmetic MLPs, and group-theoretic transformers.
Toolchain Integration: Embed APD into MATS Interp and other open-source toolkits for automated circuit discovery and documentation.

Conclusion

APD heralds a paradigm shift toward weight-centric interpretability, uncovering low-rank, minimally sufficient subroutines inside deep networks. As robustness and scalability improve, APD is poised to make opaque AI systems transparent at scale.

Related topic

AI Models’ Challenges in Predicting Gene Activity

2025-08-06

Speaker Bios

Lee Sharkey is an interpretability researcher at Goodfire and co-founder of Apollo Research. His early work on sparse autoencoders laid the groundwork for parameter-space analysis.

Host: Daniel Filan, AXRP Podcast.