A Pragmatic Vision for Interpretability

Neel Nanda

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well^[1]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact
- Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp
- See our companion piece for more on which research areas and theories of change we think are promising
Why pivot now? We think that times have changed.
- Models are far more capable, bringing new questions within empirical reach
- We have been disappointed by the amount of progress made by ambitious mech interp work, from both us and others^[2]
- Most existing interpretability techniques struggle on today’s important behaviours, e.g. they involve large models, complex environments, agentic behaviour and long chains of thought
Problem: It is easy to do research that doesn't make real progress.
- Our approach: ground your work with a North Star - a meaningful stepping-stone goal towards AGI going well - and a proxy task - empirical feedback that stops you fooling yourself and that tracks progress toward the North Star.
- “Proxy tasks” doesn't mean boring benchmarks. Examples include: interpret the hidden goal of a model organism; stop emergent misalignment without changing training data; predict what prompt changes will stop an undesired behavior.
We see two main approaches to research projects: focused projects (proxy task driven), and exploratory projects (curiosity-driven, proxy task validated)
- Curiosity-driven work can be very effective, but can also get caught in rabbit holes. We recommend starting in a robustly useful setting, time box your exploration^[3], and finding a proxy task as a validation step^[4]
We advocate method minimalism: start solving your proxy task with the simplest methods (e.g. prompting, steering, probing, reading chain-of-thought). Introduce complexity or design new methods only once baselines have failed.

Read the full post here, and the companion piece on promising AGI Safety relevant research directions here

EA Forum Bot Site
EA Forum

A Pragmatic Vision for Interpretability

9

Executive Summary

9

Reactions