Preface
I'm a postdoc at CHAI working on AI safety. This document contains my rough attempt at a taxonomy of technical AI safety research in 2025. It’s certainly imperfect, and likely biased towards the research areas that I have the most experience with. I’m sure I’ve accidentally omitted some important topics. Please leave comments with suggestions for changes/additions.
Notes:
- The amount of detail on a given topic is not a signal of importance. It’s mostly a signal of my familiarity with the topic.
- The taxonomy is not subdivided by theoretical vs. empirical work. Many listed topics include both theoretical and empirical work.
- The categorization is based on research areas, not threat models or model lifecycle stages. Since some research areas have overlap, some categories here also have overlap.
1 Alignment
1.1 Value learning & specification
Teaching human-aligned objectives to models. Example topics: RLHF, inverse RL, constitutional AI.
1.2 Scalable oversight
Ensuring alignment when it’s not possible and/or practical to directly evaluate outputs. Example topics: debate, iterated amplification, recursive reward modeling, weak-to-strong generalization.
1.3 Optimization pressure & failure modes
How does strong optimization lead to unintended behavior? Example topics: reward hacking, goal misgeneralization, mesa-optimization & inner alignment, deceptive alignment, instrumental convergence & power-seeking, corrigibility, wireheading.
1.4 Multi-agent alignment
How do alignment concerns change when multiple agents are involved? Example topics: mechanism design, cooperative AI, pluralistic alignment.
2 Security
2.1 Adversarial attacks and defenses
Making models robust to adversarial manipulation of inputs or training data. Example topics: jailbreaks, prompt injection, training data poisoning, backdoor attacks.
2.2 Privacy
Protecting personal or proprietary information throughout the model lifecycle.
2.3 Model theft and supply-chain security
Security of the model weights and intellectual property.
3 Robustness & Reliability
3.1 Distributional shift
Maintaining safe behavior when the input distribution changes between training and deployment. Example topics: OOD/anomaly detection, domain adaptation, transfer learning, (mis)generalization.
3.2 Uncertainty quantification
Estimating confidence levels to enable risk-aware decisions. Example topics: epistemic vs. aleatoric uncertainty, calibration, conformal prediction, abstention strategies.
3.3 Safe reinforcement learning & control
Learning while satisfying safety constraints. Example topics: risk-sensitive RL, constrained MDPs, safe exploration.
4 Interpretability & Transparency
4.1 Mechanistic interpretability
Reverse-engineering internal circuits to understand how models compute.
4.2 Behavioral analysis & probing
Inferring latent concepts from model responses and attributions.
4.3 Eliciting latent knowledge
Extracting truthful internal beliefs from models.
4.4 Model editing & machine unlearning
Locally modifying or guiding model knowledge without harmful side effects.
5 Formal Methods & Structured Reasoning
5.1 Formal verification
Formally proving safety properties of models and policies.
5.2 Program synthesis & specification
Automatically deriving safe programs or policies from high-level specs.
5.3 Neuro-symbolic integration
Hybrid approaches which leverage the complementary strengths of neural methods and symbolic methods.
5.4 Probabilistic programming & causality
Formally modeling uncertainty and causal structure for reliable inference.
5.5 Agent foundations
Theoretical foundations of how intelligent agents should make decisions. Example topics: logical and updateless decision theory, bounded rationality, embedded agency, Bayesian inference.
6 Evaluation & Deployment
6.1 Benchmarking
Rigorously testing for capabilities and safety through formal benchmarks.
6.2 Red-teaming
Identifying flaws in models. Example topics: dangerous capability detection, automated red-teaming, model organisms.
6.3 Safety cases & assurance
Building structured arguments and evidence for deploy-time approval.
6.4 Runtime monitoring
Detecting issues at inference time. Example topics: chain-of-thought monitoring, tool-use monitoring, output filtering.
6.5 Containment & sandboxing
Enforcing least privilege and limiting external access of autonomous agents.
7 Model Governance
7.1 Data management
Managing datasets to ensure quality, legality, and diversity.
7.2 Fine-tuning & adaptation safety
Preventing regressions and misalignment during post-training updates.
7.3 Watermarking & attribution
Embedding verifiable markers to prove ownership and detect unauthorized use.
8 Socio-Technical Factors
Note: This is a taxonomy of “technical AI safety” research, not of all “beneficial AI” research. The topics in this section are less associated with “safety”, so I’m only covering them briefly. However, they are definitely relevant to safety, so I wanted to at least mention them.
8.1 Fairness
Understanding and mitigating bias and discrimination in models.
8.2 User interface design
Designing interfaces that support trust and effective oversight.
8.3 Human-in-the-loop
Designing AI systems that collaborate well with humans (not necessarily users).
8.4 Societal impact
Understanding and guiding the long-term impact of AI on society. Includes impacts on climate, democracy, and economics.
Thanks for doing this, Ben!
Readers: Here's a spreadsheet with the above Taxonomy, and some columns which I'm hoping we can collectively populate with some useful pointers for each topic:
For security reasons, I have not made it 'editable', but please comment on the sheet and I'll come by in a few days and update the cells.
[1] softly categorised as Plausible, Hope, Grand Hope
Nice! We did something similar last year; you could check how well our taxonomies align or where any differences are. We also linked to various past taxonomies/overviews like this in that paper.
Thanks for sharing! This looks great. One big difference I see is that yours is focused on work being done at AI companies, while mine is somewhat biased towards academia since that's my background. Yours is also much more thorough, of course :)