Technical AI Safety research taxonomy attempt (2025)

Ben Plaut

Preface

I'm a postdoc at CHAI working on AI safety. This document contains my rough attempt at a taxonomy of technical AI safety research in 2025. It’s certainly imperfect, and likely biased towards the research areas that I have the most experience with. I’m sure I’ve accidentally omitted some important topics. Please leave comments with suggestions for changes/additions.

Notes:

The amount of detail on a given topic is not a signal of importance. It’s mostly a signal of my familiarity with the topic.
The taxonomy is not subdivided by theoretical vs. empirical work. Many listed topics include both theoretical and empirical work.
The categorization is based on research areas, not threat models or model lifecycle stages. Since some research areas have overlap, some categories here also have overlap.

1 Alignment

1.1 Value learning & specification

Teaching human-aligned objectives to models. Example topics: RLHF, inverse RL, constitutional AI.

1.2 Scalable oversight

Ensuring alignment when it’s not possible and/or practical to directly evaluate outputs. Example topics: debate, iterated amplification, recursive reward modeling, weak-to-strong generalization.

1.3 Optimization pressure & failure modes

How does strong optimization lead to unintended behavior? Example topics: reward hacking, goal misgeneralization, mesa-optimization & inner alignment, deceptive alignment, instrumental convergence & power-seeking, corrigibility, wireheading.

1.4 Multi-agent alignment

How do alignment concerns change when multiple agents are involved? Example topics: mechanism design, cooperative AI, pluralistic alignment.

2 Security

2.1 Adversarial attacks and defenses

Making models robust to adversarial manipulation of inputs or training data. Example topics: jailbreaks, prompt injection, training data poisoning, backdoor attacks.

2.2 Privacy

Protecting personal or proprietary information throughout the model lifecycle.

2.3 Model theft and supply-chain security

Security of the model weights and intellectual property.

3 Robustness & Reliability

3.1 Distributional shift

Maintaining safe behavior when the input distribution changes between training and deployment. Example topics: OOD/anomaly detection, domain adaptation, transfer learning, (mis)generalization.

3.2 Uncertainty quantification

Estimating confidence levels to enable risk-aware decisions. Example topics: epistemic vs. aleatoric uncertainty, calibration, conformal prediction, abstention strategies.

3.3 Safe reinforcement learning & control

Learning while satisfying safety constraints. Example topics: risk-sensitive RL, constrained MDPs, safe exploration.

4 Interpretability & Transparency

4.1 Mechanistic interpretability

Reverse-engineering internal circuits to understand how models compute.

4.2 Behavioral analysis & probing

Inferring latent concepts from model responses and attributions.

4.3 Eliciting latent knowledge

Extracting truthful internal beliefs from models.

4.4 Model editing & machine unlearning

Locally modifying or guiding model knowledge without harmful side effects.

5 Formal Methods & Structured Reasoning

5.1 Formal verification

Formally proving safety properties of models and policies.

5.2 Program synthesis & specification

Automatically deriving safe programs or policies from high-level specs.

5.3 Neuro-symbolic integration

Hybrid approaches which leverage the complementary strengths of neural methods and symbolic methods.

5.4 Probabilistic programming & causality

Formally modeling uncertainty and causal structure for reliable inference.

5.5 Agent foundations

Theoretical foundations of how intelligent agents should make decisions. Example topics: logical and updateless decision theory, bounded rationality, embedded agency, Bayesian inference.

6 Evaluation & Deployment

6.1 Benchmarking

Rigorously testing for capabilities and safety through formal benchmarks.

6.2 Red-teaming

Identifying flaws in models. Example topics: dangerous capability detection, automated red-teaming, model organisms.

6.3 Safety cases & assurance

Building structured arguments and evidence for deploy-time approval.

6.4 Runtime monitoring

Detecting issues at inference time. Example topics: chain-of-thought monitoring, tool-use monitoring, output filtering.

6.5 Containment & sandboxing

Enforcing least privilege and limiting external access of autonomous agents.

7 Model Governance

7.1 Data management

Managing datasets to ensure quality, legality, and diversity.

7.2 Fine-tuning & adaptation safety

Preventing regressions and misalignment during post-training updates.

7.3 Watermarking & attribution

Embedding verifiable markers to prove ownership and detect unauthorized use.

8 Socio-Technical Factors

Note: This is a taxonomy of “technical AI safety” research, not of all “beneficial AI” research. The topics in this section are less associated with “safety”, so I’m only covering them briefly. However, they are definitely relevant to safety, so I wanted to at least mention them.

8.1 Fairness

Understanding and mitigating bias and discrimination in models.

8.2 User interface design

Designing interfaces that support trust and effective oversight.

8.3 Human-in-the-loop

Designing AI systems that collaborate well with humans (not necessarily users).

8.4 Societal impact

Understanding and guiding the long-term impact of AI on society. Includes impacts on climate, democracy, and economics.

Sudhanshu KasewaAug 274

Thanks for doing this, Ben!

Readers: Here's a spreadsheet with the above Taxonomy, and some columns which I'm hoping we can collectively populate with some useful pointers for each topic:

Does [academic] work in this topic help with reducing GCRs/X-risks from AI?
What's the theory of change[1] for this topic?
What skills does this build, that are useful for AI existential safety?
What are some Foundational Papers in this topic?
What are some Survey Papers in this topic?
Which academic labs are doing meaningful work on this topic?
What are the best academic venues/workshops/conferences/journals for this topic?
What other projects are working on this topic?
Any guidance on how to get involved, who to speak with etc. about this topic?

For security reasons, I have not made it 'editable', but please comment on the sheet and I'll come by in a few days and update the cells.

[1] softly categorised as Plausible, Hope, Grand Hope

OscarD🔸Aug 283

Nice! We did something similar last year; you could check how well our taxonomies align or where any differences are. We also linked to various past taxonomies/overviews like this in that paper.

Ben PlautAug 291

Thanks for sharing! This looks great. One big difference I see is that yours is focused on work being done at AI companies, while mine is somewhat biased towards academia since that's my background. Yours is also much more thorough, of course :)

EA Forum Bot Site
EA Forum