EA Forum Bot Site
Topics
EA Forum

Hide table of contents

AI interpretability

AI interpretability

Related entries

Contributors

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.^[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.^[1]

Interpretability is a focus of Chris Olah and Anthropic's work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research^[2].

...

Posts tagged AI interpretability

Relevance

158

Announcing Apollo Research

· 2y ago

2

2

123

High-level hopes for AI alignment

Holden Karnofsky

· 2y ago · 23m read

2

2

90

The case for becoming a black-box investigator of language models

· 3y ago · 3m read

2

2

79

Announcing Timaeus

Stan van Wingerden, Jesse Hoogland+ 1 more

· 2y ago · 6m read

1

1

74

Interpretability Will Not Reliably Find Deceptive AI

· 1mo ago

1

1

72

Takes on "Alignment Faking in Large Language Models"

· 6mo ago

1

1

54

A Barebones Guide to Mechanistic Interpretability Prerequisites

· 3y ago · 4m read

2

2

48

Join the interpretability research hackathon

Esben Kran, Sabrina Zaki, Richard Annilo+ 2 more

· 3y ago · 6m read

1

1

47

Implications of the inference scaling paradigm for AI safety

· 5mo ago

1

1

46

AI Alignment Research Engineer Accelerator (ARENA): call for applicants

Callum McDougall, Kathryn O'Rourke

· 2y ago · 11m read

1

1

35

Interpreting Neural Networks through the Polytope Lens

· 3y ago

3

3

33

If interpretability research goes well, it may get dangerous

· 2y ago

1

1

33

Join the AI governance and interpretability hackathons!

Esben Kran, Sabrina Zaki, Apart Research

· 2y ago · 6m read

1

1

32

Against LLM Reductionism

Erich_Grunewald 🔸

· 2y ago

2

2

31

MATS Applications + Research Directions I'm Currently Excited About

· 4mo ago

1

1