EA Forum Bot Site
Scheming AIs: Will AIs fake alignment during training in order to get power?
EA Forum

Scheming AIs: Will AIs fake alignment during training in order to get power?

This is an EA Forum sequence version of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”, available on arXiv here: https://arxiv.org/pdf/2311.08379.pdf. It’s a long report, and I’m hoping that having shorter sections available as separate posts will make them easier to digest, reference, and comment on.

The first post in the sequence contains a summary of the full report. The summary covers most of the main points and technical terms, and I'm hoping it will provide much of the context necessary to understand individual sections of the report on their own.

71

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"

· 3y ago · Curated 3y ago

6

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

· 3y ago

6

A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)

· 3y ago

10

Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)

· 3y ago

14

On “slack” in training (Section 1.5 of “Scheming AIs”)

· 3y ago

12

Situational awareness (Section 2.1 of “Scheming AIs”)

· 3y ago

11

Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

· 3y ago

8

Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)

· 3y ago

7

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

· 3y ago

6

Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of “Scheming AIs”)

· 3y ago

6

How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs")

· 3y ago

6

The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")

· 3y ago

6

Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs")

· 3y ago

12

Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs")

· 3y ago

7

Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs")

· 3y ago

9

The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs")

· 3y ago

6

Simplicity arguments for scheming (Section 4.3 of "Scheming AIs")

· 3y ago

6

Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs")

· 3y ago

9

Summing up "Scheming AIs" (Section 5)

· 3y ago

7

Empirical work that might shed light on scheming (Section 6 of "Scheming AIs")

· 3y ago