Catastrophic Risks from Unsafe AI: Navigating a Tightrope Scenario (Ben Garfinkel, EAG London 2023)

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

114

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·6d ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

How (not) to fundraise from Anthropic staff

Jack Lewars·5d ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

Recent opportunities to take action

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·1d ago·3m read

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·21h ago·2m read

announcing High Impact Aliens

tzukitchan·4d ago·1m read

^{^}

For example, this approach doesn't focus on how far away Artificial General Intelligence may be (“timelines”), the likelihoods of different outcomes (“p(doom)”), or arbitrarily distinguish between technical alignment, policy, governance, and other approaches for improving safety. Instead, it focuses on describing concrete actions in many domains that can be taken to address risks of catastrophe from advanced AI systems.

^{^}

An accessible but comprehensive introduction to how GPT-4 was trained, including 3 different versions of the Feedback step, is available as a 45 minute YouTube talk ("State of GPT", Karpathy, 2023 [alt link with transcript]).

You can also read a detailed forensic history of how GPT-3's capabilities evolved from the base 2020 model to the late-2022 ChatGPT model ("How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources", Fu, 2022).

Catastrophic Risks from Unsafe AI: Navigating a Tightrope Scenario (Ben Garfinkel, EAG London 2023)

Catastrophic Risks from Unsafe AI: Navigating a Tightrope Scenario (Ben Garfinkel, EAG London 2023)

Summary

We should focus on tightrope scenarios

Recap: how are AI models trained?

Step 1: Cranking the Handle

Step 2: Feedback

Two threat sources: emergence of dangerous capabilities and misalignment

Emergence of dangerous capabilities

Misalignment

How dangerous capabilities and misalignment can lead to catastrophe

A story of catastrophe from dangerous capabilities and misalignment

The story, distilled into basic steps

Approaches to address risks

Better safety knowledge

Better defences

Better constraints

Combining several approaches into strategies that can reduce catastrophic risk from unsafe AI

Conclusion

Author statement