Intent alignment should not be the goal for AGI x-risk reduction

johnjnay

Intent alignment should not be the goal for AGI x-risk reduction

johnjnay

3 min readOct 26, 2022

Comments 1

Sorted by

New & upvoted

johnjnay

Comments

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 6d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

How (not) to fundraise from Anthropic staff

Jack Lewars·6d ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·4d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Why does solving intent alignment not lower x-risk sufficiently?

If we solve the intent alignment problem between a human, H, and an AI, A, then A implements H’s intentions with super-human intelligence and skill.

There are multiple Hs and multiple As.

By the very nature of humans, there are conflicts in the intentions of the Hs.

Humans have conflicting preferences about the behavior of other humans and about states of the world more broadly. Intent-aligned As would thus have different intentions from one another.

The As execute actions furthering the H’s intentions far too quickly for those conflicts to be solved through any existing human-driven conflict resolution. Conflicts are thus likely to spiral out of control.

Any ultimate conflict resolution mechanism needs to be human-driven. No A can conduct the conflict resolution work because it does not have buy-in from all Hs (or their intent-aligned As). Affected Hs need to endorse the process and respect the outcome. That only happens with democratic procedures.

Therefore, if we solve intent alignment, we do not solve the problem of AGI being sufficiently beneficial to humans. We do not drastically reduce P(misalignment x-risk) because there will be misalignment between many of the AGI systems and many of the humans. That level of conflict of powerful agents could be existential for humanity as a whole.

Then what should we be aiming for?

To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans (and not AI) and authoritative conflict resolution mechanisms driven entirely by humans (and not AI). Humans already have these things (and they are well-developed in the nation with the highest probability of producing AGI, the U.S.).

We need to do the work to internalize these things in AI systems. Work toward intent alignment distracts resources from societal-AGI alignment technical work (at best); and it actively makes finishing the societal-AGI alignment work harder (at worst), if intent aligned AGI is developed first.

If societal-AGI alignment is solved before intent-alignment is solved, then there is powerful societally-aligned AGI that can reduce the probability of intent-aligned AGIs being developed and/or having negative impacts.

Appendix A: What is intent-AGI alignment?

Cullen O’Keefe summarized intent alignment well in this Alignment Forum post.

The standard definition of "intent alignment" generally concerns only the relationship between some property of a human principal H and the actions of the human's AI agent A:

Jan Leike et al. define the "agent alignment problem" as "How can we create agents that behave in accordance with the user's intentions?"
Amanda Askell et al. define "alignment" as "the degree of overlap between the way two agents rank different outcomes."
Paul Christiano defines "AI alignment" as "A is trying to do what H wants it to do."
Richard Ngo endorses Christiano's definition.

Iason Gabriel does not directly define "intent alignment," but provides a taxonomy wherein an AI agent can be aligned with:

"Instructions: the agent does what I instruct it to do."
"Expressed intentions: the agent does what I intend it to do."
"Revealed preferences: the agent does what my behaviour reveals I prefer."
"Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
"Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
"Values: the agent does what it morally ought to do, as defined by the individual or society."

All but (6) concern the relationship between H and A. It would therefore seem appropriate to describe them as types of intent alignment.

Appendix B: What is societal-AGI alignment?

Two examples from Alignment Forum posts:

Coherent Extrapolated Volition is a non-democratic version of societal alignment, where "an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge."

Law-Informed AI is a democratic version of societal alignment where AGI learns societal values from democratically developed legislation, regulation, court opinions, legal expert human feedback, and more.

Intent alignment should not be the goal for AGI x-risk reduction

Intent alignment should not be the goal for AGI x-risk reduction

Why does solving intent alignment not lower x-risk sufficiently?

Then what should we be aiming for?

Conclusion

Appendix A: What is intent-AGI alignment?

Appendix B: What is societal-AGI alignment?