P(misalignment x-risk | AGI) is high.
Intent alignment should not be the goal for AGI x-risk reduction. If AGI is developed, and we solve AGI intent alignment, we will not have lowered x-risk sufficiently, and we may have even increased it higher than it would have been otherwise.
P(misalignment x-risk | intent-aligned AGI) >> P(misalignment x-risk | societally-aligned AGI).
The goal of AI alignment should be alignment with (democratically determined) societal values (because these have broad buy-in from humans).
P(misalignment x-risk | AGI) is higher if intent alignment is solved before societal-AGI alignment.
Most technical AI alignment research is currently focused on solving intent alignment. The (usually implicit, sometimes explicit) assumption is that solving intent alignment will help subsequently solve societal-AGI alignment. This would only be the case if all the humans that had access to intent-aligned AGI had the same intentions (and did not have any major conflicts between them); that is highly unlikely.
Solving intent alignment is likely to make practically implementing societal-AGI alignment harder. If we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power. Furthermore, humans with intent-aligned AIs would suddenly have significantly more power, and their advantages over others would likely compound.
To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans (and not AI) and authoritative conflict resolution mechanisms driven entirely by humans (and not AI). Humans already have these things (and they are well-developed in the nation with the highest probability of producing AGI, the U.S.).
We need to do the work to internalize these things in AI systems. Work toward intent alignment distracts resources from societal-AGI alignment technical work (at best); and it actively makes finishing the societal-AGI alignment work harder (at worst), if intent aligned AGI is developed first.
If societal-AGI alignment is solved before intent-alignment is solved, then there is powerful societally-aligned AGI that can reduce the probability of intent-aligned AGIs being developed and/or having negative impacts.
We don't yet have a solution for societal-AGI-alignment or intent-AGI-alignment, and both are very hard problems. This post is intended to raise questions about where/when to devote development resources.
Cullen O’Keefe summarized intent alignment well in this Alignment Forum post.
The standard definition of "intent alignment" generally concerns only the relationship between some property of a human principal H and the actions of the human's AI agent A:
- Jan Leike et al. define the "agent alignment problem" as "How can we create agents that behave in accordance with the user's intentions?"
- Amanda Askell et al. define "alignment" as "the degree of overlap between the way two agents rank different outcomes."
- Paul Christiano defines "AI alignment" as "A is trying to do what H wants it to do."
- Richard Ngo endorses Christiano's definition.
Iason Gabriel does not directly define "intent alignment," but provides a taxonomy wherein an AI agent can be aligned with:
- "Instructions: the agent does what I instruct it to do."
- "Expressed intentions: the agent does what I intend it to do."
- "Revealed preferences: the agent does what my behaviour reveals I prefer."
- "Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
- "Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
- "Values: the agent does what it morally ought to do, as defined by the individual or society."
All but (6) concern the relationship between H and A. It would therefore seem appropriate to describe them as types of intent alignment.
Two examples from Alignment Forum posts:
Related post: AGI misalignment x-risk may be lower due to an overlooked goal specification technology.