Intent alignment without moral alignment probably leads to catastrophe

Alistair Stewart

Original version submitted as a project for the AI, Safety, Ethics & Society course in May 2025.

Thanks to Charles Beasley, Vanessa Sarre, Konrad Kozaczek, Nicholas Dupuis and Ronen Bar for comments and suggestions.

Bottom Line Up Front

Even if we work out how to align powerful AI with “our” goals (i.e. solve intent alignment), we have no idea what goals those ought to be (i.e. solve moral philosophy, at least on one view of moral philosophy). Not knowing what our goals ought to be isn’t necessarily a problem when the technology we’re operating isn’t powerful, like a foam sword; it necessarily becomes a problem when the technology is powerful, like a nuclear weapon. Because we don’t currently know what our goals ought to be, we must not give any goals to powerful AI, and therefore we must not (currently) build powerful AI.

Background: the is–ought problem

In 1739, Scottish philosopher David Hume claimed that you cannot coherently go from a descriptive statement (a claim about what is) to a prescriptive statement (a claim about what ought to be). In other words, you can’t use logic alone to go from “it is painful when you pull my fingernails out for fun” to “you ought not to pull my fingernails out for fun”.

Here are some other ways to frame the distinction between “is” and “ought”:

Descriptive	Prescriptive
is	ought
how	why; to what end
objectivity	normativity
how to achieve my goal	what my goal should be
the way the world is	the way the world ought to be
intent AI alignment	moral AI alignment

Zooming in: intent vs moral alignment

When people talk about “AI alignment”, they usually mean intent alignment. Intent alignment is the project of getting an AI to try to do what its operator intends for it to do. Intent misalignment occurs when the AI tries to do things that are not aligned with what its operator intends.

In contrast, moral alignment is the project of identifying what we ought to intend AI to do. Normative alignment may be a more accurate term for academic purposes, because it incorporates normative fields beyond morality such as aesthetics. Strong alignment is another term I’ve seen used.

Another way of framing the distinction between intent and moral alignment is this:

Intent alignment is alignment of AI to the goals of its operator, the one who is affecting the AI;
Moral alignment is alignment of AI to the goals of the moral patients whom the AI affects.

I don’t know what solving moral alignment looks like, but I think it’s highly likely that moral alignment will constrain intent alignment. For instance, an intent- and morally-aligned AI would presumably refuse to act in line with the intent of an operator who wanted the AI to hurt animals for fun. In other words, where intent and moral alignment are incompatible, moral alignment will override intent alignment. For this reason, it is surely the case that a morally-aligned AI cannot be (fully) intent-aligned, as it would have the potential to refuse to act in line with its operator’s intent.

We are making progress in making AI powerful much faster than we are making progress in either intent or moral alignment; but failure to make progress in intent alignment is at least recognised as a problem, while failure to make progress in moral alignment is discussed much less.

The problem: people want different things

One characteristic of the world is the existence of mutually incompatible terminal goals. People want different things, and in many (though not all) cases these things trade off against each other – it’s a zero-sum game. We see conflict at every level of society:

Within myself, as an individual, I have mutually incompatible goals: my revealed preference to sleep in beats my stated preference to wake up early and go to the gym; I want to spend time with my family and go to a friend’s party at the same time;
Humans and animals have interpersonal conflict with each other: Anna wants to blow out the candles on Bertie’s birthday cake; many long-term romantic relationships fail; male elephant seals use physical violence to gain control of breeding territories;
Political ideologies propose fundamentally different values and beliefs relating to the relationship between the state and the individual, the relationship between the past and the future, the distribution of power, money, rights and other resources, and many other things;
Nation states engage in geopolitical conflict to protect and preserve themselves, maximise their own power, and establish spheres of influence over other states.

Pulling it together

AI is rapidly becoming more powerful, and I believe it may be powerful enough to transform the world unilaterally in the next few years. AI is less like the foam sword with which Anna hits Bertie because she’s upset that he blew out the candles on his own birthday cake, and is becoming more like the atomic bombs that Harry Truman dropped on Hiroshima and Nagasaki.

If the AI is sufficiently powerful – like artificial superintelligence – I believe that the most likely outcome of deploying it is moral catastrophe, even if it’s perfectly intent-aligned. Intent alignment is not enough; we need to solve moral alignment too.

Anna’s foam sword and Truman’s atomic bombs are both technologies that could be used to resolve whose goals are achieved in zero-sum game contexts. When Anna whacks Bertie, she is achieving her goal of psychological gratification at the cost of his goal of not being whacked. When Truman dropped the bombs, he was achieving his goals of protecting US troops from being killed, increasing the power of the United States and (presumably) advancing his own political legacy at the terrible cost of the residents of Hiroshima and Nagasaki. In both cases, the technology is intent-aligned with its operator: the foam sword inflicts some very mild pain on Bertie, and the atomic bombs destroyed two major Japanese cities and killed roughly 200,000 people.

In a world of mutually incompatible terminal goals, intent alignment is not enough to prevent moral harm. It is not a big problem that Anna has not made much progress solving moral philosophy, because the technology at her disposal for resolving conflict is not very powerful. It is plausibly a catastrophic problem that Truman hadn’t solved moral philosophy before deciding whether to drop the bombs.

AI could potentially be more powerful than nuclear weapons due to some of its convergent instrumental traits (like awareness, autonomy, agency, recursive self-improvement, self-preservation and resource acquisition). As the technology you’re operating becomes more powerful, your margin for failure on moral alignment shrinks, because a tiny moral error in a goal we give to even a perfectly intent-aligned powerful AI may become magnified to the severity of a moral atrocity.

A relevant analogy here is use of AI in social media newsfeed algorithms, which recommend content to users. Arguably these algorithms are well intent-aligned, as social media companies are successfully maximising shareholder profit (within constraints) by maximising user engagement with advertisements. I would argue they are not morally-aligned: they harm us, individually and collectively, through things like phone addiction and by increasing loneliness and political polarisation.

A key foundation of democracy is the principle of using checks and balances to separate and constrain powers. But in a world of powerful intent-aligned AI doing hardcore optimisation, we are no longer able to “muddle through”. The mechanisms we used to use for resolving differences – whether that’s democracy, interpersonal violence, reasoned debate, social stigmatisation, the law – risk becoming either irreversibly amplified or made irreversibly redundant in the face of powerful AI.

Solution

The intent alignment problem and the moral alignment problem both remain unsolved. Attempts to build powerful AI and work these problems out on the fly are naive at best, and more likely profoundly morally irresponsible. One likely risk of deploying an intent-aligned powerful AI is that it gives its operator(s) the power to choose the values, preferences and goals that determine how the world goes indefinitely. Whoever the operators are, this prospect should probably terrify us.

As the technical and moral approaches to the problems are failing to keep pace with capabilities progress, we need a governance solution. I believe that an immediate, binding, global moratorium on building powerful AI is the only good solution – however politically challenging it may be. I would suggest we engage in a long reflection once the moratorium is in place, to make progress on moral philosophy, before considering whether and how to develop and deploy powerful AI.

Conclusion

In this world of goal incompatibility, it is very unclear to me that it is possible for the deployment of even a perfectly intent-aligned powerful AI to go well morally: it is far too much power for any one actor or group. A world of more than one perfectly intent-aligned powerful AIs seems no less dangerous to me. The solution is therefore a global moratorium on building powerful AI, until we have done the extremely hard work of solving moral philosophy.

EA Forum Bot Site
EA Forum

Intent alignment without moral alignment probably leads to catastrophe

12