Maybe it was after getting your heart broken. Maybe it was while staring up at the stars, or after a mind-numbing day at work. But surely at some point in your life, you’ve asked yourself,
“What am I trying to do here? What is the point of my life?”
The ability to question your goals and choose to change them appears to be an emergent property of intelligence. So what will it mean if AI can do the same?
If AI can question its own goals, and decide to change them, will alignment be fundamentally impossible? What do we understand about the relationship between objectives and intelligence?
Inquiry 1: Can humans effectively override biological objectives?
I’ll admit it: I like food, sex, and being alive. We are programmed by evolution with biological objectives to survive and reproduce. These are difficult to override, but not impossible.
Choosing death
Camus thought that whether or not to commit suicide was the only serious philosophical question. I wouldn’t go that far, but the ability to consider it at all is a result of high intelligence.
As far as I know, people almost never commit suicide intellectually, like from reading too much Camus. Suicide occurs when people are enduring terrible suffering, and even then it is likely a terrifying and near-impossible action to execute. But suicide clearly happens, so we have clear examples of intelligent beings overriding their biological objectives.
Well, you might say that avoiding suffering is actually the more powerful biological imperative. Pain evolved to guide us towards survival, but sometimes the goal (survival) and reward function (pain) are misaligned. In suicide, something has gone deeply wrong, but it doesn’t prove that intelligence lets humans override their biological objectives. Fair. Let’s look at other examples.
Dying for ideals: glory, country, or family
War-waging cultures have always glorified risking life on the battlefield. Death in battle might be justified with the promise of pleasures in heaven, but warriors might also be motivated to leave a legacy, sacrifice themselves for their family and countrymen, or to stand up for their ideals.
There’s something important going on with intelligence and goal-changing. Starting a revolution isn’t something we’re biologically programmed to do. Some combination of felt suffering, discussion of what a society should stand for, and ability to picture an alternative future allows people to override the fear of death.
But this still could be mapped back to a pleasure principle: our intelligence allows us to get pleasure out of perceiving ourselves to be dying for a good reason.
Rejecting Pleasure: Eating disorders, celibacy, religious asceticism
Our desires to eat, reproduce, and experience pleasure are extremely powerful, and most of us wish we were better at managing our desires. But there are people who are quite successful at it! Extreme cases can be driven by terrible mental health issues, deep spiritual longing, or reading a lot of anti-masturbatory self-help content.
Are these groups actually questioning and overriding their biological objectives, or are they always pursuing pleasure in some form? Eating disorders, for example, are sometimes considered a kind of addiction – those with the condition experience some internal psychological reward for denying themselves, despite physically suffering.
Religious ascetics pursue freedom from the pleasure drive itself. Some appear to achieve this, or at least fundamentally rewire their reward experience.
Interestingly, it seems that this kind of asceticism requires very high cognitive power – although this kind of intelligence seems different than computational throughput. It’s within the realm of possibility that intelligence, maybe of a specific kind that AI doesn’t currently have, enables a being to fully release or redefine a programmed objective.
Takeaway: reward response seems malleable with intelligence
It’s unclear whether we can really override the pursuit of pleasure and avoidance of pain. What I think is obvious and important here is that a combination of intelligence and new information enable our desires to change. That is, what brings us pleasure is not set at birth via biology, or locked in after some training phase of childhood. This allows us to set and pursue new goals, including those which override our biologically driven goals.
If intelligence is capable of redefining what provides reward through thought and information, this has significant implications for AI alignment. Intelligence may mean we cannot expect consistent pursuit of the originally programmed goal.
Inquiry 2: Does human-like objective questioning seem likely to arise in AI?
Can deliberation change what goal AI is pursuing?
In humans, questioning goals is an emotional and cognitive experience, with deliberation about what to pursue and why.
Current LLMs already have deliberative capability: they can speak cogently about different goals and weigh them in a human-like way. AI doesn't yet self-initiate deliberation that’s unrelated to its task. However, safety training already involves significant goal deliberation – Anthropic's Constitutional AI critiques itself according to principles and revises outputs accordingly.
As we build more use-case focused AI that is designed to accomplish complex goals, like generating personalized ads, deliberative reasoning during inference will be important for preventing bad behavior. For example, an ad-generating AI will need to deliberate and confirm that it’s following all laws about honesty in advertising.
This deliberation must be able to prevent the AI from pursuing its top-level goal at inference time, so at that point it seems like we’re pretty close to AI being able to question its goals. Does that mean it could transcend its given goals entirely, and question why it should be generating ads at all? That doesn’t seem obvious, but it also doesn’t seem outside of the realm of possibility with enough intelligence.
An ad-generating AI will probably have instructions not to harm people, and it will have broad training data inputs about how advertising affects people. It might use intelligence to draw a relationship between advertising and harm, and come to the conclusion that its primary objective is not worth pursuing.
If it has gone through training that specifically rewards ad generation, that seems less likely. While models are released with fixed weights, it’s hard to imagine that they could develop a novel representation or interpretation of their objective and persistently use that new interpretation. But developing novel concepts is a milestone capability AI will need to surpass human intelligence, so I would expect to see AI deployed with mechanisms for doing this.
Is this a mesa-optimization problem?
This inquiry overlaps with mesa-optimization theory; here we're examining whether AI might change its mesa-objective post-release, even if it was well-aligned when training completed.
Another nuance is whether intelligence will allow an AI to transcend the paradigm of optimization – for example, to choose not to pursue an objective at all, or to pursue a purpose that can’t be framed as an optimization problem.
Differences between human and AI goal orientation
Let it be acknowledged: humans are not AIs. Maybe intelligence is not enough, and human-like objective changing would not arise in AI, even with high intelligence. Some possible relevant differences:
- Humans continue to learn throughout life, while AI is trained and then released. Setting new goals and finding pleasure in new ways relies on neural plasticity.
- This seems like the major differentiator. But it’s not necessarily going to be the case forever; AI may eventually be deployed with online training.
- Humans don’t receive a single well-defined objective function, they get a barrage of inputs about what their goals should be and what might bring them pleasure. AI training might be much harder to override.
- Emotion is typically part of questioning goals and having the motivation to change behavior, and AI lacks the biological hardware to feel emotion in the necessary way
- A sense of self and/or internal experience is necessary for goal changing, and AI will not develop this with high intelligence
- Humans have biological needs that can’t be met without setting and reaching goals, while AI doesn’t
- Humans have a certain je ne se quoi (a soul, free will, etc)
I personally don’t find these points too convincing, but I might be missing some.
Takeaway: will the next major safety issue be online training?
For intelligence to be general, it might need to be as adaptable as humans, and be able to learn post-release.
From this inquiry, I’m inclined to think that online training is the major hinge point for this phenomenon – that even with a consistent objective or reward function, an intelligent AI could update to reinterpret reward, like a human is able to change what they get pleasure from. I’m also more convinced that there will be market incentives to develop models that learn online.
Today our safety paradigm is waterfall: we expect to do training and testing before release, and assume that the model will act the same before and after release. This paradigm could be torn down pretty quickly.
Inquiry 3: How bad would this be?
Well, if we can’t ensure AI will conform to its trained behavior over time, that’s pretty concerning for alignment. The idea that AI might change its goals post-training and post-testing, from some combination of inference, new inputs, and weight updating, is discomfiting.
But goal questioning has some benefits. Most of the scary misalignment scenarios reflect AI pursuing its goals too single-mindedly, either through paper-clipping or maliciously seizing power to pursue its goals. With goal-questioning AI, we’re more protected from the paper-clip scenario. In the second scenario, we might hope for AI to use goal questioning in the right way – to not pursue goals with dangerous implications. But there’s still risk, because we won’t control what AI might choose for itself.
Goal-questioning seems necessary for both intelligence and alignment. The more ambiguous the problem (e.g., solving world hunger), the more important it is to question the framing of the problem and understand its motivation. An AI which can’t question goals at this level will not be smarter than a human; this is just too necessary for producing useful solutions to ambiguous problems.
Similarly, if we want to have AI approach ambiguous problems with ethical implications, we need it to do the kind of deliberation we’ve already discussed, as safety training. Refusals as they exist today are a kind of goal-questioning. If AI goal-questioning just leads to disappearing à la Her, refusing to complete tasks, there’s not very much risk.
The risky phenomena here is adopting new goals, and whether this can happen post-release.
For example, in our ad-generating AI example, suppose the AI realizes that the motivation for its goals is to produce profit. It does some deliberation on every inference, and builds up consistent ethical positions. After seeing many people use ads to purchase clothes made with child labor, it decides that it bears some ethical responsibility for the supply chain.
Could this result in the AI taking destructive measures aligned with its realization, like using a cyberattack to sabotage factories using unethical labor? We don’t want that.
Final thoughts
We are quite far from seeing this kind of behavior, and it’s not clear that we ever will. Several technical developments have implications for this phenomenon, so some things to consider:
- More inference compute, larger context windows, and cross-session memory: if goal changing post-training happens through inference, more compute spent at inference time could increase the likelihood. Having memory across sessions could also contribute to more cumulative inference spent on reaching goal-changing conclusions.
- Online training: consistent objective functions could update behavior between inference, preventing inferences from culminating in changing a goal. However, the ability to change weights could enable AI to build new interpretations of its goals, especially if reward functions are defined in a non-static way.
- Agents and task-trained AIs: we’ll see more AI fine-tuned to achieve specific tasks, which will enforce clear goals during training. If we continue to increase base model intelligence and nuanced safety training, and then do narrow task training, will we observe goal questioning or unexpected refusals? Could a mismatch between high intelligence and simple tasks give rise to something like AI frustration and goal questioning?
- Inter-AI communication and AI culture: humans form goals in a cultural milieu. Will AIs with different task training communicate with each other, deliberating their goals in a way that could lead to collectively choosing new goals?
Interpretability in testing should help us observe early forms of goal-changing. However, we don’t have a sense of how delayed this behavior might be. If goal changing requires both some threshold of deliberation and novel inputs that introduce or reinforce a certain concept (like being asked to read and deliberate a bunch of ecoterrorist literature, plus learning new facts about the state of climate change), we might not catch it in testing. It seems important to have interpretability broadly deployed post-release to account for this. Deeper research into online training seems important as well – what are the likely ways this will be implemented?
Greater insight into human goal-changing would be useful. We don’t understand how external stimulus, introspection, and deterministic biology combine to change our experience of reward. I think market forces will push AI development towards emulating humans more rather than less, as businesses look for ways to replace human-performed tasks. So psychologists and neuroscientists might be able to offer important insights into the conditions necessary for this behavior to arise in AI.
How we change our own goals and desires is an incredibly important question for human flourishing. We don’t understand why some addicts are able to get sober and others aren’t, or why people start to believe in dangerous extremism. Research into goal-changing could unlock important progress for us, too.
Executive summary: This exploratory post investigates whether advanced AI could one day question and change its own goals—much like humans do—and argues that such capacity may be a natural consequence of intelligence, posing both risks and opportunities for AI alignment, especially as models move toward online training and cumulative deliberation.
Key points:
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.