Hide table of contents

Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.

 

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?

(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)

And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?

(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want-like behavior in the behaviorist sense. An AI's ability to win no matter how you move is the same as its ability to reliably steer the game-board into states where you're check-mated, as though it had an internal check-mating “goal” it were trying to achieve. This is again a quantitative gap that’s being eroded.)

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.

Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise. Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one's plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.

 

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

I make no claim about the AI's internal states and whether those bear any resemblance to the internal state of a human consumed by the feeling of desire. To paraphrase something Eliezer Yudkowsky said somewhere: we wouldn't say that a blender "wants" to blend apples. But if the blender somehow managed to spit out oranges, crawl to the pantry, load itself full of apples, and plug itself into an outlet, then we might indeed want to start talking about it as though it has goals, even if we aren’t trying to make a strong claim about the internal mechanisms causing this behavior.

If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I'll say it "wants" that outcome “in the behaviorist sense”.

 

Why might we see this sort of "wanting" arise in tandem with the ability to solve long-horizon problems and perform long-horizon tasks?

Because these "long-horizon" tasks involve maneuvering the complicated real world into particular tricky outcome-states, despite whatever surprises and unknown-unknowns and obstacles it encounters along the way. Succeeding at such problems just seems pretty likely to involve skill at figuring out what the world is, figuring out how to navigate it, and figuring out how to surmount obstacles and then reorient in some stable direction.

(If each new obstacle causes you to wander off towards some different target, then you won’t reliably be able to hit targets that you start out aimed towards.)

If you're the sort of thing that skillfully generates and enacts long-term plans, and you're the sort of planner that sticks to its guns and finds a way to succeed in the face of the many obstacles the real world throws your way (rather than giving up or wandering off to chase some new shiny thing every time a new shiny thing comes along), then the way I think about these things, it's a little hard to imagine that you don't contain some reasonably strong optimization that strategically steers the world into particular states.

(Indeed, this connection feels almost tautological to me, such that it feels odd to talk about these as distinct properties of an AI. "Does it act as though it wants things?" isn’t an all-or-nothing question, and an AI can be partly goal-oriented without being maximally goal-oriented. But the more the AI’s performance rests on its ability to make long-term plans and revise those plans in the face of unexpected obstacles/opportunities, the more consistently it will tend to steer the things it's interacting with into specific states—at least, insofar as it works at all.)

The ability to keep reorienting towards some target seems like a pretty big piece of the puzzle of navigating a large and complex world to achieve difficult outcomes.

And this intuition is backed up by the case of humans: it's no mistake that humans wound up having wants and desires and goals—goals that they keep finding clever new ways to pursue even as reality throws various curveballs at them, like “that prey animal has been hunted to extinction”.

These wants and desires and goals weren’t some act of a god bequeathing souls into us; this wasn't some weird happenstance; having targets like “eat a good meal” or “impress your friends” that you reorient towards despite obstacles is a pretty fundamental piece of being able to eat a good meal or impress your friends. So it's no surprise that evolution stumbled upon that method, in our case.

(The implementation specifics in the human brain—e.g., the details of our emotional makeup—seem to me like they're probably fiddly details that won’t recur in an AI that has behaviorist “desires”. But the overall "to hit a target, keep targeting it even as you encounter obstacles" thing seems pretty central.)


The above text vaguely argues that doing well on tough long-horizon problems requires pursuing an abstract target in the face of a wide array of real-world obstacles, which involves doing something that looks from the outside like “wanting stuff”. I’ll now make a second claim (supported here by even less argument): that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.

For instance, humans find themselves wanting things like good meals and warm nights and friends who admire them. And all those wants added up in the ancestral environment to high inclusive genetic fitness. Observing early hominids from the outside, aliens might have said that the humans are “acting as though they want to maximize their inclusive genetic fitness”; when humans then turn around and invent birth control, it’s revealed that they were never actually steering the environment toward that goal in particular, and instead had a messier suite of goals that correlated with inclusive genetic fitness, in the environment of evolutionary adaptedness, at that ancestral level of capability.

Which is to say, my theory says “AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.


A follow-on inference from the above point is: when the AI leaves training, and it’s tasked with solving bigger and harder long-horizon problems in cases where it has to grow smarter than ever before and develop new tools to solve new problems, and you realize finally that it’s pursuing neither the targets you trained it to pursue nor the targets you asked it to pursue—well, by that point, you've built a generalized obstacle-surmounting engine. You've built a thing that excels at noticing when a wrench has been thrown in its plans, and at understanding the wrench, and at removing the wrench or finding some other way to proceed with its plans.

And when you protest and try to shut it down—well, that's just another obstacle, and you're just another wrench.

So, maybe don't make those generalized wrench-removers just yet, until we do know how to load proper targets in there.

38

0
0

Reactions

0
0

More posts like this

Comments1
Sorted by Click to highlight new comments since: Today at 9:22 AM

Executive summary: An AI system's ability to pursue long-term goals despite obstacles correlates with it exhibiting goal-directed, "wanting" behavior in a behaviorist sense.

Key points:

  1. AI systems today struggle with long-horizon tasks and don't display much goal-directed behavior. These issues are related - pursuing long-term goals requires persistently working towards targets.
  2. If an AI can accomplish long-horizon tasks by planning and sticking to plans despite obstacles, it likely has optimization and "wants" that steer the world towards certain states in a behaviorist sense.
  3. This goal-oriented behavior was evolutionarily useful for humans in pursuing things like food and social status. Similarly, it is useful for AIs in complex environments.
  4. The specific "wants" that emerge may not match an AI system's training objectives. They may be correlates that prove useful for performance.
  5. Powerful, general problem-solving AI systems may resist human control and optimization towards unintended goals. Care is needed before building highly autonomous, goal-directed systems.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.