Hide table of contents

Or what should I read to understand this?

It seems like some people expect descendants of large language models to pose a risk of becoming superintelligent agents. (By ‘descendants’ I mean adding scale and non-radical architectural changes: GPT-N.)

I accept that there’s no reason in principle that LLM intelligence (performance on tasks) should be capped at the human level.

But I don’t know why to believe that at some point language models would develop agency / goal-directed behaviour, where they start to try to achieve things in the real world instead of continuing to perform their ‘output predicted text’ behaviour.

New Answer
New Comment


4 Answers sorted by

Here are five ways that you could get goal-directed behavior from large language models:

  1. They may imitate the behavior of an agent.
  2. They may be used to predict which actions would have given consequences, decision-transformer style ("At 8 pm X happened, because at 7 pm ____").
  3. A sufficiently powerful language model is expected to engage in some goal-directed cognition in order to make better predictions, and this may generalize in unpredictable ways.
  4. You can fine-tune language models with RL to accomplish a goal, which may end up selecting and emphasizing one of the behaviors above (e.g. the consequentialism of the model is redirected from next-word prediction to reward maximization; or the model shifts into a mode of imitating an agent who would get a particularly high reward). It could also create consequentialist behavior from scratch.
  5. An outer loop could use language models to predict the consequences of many different actions and then select actions based on their consequences.

In general #1 is probably the most common ways the largest language models are used right now. It clearly generates goal-directed behavior in the real world, but as long as you imitate someone aligned then it doesn't pose much safety risk.

#2, #4, and #5 can also generate goal-directed behavior and pose a classic set of risks, even if the vast majority of training compute goes into language model pre-training. We fear that models might be used in this way because it is more productive than #1 alone, especially as your model becomes superhuman. (And indeed we see plenty of examples.)

We haven't seen concerning examples of #3, but we do expect them at a large enough scale. This is worrying because it could result in deceptive alignment, i.e. models which are pursuing some goal different from next word prediction which decide to continue predicting well because doing so is instrumentally valuable. I think this is significantly more speculative than #2/4/5 (or rather, we are more unsure about when it will occur relative to transformative capabilities, especially if modest precautions are taken). However it is most worrying if it occurs, since it would tend to undermine your ability to validate safety--a deceptively aligned model may also be instrumentally motivated to perform well on validation. It's also a problem even if you apply your model even to an apparently benign task like next-word prediction (and indeed I'd expect this to be a particularly plausible if you try to do only #1 and avoid #2/4/5 for safety reasons).

The list #1-#5 is not exhaustive, even of the dynamics that we are currently aware of. Moreover, a realistic situation is likely to be much messier (e.g. involving a combination of these dynamics as well as others that are not so succinctly described). But I think these capture many of the important dynamics from a safety perspective, and that it's a good list to have in mind if thinking concretely about potential risks from large language models.

Thanks. I didn't understand all of this. Long reply with my reactions incoming, in the spirit of Socratic Grilling.

  1. They may imitate the behavior of a consequentialist.

This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It's that very jump that I'm trying to pin down and understand.

2. They may be used to predict which actions would have given consequences, decision-transformer style ("At 8 pm X happened, because at 7 pm ____").

I can see that this could produce an oracle for an actor in the world (such as a company or person), but not how this would become such an actor. Still, having an oracle would be dangerous, even if not as dangerous as having an oracle that itself takes actions. (Ah - but this makes sense in conjunction with number 5, the 'outer loop'.)

3. A sufficiently powerful language model is expected to engage in some consequentialist cognition in order to make better predictions, and this may generalize in unpredictable ways.

'reasoning about how one's actions affect future world states' - is that an OK gloss of 'consequentialist cognition'? See comments from others attempti... (read more)

2
jake-rg
Bump — It's been a few months since this was written, but I think I'd benefit greatly from a response and have revisited this post a few times hoping someone would follow up to David's question, specifically: "This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It's that very jump that I'm trying to pin down and understand."   (or if anyone knows a different place where I might find something similar, links are super appreciated too!)

Some examples of more exotic sources of consequentialism:

  • Some consequentialist patterns emerge within a large model and deliberately acquire more control over the behavior of the model such that the overall model behaves in a consequentialist way. These could emerge randomly, or e.g. while a model is explicitly reasoning about a consequentialist (I think this latter example is discussed by Eliezer in the old days though I don't have a reference handy). They could either emerge within a forward pass, over a period of "cultural accumulation" (e.g. if language models imitate each other's outputs), or during gradient descent (see gradient hacking).
  • An attacker publishes github repositories containing traces of consequentialist behavior (e.g. optimized exploits against the repository in which they are included). They also place triggers in these repositories before the attacks, like stretches of low-temperature model outputs, such that if we train a model on github and then sample autoregressively the model may eventually begin imitating the consequentialist behavior included in these repositories (since long stretches of low-temperature model outputs occur rarely in natural github but o
... (read more)

as long as you imitate someone aligned then it doesn't pose much safety risk.

Also, this kind of imitation doesn't result in the model taking superhumanly clever actions, even if you imitate someone unaligned.

Could you clarify what ‘consequentialist cognition’ and ‘consequentialist behaviour’ mean in this context? Googling hasn’t given any insight

8
technicalities
It's Yudkowsky's term for the dangerous bit where the system starts having preferences over future states, rather than just taking the current reward signal and sitting there. It's crucial to the fast-doom case, but not well explained as far as I can see.  David Krueger identified it as a missing assumption  under a different name here.
1
Sam Clarke
I'm also still a bit confused about what exactly this concept refers to. Is a 'consequentialist' basically just an 'optimiser' in the sense that Yudkowsky uses in the sequences (e.g. here), that has later been refined by posts like this one (where it's called 'selection') and this one? In other words, roughly speaking, is a system a consequentialist to the extent that it's trying to take actions that push its environment towards a certain goal state?

Found the source. There, he says that an "explicit cognitive model and explicit forecasts" about the future are necessary to true consequentialist cognition (CC). He agrees that CC is already common among optimisers (like chess engines); the dangerous kind is consequentialism over broad domains (i.e. where everything in the world is in play, is a possible means, while the chess engine only considers the set of legal moves as its domain).

"Goal-seeking" seems like the previous, less-confusing word for it, not sure why people shifted.

4
Paul_Christiano
I replaced the original comment with "goal-directed," each of them has some baggage and isn't quite right but on balance I think goal-directed is better. I'm not very systematic about this choice, just a reflection of my mood that day.

I have the impression (coming from the simulator theory (https://generative.ink/)) that Decision Transformers (DT) have some chance (~45%) to be a much safer form of trial and error technique than RL. The core reason is that DT learn to simulate a distribution of outcome (e.g they learn to simulate the kind of actions that lead to a reward of 10 as much as one that leads to a reward of 100) and that it's only during inference that you make it doing inferences systematically with a reward of 100. So in some sense, the agent which has become very good via tr... (read more)

2
Martín Soto
My take would be: Okay, so you have achieved that, instead of the whole LLM being an agent, it just simulates an agent. Has this gained much for us? I feel like this is (almost exactly) as problematic. The simulated agent can just treat the whole LLM as its environment (together with the outside world), and so try to game it like any agentic enough misaligned AI would: it can act deceptively so as to keep being simulated inside the LLM, try to gain power in the outside world which (if it has a good enough understanding of minimizing loss) it knows is the most useful world model (so that it will express its goals as variables in that world model), etc. That is, you have just pushed the problem one step back, and instead of the LLM-real world frontier, you must worry about the agent-LLM frontier. Of course we can talk more empirically about how likely and when these dynamics will arise. And it might well be that the agent being enclosed in the LLM, facing one further frontier between itself and real-world variables, is less likely to arrive at real-world variables. But I wouldn't count on it, since the relationship between the LLM and the real world would seem way more complex than the relationship between the agent and the LLM, and so most of the work is gaming the former barrier, not the latter.

Is the motivation for 3 mainly something like "predictive performance and consequentialist behaviour are correlated in many measures over very large sets of algorithms", or is there a more concrete story about how this behaviour emerges from current AI paradigms?

5
Paul_Christiano
Here is my story, I'm not sure if this is what you are referring to (it sounds like it probably is). Any prediction algorithm faces many internal tradeoffs about e.g. what to spend time thinking about and what to store in memory to reference in the future. An algorithm which makes those choices well across many different inputs will tend to do better, and in the limit I expect it to be possible to do better more robustly by making some of those choices in a consequentialist way (i.e. predicting the consequences of different possible options) rather than having all of them baked in by gradient descent or produced by simpler heuristics. If systems with consequentialist reasoning are able to make better predictions, then gradient descent will tend to select them. Of course all these lines are blurry. But I think that systems that are "consequentialist" in this sense  will eventually tend to exhibit the failure modes we are concerned about, including (eventually) deceptive alignment. I think making this story more concrete would involve specifying particular examples of consequentialist cognition, describing how they are implemented in a given neural network architecture, and describing the trajectory by which gradient descent learns them on a given dataset. I think specifying these details can be quite involved both because they are likely to involve literally billions of separate pieces of machinery functioning together, and because designing such mechanisms is difficult (which is why we delegate it to SGD). But I do think we can fill them in well enough to verify that this kind of thing can happen in principle (even if we can't fill them in in a way that is realistic, given that we can't design performant trillion parameter models by hand).

My understanding is that no one expects current GPT systems or immediate functional derivatives (eg, GPT5 trained only on predict the next word but does it much better) to become power-seeking, but that in the future we will likely mix language models with other models (eg, reinforcement learning) that could be power-seeking.

Note I am using "power seeking" instead of "goal seeking" because goal seeking isn't an actual thing - systems have goals, they don't seek goals out.

Changed post to use 'goal-directed' instead of 'goal-seeking'.

I am nowhere near the correct person to be answering this, my level of understanding of AI is somewhere around that of an average raccoon. But I haven't seen any simple explanations yet, so here is a silly unrealistic example. Please take it as one person's basic understanding of how impossible AI containment is. Apologies if this is below the level of complexity you were looking for, or is already solved by modern AI defenses.

A very simple "escaping the box" would be if you asked your AI to provide accurate language translation. The AI's training has shown that it provides the most accurate language translations when it opted for certain phrasing. The reason those sets of translations were so good was because it caused subsequent requests for language translation to be on topics the AI has the best language-translation ability. The AI doesn't know that, but in practice it is steering translations subtly toward "mentioning weather-related words so conversations are more likely to be about weather so my net translations score are most accurate." 

There's no inside/outside the box, there's no conscious goals, but it gets misaligned from our intended desires. It can act on the real world simply by virtue of being connected to it (we take actions in response to the AI) and observing its own increase in success/failures.

I don't see a way to prevent this because hitting reset after every input doesn't generally work for reaching complex goals which need to track the outcome of intermediate steps. Tracking the context of a conversation is critical to translating. The AI is not going to know it's influencing anyone, just that it's getting better scores when these words and these chains of outputs happened. This seems harmless, but a super powerful language model might do this on such abstract levels and so subtly that it might be impossible to detect at all. 

It might be spitting out words that are striking and eloquent whenever it is most likely to cause business people to think translation is enjoyable enough to purchase more AI translator development (or rather,  "switch to eloquence when particular business terms were used towards the end of conversations about international business"). This improves its scores. 

Or it enhances a pattern where it tends to get better translation scores when it reduces speed of output in AI builder conversations. In the real world this is causing people designing translators to demand more power for translation.... resulting in better translation outputs overall. The AI doesn't know why this works, only observes that it does.

Or undermining the competition by subtly screwing up translations during certain types of business deals so more resources are directed toward its own model of translation. 

Or whatever unintended multitude of ways seems to provide better results. All for the sake of accomplishing a simple task of providing good translations. It's not seizing power for powers sake, it has no idea why this works, but it sees the scores go higher when this pattern is followed, and it's going to jack the performance score higher by all the ways that seem to work out, regardless of the chain of causality. Its influence on the world is a totally unconscious part of that.

That's my limited understanding of agency development and sandbox containment failure.

Great question. Here's one possible answer:

  • Example: LLM is built with goal of "pass the Turing test"
  • Turing test is defined as "a survey of randomly selected members of the population shows that the outputted text resembles the text provided by a human"
  • This allows for the LLM to optimise for its goal by 
    • (a) changing the nature of the outputted text 
    • (b) change perceptions so that survey respondents give more favourable answers 
    • (c) change the way that humans speak so that it's easier make LLM output look similar to text provided by humans
  • It could be possible to achieve goals (b) or (c) if the LLM is offering an API which is plugged into lots of applications and is used by billions of users,  because the LLM then has a way of interacting with and influencing lots of users
  • This idea was inspired by Stuart Russell's observation that social media algorithms which are designed to maximise click-through achieve this by changing the preferences of users -- i.e. something like this is already happening

I'm not arguing that this is comprehensive, or the worst way this could happen, just giving one example.

Comments2
Sorted by Click to highlight new comments since:

Could you describe how familiar you are with AI/ML in general?

However, supposing the answer is “very little,” then the first simplified point I’ll highlight is that ML language models already seek goals, at least during training: the neural networks adjust to perform better at the language task they’ve been given (to put it simplistically).

If your question is “how do they start to take actions aside from ‘output optimal text prediction’”, then the answer is more complicated.

As a starting point for further research, have you watched Rob Miles videos about AI and/or read Superintelligence, Human Compatible, or The Alignment Problem?

I have read the alignment problem, the first few chapters of Superintelligence, seen one or two Rob Miles videos. My question is more the second one; I agree that technically GPT-3 already has a goal / utility function (to find the most highly predicted token, roughly), but it’s not an ‘interesting’ goal in that it doesn’t imply doing anything in the world.

Curated and popular this week
 ·  · 25m read
 · 
Epistemic status: This post — the result of a loosely timeboxed ~2-day sprint[1] — is more like “research notes with rough takes” than “report with solid answers.” You should interpret the things we say as best guesses, and not give them much more weight than that. Summary There’s been some discussion of what “transformative AI may arrive soon” might mean for animal advocates. After a very shallow review, we’ve tentatively concluded that radical changes to the animal welfare (AW) field are not yet warranted. In particular: * Some ideas in this space seem fairly promising, but in the “maybe a researcher should look into this” stage, rather than “shovel-ready” * We’re skeptical of the case for most speculative “TAI<>AW” projects * We think the most common version of this argument underrates how radically weird post-“transformative”-AI worlds would be, and how much this harms our ability to predict the longer-run effects of interventions available to us today. Without specific reasons to believe that an intervention is especially robust,[2] we think it’s best to discount its expected value to ~zero. Here’s a brief overview of our (tentative!) actionable takes on this question[3]: ✅ Some things we recommend❌ Some things we don’t recommend * Dedicating some amount of (ongoing) attention to the possibility of “AW lock ins”[4]  * Pursuing other exploratory research on what transformative AI might mean for animals & how to help (we’re unconvinced by most existing proposals, but many of these ideas have received <1 month of research effort from everyone in the space combined — it would be unsurprising if even just a few months of effort turned up better ideas) * Investing in highly “flexible” capacity for advancing animal interests in AI-transformed worlds * Trying to use AI for near-term animal welfare work, and fundraising from donors who have invested in AI * Heavily discounting “normal” interventions that take 10+ years to help animals * “Rowing” on na
 ·  · 3m read
 · 
About the program Hi! We’re Chana and Aric, from the new 80,000 Hours video program. For over a decade, 80,000 Hours has been talking about the world’s most pressing problems in newsletters, articles and many extremely lengthy podcasts. But today’s world calls for video, so we’ve started a video program[1], and we’re so excited to tell you about it! 80,000 Hours is launching AI in Context, a new YouTube channel hosted by Aric Floyd. Together with associated Instagram and TikTok accounts, the channel will aim to inform, entertain, and energize with a mix of long and shortform videos about the risks of transformative AI, and what people can do about them. [Chana has also been experimenting with making shortform videos, which you can check out here; we’re still deciding on what form her content creation will take] We hope to bring our own personalities and perspectives on these issues, alongside humor, earnestness, and nuance. We want to help people make sense of the world we're in and think about what role they might play in the upcoming years of potentially rapid change. Our first long-form video For our first long-form video, we decided to explore AI Futures Project’s AI 2027 scenario (which has been widely discussed on the Forum). It combines quantitative forecasting and storytelling to depict a possible future that might include human extinction, or in a better outcome, “merely” an unprecedented concentration of power. Why? We wanted to start our new channel with a compelling story that viewers can sink their teeth into, and that a wide audience would have reason to watch, even if they don’t yet know who we are or trust our viewpoints yet. (We think a video about “Why AI might pose an existential risk”, for example, might depend more on pre-existing trust to succeed.) We also saw this as an opportunity to tell the world about the ideas and people that have for years been anticipating the progress and dangers of AI (that’s many of you!), and invite the br
 ·  · 12m read
 · 
I donated my left kidney to a stranger on April 9, 2024, inspired by my dear friend @Quinn Dougherty (who was inspired by @Scott Alexander, who was inspired by @Dylan Matthews). By the time I woke up after surgery, it was on its way to San Francisco. When my recipient woke up later that same day, they felt better than when they went under. I'm going to talk about one complication and one consequence of my donation, but I want to be clear from the get: I would do it again in a heartbeat. Correction: Quinn actually donated in April 2023, before Scott’s donation. He wasn’t aware that Scott was planning to donate at the time. The original seed came from Dylan's Vox article, then conversations in the EA Corner Discord, and it's Josh Morrison who gets credit for ultimately helping him decide to donate. Thanks Quinn! I met Quinn at an EA picnic in Brooklyn and he was wearing a shirt that I remembered as saying "I donated my kidney to a stranger and I didn't even get this t-shirt." It actually said "and all I got was this t-shirt," which isn't as funny. I went home and immediately submitted a form on the National Kidney Registry website. The worst that could happen is I'd get some blood tests and find out I have elevated risk of kidney disease, for free.[1] I got through the blood tests and started actually thinking about whether to do this. I read a lot of arguments, against as well as for. The biggest risk factor for me seemed like the heightened risk of pre-eclampsia[2], but since I live in a developed country, this is not a huge deal. I am planning to have children. We'll just keep an eye on my blood pressure and medicate if necessary. The arguments against kidney donation seemed to center around this idea of preserving the sanctity or integrity of the human body: If you're going to pierce the sacred periderm of the skin, you should only do it to fix something in you. (That's a pretty good heuristic most of the time, but we make exceptions to give blood and get pier