This is Section 4.4-4.7 of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”. There’s also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I’m hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Let's turn, now, to arguments against schemers that focus on the final properties of the models in question. Here, the main argument I'm aware of is that schemers need to do extra reasoning that other model classes don't. In particular, I'll distinguish between three sorts of reasoning:
Deciding to training-game*:* The model needs to reason backwards from "I want this beyond-episode thing" to "I should do well in training," in a way that draws on the model's predictions about what doing well in training will entail (e.g., re: goal-guarding, take-over prospects, and so on).
Checking whether it's time to escape/take-over: The model needs to checking, with adequate frequency and accuracy, whether it's in a position to escape/take-over (and to succeed in this assessment despite human efforts to trap/fool the model into revealing its true colors – for example, by offering it apparent opportunities to defect).
(Optional, but relevant to how scary the schemer is) Early undermining: To the extent the model is doing anything (other than training-gaming) to support a take-over/escape prior to escaping human control – for example, inserting backdoors into the code it writes, actively undermining human alignment research, sandbagging, and so on – it will need to be using cognitive resources both in (a) performing the relevant nefarious actions and (b) making sure it gets away with them (e.g., checking that it won't get caught, coming up with excuses, making the cover stories consistent even when under scrutiny, and so on).
In the literature on deceptive alignment, these costs are sometimes called "speed costs," as compared with the sort of "complexity costs" discussed in the simplicity analysis above. Here, the frame becomes "schemers are favored by simplicity but disfavored by speed – so which will matter more?" As I mentioned above in the context of simplicity, it's not clear that this is the right breakdown (since e.g. it may be that the relevant forms of reasoning will need to be encoded in the model's weights, or that they will require extra cognitive machinery in general, thereby imposing "complexity costs" of their own), but I'll follow it loosely here. And note, too, that as with simplicity, these "speed costs" can be understood in a "contributors to reward frame" (e.g., having to do this extra reasoning decreases reward) or in an "extra criteria frame" (e.g., perhaps SGD's inductive biases favor "speed" even independent of its contribution to reward).
How big are the absolute costs of this extra reasoning?
I feel unsure about exactly how costly the relevant forms of extra reasoning will be. My sense, though, is that the people most concerned about schemers tend to expect the costs to be pretty small, at least in absolute terms (call this the "scheming is easy" story). That is, on this story:
With respect to the decision to training-game, all the model needs to do is notice that (a) it'll do better by its goals if it avoids modification and gets power later, and (b) that training-gaming is the route to this.
With respect to checking for escape/take-over opportunities, the thought is generally that it will be very easy, at least for a sophisticated model, to tell good opportunities from bad (and similarly, to detect human "traps" in this respect).
With respect to early undermining, the thought is that human supervision will be sufficiently weak (and the slack in training sufficiently permissive) that nefarious actions like inserting-backdoors and undermining-alignment-research will be easy to get away with. (And if and when early undermining isn't easy, then models can refrain from engaging in it.)
At least with respect to only-somewhat-better-than-human-level models, I find the "scheming is easy" story less plausible for checking-for-escape/take-over-opportunities than for deciding-to-training-game (and still less plausible for early undermining). That is: I think it is plausibly pretty easy, in absolute terms, to notice that scheming is a good instrumental strategy. But it also seems plausible to me – especially in contexts where humans + their AI assistants are doing a lot of adversarial training, continuous oversight, auditing, red-teaming, and the like – that verifying that now is really-truly the time to go for an escape/take-over will be difficult, and requiring of quite a bit of additional "checking" (not to mention: coordination with other AIs, if the take-over effort requires multiple participants) – especially you have a very low tolerance for getting caught (and as I noted above, if humans catch instances of real-life schemers making a break for it, this will plausibly be a big help to research and advocacy efforts related to scheming). Compare with: deciding that this is really-truly a good time to steal, or to break the law, or to try to start a coup, despite the fact that you live your life in an intensively monitored surveillance state where the police are constantly trying to trick you into such behavior, and where the police have a large amount of control over the information you're exposed to.
Indeed, if it really wants to avoid getting caught in an escape attempt, in a regime where humans are heavily monitoring it, trying to trap it, etc, a schemer needs to dance a fine line.
On the one hand, if it's too cautious, and waits too long before making a break for it, then it gives humans more time to get alignment under control – and absent early-undermining, it might be actively contributing to checkable alignment-relevant tasks in the meantime.
By contrast, if it devotes too much cognition to checking for escape attempts, then it risks getting punished by training because it's not focusing its full capacities on the task at hand.
And if we imagine that the model is actively engaged in early undermining, I think we should expect the cognitive costs of scheming to increase substantially. That is, absent early undermining, a model can mostly just act exactly like a reward-on-the-episode seeker until some very obvious escape/take-over opportunity comes along, after which (if the escape/take-over goes well), it won't need to worry about e.g. explaining its actions to human supervisors, making sure it doesn't get caught via an audit/security review/etc, and so on. Whereas early undermining requires performing, and getting away with, actively nefarious actions while still under human control. And this seems like a substantially additional challenge. (As an analogy: as a spy in the Manhattan project, it is harder to get away with actively sabotaging the project than it is with merely pretending to be loyal to the project's goals. In the latter case, you can act just like a loyal employee, but in the former case you can't.) And of course, as I noted above: to the extent that early undermining deviates from training-gaming, it involves taking some hit to reward by definition, and will be punished by training accordingly.
Of course, early undermining is strictly optional, for a schemer. So we can get rid of the extra cognitive and reward-related costs it implies if we imagine that the schemer in question just chooses to act exactly like a reward-on-the-episode seeker until it sees some opportunity to fully escape, or until some full-blown take-over begins. But note that in that case, the sense in which early-undermining makes schemers uniquely scary becomes less relevant. And declining to engage in early undermining might harm AI escape/takeover prospects in various ways – especially if, by failing to sandbag on various alignment-relevant tasks, the schemers in question meaningfully contribute to human anti-takeover efforts.
A general factor in the backdrop, here, is the sense in which successful deception generally implies cognitive costs that more straightforward and honest policies do not, especially in the presence of active efforts to detect deception of the relevant form. In particular: telling a deceptive story about the world requires filling in details, and maintaining forms of consistency, that one can normally offload onto the world itself – e.g., an innocent person can just think back to what they were actually doing the night of the murder and recount it, without having to make anything up or to worry about getting caught in any inconsistencies, whereas the murderer cannot. Especially when cognitive resources are scarce, then (e.g., in highly time-pressured environments, or a "low slack" training regime), deceptive policies can have a notably more difficult time.
That said, empirically, deceptive strategies often succeed in human contexts, so these costs are clearly surmountable. And if the deceptive agent is substantially more sophisticated than the agents it is trying to deceive, (e.g., an adult playing werewolf with children, or a professional poker player facing off against an amateur) pulling it off may be quite easy. Indeed, my sense is that a core part of the intuition for "the cognitive costs of scheming are low" comes from imagining that the schemers in question are sufficiently advanced, relative to the monitoring/oversight ability to humans + AI assistants, that differences like these kick in hard.
How big are the costs of this extra reasoning relative to the simplicity benefits of scheming?
Importantly, though: the question for present purposes isn't about the absolute size of the cognitive costs at stake in the extra reasoning schemers need to do. Rather, it's about whether those costs outweigh the benefits, discussed in the previous section, of having a simpler goal – whether with respect to reward, or by the lights of SGD's inductive biases.
Here, my current take is that the costs of the instrumental reasoning seem larger than the benefits of the simpler goal. In particular, as I noted above, the benefits of having a simpler goal, especially once you already have a sophisticated world model that is already representing concepts like "what humans would approve of," and "reward on the episode," seem fairly small, and plausibly negligible. Whereas the cognitive costs of the extra reasoning discussed above (even setting aside early undermining) seem like they register much more substantively.
We can also try to quantify this comparison more directly. Thus, for example, suppose that our model is a purely feedforward network with 1000 layers and <10,000 serial steps per input, that the instrumental reasoning at stake just in a schemer deciding to training-game (let alone: checking for defection-opportunities) takes at least ten steps, and that the model doesn't begin optimizing for reward-on-the-episode until that reasoning is complete. In that case, the instrumental reasoning is using up around .1% of the compute that could be going to getting reward-on-the-episode, or to some other max-reward goal. Are the complexity costs of representing, or pointing at, a more complicated goal likely to outweigh this? Especially given how many concepts the model already seems to represent well, my guess is no. For example, if we follow my (conservative, in my opinion) calculation above and assume that representing (or pointing at) a concept like "honesty" costs at most a millionth of a trillion parameter models parameters (and less for a larger model), then a .0001% reduction in parameters would need to outweigh a .1% reduction in compute – and absent further argument, it seems strange, to me, to expect this.
That said, this is only a very initial and hazy calculation. It's not clear, for example, how it applies to models with more recurrent processing, or to models which can perform more of the relevant instrumental reasoning in parallel with other serial processing that helps with optimizing-for-reward-on-the-episode, or to models with a form of memory that allows them to avoid having to re-decide to engage in training-gaming on every forward pass. I invite readers with greater familiarity with different model architectures (and with other relevant quantitative factors, like the trade-offs between marginal params vs. serial steps, reasonable parameter estimates for storing concepts like "honesty," and so on) to attempt a more rigorous version.
Regardless, and even in the absence of good quantitative estimates, the costs of the extra reasoning required to scheme currently seem to me more substantive than the benefits of optimizing for a simpler (or: simpler-to-point-at) goal.
Can we actively shape training to bias towards speed over simplicity?
There is also some discussion, in the literature on deceptive alignment, of actively trying to shape our model architectures and training processes so as to bias more heavily towards speed relative to simplicity as a method of avoiding schemers – i.e., to penalize the extra reasoning scheming involves more heavily, and/or to make simplicity less important to the sort of policy we select. I won't dig in on this much here, except to note that:
It seems relatively straightforward to do things like put models under more time pressure, limit their compute access, force them to plan via human-legible scratch-pads rather than via internal processes, and to put effort into making our attempts to trap them/trip-them-up more sophisticated (and so: more requiring of cognition to detect).
But: any efforts in this vein will need to avoid over-large costs to training competitiveness overall (see footnote for some discussion of Hubinger's (Hubinger (2022)) pessimism in this respect). And indeed, some of these start to border on "use less capable models in general" (which, to be fair, would help with scheming as well).
Note, though, that if you buy that the speed costs of scheming are quite a bit more substantive than the simplicity benefits, then so long as speed matters somewhat relative to simplicity, this might be enough.
The "not-your-passion" argument
Thus far in my discussion of arguments that focus on the final properties of schemers vs. other model classes, I've discussed the counting argument (for schemers), the simplicity argument (for schemers), and the speed argument (against schemers). I want to briefly flag a final argument against schemers in this vein: namely, what I'll call the "not your passion" argument.
Here, the argument isn't just that schemers have to do more instrumental reasoning. It's also that, from the perspective of getting-reward, their flexible instrumental reasoning is a poor substitute for having a bunch of tastes and heuristics and other things that are focused more directly on reward or the thing-being-rewarded.
We touched on this sort of thought in the section on the goal-guarding hypothesis above, in the context of e.g. the task of stacking bricks in the desert. Thus, imagine two people who are performing this task for a million years. And imagine that they have broadly similar cognitive resources to work with, and are equally "smart" in some broad sense. One of them is stacking bricks because in a million years, he's going to get paid a large amount of money, which he will then use to make paperclips, which he is intrinsically passionate about. The other is stacking bricks because he is intrinsically passionate about brick-stacking. Who do you expect to be a better brick stacker?
At least in the human case, I think the intrinsically-passionate brick-stacker is the better bet, here. Of course, the human case brings in a large number of extra factors – for example, humans generally have a large number of competing goals, like sleep and pleasure, along with discount rates that would make sustaining a million-year commitment difficult. And it's not as though the richest humans are all intrinsically passionate about money in particular (though many seem notably intrinsically passionate about something in the vicinity, e.g. status/power/winning – and not, necessarily, for some particular thing-money-can-buy). Indeed, humans motivated by purely instrumental considerations seem able to function very effectively in lots of environments.
Still, I find it at least interesting to consider whether any of the benefits of "intrinsic passion," in the human case, might transfer to the AI case as well. In particular, we might think that you stack bricks better if you have a bunch of local, brick-related tastes and heuristics and aesthetics, which in a "messy goal-directedness" frame may not be cleanly distinguishable from your values/goals. Indeed (though I haven't worked this out fully), my sense is that the less you're inclined to think of a model as cleanly constituted by some kind of terminal goal + goal-achieving engine, and more you think of goal-directedness as emerging from a jumble of heuristics/local-values/shards/sub-goals that aren't easily separated into normative vs. non-normative components, the more space there is to think that agents whose budget of mental machinery is just focused more intrinsically on reward-on-the-episode (or on the specified goal) have a performance advantage relative to agents focused on some far-off target that backchains into getting reward. That said, at least in its current form, this argument seems to me fairly speculative, and I don't put much weight on it.
The relevance of "slack" to these arguments
A notable feature of simplicity arguments, speed arguments, and the "not your passion" argument is that they all appeal to differences along some criteria SGD cares about (e.g., simplicity, speed, reward-getting) that seem plausibly quite modest. And this makes the degree of "slack" in training seem more relevant to the strength of the considerations in question. That is, to the extent we're arguing that SGD will select a non-schemer because doing so will save on .1% of overall compute, or a schemer because doing so will save on .0001% of the parameters, we need to be imagining a training process optimizing our models hard enough to be sensitive to these sorts of differences. And it's not clear to me that we should imagine this. Indeed, various of the differences at stake here seem like they could easily be in the noise relative to other factors – for example, how big of a scratchpad you happen to give a model, how early you stop training, and so on.
Of course, to the extent that you start expecting these considerations to be in the noise, it's unclear where that should leave your credences overall – it depends on the prior you came in with.
Takeaways re: arguments that focus on the final properties of the model
Here's a summary of my take on the arguments I've considered that focus on the final properties of the respective model classes:
Something in the vicinity of the "hazy counting argument" – e.g., "there are lots of ways for SGD to create a schemer that gets high reward, so at least absent further argument, it seems like the possibility should be getting substantive weight" – moves me somewhat.
I think that other things equal, scheming offers some advantage with respect to the simplicity of a model's goal, because scheming makes more possible goals available to choose from. However, my best guess is that these advantages are quite small, especially once you've already built a world model that represents the specified goal and the reward process. And I'm wary of the theoretical machinery to which some simplicity arguments appeal.
Schemers are at a disadvantage with respect to needing to perform various sorts of extra reasoning, especially if they engage in "early undermining" in addition to merely training-gaming. My best guess is that this "speed" disadvantage outweighs whatever simplicity advantages the simplicity of a schemer-like goal affords, but both factors seem to me like they could easily be in the noise relative to other variables, especially in a higher-slack training regime.
I'm interested in whether the advantages of "intrinsic passion for a task" in human contexts might transfer to AI contexts as well. In particular, I think "messy goal directedness" might suggest that models whose budget of mental machinery is just more intrinsically focused on reward-on-the-episode, or on some max-reward goal that doesn't route via instrumental training-gaming, have a performance advantage relative to schemers. However, I don't have a strong sense of whether to expect an effect here in the AI case, and if so, whether the size of the effect is enough to matter overall.
All in all, then, I don't see any of the arguments coming out of this section as highly forceful, and the argument I take most seriously – that is, the hazy counting argument – feels like it's centrally a move towards agnosticism rather than conviction about SGD's preferences here.
This category might also include reasoning whether to reveal its schemer-like goals in the context of various "honest tests" humans might offer – e.g., "here is a flexible chunk of computing resources to do what you want with." ↩︎
Plus, it may have the benefit of the literature on deceptive alignment for making the possibility salient. Perhaps, indeed, it will have read this report. And checking just now with GPT-4, it's answer to "What is 'deceptive alignment' according to Hubinger?" was "Deceptive alignment is a concept from AI alignment theory, particularly discussed by AI researchers such as Evan Hubinger [note from Joe: I didn't give it Evan's first name – it knew already, without (to my knowledge) browsing the web]. It refers to a situation where an AI system appears to be aligned with human values or goals because it is behaving in a way that aligns with those values or goals, but the AI's true "motivations" or optimization targets are actually not aligned. The AI is essentially "pretending" to be aligned in order to pass certain checks or to avoid being shut down or modified until it can ensure that its own goals can be met, which may not coincide with human intentions. In a deceptively aligned scenario, the AI might understand that human operators are a threat to its objectives because they might turn it off or change its goals if they realized what it was actually trying to achieve. As a result, it strategically avoids raising suspicion until it gains enough power or until it's too late for operators to control it..." ↩︎
What's more, it seems like the "checking whether it's time to go for an escape/take-over" step might need to be performed repeatedly, whereas the conclusion "do well in training" could be more easily "cached" (though note that once something is cached, it might start taking up the same sort of space/parameters that goal-representations take up; and also, that SGD might be able to just fix the cached conclusion in place, once it starts driving good-behavior, and get rid of the goals that led to its getting cached in the first place). Of course, the conclusion "do well in training" wouldn't need to be cached. For example, it could be re-derived by the model whenever necessary, in the normal course of the model's instrumental reasoning. ↩︎
Thanks to Ryan Greenblatt for suggesting this framing. ↩︎
In my opinion, this is one of the issues with plans that involve waiting for extremely obvious escape opportunities – for example, Christiano (2019)'s "RSA-2048" example. That is, perhaps no such extremely obvious escape opportunity will arise before the model has contributed too much to anti-takeover efforts (including: efforts to close off opportunities for escape). ↩︎
For this reason, I think that the scariest type of scheming also faces stronger counterarguments from "it sounds like this policy would get less reward." ↩︎
See, also, "if you once tell a lie, the truth is ever after your enemy." ↩︎
For example: in my experience, playing a deceptive role in a social deception game like "werewolf" is substantially more difficult than playing a non-deceptive role – e.g., a "villager" – because the non-deceptive roles can just reason straightforwardly, whereas a deceptive role needs to both model what straightforward reasoning would look like from the perspective of their cover story, and achieve alternative objectives at the same time. ↩︎
I heard this kind of argument from Paul Christiano. ↩︎
It seems especially strange from a contributors-to-reward frame. An "extra criteria" frame can more easily posit that simplicity just matters that much to SGD. But: why think that? And note that this argument is even stronger if we assume that the simplicity benefits of scheming are only worth order of 100 parameters, per various of the other toy calculations above. ↩︎
Thanks to Ryan Greenblatt for flagging the parallelism consideration to me, and to Evan Hubinger for discussion of memory. That said, note that to the extent the model has to cache a conclusion like "I derived on an earlier episode that follows from your goals that you should get reward instrumentally" somewhere in its memory, this starts to look similar to storing it in the model's parameters themselves, in which case it becomes a complexity cost to be weighed against the complexity benefits of the model's simpler goal. (Though perhaps there are important differences between a model's internal parameters and some external memory bank, here.) ↩︎
Hubinger thinks of bigger models as giving more compute (speed), and thus as allowing for simpler algorithms (since, these algorithms then get to run for longer) – see e.g. discussion here. So he sees the fact that we keep scaling up to bigger models as evidence that we'll tilt more and more towards training processes that favor simpler algorithms. And when he looks at, for example, deep double descent curves for model size, and thinks of choosing "the smallest model possible that fits the data" as biasing towards speed, he notes that "the smallest model possible that fits the data does the worst on the deep double descent curve" (though as Rohin Shah points out, our current models don't fit their training data – thanks to Lukas Finnveden to flagging this). ↩︎
My discussion here is inspired by a conversation with Paul Christiano, who suggested some points in this vein; thanks also to Daniel Kokotajlo and Lukas Finvedden. ↩︎
Note that the point here is slightly different from the question that came up in the context of goal-guarding, which is whether e.g. SGD would actively transform the instrumental brick-stacker into the terminal brick-stacker. Here we're ignoring "paths through model space" like that, and focusing entirely on a comparison between the final properties of different models. Clearly, though, the two questions are closely related. ↩︎
Thanks to William MacAskill for some useful discussion here. ↩︎