3001 karmaJoined


Senior research analyst at Open Philanthropy. Doctorate in philosophy at the University of Oxford. Opinions my own.


Otherness and control in the age of AGI
Scheming AIs: Will AIs fake alignment during training in order to get power?


Bleh, sorry, looks like this cross-posted twice from LW, but has only one LW copy (and can only be edited via that copy). I'm going to try to keep this version and get rid of the other one. Edit: resolved.

(Copying over my response from LessWrong)

Thanks for writing this -- I’m very excited about people pushing back on/digging deeper re: counting argumentssimplicity arguments, and the other arguments re: scheming I discuss in the report. Indeed, despite the general emphasis I place on empirical work as the most promising source of evidence re: scheming, I also think that there’s a ton more to do to clarify and maybe debunk the more theoretical arguments people offer re: scheming – and I think playing out the dialectic further in this respect might well lead to comparatively fast progress (for all their centrality to the AI risk discourse, I think arguments re: scheming have received way too little direct attention). And if, indeed, the arguments for scheming are all bogus, this is super good news and would be an important update, at least for me, re: p(doom) overall. So overall I’m glad you’re doing this work and think this is a valuable post. 

On other note up front: I don’t think this post “surveys the main arguments that have been put forward for thinking that future AIs will scheme.” In particular: both counting arguments and simplicity arguments (the two types of argument discussed in the post) assume we can ignore the path that SGD takes through model space. But the report also discusses two arguments that don’t make this assumption – namely, the “training-game independent proxy goals story” (I think this one is possibly the most common story, see e.g. Ajeya here, and all the talk about the evolution analogy) and the “nearest max-reward goal argument.” I think that the idea that “a wide variety of goals can lead to scheming” plays some role in these arguments as well, but not such that they are just the counting argument restated, and I think they’re worth treating on their own terms. 

On counting arguments and simplicity arguments

Focusing just on counting arguments and simplicity arguments, though: Suppose that I’m looking down at a superintelligent model newly trained on diverse, long-horizon tasks. I know that it has extremely ample situational awareness – e.g., it has highly detailed models of the world, the training process it’s undergoing, the future consequences of various types of power-seeking, etc – and that it’s getting high reward because it’s pursuing some goal (the report conditions on this). Ok, what sort of goal? 

We can think of arguments about scheming in two categories here. 

  • (I) The first tries to be fairly uncertain/agnostic about what sorts of goals SGD’s inductive biases favor, and it argues that given this uncertainty, we should be pretty worried about scheming. 
    • I tend to think of my favored version of the counting argument (that is, the hazy counting argument) in these terms. 
  • (II) The second type focuses on a particular story about SGD’s inductive biases and then argues that this bias favors schemers.
    • I tend to think of simplicity arguments in these terms. E.g., the story is that SGD’s inductive biases favor simplicity, schemers can have simpler goals, so schemers are favored.

Let’s focus first on (I), the more-agnostic-about-SGD’s-inductive-biases type. Here’s a way of pumping the sort of intuition at stake in the hazy counting argument:

  1. A very wide variety of goals can prompt scheming.
  2. By contrast, non-scheming goals need to be much more specific to lead to high reward.
  3. I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals.
  4. So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer. 

Now, as I mention in the report, I'm happy to grant that this isn't a super rigorous argument. But how, exactly, is your post supposed to comfort me with respect to it? We can consider two objections, both of which are present in/suggested by your post in various ways.

  • (A) This sort of reasoning would lead to you giving significant weight to SGD overfitting. But SGD doesn’t overfit, so this sort of reasoning must be going wrong, and in fact you should have low probability on SGD having selected a schemer, even given this ignorance about SGD's inductive biases.
  • (B): (3) is false: we know enough about SGD’s inductive biases to know that it actively favors non-scheming goals over scheming goals. 

Let’s start with (A). I agree that this sort of reasoning would lead you to giving significant weight to SGD overfitting, absent any further evidence. But it’s not clear to me that giving this sort of weight to overfitting was unreasonable ex ante, or that having learned that SGD-doesn't-overfit, you should now end up with low p(scheming) even given your ongoing ignorance about SGD's inductive biases.

Thus, consider the sort of analogy I discuss in the counting arguments section. Suppose that all we know is that Bob lives in city X, that he went to a restaurant on Saturday, and that town X has a thousand chinese restaurants, a hundred mexican restaurants, and one indian restaurant. What should our probability be that he went to a chinese restaurant? 

In this case, my intuitive answer here is: “hefty.”[1] In particular, absent further knowledge about Bob’s food preferences, and given the large number of chinese restaurants in the city, “he went to a chinese restaurant” seems like a pretty salient hypothesis. And it seems quite strange to be confident that he went to a non-chinese restaurant instead. 

Ok but now suppose you learn that last week, Bob also engaged in some non-restaurant leisure activity. For such leisure activities, the city offers: a thousand movie theaters, a hundred golf courses, and one escape room. So it would’ve been possible to make a similar argument for putting hefty credence on Bob having gone to a movie. But lo, it turns out that actually, Bob went golfing instead, because he likes golf more than movies or escape rooms.

How should you update about the restaurant Bob went to? Well… it’s not clear to me you should update much. Applied to both leisure and to restaurants, the hazy counting argument is trying to be fairly agnostic about Bob’s preferences, while giving some weight to some type of “count.” Trying to be uncertain and agnostic does indeed often mean putting hefty probabilities on things that end up false. But: do you have a better proposed alternative, such that you shouldn’t put hefty probability on “Bob went to a chinese restaurant”, here, because e.g. you learned that hazy counting arguments don’t work when applied to Bob? If so, what is it? And doesn’t it seem like it’s giving the wrong answer?

Or put another way: suppose you didn’t yet know whether SGD overfits or not, but you knew e.g. about the various theoretical problems with unrestricted uses of the indifference principle. What should your probability have been, ex ante, on SGD overfitting? I’m pretty happy to say “hefty,” here. E.g., it’s not clear to me that the problem, re: hefty-probability-on-overfitting, was some a priori problem with hazy-counting-argument-style reasoning. For example: given your philosophical knowledge about the indifference principle, but without empirical knowledge about ML, should you have been super surprised if it turned out that SGD did overfit? I don’t think so. 

Now, you could be making a different, more B-ish sort of argument here: namely, that the fact that SGD doesn’t overfit actively gives us evidence that SGD’s inductive biases also disfavor schemers. This would be akin to having seen Bob, in a different city, actively seek out mexican restaurants despite there being many more chinese restaurants available, such that you now have active evidence that he prefers mexican and is willing to work for it. This wouldn’t be a case of having learned that bob’s preferences are such that hazy counting arguments “don’t work on bob” in general. But it would be evidence that Bob prefers non-chinese.

I’m pretty interested in arguments of this form. But I think that pretty quickly, they move into the territory of type (II) arguments above: that is, they start to say something like “we learn, from SGD not overfitting, that it prefers models of type X. Non-scheming models are of type X, schemers are not, so we now know that SGD won’t prefer schemers.”

But what is X? I’m not sure your answer (though: maybe it will come in a later post). You could say something like “SGD prefers models that are ‘natural’” – but then, are schemers natural in that sense? Or, you could say “SGD prefers models that behave similarly on the training and test distributions” – but in what sense is a schemer violating this standard? On both distributions, a schemer seeks after their schemer-like goal. I’m not saying you can’t make an argument for a good X, here – but I haven’t yet heard it. And I’d want to hear its predictions about non-scheming forms of goal-misgeneralization as well.  

Indeed, my understanding is that a quite salient candidate for “X” here is “simplicity” – e.g., that SGD’s not overfitting is explained by its bias towards simpler functions. And this puts us in the territory of the “simplicity argument” above. I.e., we’re now being less agnostic about SGD’s preferences, and instead positing some more particular bias. But there’s still the question of whether this bias favors schemers or not, and the worry is that it does. 

This brings me to your take on simplicity arguments. I agree with you that simplicity arguments are often quite ambiguous about the notion of simplicity at stake (see e.g. my discussion here). And I think they’re weak for other reasons too (in particular, the extra cognitive faff scheming involves seems to me more important than its enabling simpler goals). 

But beyond “what is simplicity anyway,” you also offer some other considerations, other than SGD-not-overfitting, meant to suggest that we have active evidence that SGD’s inductive biases disfavor schemers. I’m not going to dig deep on those considerations here, and I’m looking forward to your future post on the topic. For now, my main reaction is: “we have active evidence that SGD’s inductive biases disfavor schemers” seems like a much more interesting claim/avenue of inquiry than trying to nail down the a priori philosophical merits of counting arguments/indifference principles, and if you believe we have that sort of evidence, I think it’s probably most productive to just focus on fleshing it out and examining it directly. That is, whatever their a priori merits, counting arguments are attempting to proceed from a position of lots of uncertainty and agnosticism, which only makes sense if you’ve got no other good evidence to go on. But if we do have such evidence (e.g., if (3) above is false), then I think it can quickly overcome whatever “prior” counting arguments set (e.g., if you learn that Bob has a special passion for mexican food and hates chinese, you can update far towards him heading to a mexican restaurant). In general, I’m very excited for people to take our best current understanding of SGD’s inductive biases (it’s not my area of expertise), and apply it to p(scheming), and am interested to hear your own views in this respect. But if we have active evidence that SGD’s inductive biases point away from schemers, I think that whether counting arguments are good absent such evidence matters way less, and I, for one, am happy to pay them less attention.

(One other comment re: your take on simplicity arguments: it seems intuitively pretty non-simple to me to fit the training data on the training distribution, and then cut to some very different function on the test data, e.g. the identity function or the constant function. So not sure your parody argument that simplicity also predicts overfitting works. And insofar as simplicity is supposed to be the property had by non-overfitting functions, it seems somewhat strange if positing a simplicity bias predicts over-fitting after all.)

A few other comments

Re: goal realism, it seems like the main argument in the post is something like: 

  1. Michael Huemer says that it’s sometimes OK to use the principle of indifference if you’re applying it to explanatorily fundamental variables. 
  2. But goals won’t be explanatorily fundamental. So the principle of indifference is still bad here.

I haven’t yet heard much reason to buy Huemer’s view, so not sure how much I care about debating whether we should expect goals to satisfy his criteria of fundamentality. But I'll flag I do feel like there’s a pretty robust way in which explicitly-represented goals appropriately enter into our explanations of human behavior – e.g., I have buying a flight to New York because I want to go to New York, I have a representation of that goal and how my flight-buying achieves it, etc. And it feels to me like your goal reductionism is at risk of not capturing this. (To be clear: I do think that how we understand goal-directedness matters for scheming -- more here -- and that if models are only goal-directed in a pretty deflationary sense, this makes scheming a way weirder hypothesis. But I think that if models are as goal-directed as strategic and agentic humans reasoning about how to achieve explicitly represented goals, their goal-directedness has met a fairly non-deflationary standard.)

I’ll also flag some broader unclarity about the post’s underlying epistemic stance. You rightly note that the strict principle of indifference has many philosophical problems. But it doesn’t feel to me like you’ve given a compelling alternative account of how to reason “on priors” in the sorts of cases where we’re sufficiently uncertain that there’s a temptation to spread one’s credence over many possibilities in the broad manner that principles-of-indifference-ish reasoning attempts to do. 

Thus, for example, how does your epistemology think about a case like “There are 1000 people in this town, one of them is the murderer, what’s the probability that it’s Mortimer P. Snodgrass?” Or: “there are a thousand white rooms, you wake up in one of them, what’s the probability that it’s room number 734?” These aren’t cases like dice, where there’s a random process designed to function in principle-of-indifference-ish ways. But it’s pretty tempting to spread your credence out across the people/rooms (even if in not-fully-uniform ways), in a manner that feels closely akin to the sort of thing that principle-of-indifference-ish reasoning is trying to do. (We can say "just use all the evidence available to you" -- but why should this result in such principle-of-indifference-ish results?)

Your critique of counting argument would be more compelling to me if you had a fleshed out account of cases like these -- e.g., one which captures the full range of cases where we’re pulled towards something principle-of-indifference-ishsuch that you can then take that account and explain why it shouldn’t point us towards hefty probabilities on schemers, a la the hazy counting argument, even given very-little-evidence about SGD’s inductive biases.

More to say on all this, and I haven't covered various ways in which I'm sympathetic to/moved by points in the vicinity of the one's you're making here.  But for now: thanks again for writing, looking forward to future installments. 

(Also copied from LW. And partly re-hashing my response from twitter.)

I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform. 

(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers seems like it might well be a productive frame. I do also think that questions about whether models will be bottlenecked on serial computation, and/or whether "shallower" computations will be selected for, are pretty relevant here, and the report includes a rough calculation in this respect in section 4.4.2 (see also summary here).)

Indeed, I think that maybe the strongest single argument against scheming is a combination of 

  1. "Because of the extra reasoning schemers perform, SGD would prefer non-schemers over schemers in a comparison re: final properties of the models" and 
  2. "The type of path-dependence/slack at stake in training is such that SGD will get the model that it prefers overall." 

My sense is that I'm less confident than you in both (1) and (2), but I think they're both plausible (the report, in particular, argues in favor of (1)), and that the combination is a key source of hope. I'm excited to see further work fleshing out the case for both (including e.g. the sorts of arguments for (2) that I took you and Nora to be gesturing at on twitter -- the report doesn't spend a ton of time on assessing how much path-dependence to expect, and of what kind).

Re: your discussion of the "ghost of instrumental reasoning," "deducing lots of world knowledge 'in-context,' and "the perspective that NNs will 'accidentally' acquire such capabilities internally as a convergent result of their inductive biases" -- especially given that you only skimmed the report's section headings and a small amount of the content, I have some sense, here, that you're responding to other arguments you've seen about deceptive alignment, rather than to specific claims made in the report (I don't, for example, make any claims about world knowledge being derived "in-context," or about models "accidentally" acquiring flexible instrumental reasoning). Is your basic thought something like: sure, the models will develop flexible instrumental reasoning that could in principle be used in service of arbitrary goals, but they will only in fact use it in service of the specified goal, because that's the thing training pressures them to do? If so, my feeling is something like: ok, but a lot of the question here is whether using the instrumental reasoning in service of some other goal (one that backchains into getting-reward) will be suitably compatible with/incentivized by training pressures as well. And I don't see e.g. the reversal curse as strong evidence on this front. 

Re: "mechanistically ungrounded intuitions about 'goals' and 'tryingness'" -- as I discuss in section 0.1, the report is explicitly setting aside disputes about whether the relevant models will be well-understood as goal-directed (my own take on that is in section 2.2.1 of my report on power-seeking AI here). The question in this report is whether, conditional on goal-directedness, we should expect scheming. That said, I do think that what I call the "messyness" of the relevant goal-directedness might be relevant to our overall assessment of the arguments for scheming in various ways, and that scheming might require an unusually high standard of goal-directedness in some sense. I discuss this in section 2.2.3, on "'Clean' vs. 'messy' goal-directedness," and in various other places in the report.

Re: "long term goals are sufficiently hard to form deliberately that I don't think they'll form accidentally" -- the report explicitly discusses cases where we intentionally train models to have long-term goals (both via long episodes, and via short episodes aimed at inducing long-horizon optimization). I think scheming is more likely in those cases. See section 2.2.4, "What if you intentionally train the model to have long-term goals?" That said, I'd be interested to see arguments that credit assignment difficulties actively count against the development of beyond-episode goals (whether in models trained on short episodes or long episodes) for models that are otherwise goal-directed. And I do think that, if we could be confident that models trained on short episodes won't learn beyond-episode goals accidentally (even irrespective of mundane adversarial training -- e.g., that models rewarded for getting gold coins on the episode would not learn a goal that generalizes to caring about gold coins in general, even prior to efforts to punish it for sacrificing gold-coins-on-the-episode for gold-coins-later), that would be a significant source of comfort (I discuss some possible experimental directions in this respect in section 6.2).

Thanks for this thoughtful comment, Ben. And also, for putting the "The Gold Lily" and "Mother and Child" on my radar -- they hadn't been before. I agree that "Mother and Child" evokes a sort some kind of sort of intergenerational project in the way you describe -- "it is your turn to address it." It seems related to the thing I was trying to talk about at the end of the post -- e.g., Gluck asking for some kind of directness and intensity of engagement with life. 

Thanks! Re: one in five million and .01% -- thanks, edited. And thanks for pointing to the Augenblick piece -- does look relevant (though my specific interest in that footnote was in constraints applicable to a model where you can only consider some subset of your evidence at any given time).

I'm sorry to hear about this, Nathan. As I say in the post, I do think that the question how to do gut-stuff right from a practical perspective is distinct from the epistemic angle that the post focuses on, and I think it's important to attend to both.

Noting that the passage you quote in your appendix from my report isn't my definition of the type of AI I'm focused on. I'm focused on AI systems with the following properties:

  • Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation). 
  • Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world. 
  • Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.

See section 2.1 in the report for more in-depth description.

Load more