2779 karmaJoined Nov 2016


Senior research analyst at Open Philanthropy. Doctorate in philosophy at the University of Oxford. Opinions my own.


Otherness and control in the age of AGI
Scheming AIs: Will AIs fake alignment during training in order to get power?


(Also copied from LW. And partly re-hashing my response from twitter.)

I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform. 

(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers seems like it might well be a productive frame. I do also think that questions about whether models will be bottlenecked on serial computation, and/or whether "shallower" computations will be selected for, are pretty relevant here, and the report includes a rough calculation in this respect in section 4.4.2 (see also summary here).)

Indeed, I think that maybe the strongest single argument against scheming is a combination of 

  1. "Because of the extra reasoning schemers perform, SGD would prefer non-schemers over schemers in a comparison re: final properties of the models" and 
  2. "The type of path-dependence/slack at stake in training is such that SGD will get the model that it prefers overall." 

My sense is that I'm less confident than you in both (1) and (2), but I think they're both plausible (the report, in particular, argues in favor of (1)), and that the combination is a key source of hope. I'm excited to see further work fleshing out the case for both (including e.g. the sorts of arguments for (2) that I took you and Nora to be gesturing at on twitter -- the report doesn't spend a ton of time on assessing how much path-dependence to expect, and of what kind).

Re: your discussion of the "ghost of instrumental reasoning," "deducing lots of world knowledge 'in-context,' and "the perspective that NNs will 'accidentally' acquire such capabilities internally as a convergent result of their inductive biases" -- especially given that you only skimmed the report's section headings and a small amount of the content, I have some sense, here, that you're responding to other arguments you've seen about deceptive alignment, rather than to specific claims made in the report (I don't, for example, make any claims about world knowledge being derived "in-context," or about models "accidentally" acquiring flexible instrumental reasoning). Is your basic thought something like: sure, the models will develop flexible instrumental reasoning that could in principle be used in service of arbitrary goals, but they will only in fact use it in service of the specified goal, because that's the thing training pressures them to do? If so, my feeling is something like: ok, but a lot of the question here is whether using the instrumental reasoning in service of some other goal (one that backchains into getting-reward) will be suitably compatible with/incentivized by training pressures as well. And I don't see e.g. the reversal curse as strong evidence on this front. 

Re: "mechanistically ungrounded intuitions about 'goals' and 'tryingness'" -- as I discuss in section 0.1, the report is explicitly setting aside disputes about whether the relevant models will be well-understood as goal-directed (my own take on that is in section 2.2.1 of my report on power-seeking AI here). The question in this report is whether, conditional on goal-directedness, we should expect scheming. That said, I do think that what I call the "messyness" of the relevant goal-directedness might be relevant to our overall assessment of the arguments for scheming in various ways, and that scheming might require an unusually high standard of goal-directedness in some sense. I discuss this in section 2.2.3, on "'Clean' vs. 'messy' goal-directedness," and in various other places in the report.

Re: "long term goals are sufficiently hard to form deliberately that I don't think they'll form accidentally" -- the report explicitly discusses cases where we intentionally train models to have long-term goals (both via long episodes, and via short episodes aimed at inducing long-horizon optimization). I think scheming is more likely in those cases. See section 2.2.4, "What if you intentionally train the model to have long-term goals?" That said, I'd be interested to see arguments that credit assignment difficulties actively count against the development of beyond-episode goals (whether in models trained on short episodes or long episodes) for models that are otherwise goal-directed. And I do think that, if we could be confident that models trained on short episodes won't learn beyond-episode goals accidentally (even irrespective of mundane adversarial training -- e.g., that models rewarded for getting gold coins on the episode would not learn a goal that generalizes to caring about gold coins in general, even prior to efforts to punish it for sacrificing gold-coins-on-the-episode for gold-coins-later), that would be a significant source of comfort (I discuss some possible experimental directions in this respect in section 6.2).

Thanks for this thoughtful comment, Ben. And also, for putting the "The Gold Lily" and "Mother and Child" on my radar -- they hadn't been before. I agree that "Mother and Child" evokes a sort some kind of sort of intergenerational project in the way you describe -- "it is your turn to address it." It seems related to the thing I was trying to talk about at the end of the post -- e.g., Gluck asking for some kind of directness and intensity of engagement with life. 

Thanks! Re: one in five million and .01% -- thanks, edited. And thanks for pointing to the Augenblick piece -- does look relevant (though my specific interest in that footnote was in constraints applicable to a model where you can only consider some subset of your evidence at any given time).

I'm sorry to hear about this, Nathan. As I say in the post, I do think that the question how to do gut-stuff right from a practical perspective is distinct from the epistemic angle that the post focuses on, and I think it's important to attend to both.

Noting that the passage you quote in your appendix from my report isn't my definition of the type of AI I'm focused on. I'm focused on AI systems with the following properties:

  • Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation). 
  • Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world. 
  • Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.

See section 2.1 in the report for more in-depth description.

Hi Jake, 

Thanks for this comment. I discuss this sort of case in footnote 33 here -- I think it's a good place to push back on the argument. Quoting what I say there:

there is, perhaps, some temptation to say “even if I should be indifferent to these people burned alive, I’m not! Screw indifference-ism world! Sounds like a shitty objective normative order anyway – let’s rebel against it.” That is, it feels like indifference-ism worlds have told me what the normative facts are, but they haven’t told me about my loyalty to the normative facts, and the shittyness of these normative facts puts that loyalty even more in question.  

And perhaps, as well, there’s some temptation to think that “Well, indifference-ism world is morally required to be indifferent to my overall decision-procedure as well – so I’ll use a decision-procedure that isn’t indifferent to what happens in indifference-ism world. Indifference-ism world isn't allowed to care!”  

These responses might seem dicey, though. If they (or others) don't end up working, ultimately I think that biting the bullet and taking this sort of deal is in fact less bad than doing so in the nihilism-focused version or the original. So it’s an option if necessary – and one I’d substantially prefer to biting the bullet in all of them.

That is, I'm interested in some combination of: 

  • Not taking the deal because you're uncertain of your loyalty to the normative facts (e.g., something about internalism/externalism etc)
  • Not taking the deal because indifference-ism world is indifferent to your decision procedure (or to your actions more generally), so whatever, let's save my family in those worlds. 
  • Biting the bullet and taking the deal if it comes to that, but not taking it in the other cases discussed in the post. 

Adding a few more thoughts, I think part of what I'm interested in here is the question of what you would be "trying" to do (from some kind of "I endorse this" perspective, even if the endorsement doesn't have any external backing from the normative facts) conditional on a given world. If, in indifference-ism world, you wouldn't be trying, in this sense, to protect your family, such that your representative from indifference-ism world would indeed be like "yeah, go ahead, burn my family alive," then taking the deal looks more OK to me. But if, conditional on indifference-ism, you would be trying to protect your family anyway (maybe because: the normative facts are indifferent, so might as well), such that your representative from indifference-ism world would be like "I'm against this deal," then taking the deal looks worse to me. And the second thing seems more like where I'd expect to end up.

A few questions about this: 

  1. Does this view imply that it is actually not possible to have a world where e.g. a machine creates one immortal happy person per day, forever, who then form an ever-growing line?
  2. How does this view interpret cosmological hypotheses on which the universe is infinite? Is the claim that actually, on those hypotheses, the universe is finite after all? 
  3. It seems like lots of the (countable) worlds and cases discussed in the post can simply be reframed as never-ending processes, no? And then similar (identical?) questions will arise? Thus, for example, w5 is equivalent to a machine that creates a1 at -1, then a3 at -1, then a5 at -1, etc. w6 is equivalent to a machine that creates a1 at -1, then a2 at -1, a3 at -1, etc. What would this view say about which of these machines we should create, given the opportunity? How should we compare these to a w8 machine that creates b1 at -1, b2 at -1, b3 at -1, b4 at -1, etc?

Re: the Jaynes quote: I'm not sure I've understood the full picture here, but in general, to me it doesn't feel like the central issues here have to do with dependencies on "how the limit is approached," such that requiring that each scenario pin down an "order" solves the problems. For example, I think that a lot of what seems strange about Neutrality-violations in these cases is that even if we pin down an order for each case, the fact that you can re-arrange one into the other makes it seem like they ought to be ethically equivalent. Maybe we deny that, and maybe we do so for reasons related to what you're talking about - but it seems like the same bullet. 

Load more