This is a linkpost for Imitation Learning is Probably Existentially Safe by Michael Cohen and Marcus Hutter.
Abstract
Concerns about extinction risk from AI vary among experts in the field. But AI encompasses a very broad category of algorithms. Perhaps some algorithms would pose an extinction risk, and others wouldn’t. Such an observation might be of great interest to both regulators and innovators. This paper argues that advanced imitation learners would likely not cause human extinction. We first present a simple argument to that effect, and then we rebut six different arguments that have been made to the contrary. A common theme of most of these arguments is a story for how a subroutine within an advanced imitation learner could hijack the imitation learner’s behavior toward its own ends. But we argue that each argument is flawed and each story implausible.
1 Introduction
While many theorists have come to share the view that sufficiently advanced AI systems might pose a threat to the continued existence of humanity [Hinton et al., 2023, Cohen et al., 2022, Russell, 2019, Bostrom, 2014], it is important, if we are to make progress in thinking about this issue, to be clear about which types of AI pose the genuine threats. That way we can focus on where the danger actually lies. This paper aims to refute claims that imitation learning algorithms present such a threat. While we do think there are types of AI we should be worried about, that does not extend to all types of AI. So in what follows, we will examine arguments that have been put forward that imitation learners present an extinction risk to humanity, and explain why we think they go wrong.
First, we’ll offer a simple argument that a sufficiently advanced supervised learning algorithm, trained to imitate humans, would very likely not gain total control over humanity (to the point of making everyone defenseless) and then cause or allow human extinction from that position.
No human has ever gained total control over humanity. It would be a very basic mistake to think anyone ever has. Moreover, if they did so, very few humans would accept human extinction. An imitation learner that successfully gained total control over humanity and then allowed human extinction would, on both counts, be an extremely poor imitation of any human, and easily distinguishable from one, whereas an advanced imitation learner will likely imitate humans well.
This basic observation should establish that any conclusion to the contrary should be very surprising, and so a high degree of rigor should be expected from arguments to that effect. If a highly advanced supervised learning algorithm is directed to the task of imitating a human, then powerful forces of optimization are seeking a target that is fundamentally existentially safe: indistinguishability from humans. Stories about how such optimization might fail should be extremely careful in establishing the plausibility of every step.
In this paper, we’ll rebut six different arguments we’ve encountered that a sufficiently advanced supervised learning algorithm, trained to imitate humans, would likely cause human extinction. These arguments originate from Yudkowsky [2008] (the Attention Director Argument), Christiano [2016] (the Cartesian Demon Argument), Krueger [2019] (the Simplicity of Optimality Argument), Branwen [2022] (the Character Destiny Argument), Yudkowsky [2023] (the Rational Subroutine Argument), and Hubinger et al. [2019] (the Deceptive Alignment Argument). Note: Christiano only thinks his argument is possibly correct, rather than likely correct, for the advanced AI systems that we will end up creating. And Branwen does not think his hypothetical is likely, only plausible enough to discuss. But maybe some of the hundreds of upvoters on the community blog LessWrong consider it likely.
In all cases, we have rewritten the arguments originating from those sources (some of which are spread over many pages with gaps that need to be filled in). For Christiano [2016] and Hubinger et al. [2019], our rewritten versions of their arguments are shorter, but the longer originals are no stronger at the locations that we contest. And for the other four sources, the original text is no thorougher than our characterization of their argument. None of the arguments have been peer reviewed, and to our knowledge, only Hubinger et al. [2019] was reviewed even informally prior to publication. However, we can assure the reader they are taken seriously in many circles.
8 Conclusion
The existential risk from imitation learners, which we have argued is small, stands in stark contrast to the existential risk arising from reinforcement learning agents and similar artificial agents planning over the long term, which are trained to be as competent as possible, not as human-like as possible. Cohen et al. [2022] identify plausible conditions under which running a sufficiently competent long-term planning agent would make human extinction a likely outcome. Regulators interested in designing targeted regulation should note that imitation learners may safely be treated differently from long-term planning agents. It will be necessary to restrict proliferation of the latter, and such an effort must not become stalled by bundling it with overly burdensome restrictions on safer algorithms.
Thanks for the comment, Matthew!
My understanding is that the authors are making 2 points in the passage you quoted:
In my mind, very few humans would want to pursue capabilities which are conducive to gaining control over humanity. There are diminishing returns to having more resources. For example, if you give 10 M$ (0.01 % of global resources) to a random human, they will not have much of a desire to take risks to increase their wealth to 10 T$ (10 % of global resources), which would be helpful to gain control over humanity. To increase their own happiness and that of their close family and friends, they would do well by investing their newly acquired wealth in exchange-traded funds (ETFs). A good imitator AI would share our disposition of not gaining capabilities beyhond a certain point, and therefore (like humans) never get close to having a chance of gaining control over humanity.
I think humans usually aquire power fairly gradually. A good imitator AI would be mindful that acquiring power too fast (suddenly fooming) would go very much against what humans usually do.
No human has ever had control over all humanity, so I agree there is a sense in which we have "zero data" about what humans would do under such conditions. Yet, I am still pretty confident that the vast majority of humans would not want to cause human extinction. A desire to be praised by others is a major reason humans like to gain power. There would be no one to praise or be praised given human extinction, so I think very few humans would want it if they suddenly gained control over all humanity.
I do not think this is the best comparison. There would arguably be many imitator AIs, and these would not gain near-omnipotent abilities overnight. I would say both of these greatly constrain the level of subjugation. Historically, innovations and new investions have spread out across the broader economy, so I think there should be a strong prior against a single imitator AI suddenly gaining control over all the other AIs and humans.
From the 1st part of the sentence, it looks like you agree with what I said above about a good imitator AI sharing our disposition of not gaining capabilities beyhond a certain point. As for the 2nd part, I agree there would be a signicant risk of tyranny and oppression if a random human suddenly gained control over all humanity, but this seems very unlikely to me because of what I said above.
How long-run are you talking about here? Humans 500 years ago arguably had little control over current humans, but this alone does not imply a high existential risk 500 years ago. As Robin Hanson said:
I expect you agree with some of this.