Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya

Comments 12

Sorted by

New & upvoted

I think this is a good piece, and I’m glad you wrote it. One disagreement I have is that I think the behaviour of a deep learning agent trained by HFDT is less predictable than this piece suggests. My guesses would be something like:

30%: the behaviour of HFDT-AGI is basically fine/instability is easy to manage
30%: the behaviour of HFDT-AGI is unstable in a hard-to-manage way and includes seriously bad but not catastrophic behaviours
30%: the behaviour of HFDT-AGI is unstable and includes catastrophic behaviours
10%: the behaviour of HFDT-AGI converges to some kind of catastrophic behaviour

These numbers aren’t the product of much reflection, they’d probably change substantially with a couple of hours’ thought. Also, note that catastrophe is overall highly likely for the bottom 40% of outcomes, as people will probably make multiple attempts at AGI and consequently explore a range of possible behaviours.

I interpret your piece as arguing that option 4 actually has more than 50% weight, and this doesn’t seem right to me.

We seem to agree that an agent trained by episodic reward might behave in a way that is interpretable as “strategically maximising reward”, and might behave in a different way. In particular, if we assume that training episodes are IID from some distribution, then it seems reasonable to assume that the agent behaves as a strategic reward maximiser in episodes sampled from the same distribution, relative to that distribution. However, in the case we’re interested in, we’re interested in behaviour in a more general environment, and it’s much less clear whether the behaviour over time will be interpretable in this way.

I’m not sure about the following: you spend substantial time discussing the actions of a strategic reward maximiser, and argue that this follows from a premise of “correct generalisation”, but also that it’s “very plausible” that its behaviour cannot ultimately be interpreted as strategic reward maximisation. Does your overall assessment of likely AI takeover depend on strategic reward maximisation being quite likely? I think it is substantially more likely that the AIs behaviour cannot be interpreted as strategic reward maximisation. I’ll say a bit more about this later.

If strategic reward maximisation is not likely, you argue “Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control”. I think this section depends on a strong and, in my view, not especially compelling assumption. I’m going to try to spell it out. The structure of your argument seems to be:

HFDT-AGI’s behaviour will be interpretable as strategic pursuit of some utility function
In your preferred reference class of utility functions, this implies catastrophic behaviour
- I’m going to call this class “simple utility functions”, though I think this phrasing is not especially epistemically hygienic because it suggests a sharper definition than we actually have

(I’m putting words in your mouth a bit here, so please let me know if this isn’t what you meant).

This depends on the assumption:

HFDT-AGI’s behaviour will be interpretable as strategic pursuit of some utility function which is, with high probability, a simple utility function

This assumption is strong. Many behaviours can be rationalised as maximising some utility function, but for most behaviours the relevant utility function is one that appears quite crazy – of the form “I value taking this action in these exact circumstances” – and not dangerous. However, adding a constraint to the class of relevant utility functions will render almost all behaviours unable to be interpreted as utility maximising.

It seems to me (and I’ve only thought about it a little, and I certainly don’t have a proof) that there’s basically only one way that the behaviour of HFDT-AGI ends up satisfying this assumptions, and that’s if HFDT can (in the limit of high competence) be well described by an algorithm that does something like:

Pick a function that specifies “true utility”
- According to a prior that corresponds to “simple utilities” updated by reinforcement
Pick a hypothesis about how to maximise it

But it seems rather unlikely to me that this is how HFDT works insofar as just about any description of this level of specificity of how HFDT works seems unlikely, and this one doesn’t seem particularly good. I think this also holds for the special case where the function specifying “true utility” is, in some sense or other, the reward allocated to HFDT during training.

I think my argument for why to doubt this assumption is weak, but I hope you can at least appreciate that it is a strong assumption.

technicalities

Epistemic status: pun

Alexander: From ἀλέξω (aléxō, “to repel”) + ἀνήρ (anḗr, “man”) +‎ -ος (-os, thing).

~~That Which Drives Off Man.

[anonymous]

I'm an ML researcher, and I would give the probability of baseline HFDT leading to a PASTA set of capabilities as approximately 0, and my impression is that this is the experience of the majority of ML researchers.

Baseline HFDT seems to be the single most straightforward vision that could plausibly work to train transformative AI very soon. From informal conversations, I get the impression that many ML researchers would bet on something like this working in broadly the way I described in this post, and multiple major AI companies are actively trying to scale up the capabilities of models trained with something like baseline HFDT.

In other words, I disagree with this, and therefore it seems unclear what to take away from the rest of the post, which is well-reasoned iff you agree with the starting assumptions.

Remember that PASTA is actually a very strong criterion: it requires AI to be able to do all activities in the scientific research loop. Including evaluating whether ideas are good and generating creative new ideas - skills which I think are general-intelligence-complete. I've no doubt that a HFDT system can automate some parts of the scientific discovery process, such as writing draft papers, or controlling some robot to do lab experiments in constrained environments. But ultimately this subset of PASTA only makes such a system a research assistant, one which like AlphaFold may make some research more efficient, but which does not complete the full loop.

kokotajlod

Can you say more about why you think this? Both why you think there's 0 chance of HFDT leading to a system that can evaluate whether ideas are good and generate creative new ideas, and why you think this is what the majority of ML researchers think?

(I've literally never met a ML researcher with your view before to my knowledge, though I haven't exactly gone around asking everyone I know & my environment is of course selected against people with your view since I'm at OpenAI.)

[anonymous]

The rough shape of the argument is that I think a PASTA system requires roughly human-level general intelligence, and that implies some capabilities which HFDT as described in this post does not have the ability to learn. Using Karnofsky's original PASTA post, let's look at some of the requirements:

Consume research papers.
Identify research gaps and open problems.
Choose an open problem to focus on.
Generate hypotheses to test within a problem to potentially solve it.
Generate experiment ideas to test the hypotheses.
Judge how good the experiment ideas are and select experiments to perform.
Carry out the experiments.
Select a hypothesis based on experiment results.
Write up a paper.
Judge how good the paper is.

A system which can do all of the above is a narrower requirement than full general intelligence, especially if each task is decomposed into separate models, but many of these points seem impractical given what we currently know about RL with human feedback. Crucially, a PASTA system requires the ability to do these tasks autonomously in order to have any chance of transformatively accelerating scientific discovery - Ajeya specifies that Alex is an end-to-end system. It's not sufficient, for example, to make a system which can generate a bunch of experiment or research ideas in text but relies on humans to evaluate them. In particular, I would identify 4, 5, 6, 8, and 10 as the key parts of the research loop which seem to require significant ML breakthroughs to achieve, and which I think may be general-intelligence-complete. One thing these tasks have in common is that good feedback is hard for humans, the rewards are sparse, and the episode length is potentially extremely long.

And yet current work on RL with human feedback shows that these techniques work well in highly constrained environments with relatively short episode length, and accurate & frequent reward signal and gradients in the human feedback. Sparse feedback immediately decreases performance significantly (see the ReQueST paper). To me, this suggests a significant number of fundamental breakthroughs remain to achieve PASTA, and that HFDT as described here does not have this capability, even if scaled + minor improvements.

Regarding my impression of the opinions of other ML researchers, I'm a PhD student at Cambridge and have spoken to many academics (both senior and junior), some people at DeepMind and some at Google Brain about their guesses for pathways to AGI, and the vast majority don't seem to think that scaling current RL algorithms + LLMs + human feedback gets us much further towards AGI than we currently are, and they think there are many missing pieces of the puzzle. I'm not surprised to hear that OpenAI has a very different set of opinions though - the company has bet heavily on a particular approach to AGI so would naturally attract people with a different view to the one I've described.

One way to see this is to look at timelines to human level general intelligence - I think most ML researchers would not put this at within 10 or 20 years, based on previous surveys. Yet as Ajeya describes in the post, if training PASTA with baseline HFDT is possible, it seems very likely to happen within 10 years, and I agree with her that "the sooner transformative AI is developed, the more likely it is to be developed in roughly this way". I think that if a full PASTA system can be made, then we are likely to have solved almost all of the bottlenecks to create an AGI. Therefore I think that the timelines conflict with the hypothesis that scaling current techniques which don't require new fundamental breakthroughs is enough for PASTA.

Xander123

I'm pretty unconvinced that your "suggests a significant number of fundamental breakthroughs remain to achieve PASTA" is strong enough to justify the odds being "approximately 0," especially when the evidence is mostly just expecting tasks to stay hard as we scale (something which seems hard to predict, and easy to get wrong). Though it does seem that innovation in certain domains may lead to long episode lengths and inaccurate human evaluation, it also seems like innovation in certain fields (e.g., math) could easily not have this problem (i.e., in cases where verifying is much easier than solving).

[anonymous]

I gave the comment a strong upvote because it's super clear and informative. I also really appreciate it if people spell out their reasons for "scale in not all you need", which doesn't happen that often.

That said, I don't agree with the argument or conclusion. Your argument, at least as stated, seems to be "tasks with the following criteria are hard for current RL with human feedback, so we'll need significant fundamental breakthroughs". The transformer was published 5 years ago. Back then, you could have used a very analogous argument about language models to argue that language models will never do this or that task; but for many of these tasks, language models can perform them now (emergent properties).

[anonymous]

Thank you for the comment - it's a fair point about the difficulty of prediction. In my post I attempted to point to some heuristics which suggest strongly to me that significant fundamental breakthroughs are needed. Other people have different heuristics. At the same time though, it seems like your objection is a fully general argument against fundamental breakthroughs ever being necessary at any point, which seems quite unlikely.

I also think that even the original Attention Is All You Need paper gave some indication of the future direction by testing a large and small transformer and showing greatly improved performance with the large one, while RLHF's early work does not appear to have a similar immediately obvious way to scale up and tackle the big RL challenges like sparse rewards, problems with long episode length, etc.

[anonymous]

At the same time though, it seems like your objection is a fully general argument against fundamental breakthroughs ever being necessary at any point, which seems quite unlikely.

Sorry, what I wanted to say is it seems unclear if fundamental breakthroughs are needed. They might be needed, or not. I personally am pretty uncertain about this and think that both options are possible. I think it's also possible that any breakthroughs that will happen won't change the general picture described in the OP much.

I agree on the rest of your comment!

David Johnston

Independently of my other comment, I think you might be under-utilising your premises, in particular the "racing forward" assumption. Why would AI companies by racing to make the first AGI instead of the best AGI? The most likely answer seems to be: they expect large first-mover advantages that outweigh the benefits of taking more time to build a better system.

The fact that AI companies believe this is evidence that the first AGI past some capability threshold actually does have a particularly large impact. Furthermore, the fact that AI companies are chasing the first-mover advantage might also indicate that they're investing specifically in building the kind of AGI that captures this advantage.

Phil Tanny

I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.

Can we think through "an explosion in science and technology" together? Do you mean a "new, qualitatively different future" would be better, is that your point? Or?

What limits, if any, do you see to our ability to successfully manage an explosion of technology?

Do you feel that human beings can successfully manage any amount of knowledge and power delivered at any rate?

Do you feel that human factors like judgement, maturity, wisdom, and morality etc are relevant parts of the equation? Do you feel that such factors can advance at the same rate as the science and technology?

How do you feel about the science explosion of the 20th century, a known body of evidence we can reference? On one hand the science explosion of that era brought forth countless miracles. On the other hand a single human being can now destroy the modern world in minutes. In your view, what relevance does this pattern have to the future?

Locke

-12

What if instead we replace "world could plausibly develop “transformative AI”" with "someone invents the philosophers stone" and we do a similar analysis of the implications? That'd be a change of pace.

Comments

David Johnston

30%: the behaviour of HFDT-AGI is basically fine/instability is easy to manage
30%: the behaviour of HFDT-AGI is unstable in a hard-to-manage way and includes seriously bad but not catastrophic behaviours
30%: the behaviour of HFDT-AGI is unstable and includes catastrophic behaviours
10%: the behaviour of HFDT-AGI converges to some kind of catastrophic behaviour

I interpret your piece as arguing that option 4 actually has more than 50% weight, and this doesn’t seem right to me.

HFDT-AGI’s behaviour will be interpretable as strategic pursuit of some utility function
In your preferred reference class of utility functions, this implies catastrophic behaviour
- I’m going to call this class “simple utility functions”, though I think this phrasing is not especially epistemically hygienic because it suggests a sharper definition than we actually have

(I’m putting words in your mouth a bit here, so please let me know if this isn’t what you meant).

This depends on the assumption:

HFDT-AGI’s behaviour will be interpretable as strategic pursuit of some utility function which is, with high probability, a simple utility function

Pick a function that specifies “true utility”
- According to a prior that corresponds to “simple utilities” updated by reinforcement
Pick a hypothesis about how to maximise it

I think my argument for why to doubt this assumption is weak, but I hope you can at least appreciate that it is a strong assumption.

^{^}

Even though this vision doesn’t involve using exclusively human feedback, I’ll nonetheless refer to it as human feedback on diverse tasks (HFDT) to highlight that human judgments are central to training. In practice, I expect that powerful models will be trained with a combination of human feedback and automated reward signals (e.g. “did the code written by the AI compile?” or “how much computation did the AI use to perform this task?”). But I expect human feedback to be especially important, and I’ll focus on the dangers associated with the human feedback component of the feedback signal since there is more contention among the ML community about whether human feedback is dangerous. Many researchers agree that fully automated reward signals carry a high risk of misalignment, while being mostly optimistic about human feedback. While I agree incorporating human feedback is better than using fully-automated signals, I still think that the default version of human feedback is highly dangerous.

^{^}

For example, we could imagine trying to produce transformative AI by training a large number of small models to do specialized tasks which we manually chain together, rather than training one large model to simultaneously do well at many tasks. We could also imagine using 100% automated reward signals, or “hand-programming” transformative AI rather than using any form of “training.”

^{^}

For example, maybe the required model size is far too large to be affordable anytime soon, or we are unable to generate sufficiently challenging and diverse datasets and environments, or we run into optimization difficulties, etc.

^{^}

By “incentivized,” I mean: “Alex would be rewarded more for playing the training game than it would be for alternative patterns of behavior.” There are further questions about whether Alex’s training procedure would successfully find this pattern of behavior; I believe it would, and discuss this more in the section "Maybe inductive bias or path dependence favors honest strategies?".

^{^}

For example, lying about empirical questions which are politically-sensitive will often receive more reward than telling the truth; consider a version of Alex trained by the Catholic Church in the 1400s who is asked how the solar system works or how life was formed.

^{^}

What exactly “maximizing reward” means is unclear; however, I think this basic conclusion applies to most plausible ways of operationalizing this.

^{^}

This means that we’re bracketing scenarios in which Alex “escapes from the box” during the training phase; in fact I think such scenarios are plausible, which is a way in which risk is greater than this story represents.

^{^}

This could be accomplished through a variety of training signals -- e.g., more human feedback and automated signals, as well as more outcomes-based signals such as the compute efficiency of any chips they design, the performance of any new models they train, whether their scientific hypotheses are borne out by experiments, how much money they make for Magma per day, etc.

^{^}

It’d be reasonable to call this model “an AGI,” whether or not it’s able to do literally every possible task as well as humans; however, I’ll generally be avoiding that terminology in this post.

^{^}

Note that it may do this by first quickly building more powerful and more specialized AI systems who then develop these technologies.

^{^}

The exact timescale depends on many technical and economic factors, and is the subject of ongoing investigation at Open Philanthropy. Overall we expect that (absent deliberate intervention to slow things down) the time between “millions of copies of Alex are deployed” and “galaxy-scale civilization is feasible” is more likely to be on the order of 2-5 years rather than 10+ years.

^{^}

The most important possible exception is that if the world contains a number of models almost as intelligent as Alex at the time that Alex is developed, we may be able to use those models to help supervise Alex, and they may do a better job than humans would. However, I think it’s far from straightforward to translate this basic idea into a viable strategy for ensuring that Alex doesn’t play the training game, and (given that this strategy relies on quite capable models) it’s very unclear how much time there will be to try to pull it off before models get too capable to easily control.

^{^}

In reality, I expect it would have multiple input and output channels which use special format(s) more optimized for machine consumption, rather than having just one input channel which is directly analogous to human vision and one output channel which is directly analogous to human typing -- those are probably not the most efficient way for a neural network to interact with external computers. Going forward, I’ll simply refer to inputs as observations and outputs as actions, though I’ll use the shorthand of “screen image” and “keystroke” in the diagrams.

^{^}

Or mouse movement / click.

^{^}

Like today’s neural networks, Alex is trained using gradient descent. Roughly speaking, gradient descent repeatedly modifies an ML model to perform better and better at some task(s). Researchers feed Alex an instance of a task that they want it to perform well on, Alex produces some output in response, and gradient descent is used to slightly perturb the model so it performs a bit better on that example. This is repeated many times until Alex is very good on average at the kinds of tasks it saw during training. See here for a more technical explanation

^{^}

This tends to make it easier to train Alex to do more useful tasks down the road, so is often called “pre-training.”

^{^}

If we imagine observations are images of screens and actions are keystrokes, the training dataset for this might come from screen captures and keylogs on Magma employees’ laptops.

^{^}

As above, different models may be fine-tuned on different mixtures of these tasks in order to specialize into different niches.

^{^}

In this appendix, I discuss a bit more why I think this strategy is a “baseline” and what some non-baseline strategies may look like.

^{^}

In terms of time, computation, and/or other metrics like “amount of human help used” or “number of lines in the proof.”

^{^}

This is closely related to the concept of deceptive alignment, as introduced in Hubinger et al 2019.

^{^}

Using more intelligent and thoughtful human reviewers would fix this particular problem, but not fix the general problem that Alex would be incentivized to play on their biases. See this appendix for more discussion.

^{^}

Magma may itself have stakeholders it needs to please that introduces more bias. For example, if one thing Alex is supposed to do is interact with customers, then it may need to be trained not to be offensive to those customers.

^{^}

An important family of training strategies involves using different copies of Alex (or different “heads” on the same model) each with incentives to point out lying and manipulation done by the other. There are a large number of possible approaches like this, but they all seem to require significant caution if we want to be sure they’ll work.

^{^}

The number of copies of Alex that can be run immediately will depend on how much computation it took to train it -- the more computation it took to train Alex, the more chips are available to run it once it’s trained. This multiple is likely to be large because training a model generally requires running it for a huge number of subjective years.

^{^}

For example, it could be given the ability to spend money to make trades, investments, or hires if that’s important for quickly increasing Magma’s capital; it could be given the ability to remotely operate machinery if that’s important for robotics research; etc.

^{^}

In “softer takeoff” scenarios (which I consider more plausible overall), Magma will have already been using the somewhat-less-powerful predecessors of Alex to automate various parts of R&D before this point, meaning that the pace of research and innovation would already be going quickly by the time a model of Alex’s level of capability is deployed. This makes analysis more complicated, but on net I’m not convinced that it makes the situation significantly safer.

^{^}

Currently, the ratio of computation to memory in GPUs and TPUs is much higher than in brains. The A100 GPU (state-of-the-art as of March 2021) performs ~1.5e14 FLOP/s (in floating point 16 operations). Joe Carlsmith’s central estimate for the amount of computation it would take to match the human brain is ~1e15 FLOP/s, implying that ~10 A100 GPUs could match the total computation of a human brain. However, the human brain seems to store ~1e14 bytes of memory (taking a synapse to be equivalent to a byte and assuming that synapses dominate the memory in the brain), while the A100 only has a storage capacity of ~8e10 bytes. That implies it would take ~1,000 A100 GPUs to store a neural network whose parameters contained as much information as the human brain. That system of 1,000 GPUs would then have enough computation to use each parameter ~100 times more often per second than the human brain uses each synapse. I expect computation intensity would continue to be significantly higher in ML chips than in biological brains, allowing for this speedup.

^{^}

In this case, Alex continues to be trained in the deployment setting, but the training signals are systematically different from the training signals in the lab setting due to humans’ decreased understanding and control.

^{^}

In deployment (more so than in the lab), it’s not exactly clear what “maximizing reward” means precisely. But there will still be some reward signals coming into the various copies of Alex, and any plausible notion of “maximizing” those signals would be bad for humans.

^{^}

Or video, or written feedback, etc that is used to ultimately calculate reward.

^{^}

Or writing positive things, or appearing happy in front of a camera, etc.

^{^}

This feels more realistic if you consider that Alex itself would have written almost all of the code for the datacenter, and designed almost all of the hardware.

^{^}

That is, unless gradient descent for some reason “got stuck” at the benevolent strategy and was not able to push Alex toward the more effective strategy; I discuss this briefly here.

^{^}

This is similar to the argument made in Ilyas et al 2019 (Adversarial examples are not bugs, they are features).

^{^}

They could also respond by attempting to shut down Alex entirely, but this could be an extremely costly choice in the face of competitive pressure -- assuming they do not have an arbitrary lead on other AI companies, refraining from deploying Alex indefinitely is likely to simply mean that somebody else deploys a similar system (perhaps even a more dangerous one). Holden Karnofsky discusses how to balance “improving the safety of one’s own models” against “ensuring that other actors do not deploy even-more-unsafe models” in his forthcoming posts about how we could address the issue I lay out in this post.

^{^}

See Treacherous turns in the wild (Luke Muehlhauser), Objective robustness in deep reinforcement learning (Koch et al), The effects of reward misspecification (Pan et al), Specification gaming examples in AI (DeepMind Safety blog post which links to an aggregated spreadsheet), The surprising creativity of digital evolution (Lehman et al), this gwern blog post, and more. Many of these lists overlap with each other.

^{^}

We could certainly try including artificial scenarios designed to appear as if models could successfully take over from humans, but it seems very hard to be sure that models will be fooled by that and couldn’t tell it apart from a genuine opportunity to take over. For example, models could look for signs that are highly likely to happen eventually but would be too expensive for humans to generate as a training example. As a simple example, Paul Christiano suggested “someone factors an extremely large number” (which would be too expensive to factor just to include as part of the training distribution, but would nonetheless likely get factored eventually).

^{^}

For example, stronger arguments that benign generalizations are especially “natural” for gradient descent, enough to make up for the fact that playing the training game would get higher reward; stronger arguments that Alex would “generalize habits” rather than “generalize goals” from training to deployment.

^{^}

Say the mice were dropped off on an earlier spaceship.

^{^}

Thanks to Buck Shlegeris for suggesting this analogy.

^{^}

For one thing, our mouse-brain-sized models have a different and more human profile of abilities than actual mice (e.g. they can talk).

^{^}

(who either deliver reward directly or write code that delivers reward)

^{^}

Humans would consider egregious reward-grabbing acts to be more blatant and scary violations of expectations than softer ways of playing to human biases, so if Alex is later caught it is likely to receive a large negative reward and/or trigger a larger retraining project. I discuss below why I think that isn’t likely to be sufficient to change its motives to grab reward.

^{^}

See Turning reflection up to 11 for a similar proposal.

^{^}

For example, Eric Jang, citing Connor Leahy on Twitter, writes: “Just asking the AI to be nice sounds flippant, but after seeing DALL-E and other large-scale multi-modal models that seem to generalize better as they get bigger, I think we should take these simple, borderline-naive ideas more seriously.”

^{^}

Note that Alex is likely to be motivated to maximize the “final recorded reward” even if it’s ultimately interested in pursuing some other goal. For example, if Alex is trying to have some kind of lasting effect on the world that extends beyond the current episode (e.g. “discover all the secrets of the universe”), it is probably much more likely to accomplish that goal if the future contains more models like it -- which in turn is more likely to happen if it gets a very high reward.

^{^}

Humans find lying difficult and unnatural due to our specific evolutionary history. Alex’s design and training wouldn’t necessarily replicate those kinds of evolutionary pressures. Still, it seems plausible that it would, especially since we’d be trying to give it feedback to encourage honesty.

^{^}

On a philosophical level, it’s not exactly clear what it means to talk about what actions taken in deployment “would have” gotten a high reward. But on a practical level, Alex is likely to continue to receive rewards on the deployment distribution, particularly if it’s not acting the way researchers want it to act, so the meaning of “reward on the deployment distribution” is fairly well-pinned-down.

^{^}

The high-level picture doesn’t change if we imagine it taking in an attention window of a number of observations instead.

^{^}

This prediction probably won’t be in the form of separately predicting each pixel of the next screen -- it’s more likely to be something more complicated and abstract, e.g. an English-language description of the next screen.

^{^}

Although other known sequence processing architectures, like transformers, are also currently limited in how long they can “remember” things (albeit sometimes for different structural reasons).

^{^}

This may also reduce the risk that power-seeking misalignment emerges.

	Lab setting	Deployment setting
Basic situation	Alex is trained with human feedback on diverse tasks designed to increase its capabilities and intelligence; it’s not given direct access to the internet or “real world,” so the scope of its actions are limited.	Many copies of Alex are given the ability to interact with the real world; they quickly advance the scientific/technological frontier; humans still deliver rewards based on results but their understanding of the world drops and their control over Alex loosens.
Sources of rewards	Performing well on training tasks (e.g. winning games or proving theorems), being helpful and useful to humans, carefully crafting an image of friendliness and trustworthiness, telling humans what they want to hear, manipulating humans into thinking it did a better job than it really did, etc.	Delivering good outcomes (e.g. profits or discoveries), arranging events or changing records to make it seem that outcomes are good, threatening or coercing humans into delivering higher reward, directly accessing and editing the reward and preventing that from being overwritten later, etc.
Alex’s strategy	Performs well on all tasks -- including appearing very safe/aligned and passing all safety tests -- but takes opportunities to lie and manipulate whenever that would increase reward.	Continues the strategy from the lab setting while amassing power, and later launches a takeover attempt when it expects to succeed. After a successful takeover, either “maximizes reward forever” or pursues other more complex “interests.”

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Premises of the hypothetical situation

Basic setup: an AI company trains a “scientist model” very soon

“Racing forward” assumption: Magma tries to train the most powerful model it can

“HFDT scales far” assumption: Alex is trained to achieve excellent performance on a wide range of difficult tasks

Why do I call Alex’s training strategy “baseline” HFDT?

What are some training strategies that would not fall under baseline HFDT?

“Naive safety effort” assumption: Alex is trained to be “behaviorally safe”

Key properties of Alex: it is a generally-competent creative planner

Alex builds robust, very-broadly-applicable skills and understanding of the world

Alex learns to make creative, unexpected plans to achieve open-ended goals

How the hypothetical situation progresses (from the above premises)

Alex would understand its training process very well (including human psychology)

A spectrum of situational awareness

Why I think Alex would have very high situational awareness

While humans are in control, Alex would be incentivized to “play the training game”

Naive “behavioral safety” interventions wouldn’t eliminate this incentive

Maybe inductive bias or path dependence favors honest strategies?

As humans’ control fades, Alex would be motivated to take over

Deploying Alex would lead to a rapid loss of human control

In this new regime, maximizing reward would likely involve seizing control

Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control

Giving negative rewards to “warning signs” would likely select for patience

Why this simplified scenario is worth thinking about

Acknowledgements

Appendices

What would change my mind about the path of least resistance?

“Security holes” may also select against straightforward honesty

Simple “baseline” behavioral safety interventions

Using higher-quality feedback and extrapolating feedback quality

Using prompt engineering to emulate more thoughtful judgments

Requiring Alex to provide justification for its actions

Making the training distribution more diverse

Adversarial training to incentivize Alex to act conservatively

“Training out” bad behavior

“Non-baseline” interventions that might help more

Examining arguments that gradient descent favors being nice over playing the training game

Maybe telling the truth is more “natural” than lying?

Maybe path dependence means Alex internalizes moral lessons early?

Maybe gradient descent simply generalizes “surprisingly well”?

A possible architecture for Alex

Plausible high-level features of a good architecture