Hide table of contents

There are many things I feel like the post authors miss, and I want to share a few thoughts that seem good to communicate.

I'm going to focus on controlling superintelligent AI systems: systems powerful enough to solve alignment (in the CEV sense) completely, or to kill everyone on the planet. 

In this post, I'm going to ignore other AI-related sources of x-risk, such as AI-enabled bioterrorism, and I'm not commenting on everything that seems important to comment on.

I'm also not going to point at all the slippery claims that I think can make the reader generalize incorrectly, as it'd be nitpicky and also not worth the time (examples of what I'd skip- I couldn't find evidence that GPT-4 has undergone any supervised fine-tuning; RLHF shapes chatbots' brains into the kind of systems that produce outputs that make human graders click on thumbs-up/"I prefer this text", smart systems that do that are not themselves necessarily "preferred" by human graders; one footnote[1]).


many people are worried that we will lose control of artificial intelligence, leading to human extinction or a similarly catastrophic “AI takeover.” We hope the arguments in this essay make such an outcome seem implausible. But even if future AI turns out to be less “controllable” in a strict sense of the word— simply because, for example, it thinks faster than humans can directly supervise— we also argue it will be easy to instill our values into an AI, a process called “alignment.” 

This misrepresents the worry. Saying "but even if" makes it look like: people worrying about x-risk place credence on "loss of control leads to x-risk no matter/despite alignment"; and these people wrong, as the post shows "this outcome" to be implausible; and, separately, that even if they're right about loss of control, they're wrong about x-risk, as it'll be fine because of alignment.

But mostly, people (including the leading voices) are worried specifically about capable misaligned systems leading to human extinction. I don't know anyone in the community who'd say it's a bad thing that leads to extinction if a CEV-aligned superintelligence grabs control.

Since each generation of controllable AIs can help control the next generation, it looks like this process can continue indefinitely, even to very high levels of capability

I expect it to be easy to reward-shape AIs below a certain level[2] of capability, and I worry about controlling AIs above that level. I believe you need a superhumanly capable system to design and oversee a superhumanly capable system so that it doesn't kill everyone. The current ability of subhuman systems to oversee other subhuman systems such that these systems don't kill everyone is something that I predicted, and that doesn't provide a lot of evidence for subhuman systems being able to oversee superhuman systems.[3]

To solve the problem of aligning superhuman systems, you need some amount of complicated human thought/hard high-level work. If a system can output that much hard high-level work in a short amount of time, I consider this system to be superhuman, and the problem of aligning it to be "alignment-complete" in the sense that if you solve any of the problems in this class, you essentially solve alignment down the line and probably avoid x-risk, but solving any of these problems requires a lot of hard human work, and safely automating so much the hard work is an alignment-complete problem.

There needs to be an argument for why one can successfully use a subhuman system to control a complicated superhuman system, as otherwise, having generations of controllable subhuman systems doesn't matter.


Let's talk about the goals specific neural networks will be pursuing.

Many “alignment problems” we routinely solve, like raising children or training pets, seem much harder than training a friendly AI

Note that evolution has had "white-box" access to our architecture, optimising us for inclusive genetic fitness, and getting something that optimizes for similar collections of things. Consider that humans are so alignable because of that. Children are already wired to easily want chocolate, politics, and cooperation; if instead you get an alien child wired to associate goodness with eating children or sorting pebbles, giving this child rewards can make them learn your language, but won't necessarily make them not want to eat children or sort pebbles.

If you have a child, you don't need to specify, in math, everything that you value: they're probably not going to be super-smart about causing you to give them a reward, and they're already wired to want stuff that's similar to the kinds of things you want.

When you create AI, you do need to have a target of optimisation: what you hope the AI is going to try to do, a utility function safe to optimize for even with superintelligent optimization power. We don't know how to safely specify a target like that.

And then, even if you somehow design a target like that, you need to somehow find an AI that actually tries to achieve that target, and not something else, the process of achieving which was correlated with the target during training. 

the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance

I'm not sure what their assumptions are around the inner alignment problem. This is false: we expect that a smart AI with a wide range of goals can perform well on a wide range of reward functions that can be used, and gradient descent won't optimize the terminal goals that AI is actually trying to pursue. 

I fully expect gradient descent to successfully optimize artificial neural networks to achieve low loss; I just don't expect the loss function they can design to represent what we value, and I expect gradient descent to find neural networks that try to achieve something different from what was specified in the reward function.

If gradient descent finds an agent that tries to maximize something completely unrelated to humanity, and understands that for this, it needs to achieve a high score on our function, the agent will successfully achieve a high score. Gradient descent will optimize its ability to achieve a high score on our function - it will optimize the structure that makes up the agent - but won't really care about the goal contents of the current structure. If after training is finished, this structure optimizes for anything weird about the future of the universe and plans to kill us, this doesn't retroactively make the gradient change it- there is no known way for us to specify a loss function that trains away parameters that in the future might plan to kill us.


Being able to conduct experiments doesn't mean we can get demonstrations of all potential problems in advance. If the AI is smart enough and already wants something different enough from what we want, and we don't understand its cognitive architecture, we're not going to be able to trick it into believing its simulated environment is the real world where it can finally take over. Simply having read/write access to the weights and activations doesn't allow us to control what AI thinks about[4]. Techniques to shape the behaviour of subhuman systems aren't going to let us keep control of smarter systems.

Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control

Yes, but this is not an argument of the x-risk community.

Crucially, this doesn’t mean humans are “aligned to evolution”— see Evolution provides no evidence for the sharp left turn by Quintin Pope for a debunking of that analogy.

AFAIK, Nate Soares wouldn't claim that humans are aligned with evolution. Unfortunately, the authors of this or the linked post don't mechanistically understand the dynamics of the sharp left turn.

"AI control research is easier"

research on improving AI controllability is much easier than research on improving human controllability, so we should expect AIs to get more controllable faster than humans

(I'm going to assume both control and alignment are meant by "control".) Ways it's easier to test AI control techniques than human control techniques are listed. Valid for subhuman systems but isn't relevant or inapplicable to superhuman systems, as:

  • We don't know how to systematically suggest techniques that could realistically work to control superintelligent AI systems;
  • We don't know how to see what a superintelligent AI would do when it can verify it's in the real world; if it knows it's not, we don't know whether the technique actually allows us to control it in the real world;
  • Many at the major labs and outside propose making superintelligent systems that aren't single models, but instead a combination of many models working in a complicated system, where they oversee and report each other; if all compute for training a model is now being used to run copies of the model to make up a superintelligent system, the cost and scalability consideration don't really apply, as you have only a single expensive system.

"Values are easy to learn"

If an AI learns morality first, it will want to help us ensure it stays moral as it gets more powerful

True if the AI is smart and coherent enough to be able to do that. But if it's not yet a CEV-aligned superintelligence, having learnt what humans want doesn't incentivise gradient descent to change it in ways that move it towards CEV-aligned superintelligence. I expect understanding human values to, indeed, be easy for a smart AI, and to make it easier to play along; but it doesn't automatically make human values an optimisation target. Knowing what humans want doesn't make AI care unless you solve the problem of making AI care.

The behaviour of subhuman models that seems "aligned" corresponds to a messy collection of stuff that kind of optimises for what humans give rewards for; but every time gradient descent make the model grok more general optimisation/agency, the fuzzy thing that a messy collection of stuff had been optimised for is not going to influence the goal content of the new architecture gradient descent installs. There isn't a reason for gradient descent to preserve the goals and values of algorithms implemented by the neural network in the past: new, smarter AI algorithms implemented by the neural network can achieve a high score with a wider range of possible goals and values.

Since values are shared and understood by almost everyone in a society, they cannot be very complex. Unlike science and technology, where division of labor enables the accumulation of ever more complex knowledge, values must remain simple enough to be learned by children within a few years.

I'd guess the description of human values is probably shorter than a gigabyte of information or something; AI can learn what they are; but they're not simple enough for us to easily specify them as an optimization target- see The Hidden Complexity of Wishes.

current language models are already very capable of morally evaluating complex actions that a superintelligence might be capable of

They're capable of evaluating the consequences presented to them- but not more capable than humans. That said, 

  • subhuman LLMs won't be capable of evaluating plans generated by superhuman AIs, just like humans wouldn't be able to, as seeing all consequences of actions requires intelligence and not just understanding what humans would say;
  • there are failure modes I'd expect to be obvious to the authors of the post. I invite the reader to think about what happens if we automatically evaluate plans generated by superhuman AIs using current LLMs and then launch plans that our current LLMs look at and say, "this looks good". 


There are many reasons to expect that AIs will be easy to control and easy to align with human values

Unfortunately, in this post, I have not seen evidence that superintelligent AIs will be easy to control or align with human values. If a neural network implements a superhuman AI agent that wants something different from what we want, the post has not presented any evidence for thinking we'd be able to keep control over the future despite the impact of what this agent does, or to change it to implement a superhuman AI agent aligned with human values in the CEV sense, or even just to notice that something is wrong with the agent until it's too late.

While we directly optimize the weights of our AI systems to get rewards, and changes in human brains in response to rewards are less clear and transparent, we do not know how to use it to make a superintelligent AI want something we'd wish it wanted.

  1. ^

    Future AIs will exhibit emotions and desires in ways that deserve serious ethical consideration

    By default, superhuman AI systems that wipe out humanity won't have emotions. They're going to be extremely good optimizers. But seems important to note that if we succeed at not dying in the next 20 years from extremely good optimizers, I'd want us to build AI systems with emotions only intentionally and after understanding how to design new minds. See Nonsentient Optimizers and Can't Unbirth a Child.

  2. ^

    I focus on generally subhuman vs. generally superhuman systems, as this seems like a relevant distinction while being simpler to focus on, even though it's loses some nuance. It seems that with inference being cheaper than training, once you trained a human-level system, you can immediately run many copies of it, which can together make up a superhuman system (smart enough to solve alignment in a relatively short amount of time if it wanted to, and also capable enough to kill everyone). Many copies of sub-human systems, put together, won't be able to solve alignment, or any problems requiring a lot of best human cognition. So, I imagine a fuzzy threshold around the human level and focus on it in this post.

  3. ^

    There's also an open, more general problem, that I don't discuss here, of weaker systems steering stronger systems (not getting gamed and preserving preferences). We don't know how to do that. 

  4. ^

    And unfortunately, we don't know what each of the weights represents, and we don't have much transparency into the algorithms they implement; we don't understand the thought process and we wouldn't know how to influence it in a way that'd work despite various internal optimization pressures





More posts like this

Sorted by Click to highlight new comments since: Today at 3:23 AM

Executive summary: The post argues controlling AI systems will be easy but misses key issues around aligning superintelligent systems.

Key points:

  1. The post misrepresents concerns about AI safety as just loss of control, while the core issue is misalignment.
  2. Evidence of controlling subhuman systems doesn't readily transfer to controlling superhuman AI.
  3. Optimization techniques shape behavior but don't necessarily instill human values as a goal.
  4. Techniques for manipulating subhuman systems likely won't work on superintelligent systems.
  5. Learning human values doesn't automatically make an AI adopt them as optimisation targets.
  6. Current language models can't evaluate plans of superintelligent systems.



This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Thanks for the article! Even though I have a few disagreements, let me start off by saying that I really, sincerely hope you are right. If you're right, and I'm wrong, the world is better for everyone. 

With that said, I think you staking your argument on two key asumptions: 

  1. We would have any substantial insight into the workings of a superintelligence
  2. The burden of proof is on ASI skeptics to prove x-risk exists, rather than on those who claim alignment of an ASI is possible 

I'd have to challenge both of these assumptions. 

First of all, we could have no direct insight into -- let alone control over -- the mental processes of any being more intelligent than us. We could make hypotheses based on our own best interpretations of evidence available to us, but make no mistake that an actually existing ASI (aligned or otherwise) is a black box. And because of this, we could never be certain that it didn't pose an x-risk to us. For example, a below-human-intelligence* AGI could genuinely possess what you've termed CEV values, but it could drift away from that as it increases in intelligence with no human awareness of this. Even if its neural circuits are rewired, it can store multiple copies of its plan in various places to avoid such a goal interruption, and if it is more intelligent than us, it's safe to assume it has already done this. Such an AGI would be indistinguishable from an aligned model, but it would factually be misaligned. Humans just aren't intuitively good at eliminating specific thoughts in the same way we could find, for example, a bad line of code in a traditional program. Therefore, I kind of have to conclude that the only thing that could reliably align an AGI is an equally intelligent AGI--and if this intelligence level is superhuman, the actions of both systems (the misaligned and the aligner) would be an absolute black box to us. I'm sorry, but this just doesn't seem safe to me. (epistemic status -- ~80%)

Secondly, the burden of proof rests with the person making a positive claim, with skepticism being the default position. It seems more than rational to assume the existence of a superintelligent artificial system anywhere in our lightcone poses some x-risk to us, which means the assumption of x-risk with ASI is the default position. Therefore, the burden lies with pro-AGI advocates such as yourself to demonstrate, with hard data, that alignment is not only possible but the most likely outcome. (epistemic status -- ~90%)

I really hope this doesn't come off as too harsh, that's really not my intent at all! 

*I don't refer to any potentially conscious system as "subhuman," all sentient beings (human / animal, or artificial) are intrinsically valuable

pro-AGI advocates such as yourself

I am somewhat confused by your comment. Are you replying to my EA Forum post, the linked article I was replying to, or both? I am certainly not a pro-AGI advocate in the sense you seem to imply: while I think we're ought to create AGI eventually, after there's a scientific consensus it'd be safe to, I'm certainly not suggesting to do this now. I'm the author of moratorium.ai, a resource advocating for an AI moratorium.

I am not making an assumption that we'd have any substantial insight into the workings of a superintelligence. The way we develop AI now, we don't know or understand what the billions-trillions of numbers that make them up represent, and don't have a way to extract and understand the cognitive architecture that runs on these numbers.

In my model, the default way we develop ASI is via deep learning, and I expect us to not understand that ASI and die shortly afterwards (~80%, and the 20% comes primarily from international governance delaying ASI until we solve all the related problems).

I'd certainly hope we'd instead develop the architecture of ASI manually, thoroughly designing and understanding all its inner workings, not doing any gradient descent; unfortunately, at the moment, this doesn't seem realistic (although I believe possible in principle).

I have to say that I can imagine how it'd be possible, in principle, to make a safe AGI with deep learning. It'd require a research direction like Infra-Bayesianism to produce results and insights that could be used as constrains for a training run; and it'd require many other research results to avoid inner misalignment; but I think it's not literally impossible to be relatively confident in safety of an AGI trained with deep learning. This is not something I expect to happen, at all, but it seems theoretically possible.

You raise an important problem of stability under reflection. I don't think it makes much sense to talk about a subhuman AGI performing CEV (it is a procedure that requires being more capable than humans). But designing a coherent AI system in a way that safely maximises humanity's CEV, without drifting away from it even as it increases its capabilities, is indeed a complicated problem. I expect, again, that it is solvable in principle, even though I wouldn't be surprised at all if it takes generations of thousands to millions of scientists to solve.

Even if its neural circuits are rewired, it can store multiple copies of its plan in various places to avoid such a goal interruption, and if it is more intelligent than us, it's safe to assume it has already done this

I am not sure what you mean by that, as we don't program these systems and they themselves can't really rewrite their source code, and storing something internally requires something like gradient hacking; in any case, while generally I expect alignment to not be preserved as a system's capabilities increase, there are systems that we could, theoretically, get to (although I don't expect us to), that would care about CEV in a way that doesn't change as they increase their capabilities. 

Humans just aren't intuitively good at eliminating specific thoughts in the same way we could find, for example, a bad line of code in a traditional program. Therefore, I kind of have to conclude that the only thing that could reliably align an AGI is an equally intelligent AGI--and if this intelligence level is superhuman, the actions of both systems (the misaligned and the aligner) would be an absolute black box to us.

Humanity is capable of producing complicated systems. We usually need to understand laws that govern them first, but we didn't need rockets capable of getting to the Moon to design the first rocket capable of getting to the Moon. I think the problem of understanding some parts of the space of possible minds is solvable in principle; and it is possible to understand the laws that govern those parts of the space enough to come up with a target - with a mind design that would be safe - and then actually design and launch a corresponding mind, even if it is smarter than humans. Not that I expect any of that to happen within the time constraints.

the burden of proof rests with the person making a positive claim, with skepticism being the default position. It seems more than rational to assume the existence of a superintelligent artificial system anywhere in our lightcone poses some x-risk to us, which means the assumption of x-risk with ASI is the default position

I disagree with that. No previously existing technology wiped out humanity; for anthropic reasons, it's not obvious how much of evidence this is, but most new technologies certainly haven't wiped us out, and it seems like the default outside view to take about a new technology. The claim that ASI is likely to kill everyone is extraordinary and requires good reasons.

I think we have really good reasons to think that, and I summarised some of them on moratorium.ai, in this post, and in my other posts. I think it is actively harmful to do advocacy that focuses on "ASI = x-risk by default, burden of proof is on those who build it", without technical arguments. Policymakers are going to ask Meta about x-risk, and if the policymakers are not already familiar with our technical arguments, they're going to believe whatever Meta is saying, as they seem to know what they're talking about and don't say anything flawed enough for a policymaker not familiar with our arguments to see. To avoid that, we need to explain the technical reasons why ASI is likely to literally kill everyone, in ways that'd, e.g., allow policymakers to call bullshit on what Meta representatives might be saying.

pro-AGI advocates such as yourself to demonstrate, with hard data, that alignment is not only possible but the most likely outcome

Sorry, but I'm confused about who it is addressed to, as I think alignment is extremely unlikely in the current situation. I think the probability of doom conditional on AGI before 2040 is >98%. I think I have good arguments for why it's the case. I think there's ~80% everyone will die, and I really hope I'm wrong. The burden on pro-AGI-ASAP advocates is to not just demonstrate that alignment is likely to someone new, but to refute my arguments and arguments of those I agree with, and establish a scientific consensus that'd agree with them.


I don't refer to any potentially conscious system as "subhuman," all sentient beings (human / animal, or artificial) are intrinsically valuable

By default, ASI is not going to be sentient (in the sense of having qualia), and I wouldn't consider it to be intrinsically valuable. I also hope we won't make sentient AIs until we're fully ready to, see this post's first footnote for some links.

I'm talking about intelligence as I define it here. The quality of being generally subhuman or generally superhuman on this axis seems important.)

I apologize, I must have misunderstood the post you quoted and confused it with your own position. I retract that part of my post.