Hide table of contents


  • Rohin Shah — PhD student at the Center for Human-Compatible AI, UC Berkeley
  • Asya Bergal – AI Impacts
  • Robert Long – AI Impacts
  • Sara Haxhia — Independent researcher


We spoke with Rohin Shah on August 6, 2019. Here is a brief summary of that conversation:

  • Before taking into account other researchers’ opinions, Shah guesses an extremely rough~90% chance that even without any additional intervention from current longtermists, advanced AI systems will not cause human extinction by adversarially optimizing against humans. He gives the following reasons, ordered by how heavily they weigh in his consideration:
    • Gradual development and take-off of AI systems is likely to allow for correcting the AI system online, and AI researchers will in fact correct safety issues rather than hacking around them and redeploying.
      • Shah thinks that institutions developing AI are likely to be careful because human extinction would be just as bad for them as for everyone else.
    • As AI systems get more powerful, they will likely become more interpretable and easier to understand because they will use features that humans also tend to use.
    • Many arguments for AI risk go through an intuition that AI systems can be decomposed into an objective function and a world model, and Shah thinks this isn’t likely to be a good way to model future AI systems.
  • Shah believes that conditional on misaligned AI leading to extinction, it almost certainly goes through deception.
  • Shah very uncertainly guesses that there’s a ~50% that we will get AGI within two decades:
    • He gives a ~30% – 40% chance that it will be via essentially current techniques.
    • He gives a ~70% that conditional on the two previous claims, it will be a mesa optimizer.
    • Shah’s model for how we get to AGI soon has the following features:
      • AI will be trained on a huge variety of tasks, addressing the usual difficulty of generalization in ML systems
      • AI will learn the same kinds of useful features that humans have learned.
      • This process of research and training the AI will mimic the ways that evolution produced humans who learn.
      • Gradient descent is simple and inefficient, so in order to do sophisticated learning, the outer optimization algorithm used in training will have to produce a mesa optimizer.
  • Shah is skeptical of more ‘nativist’ theories where human babies are born with a lot of inductive biases, rather than learning almost everything from their experiences in the world.
  • Shah thinks there are several things that could change his beliefs, including:
    • If he learned that evolution actually baked a lot into humans (‘nativism’), he would lengthen the amount of time he thinks there will be before AGI.
    • Information from historical case studies or analyses of AI researchers could change his mind around how the AI community would by default handle problems that arise.
    • Having a better understanding of the disagreements he has with MIRI:
      • Shah believes that slow takeoff is much more likely than fast takeoff.
      • Shah doesn’t believe that any sufficiently powerful AI system will look like an expected utility maximizer.
      • Shah believes less in crisp formalizations of intelligence than MIRi does.
      • Shah has more faith in AI researchers fixing problems as they come up.
      • Shah has less faith than MIRI in our ability to write proofs of the safety of our AI systems.

This transcript has been lightly edited for concision and clarity.


Asya Bergal: We haven’t really planned out how we’re going to talk to people in general, so if any of these questions seem bad or not useful, just give us feedback. I think we’re particularly interested in skepticism arguments, or safe by default style arguments– I wasn’t sure from our conversation whether you partially endorse that, or you just are familiar with the argumentation style and think you could give it well or something like that.

Rohin Shah: I think I partially endorse it.

Asya Bergal: Okay, great. If you can, it would be useful if you gave us the short version of your take on the AI risk argument and the place where you feel you and people who are more convinced of things disagree. Does that make sense?

Robert Long: Just to clarify, maybe for my own… What’s ‘convinced of things’? I’m thinking of the target proposition as something like “it’s extremely high value for people to be doing work that aims to make AGI more safe or beneficial”.

Asya Bergal: Even that statement seems a little imprecise because I think people have differing opinions about what the high value work is. But that seems like approximately the right proposition.

Rohin Shah: Okay. So there are some very obvious ones which are not the ones that I endorse, but things like, do you believe in longtermism? Do you buy into the total view of population ethics? And if your answer is no, and you take a more standard version, you’re going to drastically reduce how much you care about AI safety. But let’s see, the ones that I would endorse-

Robert Long: Maybe we should work on this set of questions. I think this will only come up with people who are into rationalism. I think we’re primarily focused just on empirical sources of disagreement, whereas these would be ethical.

Rohin Shah: Yup.

Robert Long: Which again, you’re completely right to mention these things.

Rohin Shah: So, there’s… okay. The first one I had listed is that continual or gradual or slow takeoff, whatever you want to call it, allows you to correct the AI system online. And also it means that AI systems are likely to fail in not extinction-level ways before they fail in extinction-level ways, and presumably we will learn from that and not just hack around it and fix it and redeploy it. I think I feel fairly confident that there are several people who will disagree with exactly the last thing I said, which is that people won’t just hack around it and deploy it– like fix the surface-level problem and then just redeploy it and hope that everything’s fine.

I am not sure what drives the difference between those intuitions. I think they would point to neural architecture search and things like that as examples of, “Let’s just throw compute at the problem and let the compute figure out a bunch of heuristics that seem to work.” And I would point at, “Look, we noticed that… or, someone noticed that AI systems are not particularly fair and now there’s just a ton of research into fairness.”

And it’s true that we didn’t stop deploying AI systems because of fairness concerns, but I think that is actually just the correct decision from a societal perspective. The benefits from AI systems are in fact– they do in fact outweigh the cons of them not being fair, and so it doesn’t require you to not deploy the AI system while it’s being fixed.

Asya Bergal: That makes sense. I feel like another common thing, which is not just “hack around and fix it”, is that people think that it will fail in ways that we don’t recognize and then we’ll redeploy some bigger cooler version of it that will be deceptively aligned (or whatever the problem is). How do you feel about arguments of that form: that we just won’t realize all the ways in which the thing is bad?

Rohin Shah: So I’m thinking: the AI system tries to deceive us, so I guess the argument would be, we don’t realize that the AI system was trying to deceive us and instead we’re like, “Oh, the AI system just failed because it was off distribution or something.”

It seems strange that we wouldn’t see an AI system deliberately hide information from us. And then we look at this and we’re like, “Why the hell didn’t this information come up? This seems like a clear problem.” And then do some sort of investigation into this.

I suppose it’s possible we wouldn’t be able to tell it’s intentionally doing this because it thinks it could get better reward by doing so. But that doesn’t… I mean, I don’t have a particular argument why that couldn’t happen but it doesn’t feel like…

Asya Bergal: Yeah, to be fair I’m not sure that one is what you should expect… that’s just a thing that I commonly hear.

Rohin Shah: Yes. I also hear that.

Robert Long: I was surprised at your deception comment… You were talking about, “What about scenarios where nothing seems wrong until you reach a certain level?”

Asya Bergal: Right. Sorry, that doesn’t have to be deception. I think maybe I mentioned deception because I feel like I often commonly also see it.

Rohin Shah: I guess if I imagine “How did AI lead to extinction?”, I don’t really imagine a scenario that doesn’t involve deception. And then I claim that conditional on that scenario having happened, I am very surprised by the fact that we did not know this deception in any earlier scenario that didn’t lead to extinction. And I don’t really get people’s intuitions for why that would be the case. I haven’t tried to figure that one out though.

Sara Haxhia: So do you have no model of how people’s intuitions differ? You can’t see it going wrong aside from if it was deceptively aligned? Why?

Rohin Shah: Oh, I feel like most people have the intuition that conditional on extinction, it happened by the AI deceiving us. [Note: In this interview, Rohin was only considering risks arising because of AI systems that try to optimize for goals that are not our own, not other forms of existential risks from AI.]

Asya Bergal: I think there’s another class of things which is something not necessarily deceiving us, as in it has a model of our goals and intentionally presents us with deceptive output, and just like… it has some notion of utility function and optimizes for that poorly. It doesn’t necessarily have a model of us, it just optimizes the paperclips or something like that, and we didn’t realize before that it is optimizing. I think when I hear deceptive, I think “it has a model of human behavior that is intentionally trying to do things that subvert our expectations”. And I think there’s also a version where it just has goals unaligned with ours and doesn’t spend any resources in modeling our behavior.

Rohin Shah: I think in that scenario, usually as an instrumental goal, you need to deceive humans, because if you don’t have a model of human behavior– if you don’t model the fact that humans are going to interfere with your plans– humans just turn you off and nothing, there’s no extinction.

Robert Long: Because we’d notice. You’re thinking in the non-deception cases, as with the deception cases, in this scenario we’d probably notice.

Sara Haxhia: That clarifies my question. Great.

Rohin Shah: As far as I know, this is an accepted thing among people who think about AI x-risk.

Asya Bergal: The accepted thing is like, “If things go badly, it’s because it’s actually deceiving us on some level”?

Rohin Shah: Yup. There are some other scenarios which could lead to us not being deceived and bad things still happen. These tend to be things like, we build an economy of AI systems and then slowly humans get pushed out of the economy of AI systems and… 

They’re still modeling us. I just can’t really imagine the scenario in which they’re not modeling us. I guess you could imagine one where we slowly cede power to AI systems that are doing things better than we could. And at no point are they actively trying to deceive us, but at some point they’re just like… they’re running the entire economy and we don’t really have much say in it.

And perhaps this could get to a point where we’re like, “Okay, we have lost control of the future and this is effectively an x-risk, but at no point was there really any deception.”

Asya Bergal: Right. I’m happy to move on to other stuff.

Rohin Shah: Cool. Let’s see. What’s the next one I have? All right. This one’s a lot sketchier-

Asya Bergal: So sorry, what is the thing that we’re listing just so-

Rohin Shah: Oh, reasons why AI safety will be fine by default.

Asya Bergal: Right. Gotcha, great.

Rohin Shah: Okay. These two points were both really one point. So then the next one was… I claimed that as AI systems get more powerful, they will become more interpretable and easier to understand, just because they’re using– they will probably be able to get and learn features that humans also tend to use.

I don’t think this has really been debated in the community very much and– sorry, I don’t mean that there’s agreement on it. I think it is just not a hypothesis that has been promoted to attention in the community. And it’s not totally clear what the safety implications are. It suggests that we could understand AI systems more easily and sort of in combination with the previous point it says, “Oh, we’ll notice things– we’ll be more able to notice things than today where we’re like, ‘Here’s this image classifier. Does it do good things? Who the hell knows? We tried it on a bunch of inputs and it seemed like it was doing the right stuff, but who knows what it’s doing inside.'”

Asya Bergal: I’m curious why you think it’s likely to use features that humans tend to use. It’s possible the answer is some intuition that’s hard to describe.

Rohin Shah: Intuition that I hope to describe in a year. Partly it’s that in the very toy straw model, there are just a bunch of features in the world that an AI system can pay attention to in order to make good predictions. When you limit the AI system to make predictions on a very small narrow distribution, which is like all AI systems today, there are lots of features that the AI system can use for that task that we humans don’t use because they’re just not very good for the rest of the distribution.

Asya Bergal: I see. It seems like implicitly in this argument is that when humans are running their own classifiers, they have some like natural optimal set of features that they use for that distribution?

Rohin Shah: I don’t know if I’d say optimal, but yeah. Better than the features that the AI system is using.

Robert Long: In the space of better features, why aren’t they going past us or into some other optimal space of feature world?

Rohin Shah: I think they would eventually.

Robert Long: I see, but they might have to go through ours first?

Rohin Shah: So A) I think they would go through ours, B) I think my intuition is something like the features– and this one seems like more just raw intuition and I don’t really have an argument for it– but the features… things like agency, optimization, want, deception, manipulation seem like things that are useful for modeling the world.

I would be surprised if an AI system went so far beyond that those features didn’t even enter into its calculations. Or, I’d be surprised if that happened very quickly, maybe. I don’t want to make claims about how far past those AI systems could go, but I do think that… I guess I’m also saying that we should be aiming for AI systems that are like… This is a terrible way to operationalize it, but AI systems that are 10X as intelligent as humans, what do we have to do for them? And then once we’ve got AI systems that are 10 x smarter than us, then we’re like, “All right, what more problems could arise in the future?” And ask the AI systems to help us with that as well.

Asya Bergal: To clarify, the thing you’re saying is… By the time AI systems are good and more powerful, they will have some conception of the kind of features that humans use, and be able to describe their decisions in terms of those features? Or do you think inherently, there’ll be a point where AI systems use the exact same features that humans use?

Rohin Shah: Not the exact same features, but broadly similar features to the ones that humans use.

Robert Long: Where examples of those features would be like objects, cause, agent, the things that we want interpreted in deep nets but usually can’t.

Rohin Shah: Yes, exactly.

Asya Bergal: Again, so you think in some sense that that’s a natural way to describe things? Or there’s only one path through getting better at describing things, and that has to go through the way that humans describe things? Does that sound right?

Rohin Shah: Yes.

Asya Bergal: Okay. Does that also feel like an intuition?

Rohin Shah: Yes.

Robert Long: Sorry, I think I did a bad interviewer thing where I started listing things, I should have just asked you to list some of the features which I think-

Rohin Shah: Well I listed them, like, optimization, want, motivation before, but I agree causality would be another one. But yeah, I was thinking more the things that safety researchers often talk about. I don’t know, what other features do we tend to use a lot? Object’s a good one… the conception of 3D space is one that I don’t think these classifiers have and that we definitely have.

And the concept of 3D space seems like it’s probably going to be useful for an AI system no matter how smart it gets. Currently, they might have a concept of 3D space, but it’s not obvious that they do. And I wouldn’t be surprised if they don’t.

At some point, I want to take this intuition and run with it and see where it goes. And try to argue for it more.

Robert Long: But I think for the purposes of this interview, I think we do understand how this is something that would make things safe by default. At least, in as much as interpretability conduces to safety. Because we could be able to interpret them in and still fuck shit up.

Rohin Shah: Yep. Agreed. Cool.

Sara Haxhia: I guess I’m a little bit confused about how it makes the code more interpretable. I can see how if it uses human brains, we can model it better because we can just say, “These are human things and this means we can make predictions better.” But if you’re looking at a neural net or something, it doesn’t make it more interpretable.

Rohin Shah: If you mean the code, I agree with that.

Sara Haxhia: Okay. So, is this kind of like external, like you being able to model that thing?

Rohin Shah: I think you could look at the… you take a particular input to neural net, you pass it through layers, you see what the activations are. I don’t think if you just look directly at the activations, you’re going to get anything sensible, in the same way that if you look at electrical signals in my brain you’re not going to be able to understand them.

Sara Haxhia: So, is your point that the reason it becomes more interpretable is something more like, you understand its motivations?

Rohin Shah: What I mean is… Are you familiar with Chris Olah’s work?

Sara Haxhia: I’m not.

Rohin Shah: Okay. So Chris Olah does interpretability work with image classifiers. One technique that he uses is: Take a particular neuron in the neural net, say, “I want to maximize the activation of this neuron,” and then do gradient descent on your input image to see what image maximally activates that neuron. And this gives you some insight into what that neuron is detecting. I think things like that will be easier as time goes on.

Robert Long: Even if it’s not just that particular technique, right? Just the general task?

Rohin Shah: Yes.

Sara Haxhia: How does that relate to the human values thing? It felt like you were saying something like it’s going to model the world in a similar way to the way we do, and that’s going to make it more interpretable. And I just don’t really see the link.

Rohin Shah: A straw version of this, which isn’t exactly what I mean but sort of is the right intuition, would be like maybe if you run the same… What’s the input that maximizes the output of this neuron? You’ll see that this particular neuron is a deception classifier. It looks at the input and then based on something, does some computation with the input, maybe the input’s like a dialogue between two people and then this neuron is telling you, “Hey, is person A trying to deceive person B right now?” That’s an example of the sort of thing I am imagining.

Asya Bergal: I’m going to do the bad interviewer thing where I put words in your mouth. I think one problem right now is you can go a few layers into a neural network and the first few layers correspond to things you can easily tell… Like, the first layer is clearly looking at all the different pixel values, and maybe the second layer is finding lines or something like that. But then there’s this worry that later on, the neurons will correspond to concepts that we have no human interpretation for, so it won’t even make sense to interpret them. Whereas Rohin is saying, “No, actually the neurons will correspond to, or the architecture will correspond to some human understandable concept that it makes sense to interpret.” Does that seem right?

Rohin Shah: Yeah, that seems right. I am maybe not sure that I tie it necessarily to the architecture, but actually probably I’d have to one day.

Asya Bergal: Definitely, you don’t need to. Yeah.

Rohin Shah: Anyway, I haven’t thought about that enough, but that’s basically that. If you look at current late layers in image classifiers they are often like, “Oh look, this is a detector for lemon tennis balls,” and you’re just like, “That’s a strange concept you’ve got there, neural net, but sure.”

Robert Long: Alright, cool. Next way of being safe?

Rohin Shah: They’re getting more and more sketchy. I have an intuition that… I should rephrase this. I have an intuition that AI systems are not well-modeled as, “Here’s the objective function and here is the world model.” Most of the classic arguments are: Suppose you’ve got an incorrect objective function, and you’ve got this AI system with this really, really good intelligence, which maybe we’ll call it a world model or just general intelligence. And this intelligence can take in any utility function, and optimize it, and you plug in the incorrect utility function, and catastrophe happens.

This does not seem to be the way that current AI systems work. It is the case that you have a reward function, and then you sort of train a policy that optimizes that reward function, but… I explained this the wrong way around. But the policy that’s learned isn’t really… It’s not really performing an optimization that says, “What is going to get me the most reward? Let me do that thing.”

It has been given a bunch of heuristics by gradient descent that tend to correlate well with getting high reward and then it just executes those heuristics. It’s kind of similar to… If any of you are fans of the sequences… Eliezer wrote a sequence on evolution and said… What was it? Humans are not fitness maximizers, they are adaptation executors, something like this. And that is how I view neural nets today that are trained by RL. They don’t really seem like expected utility maximizers the way that it’s usually talked about by MIRI or on LessWrong.

I mostly expect this to continue, I think conditional on AGI being developed soon-ish, like in the next decade or two, with something kind of like current techniques. I think it would be… AGI would be a mesa optimizer or inner optimizer, whichever term you prefer. And that that inner optimizer will just sort of have a mishmash of all of these heuristics that point in a particular direction but can’t really be decomposed into ‘here are the objectives, and here is the intelligence’, in the same way that you can’t really decompose humans very well into ‘here are the objectives and here is the intelligence’.

Robert Long: And why does that lead to better safety?

Rohin Shah: I don’t know that it does, but it leads to not being as confident in the original arguments. It feels like this should be pushing in the direction of ‘it will be easier to correct or modify or change the AI system’. Many of the arguments for risk are ‘if you have a utility maximizer, it has all of these convergent instrumental sub-goals’ and, I don’t know, if I look at humans they kind of sort of pursued convergent instrumental sub-goals, but not really.

You can definitely convince them that they should have different goals. They change the thing they are pursuing reasonably often. Mostly this just reduces my confidence in existing arguments rather than gives me an argument for safety.

Robert Long: It’s like a defeater for AI safety arguments that rely on a clean separation between utility…

Rohin Shah: Yeah, which seems like all of them. All of the most crisp ones. Not all of them. I keep forgetting about the… I keep not taking into account the one where your god-like AI slowly replace humans and humans lose control of the future. That one still seems totally possible in this world.

Robert Long: If AGI is through current techniques, it’s likely to have systems that don’t have this clean separation.

Rohin Shah: Yep. A separate claim that I would argue for separately– I don’t think they interact very much– is that I would also claim that we will get AGI via essentially current techniques. I don’t know if I should put a timeline on it, but two decades seems plausible. Not saying it’s likely, maybe 50% or something. And that the resulting AGI will look like mesa optimizer.

Asya Bergal: Yeah. I’d be very curious to delve into why you think that.

Robert Long: Yeah, me too. Let’s just do that because that’s fast. Also your… What do you mean by current techniques, and what’s your credence in that being what happens?

Sara Haxhia: And like what’s your model for how… where is this coming from?

Rohin Shah: So on the meta questions, first, the current techniques would be like deep learning, gradient descent broadly, maybe RL, maybe meta-learning, maybe things sort of like it, but back propagation or something like that is still involved.

I don’t think there’s a clean line here. Something like, we don’t look back and say: That. That was where the ML field just totally did a U-turn and did something else entirely.

Robert Long: Right. Everything that’s involved in the building of the AGI is something you can roughly find in current textbooks or like conference proceedings or something. Maybe combined in new cool ways.

Rohin Shah: Yeah. Maybe, yeah. Yup. And also you throw a bunch of compute at it. That is part of my model. So that was the first one. What is current techniques? Then you asked credence.

Credence in AGI developed in two decades by current-ish techniques… Depends on the definition of current-ish techniques, but something like 30, 40%. Credence that it will be a mesa optimizer, maybe conditional on this being… The previous thing being true, the credence on it being a mesa optimizer, 60, 70%. Yeah, maybe 70%.

And then the actual model for why this is… it’s sort of related to the previous points about features wherein there are lots and lots of features and humans have settled on the ones that are broadly useful across a wide variety of contexts. I think that in that world, what you want to do to get AGI is train an AI system on a very broad… train an AI system maybe by RL or something else, I don’t know. Probably RL.

On a very large distribution of tasks or a large distribution of something, maybe they’re tasks, maybe they’re not like, I don’t know… Human babies aren’t really training on some particular task. Maybe it’s just a bunch of unsupervised learning. And in doing so over a lot of time and a lot of compute, it will converge on the same sorts of features that humans use.

I think the nice part of this story is that it doesn’t require that you explain how the AI system generalizes– generalization in general is just a very difficult property to get out of ML systems if you want to generalize outside of the training distribution. You mostly don’t require that here because, A) it’s being trained on a very wide variety of tasks and B) it’s sort of mimicking the same sort of procedure that was used to create humans. Where, with humans you’ve also got the sort of… evolution did a lot of optimization in order to create creatures that were able to work effectively in the environment, the environment’s super complicated, especially because there are other creatures that are trying to use the same resources.

And so that’s where you get the wide variety or, the very like broad distribution of things. Okay. What have I not said yet?

Robert Long: That was your model. Are you done with the model of how that sort of thing happens or-

Rohin Shah: I feel like I’ve forgotten aspects, forgotten to say aspects of the model, but maybe I did say all of it.

Robert Long: Well, just to recap: One thing you really want is a generalization, but this is in some sense taken care of because you’re just training on a huge bunch of tasks. Secondly, you’re likely to get them learning useful features. And one-

Rohin Shah: And thirdly, it’s mimicking what evolution did, which is the one example we have of a process that created general intelligence.

Asya Bergal: It feels like implicit in this sort of claim for why it’s soon is that compute will grow sufficiently to accommodate this process, which is similar to evolution. It feels like there’s implicit there, a claim that compute will grow and a claim that however compute will grow, that’s going to be enough to do this thing.

Rohin Shah: Yeah, that’s fair. I think actually I don’t have good reasons for believing that, maybe I should reduce my credences on these a bit, but… That’s basically right. So, it feels like for the first time I’m like, “Wow, I can actually use estimates of human brain computation and it actually makes sense with my model.”

I’m like, “Yeah, existing AI systems seem more expensive to run than the human brain… Sorry, if you compare dollars per hour of human brain equivalent. Hiring a human is what? Maybe we call it $20 an hour or something if we’re talking about relatively simple tasks. And then, I don’t think you could get an equivalent amount of compute for $20 for a while, but maybe I forget what number it came out to, I got to recently. Yeah, actually the compute question feels like a thing I don’t actually know the answer to.

Asya Bergal: A related question– this is just to clarify for me– it feels like maybe the relevant thing to compare to is not the amount of compute it takes to run a human brain, but like-

Rohin Shah: Evolution also matters.

Asya Bergal: Yeah, the amount of compute to get to the human brain or something like that.

Rohin Shah: Yes, I agree with that, that that is a relevant thing. I do think we can be way more efficient than evolution.

Asya Bergal: That sounds right. But it does feel like that’s… that does seem like that’s the right sort of quantity to be looking at? Or does it feel like-

Rohin Shah: For training, yes.

Asya Bergal: I’m curious if it feels like the training is going to be more expensive than the running in your model.

Rohin Shah: I think the… It’s a good question. It feels like we will need a bunch of experimentation, figuring out how to build essentially the equivalent of the human brain. And I don’t know how expensive that process will be, but I don’t think it has to be a single program that you run. I think it can be like… The research process itself is part of that.

At some point I think we build a system that is initially trained by gradient descent, and then the training by gradient descent is comparable to humans going out in the world and acting and learning based on that. A pretty big uncertainty here is: How much has evolution put in a bunch of important priors into human brains? Versus how much are human brains actually just learning most things from scratch? Well, scratch or learning from their parents.

People would claim that babies have lots of inductive biases, I don’t know that I buy it. It seems like you can learn a lot with a month of just looking at the world and exploring it, especially when you get way more data than current AI systems get. For one thing, you can just move around in the world and notice that it’s three dimensional.

Another thing is you can actually interact with stuff and see what the response is. So you can get causal intervention data, and that’s probably where causality becomes such an ingrained part of us. So I could imagine that these things that we see as core to human reasoning, things like having a notion of causality or having a notion, I think apparently we’re also supposed to have as babies an intuition about statistics and like counterfactuals and pragmatics.

But all of these are done with brains that have been in the world for a long time, relatively speaking, relative to AI systems. I’m not actually sure if I buy that this is because we have really good priors.

Asya Bergal: I recently heard… Someone was talking to me about an argument that went like: Humans, in addition to having priors, built-ins from evolution and learning things in the same way that neural nets do, learn things through… you go to school and you’re taught certain concepts and algorithms and stuff like that. And that seems distinct from learning things in a gradient descenty way. Does that seem right?

Rohin Shah: I definitely agree with that.

Asya Bergal: I see. And does that seem like a plausible thing that might not be encompassed by some gradient descenty thing?

Rohin Shah: I think the idea there would be, you do the gradient descenty thing for some time. That gets you in the AI system that now has inside of it a way to learn. That’s sort of what it means to be a mesa optimizer. And then that mesa optimizer can go and do its own thing to do better learning. And maybe at some point you just say, “To hell with this gradient descent, I’ll turn it off.” Probably humans don’t do that. Maybe humans do that, I don’t know.

Asya Bergal: Right. So you do gradient descent to get to some place. And then from there you can learn in the same way– where you just read articles on the internet or something?

Rohin Shah: Yeah. Oh, another reason that I think this… Another part of my model for why this is more likely– I knew there was more– is that, exactly that point, which is that learning probably requires some more deliberate active process than gradient descent. Gradient design feels really relatively dumb, not as dumb as evolution, but close. And the only plausible way I’ve seen so far for how that could happen is by mesa optimization. And it also seems to be how it happened with humans. I guess you could imagine the meta-learning system that’s explicitly trying to develop this learning algorithm.

And then… okay, by the definition of mesa optimizers, that would not be a mesa optimizer, it would be an inner optimizer. So maybe it’s an inner optimizer instead if we use-

Asya Bergal: I think I don’t quite understand what it means that learning requires, or that the only way to do learning is through mesa optimization

Rohin Shah: I can give you a brief explanation of what it means to me in a minute or two. I’m going to go and open my summary because that says it better than I can.

Learned optimization, that’s what it was called. All right. Suppose you’re searching over a space of programs to find one that plays tic-tac-toe well. And initially you find a program that says, “If the board is empty, put something in the center square,” or rather, “If the center square is empty, put something there. If there’s two in a row somewhere of yours, put something to complete it. If your opponent has two in a row somewhere, make sure to block it,” and you learn a bunch of these heuristics. Those are some nice, interpretable heuristics but maybe you’ve got some uninterpretable ones too.

But as you search more and more, eventually someday you stumble upon the minimax algorithm, which just says, “Play out the game all the way until the end. See whether in all possible moves that you could make, and all possible moves your opponent could make, and search for the path where you are guaranteed to win.”

And then you’re like, “Wow, this algorithm, it just always wins. No one can ever beat it. It’s amazing.” And so basically you have this outer optimization loop that was searching over a space of programs, and then it found a program, so one element of the space, that was itself performing optimization, because it was searching through possible moves or possible paths in the game tree to find the actual policy it should play.

And so your outer optimization algorithm found an inner optimization algorithm that is good, or it solves the task well. And the main claim I will make, and I’m not sure if… I don’t think the paper makes it, but the claim I will make is that for many tasks if you’re using gradient descent as your optimizer, because gradient descent is so annoyingly slow and simple and inefficient, the best way to actually achieve the task will be to find a mesa optimizer. So gradient descent finds parameters that themselves take an input, do some sort of optimization, and then figure out an output.

Asya Bergal: Got you. So I guess part of it is dividing into sub-problems that need to be optimized and then running… Does that seem right?

Rohin Shah: I don’t know that there’s necessarily a division into sub problems, but it’s a specific kind of optimization that’s tailored for the task at hand. Maybe another example would be… I don’t know, that’s a bad example. I think the analogy to humans is one I lean on a lot, where evolution is the outer optimizer and it needs to build things that replicate a bunch.

It turns out having things replicate a bunch is not something you can really get by heuristics. What you need to do is to create humans who can themselves optimize and figure out how to… Well, not replicate a bunch, but do things that are very correlated with replicating a bunch. And that’s how you get very good replicators.

Asya Bergal: So I guess you’re saying… often the gradient descent process will– it turns out that having an optimizer as part of the process is often a good thing. Yeah, that makes sense. I remember them in the mesa optimization stuff.

Rohin Shah: Yeah. So that intuition is one of the reasons I think that… It’s part of my model for why AGI will be a mesa optimizer. Though I do– in the world where we’re not using current ML techniques I’m like, “Oh, anything can happen.”

Asya Bergal: That makes sense. Yeah, I was going to ask about that. Okay. So conditioned on current ML techniques leading to it, it’ll probably go through mesa optimizers?

Rohin Shah: Yeah. I might endorse the claim with much weaker confidence even without current ML techniques, but I’d have to think a lot more about that. There are arguments for why mesa optimization is the thing you want– is the thing that happens– that are separate from deep learning. In fact, the whole paper doesn’t really talk about deep learning very much.

Robert Long: Cool. So that was digging into the model of why and how confident we should be on current technique AGI, prosaic AI I guess people call it? And seems like the major sources of uncertainty there are: does compute actually go up, considerations about evolution and its relation to human intelligence and learning and stuff?

Rohin Shah: Yup. So the Median Group, for example, will agree with most of this analysis… Actually no. The Median Group will agree with some of this analysis but then say, and therefore, AGI is extremely far away, because evolution threw in some horrifying amount of computation and there’s no way we can ever match that.

Asya Bergal: I’m curious if you still have things on your list of like safety by default arguments, I’m curious to go back to that. Maybe you covered them.

Rohin Shah: I think I have covered them.  The way I’ve listed this last one is ‘AI systems will be optimizers in the same way that humans are optimizers, not like Eliezer-style EU maximizers’… which is basically what I’ve just been saying.

Sara Haxhia: But it seems like it still feels dangerous.. if a human had loads of power, it could do things that… even if they aren’t maximizing some utility.

Rohin Shah: Yeah, I agree, this is not an argument for complete safety. I forget where I was initially going with this point. I think my main point here is that mesa optimizers don’t nice… Oh, right, they don’t nicely factor into utility function and intelligence. And that reduces my credence in existing arguments, and there are still issues which are like, with a mesa optimizer, your capabilities generalize with distributional shift, but your objective doesn’t.

Humans are not really optimizing for reproductive success. And arguably, if someone had wanted to create things that were really good at reproducing, they might have used evolution as a way to do it. And then humans showed up and were like, “Oh, whoops, I guess we’re not doing that anymore.”

I mean, the mesa optimizers paper is a very pessimistic paper. In their view, mesa optimization is a bad thing that leads to danger and that’s… I agree that all of the reasons they point out for mesa optimization being dangerous are in fact reasons that we should be worried about mesa optimization.

I think mostly I see this as… convergent instrumental sub-goals are less likely to be obviously a thing that this pursues. And that just feels more important to me. I don’t really have a strong argument for why that consideration dominates-

Robert Long: The convergent instrumental sub-goals consideration?

Rohin Shah: Yeah.

Asya Bergal: I have a meta credence question, maybe two layers of them. The first being, do you consider yourself optimistic about AI for some random qualitative definition of optimistic? And the follow-up is, what do you think is the credence that by default things go well, without additional intervention by us doing safety research or something like that?

Rohin Shah: I would say relative to AI alignment researchers, I’m optimistic. Relative to the general public or something like that, I might be pessimistic. It’s hard to tell. I don’t know, credence that things go well? That’s a hard one. Intuitively, it feels like 80 to 90%, 90%, maybe. 90 feels like I’m being way too confident and like, “What? You only assign 10%, even though you have literally no… you can’t predict the future and no one can predict the future, why are you trying to do it?” It still does feel more like 90%.

Asya Bergal: I think that’s fine. I guess the follow-up is sort of like, between the sort of things that you gave, which were like: Slow takeoff allows for correcting things, things that are more powerful will be more interpretable, and I think the third one being, AI systems not actually being… I’m curious how much do you feel like your actual belief in this leans on these arguments? Does that make sense?

Rohin Shah: Yeah. I think the slow takeoff one is the biggest one. If I believe that at some point we would build an AI system that within the span of a week was just way smarter than any human, and before that the most powerful AI system was below human level, I’m just like, “Shit, we’re doomed.”

Robert Long: Because there it doesn’t matter if it goes through interpretable features particularly.

Rohin Shah: There I’m like, “Okay, once we get to something that’s super intelligent, it feels like the human ant analogy is basically right.” And unless we… Maybe we could still be fine because people thought about it and put in… Maybe I’m still like, “Oh, AI researchers would have been able to predict that this would’ve happened and so were careful.”

I don’t know, in a world where fast takeoff is true, lots of things are weird about the world, and I don’t really understand the world. So I’m like, “Shit, it’s quite likely something goes wrong.” I think the slow takeoff is definitely a crux. Also, we keep calling it slow takeoff and I want to emphasize that it’s not necessarily slow in calendar time. It’s more like gradual.

Asya Bergal: Right, like ‘enough time for us to correct things’ takeoff.

Rohin Shah: Yeah. And there’s no discontinuity between… you’re not like, “Here’s a 2X human AI,” and a couple of seconds later it’s now… Not a couple of seconds later, but like, “Yeah, we’ve got 2X AI,” for a few months and then suddenly someone deploys a 10,000X human AI. If that happened, I would also be pretty worried.

It’s more like there’s a 2X human AI, then there’s like a 3X human AI and then a 4X human AI. Maybe this happens from the same AI getting better and learning more over time. Maybe it happens from it designing a new AI system that learns faster, but starts out lower and so then overtakes it sort of continuously, stuff like that.

So that I think, yeah, without… I don’t really know what the alternative to it is, but in the one where it’s not human level, and then 10,000X human in a week and it just sort of happened, that I’m like, I don’t know, 70% of doom or something, maybe more. That feels like I’m… I endorse that credence even less than most just because I feel like I don’t know what that world looks like. Whereas on the other ones I at least have a plausible world in my head.

Asya Bergal: Yeah, that makes sense. I think you’ve mentioned, in a slow takeoff scenario that… Some people would disagree that in a world where you notice something was wrong, you wouldn’t just hack around it, and keep going.

Asya Bergal: I have a suggestion which it feels like maybe is a difference and I’m very curious for your take on whether that seems right or seems wrong. It seems like people believe there’s going to be some kind of pressure for performance or competitiveness that pushes people to try to make more powerful AI in spite of safety failures. Does that seem untrue to you or like you’re unsure about it?

Rohin Shah: It seems somewhat untrue to me. I recently made a comment about this on the Alignment Forum. People make this analogy between AI x-risk and risk of nuclear war, on mutually assured destruction. That particular analogy seems off to me because with nuclear war, you need the threat of being able to hurt the other side whereas with AI x-risk, if the destruction happens, that affects you too. So there’s no mutually assured destruction type dynamic.

You could imagine a situation where for some reason the US and China are like, “Whoever gets to AGI first just wins the universe.” And I think in that scenario maybe I’m a bit worried, but even then, it seems like extinction is just worse, and as a result, you get significantly less risky behavior? But I don’t think you get to the point where people are just literally racing ahead with no thought to safety for the sake of winning.

I also don’t think that you would… I don’t think that differences in who gets to AGI first are going to lead to you win the universe or not. I think it leads to pretty continuous changes in power balance between the two.

I also don’t think there’s a discrete point at which you can say, “I’ve won the race.” I think it’s just like capabilities keep improving and you can have more capabilities than the other guy, but at no point can you say, “Now I have won the race.” I suppose if you could get a decisive strategic advantage, then you could do it. And that has nothing to do with what your AI capability… If you’ve got a decisive strategic advantage that could happen.

I would be surprised if the first human-level AI allowed you to get anything close to a decisive strategic advantage. Maybe when you’re at 1000X human level AI, perhaps. Maybe not a thousand. I don’t know. Given slow takeoff, I’d be surprised if you could knowably be like, “Oh yes, if I develop this piece of technology faster than my opponent, I will get a decisive strategic advantage.”

Asya Bergal: That makes sense. We discussed a lot of cruxes you have. Do you feel like there’s evidence that you already have pre-computed that you think could move you in one direction or another on this? Obviously, if you’ve got evidence that X was true, that would move you, but are there concrete things where you’re like, “I’m interested to see how this will turn out, and that will affect my views on the thing?”

Rohin Shah: So I think I mentioned the… On the question of timelines, they are like the… How much did evolution actually bake in to humans? It seems like a question that could put… I don’t know if it could be answered, but maybe you could answer that one. That would affect it… I lean on the side of not really, but it’s possible that the answer is yes, actually quite a lot. If that was true, I just lengthen my timelines basically.

Sara Haxhia: Can you also explain how this would change your behavior with respect to what research you’re doing, or would it not change that at all?

Rohin Shah: That’s a good question. I think I would have to think about that one for longer than two minutes.

As background on that, a lot of my current research is more trying to get AI researchers to be thinking about what happens when you deploy, when you have AI systems working with humans, as opposed to solving alignment. Mostly because I for a while couldn’t see research that felt useful to me for solving alignment. I think I’m now seeing more things that I can do that seem more relevant and I will probably switch to doing them possibly after graduating because thesis, and needing to graduate, and stuff like that.

Rohin Shah: Yes, but you were asking evidence that would change my mind-

Asya Bergal: I think it’s also reasonable to be not sure exactly about concrete things. I don’t have a good answer to this question off the top of my head.

Rohin Shah: It’s worth at least thinking about for a couple of minutes. I think I could imagine getting more information from either historical case studies of how people have dealt with new technologies, or analyses of how AI researchers currently think about things or deal with stuff, could change my mind about whether I think the AI community would by default handle problems that arise, which feels like an important crux between me and others.

I think currently my sense is if the like… You asked me this, I never answered it. If the AI safety field just sort of vanished, but the work we’ve done so far remained and conscientious AI researchers remained, or people who are already AI researchers and already doing this sort of stuff without being influenced by EA or rationality, then I think we’re still fine because people will notice failures and correct them.

I did answer that question. I said something like 90%. This was a scenario I was saying 90% for. And yeah, that one feels like a thing that I could get evidence on that would change my mind.

I can’t really imagine what would cause me to believe that AI systems will actually do a treacherous turn without ever trying to deceive us before that. But there might be something there. I don’t really know what evidence would move me, any sort of plausible evidence I could see that would move me in that direction.

Slow takeoff versus fast takeoff…. I feel like MIRI still apparently believes in fast takeoff. I don’t have a clear picture of these reasons, I expect those reasons would move me towards fast takeoff.

Oh, on the expected utility max or the… my perception of MIRI, or of Eliezer and also maybe MIRI, is that they have this position that any AI system, any sufficiently powerful AI system, will look to us like an expected utility maximizer, therefore convergent instrumental sub-goals and so on. I don’t buy this. I wrote a post explaining why I don’t buy this.

Yeah, there’s a lot of just like.. MIRI could say their reasons for believing things and that would probably cause me to update. Actually, I have enough disagreements with MIRI that they may not update me, but it could in theory update me.

Asya Bergal: Yeah, that’s right. What are some disagreements you have with MIRI?

Rohin Shah: Well, the ones I just mentioned. There is this great post from maybe not a year ago, but in 2018, called ‘Realism about Rationality’, which is basically this perspective that there is the one true learning algorithm or the one correct way of doing exploration, or just, there is a platonic ideal of intelligence. We could in principle find it, code it up, and then we would have this extremely good AI algorithm.

Then there is like, to the extent that this was a disagreement back in 2008, Robin Hanson would have been on the other side saying, “No, intelligence is just like a broad… just like conglomerate of a bunch of different heuristics that are all task specific, and you can’t just take one and apply it on the other space. It is just messy and complicated and doesn’t have a nice crisp formalization.”

And, I fall not exactly on Robin Hanson’s side, but much more on Robin Hanson’s side than the ‘rationality is a real formalizable natural thing in the world’.

Sara Haxhia: Do you have any idea where the cruxes of disagreement are at all?

Rohin Shah: No, that one has proved very difficult to…

Robert Long: I think that’s an AI Impacts project, or like a dissertation or something. I feel like there’s just this general domain specificity debate, how general is rationality debate…

I think there are these very crucial considerations about the nature of intelligence and how domain specific it is and they were an issue between Robin and Eliezer and no one… It’s hard to know what evidence, what the evidence is in this case.

Rohin Shah: Yeah. But I basically agree with this and that it feels like a very deep disagreement that I have never had any success in coming to a resolution to, and I read arguments by people who believe this and I’m like, “No.”

Sara Haxhia: Have you spoken to people?

Rohin Shah: I have spoken to people at CHAI, I don’t know that they would really be on board this train. Hold on, Daniel probably would be. And that hasn’t helped that much. Yeah. This disagreement feels like one where I would predict that conversations are not going to help very much.

Robert Long: So, the general question here was disagreements with MIRI, and then there’s… And you’ve mentioned fast takeoff and maybe relatedly, the Yudkowsky-Hanson–

Rohin Shah: Realism about Rationality is how I’d phrase it. There’s also the– are AI researchers conscientious? Well, actually I don’t know that they would say they are not conscientious. Maybe they’d say they’re not paying attention or they have motivated reasoning for ignoring the issues… lots of things like that.

Robert Long: And this issue of do advanced intelligences look enough like EU maximizers…

Rohin Shah: Oh, yes. That one too. Yeah, sorry. That’s one of the major ones. Not sure how I forgot that.

Robert Long: I remember it because I’m writing it all down, so… again, you’ve been talking about very complicated things.

Rohin Shah: Yeah. Related to the Realism about Rationality point is the use of formalism and proof. Nor formalism, but proof at least. I don’t know that MIRI actually believes that what we need to do is write a bunch of proofs about our AI system, but it sure sounds like it, and that seems like a too difficult, and basically impossible task to me, if the proofs that we’re trying to write are about alignment or beneficialness or something like that.

They also seem to… No, maybe all the other disagreements can be traced back to these disagreements. I’m not sure.





More posts like this

Sorted by Click to highlight new comments since:

I can’t really imagine what would cause me to believe that AI systems will actually do a treacherous turn without ever trying to deceive us before that. But there might be something there. I don’t really know what evidence would move me, any sort of plausible evidence I could see that would move me in that direction.

Perhaps this?

In that sentence I meant "a treacherous turn that leads to an existential catastrophe", so I don't think the example you link updates me strongly on that.

While Luke talks about that scenario as an example of a treacherous turn, you could equally well talk about it as an example of "deception", since the evolved creatures are "artificially" reducing their rates of reproduction to give the supervisor / algorithm a "false belief" that they are bad at reproducing. Another example along these lines is when a robot hand "deceives" its human overseer into thinking that it has grasped a ball, when it is in fact in front of the ball.

I think really though these examples aren't that informative because it doesn't seem reasonable to say that the AI system is "trying" to do something in these examples, or that it does some things "deliberately". These behaviors were learned through trial and error. An existential catastrophe style treacherous turn would presumably not happen through trial and error. (Even if it did, it seems like there must have been at least some cases where it tried and failed to take over the world, which seems like a clear and obvious warning shot, that we for some reason completely ignored.)

(If it isn't clear, the thing that I care about is something like "will there be some 'warning shot' that greatly increases the level of concern people have about AI systems, before it is too late".)

That makes sense. Thanks for the comment!

Shah thinks there are several things that could change his beliefs, including:

If he learned that evolution actually baked a lot into humans (‘nativism’), he would lengthen the amount of time he thinks there will be before AGI.

Tooby and Cosmides are big advocates for the "massive modularity" view--a huge amount of human cognition takes place in specialized, task-tailored modules rather than on one big, domain-general "computer". Common examples of these sorts of modules are:

  • Chomsky's universal grammar: There's not enough language data for children to learn languages in the absence of inductive biases.
  • Social exchange: People perform much better at the Wason selection task when the domain is social exchange rather than fully abstract.

Unfortunately, I don't know of any review collecting and examining evidence for the massive modularity view.

(Not sure how much of this Shah already knows.)

(Not sure how much of this Shah already knows.)

Not much, sadly. I don't actually intend to learn about it in the near future, because I don't think timelines are particularly decision-relevant to me (though they are to others, especially funders). Thanks for the links!

Tooby and Cosmides are big advocates for the "massive modularity" view--a huge amount of human cognition takes place in specialized, task-tailored modules rather than on one big, domain-general "computer".

On my view, babies would learn a huge amount about the structure of the world simply by interacting with it (pushing over an object can in principle teach you a lot about objects, causality, intuitive physics, etc), and this leads to general patterns that we later call "inductive biases" for more complex tasks. For example, hierarchy is a very useful way to understand basically any environment we are ever in; perhaps babies develop a sense of "hierarchy" which then gets applied to language, explaining how children learn languages so fast.

From the Wikipedia page you linked, challenges to a "rationality" based view:

1. Evolutionary theories using the idea of numerous domain-specific adaptions have produced testable predictions that have been empirically confirmed; the theory of domain-general rational thought has produced no such predictions or confirmations.

I wish they said what these predictions were. I'm not going to chase down this reference.

2. The rapidity of responses such as jealousy due to infidelity indicates a domain-specific dedicated module rather than a general, deliberate, rational calculation of consequences.

This is a good point; in general emotions are probably not learned, for the most part. I'm not sure what's going on there.

3. Reactions may occur instinctively (consistent with innate knowledge) even if a person has not learned such knowledge.

I agree that reflexes are "built-in" and not learned; reflexes are also pretty different from e.g. language. Obviously not everything our bodies do is "learned", reflexes, breathing, digestion, etc. all fall into the "built-in" category. I don't think this says much about what leads humans to be good at chess, language, plumbing, soccer, gardening, etc, which is what I'm more interested in.

It seems likely to me that you might need the equivalent of reflexes, breathing, digestion, etc. if you want to design a fully autonomous agent that learns without any human support whatsoever, but we will probably instead design an agent that (initially) depends on us to keep the electricity flowing, to fix any wiring issues, to keep up the Internet connection, etc. (In contrast, human parents can't ensure that the child keeps breathing, so you need an automatic, built-in system for that.)

perhaps babies develop a sense of "hierarchy" which then gets applied to language, explaining how children learn languages so fast.

Though if we are to believe this paper at face value (I haven't evaluated it), babies start learning in the womb. (The paper claims that the biases depend on which language is spoken around the pregnant mother, which suggests that it must be learned, rather than being "built-in".)

Chomsky's universal grammar: There's not enough language data for children to learn languages in the absence of inductive biases.

I think there's more recent work in computational linguistics that challenges this. Unfortunately I can't summarize it since I only took an overview course a long time ago. I've been wondering whether I should read up on language evolution at some point. Mostly because it seems really interesting, but also because it's a field I haven't seen being discussed in EA circles, and it seems potentially useful to have this background when it comes to evaluating/interpreting AI milestones and so on. In any case, if someone understands computational linguistics, language evolution and how it relates to the nativism debate, I'd be extremely interested in a summary!

For reference, here's the post on realism about rationality that Rohin mentioned several times.

Curated and popular this week
Relevant opportunities