Hide table of contents

In this report (which I'm linking from the Alignment Forum) I have attempted to put together the most compelling case for why the development of AGI might pose an existential threat. It stems from my dissatisfaction with existing arguments about the potential risks from AGI. Early work tends to be less relevant in the context of modern machine learning; more recent work is scattered and brief. I originally intended to just summarise other people's arguments, but as this report has grown, it's become more representative of my own views and less representative of anyone else's. So while it covers the standard ideas, I also think that it provides a new perspective on how to think about AGI - one which doesn't take any previous claims for granted, but attempts to work them out from first principles.

The report is primarily aimed at people who already understand the basics of machine learning, but most of it should also make sense to laypeople. It's roughly 15,000 words in total, split into six sections: the first and last are short framing sections, while the middle four correspond to the four premises of the core argument. The brief introductory section appears below; find the rest on the Alignment Forum.

AGI safety from first principles

The key concern motivating technical AGI safety research is that we might build autonomous artificially intelligent agents which are much more intelligent than humans, and which pursue goals that conflict with our own. Human intelligence allows us to coordinate complex societies and deploy advanced technology, and thereby control the world to a greater extent than any other species. But AIs will eventually become more capable than us at the types of tasks by which we maintain and exert that control. If they don’t want to obey us, then humanity might become only Earth's second most powerful "species", and lose the ability to create a valuable and worthwhile future.

I’ll call this the “second species” argument; I think it’s a plausible argument which we should take very seriously.[1] However, the version stated above relies on several vague concepts and intuitions. In this report I’ll give the most detailed presentation of the second species argument that I can, highlighting the aspects that I’m still confused about. In particular, I’ll defend a version of the second species argument which claims that, without a concerted effort to prevent it, there’s a significant chance that:

  1. We’ll build AIs which are much more intelligent than humans (i.e. superintelligent).
  2. Those AIs will be autonomous agents which pursue large-scale goals.
  3. Those goals will be misaligned with ours; that is, they will aim towards outcomes that aren’t desirable by our standards, and trade off against our goals.
  4. The development of such AIs would lead to them gaining control of humanity’s future.

While I use many examples from modern deep learning, this report is also intended to apply to AIs developed using very different models, training algorithms, optimisers, or training regimes than the ones we use today. However, many of my arguments would no longer be relevant if the field of AI moves away from focusing on machine learning. I also frequently compare AI development to the evolution of human intelligence; while the two aren’t fully analogous, humans are the best example we currently have to ground our thinking about generally intelligent AIs.

That's the introduction; to continue reading, here's the next section, on Superintelligence. In addition to reframing existing arguments, here are a few of the more novel claims made in the rest of the report:

  1. When training AIs which can perform well on a range of novel tasks, we shouldn’t think of objective functions as specifications of our desired goals, but rather as tools to shape our agents’ motivations and cognition.
  2. Interactions between many AGIs (specifically via replication, cultural learning, and recursive improvement) will be important during the transition from human-level AGI to superintelligence.
  3. Existing frameworks for thinking about goal-directed agency don’t help us to predict the types of goals AGIs will have. To do so, we should identify specific cognitive capacities AGIs would need to be capable of pursuing goals, and how those might develop.
  4. The likelihood of inner misalignment occuring depends on whether instrumentally convergent subgoals will be present during training, and how complex they will be compared with the outer objective.
  5. We should plan towards building intent aligned AGIs which are better than humans at safety and governance research. Up to that point, we can increase our chances of retaining control via coordination to deploy transparent systems in constrained ways.

See also Rohin's summary for the Alignment Newsletter here.


  1. Stuart Russell also refers to this as the “gorilla problem” in his recent book, Human Compatible. ↩︎

Comments10


Sorted by Click to highlight new comments since:

In the report, you say 
 

We can start with Legg’s well-known definition, which identifies intelligence as the ability to do well on a broad range of cognitive tasks.


However, this is importantly different from Legg's actual definition(pg12), which is:

Intelligence measures an agent’s ability to achieve goals in a wide range of environments

I'm curious whether this change is intentional, or perhaps I'm looking at the wrong link?

I think the two definitions are meaningfully different (consider the case of a child completing Raven's Progressive Matrices without particular interest or inclination to do so), and the definition you use is more common when people refer to human intelligence and the definition from Legg and Hutter more common when people refer to machine intelligence.

I intended mine to be a slight rephrasing of Legg and Hutter's definition to make it more accessible to people without RL backgrounds. One thing that's not obvious from the way they use "environments" is that the goal is actually built into the environment via a reward function, so describing each environment as a "task" seems accurate.

A second non-obvious thing is that the body the agent uses is also defined as part of the environment, so that the agent only performs the abstract task of sending instructions to that body. A naive reading of Legg and Hutter's definition would interpret a stronger agent as being more intelligent. Adding "cognitive" I think rules this out, while also remaining true to the spirit of the original definition.

Curious if you still disagree, and if so why - I don't really see what you're pointing at with the Raven's Matrices example.

One thing that's not obvious from the way they use "environments" is that the goal is actually built into the environment via a reward function, so describing each environment as a "task" seems accurate.

Thanks, I was not aware of this issue. 

Curious if you still disagree, and if so why - I don't really see what you're pointing at with the Raven's Matrices example.

I'm less sure I disagree now. I think my intuitive sense of intelligence (as defined for humans) is that it's used as a (sometimes single-variable) mapping for a broad cluster of somewhat distinct but in practice fairly correlated traits like pattern recognition, working memory, etc, while Legg's definition is carefully written to define intelligence to be outcome-only. The fact that the reward is built into the environment is not something I was previously aware of and I need to think harder about whether I still have reservations. 

One thing I'm confused about is whether Legg's definition (or your rephrasing) allows for situations where it's in principle possible that being smarter is ex ante worse for an agent (obviously ex post it's possible to follow the correct decision procedure and be unlucky). My intuition is that most naive definitions of intelligence allows for this to at least theoretically be possible in various ways, but I'm not sure (and currently lean against) Legg's definition allowing for this. 

One thing I'm confused about is whether Legg's definition (or your rephrasing) allows for situations where it's in principle possible that being smarter is ex ante worse for an agent (obviously ex post it's possible to follow the correct decision procedure and be unlucky).

There definitely are such cases - e.g. Omega penalises all smart agents. Or environments where there are several crucial considerations which you're able to identify at different levels of intelligence, so that as intelligence increases, your success increases and decreases.

But in general I agree with your complaint about Legg's definition being defined in behavioural terms, and how it'd be better to have a good definition of intelligence in terms of the cognitive processes involved (e.g. planning, abstraction, etc). I do think that starting off in behaviourist terms was a good move, back when people were much more allergic to talking about AGI/superintelligence. But now that we're past that point, I think we can do better. (I don't think I've written about this yet in much detail, but it's quite high on my list of priorities.)

There definitely are such cases ... Or environments where there are several crucial considerations which you're able to identify at different levels of intelligence, so that as intelligence increases, your success increases and decreases.

Sorry I'm confused about this claim as stated. Assume that all environments have 3 levels of abstraction, which has this ultimate {action -> expected utility} pair 

A-> 10 expected utils

B -> -10 expected utils

C -> 20 expected utils


It seems to me that by the definition:

Intelligence measures an agent’s ability to achieve goals in a wide range of environments

Then by definition, strategy that outputs C smarter than strategy that outputs A smarter than strategy that outputs B.  So B < A < C. 

This is true even if cognitively the algorithm that outputs B is more sophisticated (eg more crucial considerations,  or literally the same learning algorithm but with more compute) than the one that outputs A. 

Am I confused here? 

Ah, I see. I thought you meant "situations" as in "individual environments", but it seems like you meant "situations" as in "possible ways that all environments could be".

In that case, I think you're right, but I don't consider it a problem. Why might it be the case that adding more compute, or more memory, or something like that, would be net negative across all environments? It seems like either we'd have to define the set of environments in a very gerrymandered way, or else there's something about the change we made that lands us in a valley of bad thinking. In the former case, we should use a wider set of environments; in the latter case, it seems easier to bite the bullet and say "Yeah, turns out that adding more of this usually-valuable trait makes agents less intelligent."

Hmm, I'm probably not phrasing this well, but the point I'm trying to get across is that the Legg definition defines intelligence as always monotonically good in an in-principle way. I actually agree with you that empirically smarts (as usually defined)->good outcomes seems like the most natural hypothesis, but I'd have preferred a definition of intelligence which would leave open this question as an empirical hypothesis, over something that assumes it by definition.

I realize that "empirical hypothesis" is weird because of No Free Lunch, so I guess by a range of environments I mean something like "environments that plausibly reflect actual questions that might naturally arise in the real world" (Not very well-stated).

For example, another thing that I'm sort of interested in is multiagent situations where credibly proving you're dumber makes you a more trustworthy agent, where it feels weird for me to claim that the credibly dumber agent is actually on some deeper level smarter than the naively smarter agent (whereas an agent smart enough to credibly lie about their dumbness is smarter again on both definitions). 

(I don't think the Omega hates smartness example, or for that matter a world where anti-induction is the correct decision procedure, is very interesting, relatively speaking, because they feel contrived enough to be a relatively small slice of realistic possibilities). 

Ah, I like the multiagent example. So to summarise: I agree that we have some intuitive notion of what cognitive processes we think of as intelligent, and it would be useful to have a definition of intelligence phrased in terms of those. I also agree that Legg's behavioural definition might diverge from our implicit cognitive definition in non-trivial ways.

I guess the reason why I've been pushing back on your point is that I think that possible divergences between the two aren't the main thing going on here. Even if it turned out that the behavioural definition and the cognitive definition ranked all possible agents the same, I think the latter would be much more insightful and much more valuable for helping us think about AGI.

But this is probably not an important disagreement.

I see the issue now. To restate it in my own words, both of us agree that cognitive definitions are plausibly more useful than behavioral definitions (and probably you are more confident in this claim than I am), but for me the cruxes are in the direction of where the cognitive definitions and the behavioral definitions diverge in non-trivial ways in ranking agents, and in those cases the divergences are important + interesting, whereas in your case you'd consider the cognitive definitions more insightful for thinking about AGI even if it can later be shown that the divergences are only in trivial ways.  

Upon reflection, I'm not sure if we disagree. I'll need to think harder about whether I'd consider using the cognitive definitions (which presumably will suffer a bit of an elegance tax) to still be a generally superior way of thinking about AGI than using the behavioral definition if there are no non-trivial divergences.

I also agree that as stated this is probably not an important disagreement. 

[I'm doing a bunch of low-effort reviews of posts I read a while ago and think are important. Unfortunately, I don't have time to re-read them or say very nuanced things about them.]

There's been a variety of work over the last few years focused on examining the arguments for focusing on AI alignment. I think this is one of the better and more readable ones. It's also quite long and not-really-on-the-Forum. Not sure what to do with that. The last post has a bunch of comment threads, which might be a good way of demonstrating EA reasoning.

Curated and popular this week
 ·  · 16m read
 · 
At the last EAG Bay Area, I gave a workshop on navigating a difficult job market, which I repeated days ago at EAG London. A few people have asked for my notes and slides, so I’ve decided to share them here.  This is the slide deck I used.   Below is a low-effort loose transcript, minus the interactive bits (you can see these on the slides in the form of reflection and discussion prompts with a timer). In my opinion, some interactive elements were rushed because I stubbornly wanted to pack too much into the session. If you’re going to re-use them, I recommend you allow for more time than I did if you can (and if you can’t, I empathise with the struggle of making difficult trade-offs due to time constraints).  One of the benefits of written communication over spoken communication is that you can be very precise and comprehensive. I’m sorry that those benefits are wasted on this post. Ideally, I’d have turned my speaker notes from the session into a more nuanced written post that would include a hundred extra points that I wanted to make and caveats that I wanted to add. Unfortunately, I’m a busy person, and I’ve come to accept that such a post will never exist. So I’m sharing this instead as a MVP that I believe can still be valuable –certainly more valuable than nothing!  Introduction 80,000 Hours’ whole thing is asking: Have you considered using your career to have an impact? As an advisor, I now speak with lots of people who have indeed considered it and very much want it – they don't need persuading. What they need is help navigating a tough job market. I want to use this session to spread some messages I keep repeating in these calls and create common knowledge about the job landscape.  But first, a couple of caveats: 1. Oh my, I wonder if volunteering to run this session was a terrible idea. Giving advice to one person is difficult; giving advice to many people simultaneously is impossible. You all have different skill sets, are at different points in
 ·  · 47m read
 · 
Thank you to Arepo and Eli Lifland for looking over this article for errors.  I am sorry that this article is so long. Every time I thought I was done with it I ran into more issues with the model, and I wanted to be as thorough as I could. I’m not going to blame anyone for skimming parts of this article.  Note that the majority of this article was written before Eli’s updated model was released (the site was updated june 8th). His new model improves on some of my objections, but the majority still stand.   Introduction: AI 2027 is an article written by the “AI futures team”. The primary piece is a short story penned by Scott Alexander, depicting a month by month scenario of a near-future where AI becomes superintelligent in 2027,proceeding to automate the entire economy in only a year or two and then either kills us all or does not kill us all, depending on government policies.  What makes AI 2027 different from other similar short stories is that it is presented as a forecast based on rigorous modelling and data analysis from forecasting experts. It is accompanied by five appendices of “detailed research supporting these predictions” and a codebase for simulations. They state that “hundreds” of people reviewed the text, including AI expert Yoshua Bengio, although some of these reviewers only saw bits of it. The scenario in the short story is not the median forecast for any AI futures author, and none of the AI2027 authors actually believe that 2027 is the median year for a singularity to happen. But the argument they make is that 2027 is a plausible year, and they back it up with images of sophisticated looking modelling like the following: This combination of compelling short story and seemingly-rigorous research may have been the secret sauce that let the article to go viral and be treated as a serious project:To quote the authors themselves: It’s been a crazy few weeks here at the AI Futures Project. Almost a million people visited our webpage; 166,00
 ·  · 32m read
 · 
Authors: Joel McGuire (analysis, drafts) and Lily Ottinger (editing)  Formosa: Fulcrum of the Future? An invasion of Taiwan is uncomfortably likely and potentially catastrophic. We should research better ways to avoid it.   TLDR: I forecast that an invasion of Taiwan increases all the anthropogenic risks by ~1.5% (percentage points) of a catastrophe killing 10% or more of the population by 2100 (nuclear risk by 0.9%, AI + Biorisk by 0.6%). This would imply it constitutes a sizable share of the total catastrophic risk burden expected over the rest of this century by skilled and knowledgeable forecasters (8% of the total risk of 20% according to domain experts and 17% of the total risk of 9% according to superforecasters). I think this means that we should research ways to cost-effectively decrease the likelihood that China invades Taiwan. This could mean exploring the prospect of advocating that Taiwan increase its deterrence by investing in cheap but lethal weapons platforms like mines, first-person view drones, or signaling that mobilized reserves would resist an invasion. Disclaimer I read about and forecast on topics related to conflict as a hobby (4th out of 3,909 on the Metaculus Ukraine conflict forecasting competition, 73 out of 42,326 in general on Metaculus), but I claim no expertise on the topic. I probably spent something like ~40 hours on this over the course of a few months. Some of the numbers I use may be slightly outdated, but this is one of those things that if I kept fiddling with it I'd never publish it.  Acknowledgements: I heartily thank Lily Ottinger, Jeremy Garrison, Maggie Moss and my sister for providing valuable feedback on previous drafts. Part 0: Background The Chinese Civil War (1927–1949) ended with the victorious communists establishing the People's Republic of China (PRC) on the mainland. The defeated Kuomintang (KMT[1]) retreated to Taiwan in 1949 and formed the Republic of China (ROC). A dictatorship during the cold war, T