Comment Permalink

The paper is an interesting read, but I think that it unfortunately isn't of much practical value down to the omission of a crucial consideration:

The paper rests on the assumption that alignment/control of artificial superintelligence (ASI) is possible. This has not been theoretically established, let alone assessed to be practically likely in the time we have before an intelligence explosion. As far as I know, there aren't any sound supporting arguments for the assumption (and you don't reference any), and in fact there are good arguments on the other side for why aligning or controlling ASI is fundamentally impossible.

AI Takeover is listed first in the Grand Challenges section, but it trumps all the others because it is the default outcome. You even say “we should expect AIs that can outsmart humans”, and “There are reasonable arguments for expecting misalignment, and subsequent takeover, as the ‘default’ outcome (without concerted efforts to prevent it).”, and “There is currently no widely agreed-upon solution to the problems of aligning and controlling advanced AI systems, and so leading experts currently see the risk of AI takeover as substantial.” I still don’t understand where the ~10% estimates are coming from though; [fn 93:] “just over 50% of respondents assigned a subjective probability of 10% or more to the possibility that, “human inability to control future advanced Al systems causing human extinction or similarly permanent and severe disempowerment of the human species” (Grace et al., ‘Thousands of AI Authors on the Future of AI’.)”]. They seem logically unfounded. What is happening in the other ~90%? I didn’t get any satisfactory answers when asking here a while back.

You say “In this paper, we won’t discuss AI takeover risk in depth, but that’s because it is already well-discussed elsewhere.” It’s fine that you want to talk about other stuff in the paper, but that doesn’t make it any less of a crucial consideration that overwhelms concern for all of the other issues!

You conclude by saying that “Many are admirably focused on preparing for a single challenge, like misaligned AI takeover... But focusing on one challenge is not the same as ignoring all others: if you are a single-issue voter on AI, you are probably making a mistake.” I disagree, because alignment of ASI hasn’t been shown to even be solvable in principle! It is the single most important issue by far. The others don't materialise because they assume humans will be in control of ASI for the most part (which is very unlikely to happen). The only practical solution (which is also dissolves nearly all the other issues identified in the paper) is to prevent ASI from being built^[1]. We need a well enforced global moratorium on ASI as soon as possible.

^{^}
At least until either it can be built safely, or the world collectively decides to take whatever risk remains after a consensus on an alignment/control solution is reached. At which point the other issues identified in the paper become relevant.

Showing 3 of 4 replies (Click to show all)

Greg_Colbourn ⏸️

Mar 24

Thanks for the reply. This is a stupid analogy! (Traffic accidents aren't very likely.) A better analogy would be "all the preparations for a wedding would be undermined if the couple weren't able to to be together because one was stranded on Mars with no hope of escape. This justifies spending all the wedding budget on trying to rescue them." Or perhaps even better: "all the preparations for a wedding would be undermined if the couple probably won't be able to be together, because one taking part in a mission to Mars that half the engineers and scientists on the guest list are convinced will be a death trap (for detailed technical reasons). This justifies spending all the wedding budget on trying to stop the mission from going ahead." I think Wei Dei's reply articulates my position well: Your next point seems somewhat of a straw man? No, the correct reply is that dolphins won't run the world because they can't develop technology down to their physical form (no opposable thumbs etc), and they won't be able to evolve their physical form in such a short time (even with help from human collaborators)[1]. i.e. an object level rebuttal. No, but they had sound theoretical arguments. I'm saying these are lacking when it comes to why it's possible to align/control/not go extinct from ASI. I'd say ~90% (and the remaining 10% is mostly exotic factors beyond our control [footnote 10 of linked post]). But it's worse than this, because the only viable solution to avoid takeover is to stop building ASI, in which case the non-takeover work is redundant (we can mostly just hope to luck out with one of the exotic factors). 1. ^ And they won't be able to be helped by ASIs either, because the control/alignment problem will remain unsolved (and probably unsolvable, for reasons x, y, z...)

finm

Mar 26

Oh, I didn't mean to imply that I think AI takeover risk is on par with traffic accident-risk. I was just illustrating the abstract point that the mere presence of a mission-ending risk doesn't imply spending everything to prevent it. I am guessing you agree with this abstract point (but furthermore think that AI takeover risk is extremely high, and as such we should ~entirely focus on preventing it). Maybe I'm splitting hairs, but “x-risk could be high this century as a result of AI” is not the same claim as “x-risk from AI takeover is high this century”, and I read you as making the latter claim (obviously I can't speak for Wei Dai). That's right, and I do think the dolphin example was too misleading and straw-man-ish. The point I was trying to illustrate, though, is not that there is no way to refute the dolphin theory, but that failing to adequately describe the alternative outcome(s) doesn't especially support the dolphin theory, because trying to accurately describe the future is just generally extremely hard. Got it. I guess I see things as messier than this — I see people with very high estimates of AI takeover risk advancing arguments, and I see others advancing skeptical counter-arguments (example), and before engaging with these arguments a lot and forming one's own views, I think it's not obvious which sets of arguments are fundamentally unsound. Makes sense.

Greg_Colbourn ⏸️ Mar 312

I am guessing you agree with this abstract point (but furthermore think that AI takeover risk is extremely high, and as such we should ~entirely focus on preventing it).

Yes (but also, I don't think the abstract point is adding anything, because of the risk actually being significant.)

Maybe I'm splitting hairs, but “x-risk could be high this century as a result of AI” is not the same claim as “x-risk from AI takeover is high this century”, and I read you as making the latter claim (obviously I can't speak for Wei Dai).

This does seem like splitting hairs. Mo... (read more)

See in context

Preparing for the Intelligence Explosion

by finm, William_MacAskill, Forethought

Mar 111 min read 15

120

AI safetyIntelligence explosionResearchAI governanceAnnouncements and updatesEvents on the EA ForumExistential Choices Debate WeekOrganization updates

Frontpage

This is a linkpost for https://www.forethought.org/preparing-for-the-intelligence-explosion

This is a linkpost for a new paper called Preparing for the Intelligence Explosion, by Will MacAskill and Fin Moorhouse. It sets the high-level agenda for the sort of work that Forethought is likely to focus on.

Some of the areas in the paper that we expect to be of most interest to EA Forum or LessWrong readers are:

Section 3 finds that even without a software feedback loop (i.e. “recursive self-improvement”), even if scaling of compute completely stops in the near term, and even if the rate of algorithmic efficiency improvements slow, then we should still expect very rapid technological development — e.g. a century’s worth of progress in a decade — once AI meaningfully substitutes for human researchers.
A presentation, in section 4, of the sheer range of challenges that an intelligence explosion would pose, going well beyond the “standard” focuses of AI takeover risk and biorisk.
Discussion, in section 5, of when we can and can’t use the strategy of just waiting until we have aligned superintelligence and relying on it to solve some problem.
An overview, in section 6, of what we can do, today, to prepare for this range of challenges.

Here’s the abstract:

AI that can accelerate research could drive a century of technological progress over just a few years. During such a period, new technological or political developments will raise consequential and hard-to-reverse decisions, in rapid succession. We call these developments grand challenges.
These challenges include new weapons of mass destruction, AI-enabled autocracies, races to grab offworld resources, and digital beings worthy of moral consideration, as well as opportunities to dramatically improve quality of life and collective decision-making.
We argue that these challenges cannot always be delegated to future AI systems, and suggest things we can do today to meaningfully improve our prospects. AGI preparedness is therefore not just about ensuring that advanced AI systems are aligned: we should be preparing, now, for the disorienting range of developments an intelligence explosion would bring.

120 Reactions

Comments15

Sorted by

New & upvoted

Click to highlight new comments since: Today at 11:27 AM

Greg_Colbourn ⏸️ Mar 18*5

^{^}
At least until either it can be built safely, or the world collectively decides to take whatever risk remains after a consensus on an alignment/control solution is reached. At which point the other issues identified in the paper become relevant.

finmMar 2310

Thanks for the comment. I agree that if you think AI takeover is the overwhelmingly most likely outcome from developing ASI, then preventing takeover (including by preventing ASI) should be your strong focus. Some comments, though —

Just because failing at alignment undermines ~every other issue, doesn't mean that working on alignment is the only or overwehelmingly most important thing.^[1] Tractability and likelihood also matters.
I'm not sure I buy that things are so stark as “there are no arguments against AI takeover”, see e.g. Katja Grace's post here. I also think there are cases where someone presents you with an argument that superficially drives toward a conclusion that sounds unlikely, and it's legitimate to be skeptical of the conclusion even if you can't spell out exactly where the argument is going wrong (e.g. the two-envelope “paradox”). That's not to say you can justify not engaging with the theoretical arguments whenever you're uncomfortable with where they point, just that humility deducing bold claims about the future on theoretical grounds cuts both ways.
Relatedly, I don't think you don't need to be able to describe alternative outcomes in detail to reject a prediction about how the world goes. If I tell someone the world will be run by dolphins in the year 2050, and they disagree, I can reply, “oh yeah, well you tell me what the world looks like in 2050”, and their failure to describe their median world in detail doesn't strongly support the dolphin hypothesis.^[2]
“Default” doesn't necessarily mean “unconditionally likely” IMO. Here I take it to mean something more like “conditioning on no specific response and/or targeted countermeasures”. Though I guess it's baked into the meaning of “default” that it's unconditionally plausible (like, ⩾5%?) — it would be misleading to say “the default outcome from this road trip is that we all die (if we don't steer out of oncoming traffic)”.
In theory, one could work on making outcomes from AI takeover less bad, as well as making them less likely (though less clear what this looks like).

Altogether, I think you're coming from a reasonable but different position, that takeover risk from ASI is very high (sounds like 60–99% given ASI?) I agree that kinds of preparedness not focused on avoiding takeover look less important on this view (largely because they matter in fewer worlds). I do think this axis of disagreement might not be as sharp as it seems, though — suppose person A has 60% p(takeover) and person B is on 1%. Assuming the same marginal tractability and neglectedness between takeover and non-takeover work, person A thinks that takeover-focused work is 60× more important; but non-takeover work is 40/99≈0.4 times as important, compared to person B.

^{^}
By (stupid) analogy, all the preparations for a wedding would be undermined if the couple got into a traffic accident on the way to the ceremony; this does not justify spending ~all the wedding budget on car safety.
^{^}
Again by analogy, there were some superficially plausible arguments in the 1970s or thereabouts that population growth would exceed the world's carrying capacity, and we'd run out of many basic materials, and there would be a kind of system collapse by 2000. The opponents of these arguments were not able to describe the ways that the world could avoid these dire fates in detail (they could not describe the specific tech advances which could raise agricultural productivity, or keep materials prices relatively level, for instance).

Greg_Colbourn ⏸️ Mar 24*5

Thanks for the reply.

By (stupid) analogy, all the preparations for a wedding would be undermined if the couple got into a traffic accident on the way to the ceremony; this does not justify spending ~all the wedding budget on car safety.

This is a stupid analogy! (Traffic accidents aren't very likely.) A better analogy would be "all the preparations for a wedding would be undermined if the couple weren't able to to be together because one was stranded on Mars with no hope of escape. This justifies spending all the wedding budget on trying to rescue them." Or perhaps even better: "all the preparations for a wedding would be undermined if the couple probably won't be able to be together, because one taking part in a mission to Mars that half the engineers and scientists on the guest list are convinced will be a death trap (for detailed technical reasons). This justifies spending all the wedding budget on trying to stop the mission from going ahead."

see e.g. Katja Grace's post here

I think Wei Dei's reply articulates my position well:

Suppose you went through the following exercise. For each scenario described under "What it might look like if this gap matters", ask:
Is this an existentially secure state of affairs?
If not, what are the main obstacles to reaching existential security from here?
and collected the obstacles, you might assemble a list like this one, which might update you toward AI x-risk being "overwhelmingly likely". (Personally, if I had to put a number on it, I'd say 80%.)

Your next point seems somewhat of a straw man?

If I tell someone the world will be run by dolphins in the year 2050, and they disagree, I can reply, “oh yeah, well you tell me what the world looks like in 2050”

No, the correct reply is that dolphins won't run the world because they can't develop technology down to their physical form (no opposable thumbs etc), and they won't be able to evolve their physical form in such a short time (even with help from human collaborators)^[1]. i.e. an object level rebuttal.

The opponents of these arguments were not able to describe the ways that the world could avoid these dire fates in detail

No, but they had sound theoretical arguments. I'm saying these are lacking when it comes to why it's possible to align/control/not go extinct from ASI.

Altogether, I think you're coming from a reasonable but different position, that takeover risk from ASI is very high (sounds like 60–99% given ASI?)

I'd say ~90% (and the remaining 10% is mostly exotic factors beyond our control [footnote 10 of linked post]).

I do think this axis of disagreement might not be as sharp as it seems, though — suppose person A has [9]0% p(takeover) and person B is on 1%. Assuming the same marginal tractability and neglectedness between takeover and non-takeover work, person A thinks that takeover-focused work is [9]0× more important; but non-takeover work is 10/99≈0.[1] times as important, compared to person B.

But it's worse than this, because the only viable solution to avoid takeover is to stop building ASI, in which case the non-takeover work is redundant (we can mostly just hope to luck out with one of the exotic factors).

^{^}
And they won't be able to be helped by ASIs either, because the control/alignment problem will remain unsolved (and probably unsolvable, for reasons x, y, z...)

finmMar 262

This is a stupid analogy! (Traffic accidents aren't very likely.)

Oh, I didn't mean to imply that I think AI takeover risk is on par with traffic accident-risk. I was just illustrating the abstract point that the mere presence of a mission-ending risk doesn't imply spending everything to prevent it. I am guessing you agree with this abstract point (but furthermore think that AI takeover risk is extremely high, and as such we should ~entirely focus on preventing it).

I think Wei Dei's reply articulates my position well:

Maybe I'm splitting hairs, but “x-risk could be high this century as a result of AI” is not the same claim as “x-risk from AI takeover is high this century”, and I read you as making the latter claim (obviously I can't speak for Wei Dai).

No, the correct reply is that dolphins won't run the world because they can't develop technology

That's right, and I do think the dolphin example was too misleading and straw-man-ish. The point I was trying to illustrate, though, is not that there is no way to refute the dolphin theory, but that failing to adequately describe the alternative outcome(s) doesn't especially support the dolphin theory, because trying to accurately describe the future is just generally extremely hard.

No, but they had sound theoretical arguments. I'm saying these are lacking when it comes to why it's possible to align/control/not go extinct from ASI.

Got it. I guess I see things as messier than this — I see people with very high estimates of AI takeover risk advancing arguments, and I see others advancing skeptical counter-arguments (example), and before engaging with these arguments a lot and forming one's own views, I think it's not obvious which sets of arguments are fundamentally unsound.

But it's worse than this, because the only viable solution to avoid takeover is to stop building ASI, in which case the non-takeover work is redundant (we can mostly just hope to luck out with one of the exotic factors).

Makes sense.

Greg_Colbourn ⏸️ Mar 312

I am guessing you agree with this abstract point (but furthermore think that AI takeover risk is extremely high, and as such we should ~entirely focus on preventing it).

Yes (but also, I don't think the abstract point is adding anything, because of the risk actually being significant.)

Maybe I'm splitting hairs, but “x-risk could be high this century as a result of AI” is not the same claim as “x-risk from AI takeover is high this century”, and I read you as making the latter claim (obviously I can't speak for Wei Dai).

This does seem like splitting hairs. Most of Wei Dai's linked list is about AI takeover x-risk (or at least x-risk as a result of actions that AI might take, rather than actions that humans controlling AIs might take). Also, I'm not sure where "century" comes from? We're talking about the next 5-10 years, mostly.

I guess I see things as messier than this — I see people with very high estimates of AI takeover risk advancing arguments, and I see others advancing skeptical counter-arguments (example), and before engaging with these arguments a lot and forming one's own views, I think it's not obvious which sets of arguments are fundamentally unsound.

I think there are a number of intuitions and intuition pumps that are useful here: Intelligence being evolutionarily favourable (in a generalised Darwinism sense); there being no evidence for moral realism (an objective ethics of the universe existing independently of humans) being true (->Orthogonality Thesis), or humanity having a special (divine) place in the universe (we don't have plot armour); convergent instrumental goals being overdetermined; security mindset (I think most people who have low p(doom)s probably lack this?).

That said, we also must engage with the best counter-arguments to steelman our positions. I will come back to your linked example.

Greg_Colbourn ⏸️ Mar 18*4

Some other quotes and comments from my notes (in addition to my main comment):

“If GPT-6 were as capable as a human being”

This is conservative. Why not "GPT-5"? (In which case the 100,000x efficiency gain becomes 10,000,000,000x.)

See APM section for how misaligned ASI takeover could lead to extinction. Also

“if nuclear fusion grew to produce half as much power as the solar radiation which falls on Earth”

brings to mind Yudkowsky's "boiling the oceans" scenario.

"Issues around digital rights and welfare interact with other grand challenges, most notably AI takeover. In particular, granting AIs more freedoms might accelerate a ‘gradual disempowerment’ scenario, or make a more coordinated takeover much easier, since AI systems would be starting in a position of greater power. Concerns around AI welfare could potentially limit some methods for AI alignment and control. On the other hand, granting freedoms to digital people (and giving them power to enjoy those freedoms) could reduce their incentive to deceive us and try to seize power, by letting them pursue their goals openly instead and improving their default conditions."

This is important. Something I need to read and think more about.

“If we can capture more of the wealth that advanced AI would generate before it poses catastrophic risks, then society as a whole would behave more cautiously.”

Why is this likely? Surely we need a Pause to be able to do this?

“Unknown unknowns”

Expect these to be more likely to cause extinction than a good future? (Given Vulnerable World).

“One stark and still-underappreciated challenge is that we accidentally lose control over the future to an AI takeover.”

“Here’s a sceptical response you could make to our argument: many of the challenges we list will arise only after the development of superintelligence. If superintelligence is catastrophically misaligned, then it will take over, and the other challenges won’t be relevant.” [my emphasis in bold]

Yes!

“Ensuring that we get helpful superintelligence earlier in time, that it is useable and in fact used by key decision-makers, and that is accessible to as wide a range of actors as possible without increasing other catastrophic risks.” [my emphasis in bold]

It increases takeover risk(!) given lack of progress on (the needed perfect^[1]) alignment and control techniques for ASI.

“6. AGI Preparedness”

This whole section (the whole paper?) assumes that an intelligence explosion is inevitable. There is no mention of “pause” or “moratorium” anywhere in the paper.

“At the moment, the machine learning community has major influence via which companies they choose to work for. They could form a “union of concerned computer scientists” in order to be able to act as a bloc to push development towards more socially desirable outcomes, refusing to work for companies or governments that cross certain red lines. It would be important to do this soon, because most of this influence will be lost once AI has automated machine learning research and development.
Other actors have influence too. Venture capitalists have influence via which private companies they invest in. Consumers have influence through which companies they purchase AI products from. Investigative journalists can have major influence by uncovering bad behaviour from AI companies or politicians, and by highlighting which actors seem to be acting responsibly. Individuals can do similarly by amplifying those messages on social media, and by voting for more responsible political candidates.”

We need much more of this!

“Slowing the intelligence explosion. If we could slow down the intelligence explosion in general, that would give decision-makers and institutions more time to react thoughtfully.”

Yes!

“One route to prevent chaotically fast progress is for the leading power (like the US and allies) to build a strong lead, allowing it to comfortably use stabilising measures over the period of fastest change. Such a lead could even be maintained by agreement, if the leader can credibly commit to sharing power and benefits with the laggards after achieving AGI, rather than using that advantage to dismantle its competition. Because post-superintelligence abundance will be so great, agreements to share power and benefits should strongly be in the leader’s national self-interest: as we noted in the section on abundance, having only 80% of a very large pie is much more desirable than an 80% chance of the whole pie and 20% chance of nothing. Of course, making such commitments credible is very challenging, but this is something that AI itself could help with.”

But could also just lead to Mutually Assured AI Malfunction (MAIM).

“Second, regulations which are sensible on their own terms could also slow peak rates of development. These could include mandatory predeployment testing for alignment and dangerous capabilities, tied to conditions for release; or even welfare-oriented rights for AI systems with a reasonable claim to moral status. That said, regulation along these lines would probably need international agreement in order to be effective, otherwise they could simply advantage whichever countries did not abide by them.”

An international agreement sounds good.

“Third, we could bring forward the start of the intelligence explosion, stretching out the intelligence explosion over time, so that peak rates of change are more manageable. This could give more time to react, and a longer period of time to benefit from excellent AI advice prior to grand challenges. For example, accelerating algorithmic progress now means there would be less available room for improvement in software at the time of the intelligence explosion, and the software feedback loop couldn’t go on for as long before compute constraints kick in.”

This sounds like a terrible and reckless idea! Because we don’t know exactly where the thresholds are for recursive self-improvement to kick in.

“we think an intelligence explosion is more likely than not this century, and may well begin within a decade.”

Yes, unless we stop it happening (and we should!)

“If–then commitments”

Problem is knowing that by the time the “if” is verified to have occurred, it could well be too late to do the “then” (e.g. once a proto-ASI has already escaped onto the internet).

“We shouldn’t succumb to the evidence dilemma: if we wait until we have certainty about the likelihood of the intelligence explosion, it will by then be too late to prepare. It’s too late to buy home insurance by the time you see smoke creeping under the kitchen door.”

Exactly! Need a moratorium now, not unworkable “if-then” commitments!

“challenges around space governance, global governance, missile defence, and nuclear weapons are not directly questions about how to design, build, and deploy AI itself. Rather, AI accelerates and reorders the pace at which these challenges arrive, forcing us to confront them in a world changing at disorienting speed.”

This is assuming ASI is alignable! (The whole Not just misalignment section is).

“And, often, the most important thing to do is to ensure that superintelligence is in fact used in beneficial ways, and as soon as possible.”

This has not been justified in the paper.

^{^}
We need at least 13 9s of safety for ASI, and the best current alignment techniques aren't even getting 3 9s...

SharmakeMar 2810

While finm made a general comment in response to you, I want to specifically focus on the footnote, because I think it's a central crux in why a lot of EAs are way less doomy than you.

Quote below:

We need at least 13 9s of safety for ASI, and the best current alignment techniques aren't even getting 3 9s...

I think the 13 9s can be reduced to something requiring closer to 1-2 9s at the very least, and there are 2 reasons for this:

I think you drastically overestimate how many chances the AI gets at misalignment, because the trillions of executions will use far, far too little compute per single action to lead to a takeover, even in worlds where a software intelligence explosion is possible and large efficiencies in software are gainable. I expect there to be at most millions of such plausible actions, and the more likely scenario is that it's closer to hundreds or thousands of chances to escape, undermining this tweet:

https://x.com/gcolbourn/status/1762224406955216897

2. If we manage to catch an AI doing bad stuff, then it's much, much harder for the AI to escape, because there are a lot of techniques that can be applied to make the issue go away like proliferating the escape example.

More from Ryan Greenblatt here:

https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed

I definitely think alignment needs to be reliable, but I do not think it needs to be so reliable that we cannot achieve it, or that doom is very likely and we can't change the probabilities.

I'd certainly say it's quite scary, but I do think there's a reasonable hope of surviving and going on to thrive such that I think alignment invest is worth the money.

Greg_Colbourn ⏸️ Mar 292

I think you drastically overestimate how many chances the AI gets at misalignment, because the trillions of executions will use far, far too little compute per single action to lead to a takeover

The little compute leads to much more once it has escaped!

If we manage to catch an AI doing bad stuff

The point is that we won't, unless we have many more 9s of reliability in terms of catching such attempts!

SharmakeMar 291

I agree that conditional on escaping/rogue internal deployments like this scenario by Buck, with a lot of contributors, it leads to much larger disasters, and if the AI is unaligned, then unless we have an aligned AI that has somewhat similar capabilities, we lose.

My point is more so that you are way overestimating how many chances the AI has to overthrow us before it is aligned.

https://www.lesswrong.com/posts/ceBpLHJDdCt3xfEok/ai-catastrophes-and-rogue-deployments

But the crux might be that I don't think that we need that much reliability for AI catching, because as long as the AI is caught with non-negligible probability, the AI has to be very cautious, because it is way worse for the AI to be caught than to be successful or the game just ending.

In essence, we probably only need 1, or at most a few times to completely change the game in ways that massively favor alignment success.

You do not need to reliably catch someone if you can make the costs of catching the AI out very painful, and I do suspect we will probably be able to do this (perhaps using trusted AI labor, to use Ryan Greenblatt and Buck's terms).

Greg_Colbourn ⏸️ Mar 312

before it is aligned

This is begging the question! My whole objection is that alignment of ASI hasn't been established to be possible.

as long as the AI is caught with non-negligible probability, the AI has to be very cautious, because it is way worse for the AI to be caught than to be successful or the game just ending.

So it will worry about being in a kind of panopticon? Seems pretty unlikely. Why should the AI care about being caught any more than it should about any given runtime instance of it being terminated?

SharmakeApr 73

This is begging the question! My whole objection is that alignment of ASI hasn't been established to be possible.

A couple of things I'll say here:

You do not need a strong theory for why something must be possible in order to put non-trivial credence on it being possible, and if you hold a prior that scientific difficulty of doing something is often overrated, especially if you believe in the idea that alignment is possibly automatable and that a lot of people overrate the difficulty of automating something, that's enough to cut p(doom) by a lot, arguably 1 OOM, but at the very least nowhere near your 90 p(doom)%. That doesn't mean that we are going to make it out of ASI alive, but it does mean that even in situations where there is no established theory or plan to survive, you can still possibly do something.
If I wanted to make the case that ASI alignment is possible, I'd probably read these 3 posts by Joshua Clymer first on how automated alignment schemes could work (with some discussion by Habryka and Eliezer Yudkowsky and Jeremy Gillen the comments, and Joshua Clymer's responses):

https://www.lesswrong.com/posts/8vgi3fBWPFDLBBcAx/planning-for-extreme-ai-risks

https://www.lesswrong.com/posts/TTFsKxQThrqgWeXYJ/how-might-we-safely-pass-the-buck-to-ai

https://www.lesswrong.com/posts/5gmALpCetyjkSPEDr/training-ai-to-do-alignment-research-we-don-t-already-know

So it will worry about being in a kind of panopticon? Seems pretty unlikely. Why should the AI care about being caught any more than it should about any given runtime instance of it being terminated?

The basic reason for this is that you can gain way more information on the AI once you have escaped, combined with the ability to use much more targeted countermeasures that are more effective once you have caught the AI red handed.

As a bonus, this can also eliminate threat models like sandbagging, if you have found a reproducible signal for when an AI will try to overthrow a lab.

More discussion by Ryan Greenblatt and Buck here:

https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed

finmMar 232

Thanks for these comments, Greg, and sorry for taking a while to get round to them.

This is conservative. Why not "GPT-5"? (In which case the 100,000x efficiency gain becomes 10,000,000,000x.)

Of course there's some ambiguity in what “as capable as a human being” means, since present-day LLMs are already superhuman in some domains (like general knowledge), and before AI systems are smarter in every important way than humans, they will be smarter in increasingly many but not all ways. But in the broader context of the piece, we're interested in AI systems which effectively substitute for a human researcher, and I just don't think GPT-5 will be that good. Do you disagree or were we just understanding the claim differently?

See APM section for how misaligned ASI takeover could lead to extinction.

Are you missing a quotation here?

Why is this [capturing more wealth before AI poses meaning catastrophic risk] likely? Surely we need a Pause to be able to do this?

It's a conditional, so we're not claiming it's more likely than not that AI generates a lot of wealth before reaching very high “catastrophic risk potential” (if ever), but I do think it's plausible. One scenario where this looks likely is the one described by Epoch in this post, where AI services are diffusely integrated into the world economy. I think it would be more likely if we do not see something like a software intelligence explosion (i.e. “takeoff” from automating AI R&D). It would also be made more likely by laws and regulations which successfully restrict dangerous uses of AI.

A coordinated pause might block a lot of the wealth-generating effects of AI, if most those effects come from frontier models. But a pause (generally or on specific applications/uses) could certainly make the scenario we mention more likely (and even if it didn't, that in itself wouldn't make it a bad idea).

Expect these to be more likely to cause extinction than a good future? (Given Vulnerable World)

Not sure how to operationalise that question. I think most individual new technologies (historically and in the future) will make the world better, and I think the best world we can feasibly get to at the current technology level is much less good than the best world we can get to with sustained tech progress. How likely learning more unknown unknowns is (in general) to cause extinction is partly a function of whether there are “recipes for ruin” hidden in the tech tree, and then how society handles them. So I think I'd prefer “a competent and well-prepared society continues to learn new unknown unknowns (i.e. novel tech or other insights)” over “we indefinitely stop the kind of tech progress/inquiry that could yield unknown unknowns” over “a notably incompetent or poorly-prepared society learns lots of new unknown unknowns all at once”.

If superintelligence is catastrophically misaligned, then it will take over, and the other challenges won’t be relevant.

I expect we agree on this at least in theory, but maybe worth noting explicitly: if you're prioritising between some problems, one problem completely undermines everything else if you fail on it, it doesn't follow that you should fully prioritise work on that problem. Though I do think the work going into preventing AI takeover is embarrassingly inadequate to the importance of the problem.

It [ensuring that we get helpful superintelligence earlier in time] increases takeover risk(!)

Emphasis here on the “helpful” (with respect to the challenges we list earlier, and a background level of frontier progress). I don't think we should focus efforts on speeding up frontier progress in the broad sense. This appendix to this report discusses the point that speeding up specific AI applications is rarely if ever worthwhile, because it involves speeding up up AI progress in general.

We need at least 13 9s of safety for ASI, and the best current alignment techniques aren't even getting 3 9s...

Can you elaborate on this? How are we measuring the reliability of current alignment techniques here? If you roughly know the rate of failure of components of a system, and you can build in redundancy, and isolate failures before they spread, you can get away with any given component failing somewhat regularly. I think if you can confidently estimate the failure rates of different (sub-)components of the system, you're already in a good place, because then you can build AIs the way engineers build and test bridges, airplanes, and nuclear power stations. I don't have an informed view on whether we'll reach that level of confidence in how to model the AIs (which is indeed reason to be pretty freaked out).

This whole section (the whole paper?) assumes that an intelligence explosion is inevitable.

Sure — “assumes that” in the sense of “is conditional on”. I agree that most of the points we raise are less relevant if we don't get an intelligence explosion (as the title suggests). Not “assumes” as in “unconditionally asserts”. We say: “we think an intelligence explosion is more likely than not this century, and may well begin within a decade.” (where “intelligence explosion” is informally understood as a very rapid and sustained increase in the collective capabilities of AI systems). Agree it's not inevitable, and there are levers to pull which influence the chance of an intelligence explosion.

But could also just lead to Mutually Assured AI Malfunction (MAIM).

Is this good or bad, on your view? Seems more stabilising than a regime which favours AI malfunction “first strikes”?

This [bringing forward the start of the intelligence explosion] sounds like a terrible and reckless idea! Because we don’t know exactly where the thresholds are for recursive self-improvement to kick in.

I agree it would be reckless if it accidentally made a software intelligence explosion happen sooner or be more likely! And I think it's a good point that we don't know much about the thresholds for accelerating progress from automating AI R&D. Suggests we should be investing more in setting up relevant measures and monitoring them carefully (+ getting AI developers to report on them).

Yes, unless we stop it happening (and we should!)

See comment above! Probably we disagree on how productive and feasible a pause on frontier development is (just going off the fact that you are working on pushing for it and I am not), but perhaps we should have emphasised more that pausing is an option.

Problem is knowing that by the time the “if” is verified to have occurred, it could well be too late to do the “then” (e.g. once a proto-ASI has already escaped onto the internet).

I think we're operating with different pictures in our head here. I agree that naive “if-then” policies could easily kick in too late to prevent deceptively aligned AI doing some kind of takeover (in particular because the deceptively aligned AI could know about and try to avoid triggering the “if” condition). But most “if-then” policies I am imagining are not squarely focused on avoiding AI takeover (nor is most of the piece).

Need a moratorium now, not unworkable “if-then” commitments!

it's not clear to me if-then policies are less “workable” than a blanket moratorium on frontier AI development, in terms of the feasibility of implementing them. I guess you could be very pessimistic about whether any if-then commitments would at all help, which it sounds like you are.

This [challenges downstream of ASI] is assuming ASI is alignable! (The whole Not just misalignment section is).

Again, it's true that we'd only face most the challenges we list if we avoid full-blown AI takeover, but we're not asserting that ASI is alignable with full confidence. I agree that if you are extremely confident that ASI is not alignable, then all these downstream issues matters less. I currently think it's more likely than not that we avoid full-blown AI takeover, which makes me think it's worth considering downstream issues.

Thanks again for your comments!

Greg_Colbourn ⏸️ Mar 244

Do you disagree or were we just understanding the claim differently?

I disagree, assuming we are operating under the assumption that GPT-5 means "increase above GPT-4 relative to the increase GPT-4 was above GPT-3" (which I think is what you are getting at in the paper?), rather than what the thing that will actually be called GPT-5 will be like. And it has an "o-series style" reasoning model built on top of it, and whatever other scaffolding needed to make it agentic (computer use etc).

“a notably incompetent or poorly-prepared society learns lots of new unknown unknowns all at once”

I think that is, unfortunately, where we are heading!

"It [ensuring that we get helpful superintelligence earlier in time] increases takeover risk(!)"
Emphasis here on the “helpful”

I think the problem is the word "ensuring", when there's no way we can ensure it. The result is increasing risk when people take this as a green light to go faster and bring forward the time where we take the (most likely fatal) gamble on ASI.

"We need at least 13 9s of safety for ASI, and the best current alignment techniques aren't even getting 3 9s..."
Can you elaborate on this? How are we measuring the reliability of current alignment techniques here?

I'm going by published results where various techniques are reported, and show things like 80% reduction in harmful outputs, 90% reduction in deception, 99% reduction in jailbreaks etc.

Is this good or bad, on your view? Seems more stabilising than a regime which favours AI malfunction “first strikes”?

Yeah. Although an international non-proliferation treaty would be far better. Perhaps MAIM might prompt this though?

but perhaps we should have emphasised more that pausing is an option.

Yes!

But most “if-then” policies I am imagining are not squarely focused on avoiding AI takeover

They should be! We need strict red lines in the evals program^[1].

I currently think it's more likely than not that we avoid full-blown AI takeover, which makes me think it's worth considering downstream issues.

See replies in the other thread. Thanks again for engaging!

^{^}
That are short of things like "found in the wild escaped from the lab"(!)

Greg_Colbourn ⏸️ Mar 182

Typos:

Footnotes 142-144 are out of order (looks like 2 paragraphs have been swapped without their footnote numbers being swapped)
Footnote 151: “‘What Do I Think about Community Notes?’” isn’t referenced (it’s Vitalik’s essay, I guess?)
Footnote 154 truncated mid-sentence? (also Footnote 154 is Footnote 152 in the pdf..)
“conditions under which it does not make sense to punt on early preparation.” - should this be "does make sense"?

finmMar 262

Thanks!

should this be "does make sense"?