Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

LintzA

Comments 4

Sorted by

New & upvoted

Great post!

A related framing I like involves two 'pillars,' reduce the alignment tax (similar to your pillar 1) and pay the alignment tax (similar to your pillars 2 & 3). (See Current Work in AI Alignment.)

We could also zoom out and add more necessary conditions for the future to go well. In particular, eventually achieving AGI (avoiding catastrophic conflict, misuse, accidents, and non-AI x-risks) and using AGI well (conditional on it being aligned) carve nature close to its joints, I think.

MaxRa

Thanks again for writing this up! Just a random thought, have you considered what happens when you loosen this assumption:

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

I'm thinking about scenarios where humanity is able to keep the first 1 to 2 generations of AGI under control (e.g. by restricting applications, by using sufficiently good interpretability to detect most deception, due to very gradual capability increases).

Some spontaneous thoughts what pillars might be additionally interesting then:

Coordination, but focussed more on labs sharing incidents, insights, tools
Humanity's ability to detect and fight power-seeking agents
- Generic state capacity
- Generic international cooperation
- Cybersecurity to prevent rogue agents getting access to resources and weapons, to prevent debilitating cyberattacks
- Surveillance capabilities
- Robustness against bioweapons

SammyDMartin

Great post!

Check whether the model works with Paul Christiano-type assumptions about how AGI will go.

I had a similar thought reading through your article and my gut reaction is that your setup can be made to work as-is with a more gradual takeoff story with more precedents, warning shots and general transformative effects of AI before we get to takeover capability, but its a bit unnatural and some of the phrasing doesn't quite fit.

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

Paul says rather that e.g.

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.

On his view (and this is somewhat similar to my view) the background assumption is more like, 'deploying your first critical try (i.e. an AGI that is capable of taking over) implies doom', which is saying that there is an eventual deadline where these issues need to be sorted out, but lots of transformation and interaction may happen first to buy time or raise the level of capability needed for takeover. So something like the following is needed:

Technical alignment research success by the time of the first critical try (possibly AI assisted)
Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)

On the Paul view, your three pillars would still eventually have to be satisfied at some point, to reach a stable regime where unaligned AGI cannot pose a threat, but we would only need to get to those 100 points after a period where less capable AGIs are running around either helping or hindering, motivating us to respond better or causing damage that degrades our response, to varying extents depending on how we respond in the meantime, and exactly how long we spend during the AI takeoff period.

Also, crucially, the actions of pre-AGI AI may push this point where the problems become critical to higher AI capability levels as well as potentially assisting on each of the pillars directly, e.g. by making takeover harder in various ways. But Paul's view isn't that this is enough to actually postpone the need for a complete solution forever: e.g. that the effects of pre-AGI AI could 'could significantly (though not indefinitely) postpone the point when alignment difficulties could become fatal'.

This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.

Essentially, the time/level of AI capability at which we must reach 100 points to succeed also becomes a free variable in the model that can move up and down, and we also have to consider the shorter-term effects of transformative AI on each of the pillars as well.

LintzA

Thanks for this!

My thinking has moved in this direction as well somewhat since writing this. I'm working on a post which tells a story more or less following what you lay out above - in doc form here: https://docs.google.com/document/d/1msp5JXVHP9rge9C30TL87sau63c7rXqeKMI5OAkzpIA/edit#

I agree this danger level for capabilities could be an interesting addition to the model.

I do feel like the model remains useful in my thinking, so I might try a re-write + some extensions at some point (but probably not very soon)

Comments

Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination — EA Forum

Comments 4

Sorted by

New & upvoted

Zach Stein-Perlman

Great post!

A related framing I like involves two 'pillars,' reduce the alignment tax (similar to your pillar 1) and pay the alignment tax (similar to your pillars 2 & 3). (See Current Work in AI Alignment.)

MaxRa

Thanks again for writing this up! Just a random thought, have you considered what happens when you loosen this assumption:

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

Some spontaneous thoughts what pillars might be additionally interesting then:

Coordination, but focussed more on labs sharing incidents, insights, tools
Humanity's ability to detect and fight power-seeking agents
- Generic state capacity
- Generic international cooperation
- Cybersecurity to prevent rogue agents getting access to resources and weapons, to prevent debilitating cyberattacks
- Surveillance capabilities
- Robustness against bioweapons

SammyDMartin

Great post!

Check whether the model works with Paul Christiano-type assumptions about how AGI will go.

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

Paul says rather that e.g.

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.

Technical alignment research success by the time of the first critical try (possibly AI assisted)
Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)

This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.

LintzA

Thanks for this!

I agree this danger level for capabilities could be an interesting addition to the model.

I do feel like the model remains useful in my thinking, so I might try a re-write + some extensions at some point (but probably not very soon)

Epistemic status: This model is loosely inspired by a conversation with Nate Soares but has since warped into something perhaps significantly different from his original intent. I think this model is a useful thinking tool when it comes to examining potential interventions to mitigate AI risk and getting a grasp of the problem we face.

Summary

The three pillars model attempts to describe the conditions needed to successfully avoid the deployment of unaligned AGI. It proposes that, to succeed, we need to achieve some sufficient combination of success on all three of the following:

Technical alignment research
Safety-conscious deployment decisions
Coordination between potential AI deployers.

While how difficult success is depends on the difficulty of solving any given pillar, this model points toward why we may well fail to avoid AGI catastrophe: we need to simultaneously succeed at three difficult problems.

More generally, the model aims to help those concerned about AI risk flesh out our mental pictures of what success on managing AGI looks like. In particular, it suggests that a strategy aimed solely at a single pillar is unlikely to be sufficient and that our community might need to take ambitious actions in several directions at once.

The three pillars

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

We need to attain sufficient success on some combination of three pillars in order to attain a future where AGI does not kill us: technical safety, safety-conscious deployment decisions, and coordination between deployers. Success on any given pillar can, to some extent, serve as a substitute for success on another. For example, we could be extremely successful on one pillar, quite successful on two, or fairly successful on all three in order to avoid doom. We do generally need at least a minimal amount of progress on each pillar to succeed — for example, even a low-cost and easy-to-implement technical solution to alignment still needs to be adopted by leading AI developers.

One conceptualization which might be useful for driving intuition is to set the bar for victory at 100 pillar points. We can have partial success on all pillars: e.g. 33, 33, and 34 from pillars 1,2, and 3 respectively; or we can get almost everything from one pillar: e.g. a 90-5-5 split with pillar 1 supporting most of the weight.

In practice, we should expect progress on the pillars to be correlated. Success on each pillar is more likely if we’re in a world with more competence, more well-calibrated AGI risk awareness, a more effective and influential longtermist community, etc. In addition, success on one pillar can increase the return-on-investment in other pillars. E.g. If technical success were to give us the foolproof ability to mind-read AGIs, it would be trivial to succeed on deployment decisions because it would be obvious whether an AGI is or isn’t aligned. So, on top of the 60 pillar points we get for achieving robust mind-reading, we could then get 60 points on deployment decisions without much extra effort by spreading knowledge about how to mind-read.

Success could happen in a number of ways, but I have provided some specific examples for each pillar. I’ve chosen to give examples of three levels of success/failure for each pillar to give intuition about the pillar (though note there is actually a continuous success/failure spectrum). These levels are as follows:

Strong success is a success which largely removes the need for success on other pillars. ^[1]
Partial success needs to be matched with other partial successes for overall success to occur
Failure means that we need strong success on one or more of the other pillars in order to survive

Pillar 1: Technical alignment

The extent to which actors are able to make technical progress on the alignment problem (i.e. creating an AGI that won’t kill us all when deployed), and the ability to identify whether we’ve solved it (i.e. recognizing if a system is misaligned before deployment).

Strong success: Someone solves the AGI alignment problem before anyone can deploy. The solution is totally accessible and has no clear drawbacks.
Partial success: A group of highly talented people work on the right problem^[2] and are able to develop a solution a few months or years after the first actor is able to deploy AGI. They have the ability to determine whether or not the solution will indeed produce an aligned AGI.
Failure: We either fail to work on the right problem or else don’t work on the problem at all. A solution to the alignment problem at this pace might take decades after the first actor could deploy AGI.

Pillar 2: Safety-conscious deployment decisions

The extent to which organizations capable of deploying AGI are (a) cognizant of AI risk and (b) able to decide whether to deploy a system (including deciding not to deploy) based on whether or not they think it’s safe to deploy.

Strong success: Actors who could deploy AGI are sensitive to the risks and are willing to act on the belief that AGI is potentially dangerous. Leaders cognizant of the immense complexity and scale of the risks are in control of systems which could be deployed and are always able to discern between true and false solutions to the AGI alignment problem.^[3] They consistently make the right decision of when to not deploy.
Partial success: Actors able to deploy AGI are all fairly competent at discerning true and false solutions to alignment. They listen carefully to their team about which solutions are likely to work and adopt the latest tools for doing so.
Failure: Of the set of actors that (a) end up having the option to deploy unsafe AGI and/or (b) are perceived to have a decent chance of ending up with that option, one or more actors would choose to deploy an AGI system that is in fact unsafe.^[4] The actor might not recognize that the system is unsafe or might not care.

Pillar 3: Coordination between potential AI deployers

The extent to which conditions and relations between potential deployers of AGI favor or disfavor deploying unaligned AGI.

Strong success: The international community is able to agree nearly unanimously that deploying unaligned AGI would be disastrous. They institute strong controls on compute access and limit the number of actors with access to enough compute to build AGI.
Partial success: We attain a situation which doesn’t unduly pressure decision-makers to deploy unsafe AGI. This could include, for instance, cooperation between labs or a strong lead by one lab.
Failure: There are strong pressures to deploy unsafe AGI systems or efforts to do the right thing are blocked.

Hypothetical scenarios (for illustration)

These very rough hypothetical future scenarios are intended both to illustrate what the above pillars mean by sketching examples of how each could succeed or fail, and to illustrate some plausible failure and success scenarios.

Failure on one pillar, and partial success in others, leads to overall failure

Partial technical and deployment decision success, but coordination failure

Scenario 1: Deepmind has a giant team working on alignment. They are properly incentivized and are almost ready to deploy an aligned AGI. Deepmind gets inaccurate intel that Microsoft has an AGI that they are very nearly ready to deploy. Demis Hassabis, CEO of Deepmind, reasons that Deepmind’s AGI is more likely to go right than the one created by Microsoft, and so he chooses to deploy even though he’s not quite sure it’s ready. The unaligned AGI becomes impossible to control and kills us all.

Scenario 2: Google, OpenAI, Anthropic, and a few other major AI companies get together and form a coalition. With their combined resources they are dramatically ahead of the competition. They have competent leadership that can distinguish between true and false solutions to the alignment problem. The US government abolishes the coalition because it violates antitrust law. The various parties split apart. One party chooses to deploy an AGI early, fearing that others will do the same. Humanity loses.

Partial technical and coordination success, but deployment decisions failure

Scenario 3: The US government dominates the AI ecosystem. They’ve managed to create a surveillance system which can tell them who is building AGI, and they attack anyone who does with immense cyber offensive capabilities. They’re quite sure their projects are the only ones capable of AGI. A new president is elected who has optimistic views on AGI safety. He believes that we need only be nice to the AGI while training it and then it will be safe to deploy.^[5] He commands the most submissive lab to train up such an AGI and deploy it. Humanity loses.

Partial coordination and deployment decision success, but technical failure

Scenario 4: The US government dominates the AI ecosystem. They’ve got an eye on all the competing projects and are shutting them down as they crop up. They’ve got people working round the clock on technical safety solutions. The AGI is fairly responsible but after several years of being able to deploy AGI and choosing not to, pressure increases. The team doesn’t have access to tools which can tell them whether the AGI will be misaligned nor does it have a foolproof alignment technique. Under public pressure after a decade of holdups, the leader of the AI lab eventually declares that he’s pretty sure he has trained an AGI which will not kill us all. He deploys and it kills us all.

Modest success in all pillars has unclear results

Sufficient partial success on all pillars

Just a few labs are clearly in the lead. They are fairly sure of their lead, are inclined to spend time aligning AGI before deploying it, and have some degree of cooperation among them. As pressure mounts to deploy AGI one year after the labs are capable of doing so, OpenAI develops a solution to technical alignment that is easily understood and verified by third parties. OpenAI deploys an aligned AGI.

Insufficient partial success on all pillars

Just a few Western labs are clearly in the lead. They are fairly sure of their lead, are inclined to spend time aligning AGI before deploying it, and have some degree of cooperation among them. Quality work is ongoing to solve the technical alignment problem but two years after leading labs are capable of deploying AGI, there is no foolproof solution in sight. Feeling threatened by the potential for a US-dominated post-AGI world, China starts its own secret project to build AGI. They are able to copy a leading lab’s model and, cognizant of the risks, decide it’s worth taking the chance to deploy the model even though alignment is not certain. The AI is unaligned and humanity loses.

Strong success on one pillar is unlikely, but could lead to overall success

Very strong technical success sets a very low bar for deployment decision and coordination success

Scenario 5: Researchers at MIRI stumble across a total solution to the alignment problem. It can be easily adapted to new systems and they send it out to every lab in the world. Despite intense competition and poor leadership, leaders are able to recognize that this is in fact a solution to the problem and that applying it to their systems is a good idea. Labs apply it and Baidu deploys an aligned AGI.

Strong deployment decision success sets a low bar for technical and coordination success

Scenario 6: The US government and Chinese government are totally bought into AGI risk and treat it as a truly existential concern. They get their competence hats on and are able to pick leaders for their national projects who can distinguish between true and false solutions to the alignment problem. But the two countries fail to come to any real between-country agreements, and multilateral efforts fail. Also, barely anyone works on the right problem. A mutually assured destruction (MAD) dynamic develops where the US and China both know they can destroy the world by releasing their AGI but they do not. This leads to a stalemate for decades where AGI deployment is seen as the most dangerous possible thing and hope for alignment is low. After 50 years like this we figure out how to make emulated minds and they are able to figure out the solution to the alignment problem. The US deploys an aligned AGI.

How likely are we to succeed?

A key factor for determining whether things are likely to go well, and on which pillars we should focus our attention, is how easy or difficult each pillar is to solve. It’s possible that one or more pillars are trivial and it’s also possible that one or more pillars are impossible.^[6]

Some variables that affect the difficulty of the pillars

There are many variables which affect the difficulty of solving the pillars, some of which we can intervene on in order to make the problem ahead easier. Below are some examples of variables that could affect the difficulty of pillars, and my rough best-guess at their overall effect:

How long we have: If we have 10 years, it will likely be harder to reach sufficient success than if we have 50. This is because we have less time to solve the technical alignment problem, ensure buy-in on AGI risk in top labs and governments, and set up coordination infrastructure. On the other hand, there would also probably be fewer relevant actors to coordinate and it may be more tractable to influence those actors, thus making coordination potentially easier.
Number of actors: The more actors potentially able to build AGI, the more difficult the coordination problem is and the more individual labs need to succeed at deployment decision-making. The technical problem might move faster however with many parties working on it at once.
Ease of high-quality interpretability: If high-quality interpretability is an easy problem, it would be much easier to solve the technical problem and much easier to determine whether a given alignment solution is true. Coordination would probably also be easier because we would be more likely to have clear warning signs to point to about AGI risk.

How to use the model

Testing paths to victory

One way to use this model is to get a sense of what combinations of strategies might be necessary for success. I.e. The model could facilitate building more complex theories of victory.

For example, perhaps the longtermist community needs to simultaneously:

Get the US government fully bought-in on AGI risk so that they are willing to coordinate or stop the proliferation of dangerous models
Have someone in the room where deployment decisions are made at the leading labs to ensure proper risk aversion
Create a highly functional AI safety research ecosystem populated by geniuses that we found by scouting the world for talent.

Ideally, our community would have multiple redundant efforts to partially or fully solve each pillar.

Describing strategic views

We can describe a range of strategic views on AI risk with a few parameters:

What return on effort (RoE) you expect to get for each pillar
- Table 2 in this post by Matthijs Maas lists strategies that make sense under different views of technical vs governance tractability.
- E.g. Many technical safety people think we get far more bang for our buck with technical research and affecting coordination and deployment decisions is too hard to be worth trying.
Whether you think we’re likely to reach a sufficient level of success on each pillar
- E.g. My understanding is that the MIRI view can be operationalized as believing that the pillars are all very hard to attain in practice, though not terribly hard in principle. A more competent world might well be able to attain sufficient success on the pillars, but we do not live in that world. As such, we’re not on track for total success on any given pillar, and getting sufficient partial success on all three pillars is similarly unlikely given current efforts and approaches.^[7]
- E.g.: Some people in AI governance might focus on coordination because they have faith in institutions’ ability to recognize the importance of AI risk over time, but think that even in worlds where this gives us technical and deployment decision success, humanity might struggle to sufficiently manage competitive pressures.

In practice, teasing out the implications of pillar difficulty is not obvious. If you believe a particular pillar is necessary for success but not very tractable, it might still be worth working on. By contrast, if you think that each pillar is likely to meet the minimum viable threshold, it might make more sense to work on particularly tractable areas. Since we don’t know exactly what minimum viability looks like, prioritization is hard!

Some interventions look more promising when we consider their contributions across multiple pillars — for example, building consensus about the importance of AI safety among technical ML researchers is likely to get some important decision-makers on board, and could make it easier to developing an international “epistemic community” around AI risk reduction, improving international coordination.

Imperfections of the model and future research

I have some intuition that upon further scrutiny this model might break down. However, I do think it’s a useful thinking tool! As the saying goes, “all models are wrong, some models are useful”. The main issue is that all the pillars are deeply interconnected. For instance, progress on technical issues will be crucial for helping decision-makers determine whether or not a given alignment solution is real or not. Also, labs that have AGI-concerned leaders are more likely to coordinate.

I think there are a lot of potentially interesting things to do with the model in follow-up work. For example:

Build a formal version of this model that allows for more subtle manipulation and prediction.
Clarify the assumptions. E.g. Can we say that deploying aligned AGI is total success?
Make the pillars more distinct and mutually exclusive.
Dive more deeply into each pillar and explore how different changes to the world affect the likelihood that we can solve the problem.
Test the model against some proposed solutions or bundles of solutions to see how those solutions fare.
Check whether the model works with Paul Christiano-type assumptions about how AGI will go.
What are the thresholds we need to achieve on each pillar? Can the model be characterized as success needing to sum to, e.g., 100 for us to succeed? I.e. could we say that getting 70% to total success on technical, 20% on deployment decisions, and 10% on coordination would be just enough to carry us over the finish line? Or are some pillars notably more important than others? Is the concept of 50% success even meaningful?

Acknowledgements: Ashwin Acharya and Michael Aird for significant feedback, Nate Soares for providing the initial idea, as well as Abi Olvera, Fynn Heide, Max Räuker, and Ben Cottier.

^{^}
Though we still can’t drop the ball entirely on the other pillars, we are just at a point where business as usual is probably fine.
^{^}
For Nate Soares’ take on what the right problem is, see his post here: On how various plans miss the hard bits of the alignment challenge
^{^}
While the ability to determine whether an alignment solution will actually work is a technical problem, the leadership of the organization deciding whether to deploy AGI will face the problem of figuring out who to listen to as to whether the system is aligned and/or doing the technical thinking themselves.

^{^}

This set of actors could be kept small by (a) one actor deploying safe AGI in such a way that prevents other actors from deploying any AGI or from deploying unsafe AGI specifically and (b) there being a large (perceived) lead in AI development between one or a small set of actors and all other actors.

The expected behavior of actors who might end up with the option of deploying unsafe AGI matters because that may affect what other actors do, such as how much those other actors cut corners to get to AGI first.

^{^}

There are apparently real, powerful, people who have this opinion so it’s not as ridiculous as it sounds.

^{^}

Note that to calculate a probability of success it’s not as simple as multiplying 10%*10%*10% to get the correct odds. This is because success and failure on different pillars is correlated.

^{^}

This framing really helped me understand why MIRI folk tend to be extremely pessimistic about our odds of survival.

Mentioned in

108

Rethink Priorities’ 2022 Impact, 2023 Strategy, and Funding Gaps

AGI ruin scenarios are likely (and disjunctive)

Monthly Overload of EA - September 2022