We're Not Ready: thoughts on "pausing" and responsible scaling policies

Holden Karnofsky

Comments 22

Sorted by

New & upvoted

Holden - these are reasonable points. But I have two quibbles.

First, the recent surveys of the general public's attitudes towards AI risk suggest that a strongly enforced global pause would actually get quite a bit of support. It's not outside the public's Overton Window. It might be considered an 'extreme solution' by AI industry insiders and e/acc cultists. But the public seems to understand that it's just fundamentally dangerous to invent Artificial General Intelligence that's as smart as smart humans (and much, much faster), or to invent Artificial Superintelligence. AI experts might patronize the public by claiming they're just reacting to sensationalized Hollywood depictions of AI risk. But I don't care. If the public understands the potential risks, through whatever media they've been exposed to, and if it leads them to support a pause, we might as well capitalize on public sentiment.

Second, I worry that EAs generally have a 'policy fetish', in assuming that the only way to slow down a technological field is through formal, government-sanctioned regulation and 'good policy' solutions. I think this is incorrect, both historically and logically. In this piece on moral stigmatization of AI, I argued that an informal, grass-roots, public moral backlash against the AI industry could accomplish almost everything formal regulation can accomplish, without many of the loopholes and downsides that regulation would face. If the general public realizes that AGI-directed research is just fundamentally stupid and reckless and a huge extinction risk, they can stigmatize AI researchers, funders, suppliers, etc in ways that shut down the industry -- potentially for decades. If that public stigmatization goes global, the AI industry globally could be put on 'pause' for quite a while. Sure, we might delay some potential benefits from some narrow AI applications. But that's a tradeoff most reasonable people would be willing to accept. (For example, if my generation misses out on AI-created longevity treatments, and we die, but our kids survive, without facing AGI-imposed extinction risks, that's fine with me -- and I think it would be OK with most parents.)

I understand that harnessing the power of moral stigmatization to shut down a promising-but-dangerous technology like AI isn't the usual EA style, but at this point, it might be the only practical solution to pausing dangerous AI development.

Greg_Colbourn ⏸️

Fully agree. A potential taboo on AGI is something that is far too often overlooked by people who worry about pauses not working well (e.g. see also Scott Alexander, Matthew Barnett, Nora Belrose).

Zed Tarar

This is true--it's the same tactic anti-GMO lobbies, the NRA, NIMBYs, and anti-vaxxers have used. The public as a whole doesn't need to be anti-AI, even a vocal minority will be enough to swing elections and ensure an unfavorable regulatory environment. If I had to guess, AI would end up like nuclear fission--not worth the hassle, but with no off-ramp, no way to unring the alarm bell.

Ryan Greenblatt

First, the recent surveys of the general public's attitudes towards AI risk suggest that a strongly enforced global pause would actually get quite a bit of support. It's not outside the public's Overton Window. It might be considered an 'extreme solution' by AI industry insiders and e/acc cultists. But the public seems to understand that it's just fundamentally dangerous to invent Artificial General Intelligence that's as smart as smart humans (and much, much faster), or to invent Artificial Superintelligence. AI experts might patronize the public by claiming they're just reacting to sensationalized Hollywood depictions of AI risk. But I don't care. If the public understands the potential risks, through whatever media they've been exposed to, and if it leads them to support a pause, we might as well capitalize on public sentiment.

I think the public might support a pause on scaling, but I'm much more skeptical about the sort of hardware-inclusive pause that Holden discusses here:

global regulation-backed pause on all investment in and work on (a) general³ enhancement of AI capabilities beyond the current state of the art, including by scaling up large language models; (b) building more of the hardware (or parts of the pipeline most useful for more hardware) most useful for large-scale training runs (e.g., H100’s); (c) algorithmic innovations that could significantly contribute to (a)

A hardware-inclusive pause which is sufficient for pausing for >10 years would probably effectively dismantle companies like nvidia and would be at least a serious dent in TSMC. This would involve huge job loss and a large hit to the stock market. I expect people would not support such a pause which effectively requires dismantling a powerful industry.

It's possible I'm overestimating the extent to which hardware needs to be stopped for such a ban to be robust and an improvement on the status quo.

Nick K.

I'm not an expert but economic damage seems to me plausibly like a question of implementation details. E.g. if you ask for a stop in hardware improvements at the same time as implementing hardware-level compute monitoring, this likely requires development of new technology to do efficiently which may allow the current companies to maintain their leading position.

Of course, restrictions are going to have some effect, and plausibly may hit Nvidia's valuation but it is not at all clear that the economic consequences would necessarily be dramatic (the situation of the car industry and switching to E.V.'s might be vaguely analogous).

kokotajlod

I think the tech companies -- and in particular the AGI companies -- are already too powerful for such an informal public backlash to slow them down significantly.

Geoffrey Miller

Disagree. Almost every successful moral campaign in history started out as an informal public backlash against some evil or danger.

The AGI companies involve a few thousand people versus 8 billion, a few tens of billions of funding versus 360 trillion total global assets, and about 3 key nation-states (US, UK, China) versus 195 nation-states in the world.

Compared to actually powerful industries, AGI companies are very small potatoes. Very few people would miss them if they were set on 'pause'.

kokotajlod

I hope you are right.

Greg_Colbourn ⏸️

I imagine it going hand in hand with more formal backlashes (i.e. regulation, law, treaties).

Greg_Colbourn ⏸️

Overall I don’t have settled views on whether it’d be good for me to prioritize advocating for any particular policy.⁵ At the same time, if it turns out that there is (or will be) a lot more agreement with my current views than there currently seems to be, I wouldn’t want to be even a small obstacle to big things happening, and there’s a risk that my lack of active advocacy could be confused with opposition to outcomes I actually support.

You have a huge amount of clout in determining where $100Ms of OpenPhil money is directed toward AI x-safety. I think you should be much more vocal on this - at least indirectly by OpenPhil grant making. In fact I've been surprised at how quiet you (and OpenPhil) have been since GPT-4 was released!

evhub

Cross-posted from LessWrong.

One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going.

It's hard to take anything else you're saying seriously when you say things like this; it seems clear that you just haven't read Anthropic's RSP. I think that the current conditions and resulting safeguards are insufficient to prevent AI existential risk, but to say that it doesn't make them clear is just patently false.

The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:

Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.

And then it lays out a serious of safety procedures that Anthropic commits to meeting for ASL-3 models or else pausing, with some of the most serious commitments here being:

Model weight and code security: We commit to ensuring that ASL-3 models are stored in such a manner to minimize risk of theft by a malicious actor that might use the model to cause a catastrophe. Specifically, we will implement measures designed to harden our security so that non-state attackers are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense. The full set of security measures that we commit to (and have already started implementing) are described in this appendix, and were developed in consultation with the authors of a forthcoming RAND report on securing AI weights.

Successfully pass red-teaming: World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse. Misuse domains should at a minimum include causes of extreme CBRN risks, and cybersecurity.

Note that in contrast to the ASL-3 capability threshold, this red-teaming is about whether the model can cause harm under realistic circumstances (i.e. with harmlessness training and misuse detection in place), not just whether it has the internal knowledge that would enable it in principle to do so.

We will refine this methodology, but we expect it to require at least many dozens of hours of deliberate red-teaming per topic area, by world class experts specifically focused on these threats (rather than students or people with general expertise in a broad domain). Additionally, this may involve controlled experiments, where people with similar levels of expertise to real threat actors are divided into groups with and without model access, and we measure the delta of success between them.

And a clear evaluation-based definition of ASL-3:

We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things. (By post-training techniques we mean the best capabilities elicitation techniques we are aware of at the time, including but not limited to fine-tuning, scaffolding, tool use, and prompt engineering.)

Capabilities that significantly increase risk of misuse catastrophe: Access to the model would substantially increase the risk of deliberately-caused catastrophic harm, either by proliferating capabilities, lowering costs, or enabling new methods of attack. This increase in risk is measured relative to today’s baseline level of risk that comes from e.g. access to search engines and textbooks. We expect that AI systems would first elevate this risk from use by non-state attackers. Our first area of effort is in evaluating bioweapons risks where we will determine threat models and capabilities in consultation with a number of world-class biosecurity experts. We are now developing evaluations for these risks in collaboration with external experts to meet ASL-3 commitments, which will be a more systematized version of our recent work on frontier red-teaming. In the near future, we anticipate working with CBRN, cyber, and related experts to develop threat models and evaluations in those areas before they present substantial risks. However, we acknowledge that these evaluations are fundamentally difficult, and there remain disagreements about threat models.

Autonomous replication in the lab: The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]. The appendix includes an overview of our threat model for autonomous capabilities and a list of the basic capabilities necessary for accumulation of resources and surviving in the real world, along with conditions under which we would judge the model to have succeeded. Note that the referenced appendix describes the ability to act autonomously specifically in the absence of any human intervention to stop the model, which limits the risk significantly. Our evaluations were developed in consultation with Paul Christiano and ARC Evals, which specializes in evaluations of autonomous replication.

This is the basic substance of the RSP; I don't understand how you could have possibly read it and missed this. I don't want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.

NickLaing

It think calling a take "lazy", which could indeed be considered "mean" is not avery helpful approach, you could have made your point without that kind of derision. There are going to be a lot of misunderstandings and hot takes around RSPs, and I think AI company employees especially should err heavily on the side of patience and kind understanding it they want to avoid people becoming more adversarial towards them.

Live by the sword, die by the sword.

Akash said...

"that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going. It"

I agree the conditions from the RSP you started are clearer than I would have expected reading Akash's above comment, but to be fair to Akash, from those paragraphs you posted above, only the last one seems to state a clear and specific condition for pausing, the others seem to say "refer to experts" which could be considered unclear, to give Akash the benefit of the doubt.

And they don't say how long the pause would be out conditions for restarting either.

Greg_Colbourn ⏸️

There’s a serious (>10%) risk that we’ll see transformative AI² within a few years.
In that case it’s not realistic to have sufficient protective measures for the risks in time.
Sufficient protective measures would require huge advances on a number of fronts, including information security that could take years to build up and alignment science breakthroughs that we can’t put a timeline on given the nascent state of the field, so even decades might or might not be enough time to prepare, even given a lot of effort.
If it were all up to me, the world would pause now

Reading the first half of this post, I feel that your views are actually very close to my own. It leaves me wondering how much your conflicts of interest -

I am married to the President of Anthropic and have a financial interest in both Anthropic and OpenAI via my spouse.

- are factoring into why you come down in favour of RSPs (above pausing now) in the end.

Peter Wildeford

I’m guessing stopping scaling by US POTUS executive order is not even legally possible though? So I don’t think we’d have to worry about that.

dan.pandori 🔸

Legal or constitutional infeasibility does not always prevent executive orders from being applied (or followed). I feel like the US president declaring a state of emergency related to AI catastrophic risk (and then forcing large AI companies to stop training large models) sounds at least as constitutionally viable as the attempted executive order for student loan forgiveness.

I agree that this seems fairly unlikely to happen in practice though.

Zed Tarar

I think you put it well when you said:

"Some people think that the kinds of risks I’m worried about are far off, farfetched or ridiculous."

If I made the claim that we had 12 months before all of humanity is wiped by an asteroid, you'd rightly ask me for evidence. Have I picked up a distant rock in space using radio telescopes? Some other tangible proof? Or is it a best-guess, since, hey, it's technically possible that we could be hit with an asteroid on any given year. Then imagine if I advocate we spend two percent of global GDP preparing for this event.

That's where the state of AGI fear is--all scenarios depend on wild leaps of faith and successive assumptions that build on each other.

I've attempted to put this all in one place with this post.

G J01

-1

Unfortunately, I beleve that any pause that comes about might be publicly acknowledged but there are certain interests that would be far too happy to drive any further development underground. The potential for shadow R and D would only cause others to also continue, leading to a situation whereby the likelihood of ANY hope of regulation would disappear completely. I think the threat is real and already an existential one. AI is "in the system" already and developing itself into something we cannot possibly imagine.

Don't forget the Google or Microsoft experiment decades ago when 2 AI were talking to each other and created a language that the onlookers couldn't understand...so they switched it off. If you study the words that were initially being used by the AI algorithm it can be seen as trying to identify itself between subject, object and verb. In short, AI was even then developing self awareness.
We are decades behind the curve here.

Seth Herd

-1

I'm not sure which is the better place to have this discussion, so I'm trying both. Copied from my comment on Less Wrong:

That all makes sense. To expand a little more on some of the logic:

It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.

I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.

On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs

In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.

This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.

The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.

This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.

It does seem clear that creating mechanisms and political will for a pause are a good idea.

Advocating for more safety work also seems clear cut.

To this end, I think it's true that you create more political capitol by successfully pushing for policy.

A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.

So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.

There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.

Geoffrey Miller

Seth - you mentioned that 'we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.'

I assume you're referring to OpenAI, DeepMind, and Anthropic.

Yes, they all give lip service to AI safety, and they hire safety researchers, and they safety-wash their capabilities development.

But I see no evidence that they would actually stop their AGI development under any circumstances, no matter how risky it started to seem.

Maybe you trust their leadership. I do not. And I don't think the 8 billion people in the world should have their fates left in the hands of a tiny set of AI industry leaders - no matter how benevolent they seem, or how many times they talk about AI safety in interviews.

Seth Herd

I agree that those teams aren't completely trustworthy, and in an ideal world, we should be making this decision by including everyone on earth. But with a partial pause, do you expect to have better or worse teams in the lead for achieving AGI? That was my point.

Geoffrey Miller

Well from an AI safety viewpoint, the very worst teams to be leading the AGI rush would be those that (1) are very competent, well-funded, well-run, and full of idealistic talent, and (2) don't actually care about reducing extinction risk -- however much lip service they pay to AI safety.

From that perspective, OpenAI is the worst team, and they're in the lead.

Seth Herd

I think that's quite a pessimistic take. I take Altman seriously on caring about x-risk, although I'm not sure he takes it quite seriously enough. This is based on public comments to that effect around 2013, before he started running OpenAI. And Sutskever definitely seems properly concerned.

Comments

More from the author

135

Responsible Scaling Policy v3

Holden Karnofsky·5mo ago·43m read

644

Some comments on recent FTX-related events

Holden Karnofsky·3y ago·5m read

530

EA is about maximization, and maximization is perilous

Holden Karnofsky·3y ago·8m read

Curated and popular this week

Hard-to-reverse decisions destroy option value

Stefan_Schubert·9y ago·Curated 1d ago·14m read

This post is co-authored with Ben Garfinkel. It is cross-posted from the CEA blog. A PDF version can be found here. Summary: Some strategic decisions available to the effective altruism m...

Introducing Impact List: a ranking of philanthropists by expected lives saved

Elliot Olds·2d ago·6m read

TL;DR: I'm releasing a website that ranks philanthropists according to EA principles and research, and allows users to re-rank the list using their own assumptions. I'd like feedback and help making it better. I'd especially like ideas for how to make the results more trustworthy. Funding may be available. Crossposted to LessWrong. ...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·6d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Recent opportunities to take action

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·4d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·5d ago·3m read

Starting an EA group @ SUNY Binghamton

micahzarin·3d ago·1m read

evhub

Cross-posted from LessWrong.

One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going.

The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:

Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.

And then it lays out a serious of safety procedures that Anthropic commits to meeting for ASL-3 models or else pausing, with some of the most serious commitments here being:

Model weight and code security: We commit to ensuring that ASL-3 models are stored in such a manner to minimize risk of theft by a malicious actor that might use the model to cause a catastrophe. Specifically, we will implement measures designed to harden our security so that non-state attackers are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense. The full set of security measures that we commit to (and have already started implementing) are described in this appendix, and were developed in consultation with the authors of a forthcoming RAND report on securing AI weights.

Successfully pass red-teaming: World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse. Misuse domains should at a minimum include causes of extreme CBRN risks, and cybersecurity.

Note that in contrast to the ASL-3 capability threshold, this red-teaming is about whether the model can cause harm under realistic circumstances (i.e. with harmlessness training and misuse detection in place), not just whether it has the internal knowledge that would enable it in principle to do so.

We will refine this methodology, but we expect it to require at least many dozens of hours of deliberate red-teaming per topic area, by world class experts specifically focused on these threats (rather than students or people with general expertise in a broad domain). Additionally, this may involve controlled experiments, where people with similar levels of expertise to real threat actors are divided into groups with and without model access, and we measure the delta of success between them.

And a clear evaluation-based definition of ASL-3:

We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things. (By post-training techniques we mean the best capabilities elicitation techniques we are aware of at the time, including but not limited to fine-tuning, scaffolding, tool use, and prompt engineering.)

Capabilities that significantly increase risk of misuse catastrophe: Access to the model would substantially increase the risk of deliberately-caused catastrophic harm, either by proliferating capabilities, lowering costs, or enabling new methods of attack. This increase in risk is measured relative to today’s baseline level of risk that comes from e.g. access to search engines and textbooks. We expect that AI systems would first elevate this risk from use by non-state attackers. Our first area of effort is in evaluating bioweapons risks where we will determine threat models and capabilities in consultation with a number of world-class biosecurity experts. We are now developing evaluations for these risks in collaboration with external experts to meet ASL-3 commitments, which will be a more systematized version of our recent work on frontier red-teaming. In the near future, we anticipate working with CBRN, cyber, and related experts to develop threat models and evaluations in those areas before they present substantial risks. However, we acknowledge that these evaluations are fundamentally difficult, and there remain disagreements about threat models.

Autonomous replication in the lab: The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]. The appendix includes an overview of our threat model for autonomous capabilities and a list of the basic capabilities necessary for accumulation of resources and surviving in the real world, along with conditions under which we would judge the model to have succeeded. Note that the referenced appendix describes the ability to act autonomously specifically in the absence of any human intervention to stop the model, which limits the risk significantly. Our evaluations were developed in consultation with Paul Christiano and ARC Evals, which specializes in evaluations of autonomous replication.

If it were all up to me, the world would pause now

If it were all up to me, the world would pause now

We're Not Ready: thoughts on "pausing" and responsible scaling policies

I think transformative AI could be soon, and we’re not ready

If it were all up to me, the world would pause now - but it isn’t, and I’m more uncertain about whether a “partial pause” is good

Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine (with some risks that I think can be managed)

Footnotes