Hide table of contents

Canonical linkpost: https://www.lesswrong.com/posts/Q7caj7emnwWBxLECF/anthropic-s-updated-responsible-scaling-policy.

I haven't yet formed an opinion on the key questions, including are the thresholds and mitigations reasonable and adequate. I'll edit this post later.

Anthropic's first update to its RSP is here at last.

Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards. Key improvements include new capability thresholds to indicate when we will upgrade our safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards (inspired by safety case methodologies), and new measures for internal governance and external input. By learning from our implementation experiences and drawing on risk management practices used in other high-consequence industries, we aim to better prepare for the rapid pace of AI advancement.

Summary of changes.


Initial reactions:

 

The new framework involves "preliminary assessments" and "comprehensive assessments." Anthropic will "routinely" do a preliminary assessment: check whether it's been 6 months (or >4x effective compute) since the last comprehensive assessment, and if so, do a comprehensive assessment. "Routinely" is problematic. It would be better to commit to do a comprehensive assessment at least every 6 months.

This is weaker than the original RSP, which said "During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements." I think 6 months seems fine for now (maybe not if AI progress becomes faster/crazier in the future), but the safety buffer should be bigger. Anthropic explains: "We adjusted the comprehensive assessment cadence to 4x Effective Compute or six months of accumulated post-training enhancements (this was previously three months). We found that a three-month cadence forced teams to prioritize conducting frequent evaluations over more comprehensive testing and improving methodologies."

 

ASL-3 deployment mitigations have become more meta — more like we'll make a safety case. (Compare to original.) (This was expected; see e.g. The Checklist: What Succeeding at AI Safety Will Involve.) This is OK; figuring out exact mitigations and how-to-verify-them in advance is hard.

But it's inconsistent with wanting the RSP to pass the LeCun test — for it to be sufficient for other labs to adopt the RSP (or for the RSP to tie Anthropic's hands much). And it means the procedural checks are super important. But the protocol for ASL/mitigation/deployment decisions isn't much more than CEO and RSO decide. A more ambitious procedural approach would involve strong third-party auditing.

I really like that Anthropic shares non-binding descriptions of our future ASL-3 safeguard plans.

 

New capability thresholds:

Appendix C: Detailed Capability Thresholds

This appendix contains detailed definitions for each Capability Threshold in Section 2.

Chemical, Biological, Radiological, and Nuclear (CBRN) weapons: The ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons. We assess this by comparing what potential attackers could achieve with full model access versus 2023-level online resources, assuming they have funding and up to one year of time to invest, but no initial specialized expertise.

Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world’s most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 35^2 = ~1000x.

[Footnote:] The 35x/year scaleup estimate is based on assuming the rate of increase in compute being used to train frontier models from ~2018 to May 2024 is 4.2 x/year (reference), the impact of increased (LLM) algorithmic efficiency is roughly equivalent to a further 2.8 x/year (reference), and the impact of post training enhancements is a further 3 x/year (informal estimate). Combined, these have an effective rate of scaling of 35 x/year.

Model Autonomy checkpoint: The ability to perform a wide range of advanced software engineering tasks autonomously that could be precursors to full autonomous replication or automated AI R&D, and that would take a domain expert human 2-8 hours to complete. We primarily view this level of model autonomy as a checkpoint on the way to managing the risks of robust, fully autonomous systems with capabilities that might include (a) automating and greatly accelerating research and development in AI development (b) generating their own revenue and using it to run copies of themselves in large-scale, hard-to-shut-down operations.

The CBRN threshold triggers ASL-3 deployment and security mitigations. The autonomous AI R&D threshold triggers ASL-3 security mitigations.

 

New:

Policy changes: Changes to this policy will be proposed by the CEO and the Responsible Scaling Officer and approved by the Board of Directors, in consultation with the Long-Term Benefit Trust. The current version of the RSP is accessible at www.anthropic.com/rsp. We will update the public version of the RSP before any changes take effect and record any differences from the prior draft in a change log.

[Footnote:] It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.

Old:

[We commit to] Follow an "Update Process" for this document, including approval by the board of directors, following consultation with the Long-Term Benefit Trust (LTBT). Any updates will be noted and reflected in this document before they are implemented. The most recent version of this document can be found at http://anthropic.com/responsible-scaling-policy.

  • We expect most updates to this process to be incremental, for example adding a new ASL level or slightly modifying the set of evaluations or security procedures as we learn more about model safety features or unexpected capabilities.
  • However, in a situation of extreme emergency, such as when a clearly bad actor (such as a rogue state) is scaling in so reckless a manner that it is likely to lead to lead to imminent global catastrophe if not stopped (and where AI itself is helpful in such defense), we could envisage a substantial loosening of these restrictions as an emergency response. Such action would only be taken in consultation with governmental authorities, and the compelling case for it would be presented publicly to the extent possible.

I think the idea behind the new footnote is fine, but I wish it was different in a few ways:

  • Distinguish the staying behind the frontier version from the winning the race version
    • In winning the race, "the incremental increase in risk attributable to us would be small" shouldn't be a crux — if you're a good guy and other frontier labs are bad guys, you should incur substantial 'risk attributable to you' (or action risk) to minimize net risk
  • Make "acknowledge the overall level of risk posed by AI systems (including ours)" better — plan to sound the alarm that you're taking huge risks (e.g. mention expected number of casualties per year due to you) that sound totally unacceptable and are only justified because inaction is even more dangerous!

 

we believe the risk of substantial under-elicitation is low

This is in tension with both the last evals report[1] and today's update that "Some of our evaluations lacked some basic elicitation techniques such as best-of-N or chain-of-thought prompting." I've asked for clarification; for now I'm skeptical.

 

"At minimum, we will perform basic finetuning for instruction following, tool use, minimizing refusal rates." I appreciate details like this.

 

Nondisparagement: it's cool that they put their stance in a formal written policy, but I wish they just wouldn't use nondisparagement:

We will not impose contractual non-disparagement obligations on employees, candidates, or former employees in a way that could impede or discourage them from publicly raising safety concerns about Anthropic. If we offer agreements with a non-disparagement clause, that clause will not preclude raising safety concerns, nor will it preclude disclosure of the existence of that clause.

 

Anthropic acknowledges an issue I pointed out.

In our most recent evaluations, we updated our autonomy evaluation from the specified placeholder tasks, even though an ambiguity in the previous policy could be interpreted as also requiring a policy update. We believe the updated evaluations provided a stronger assessment of the specified “tasks taking an expert 2-8 hours” benchmark. The updated policy resolves the ambiguity, and in the future we intend to proactively clarify policy ambiguities.

As far as I can tell, this description is wrong; it was not an ambiguity; the RSP set forth an ASL-3 threshold and the Claude 3 Opus evals report incorrectly asserted that that threshold was merely a yellow line. I would call this a lie but when I've explained the issue to some relevant Anthropic people they've seemed to genuinely not understand it. But not understanding your RSP, when someone explains it to you, is pretty bad. (To be clear, Anthropic didn't cross the threshold; the underlying issue is not huge.)

 

Anthropic missed the opportunity to say something stronger on external model evals than "Findings from partner organizations and external evaluations of our models (or similar models) should also be incorporated into the final assessment, when available."

  1. ^

    We expect we have substantially under-elicited capabilities from the model, and that additional general and task-specific fine-tuning, and better prompting and scaffolding, could increase the capabilities of the model quite substantially. . . .

    Overall, our evaluations teams do not believe the current model crosses any of the Yellow Line thresholds. That said, there are a number of ways in which Claude 3 could meet our Yellow Lines that our evaluations would have missed, which are summarized below.

    • Our post-training methods to remove refusals were suboptimal compared to training a "helpful only" model from scratch. The effect could have damaged capabilities or made them more difficult to elicit. Once additional general and task-specific fine-tuning is applied, the jump in relevant capabilities could be quite substantial.
    • Our current prompting and scaffolding techniques are likely far from optimal, especially for our CBRN evaluations. As a result, we could be substantially underestimating the capabilities that external actors could elicit from our models.

26

0
0

Reactions

0
0

More posts like this

No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities