Dangerous capability tests should be harder

Luca Righetti 🔸

Dangerous capability tests should be harder

Luca Righetti 🔸

6 min readAug 20, 2024

Comments 1

Sorted by

New & upvoted

SummaryBot

Executive summary: Current tests of AI capabilities in dangerous domains like bioweapons are inadequate; we need much more rigorous and realistic tests to justify taking costly preventative actions.

Key points:

Existing AI capability tests in biology are too easy and don't reflect real-world challenges of creating bioweapons.
As AI improves, companies keep making tests harder, but still not hard enough to conclusively demonstrate danger.
A "gold standard" test would involve a randomized trial of non-experts trying to create a (harmless) virus with AI assistance vs. internet resources alone.
We need to agree in advance on tests that are difficult and realistic enough to clearly justify strong preventative actions if passed.
Designing truly hard capability tests is challenging but crucial to do now, before AI potentially becomes extremely powerful.
Focus should shift from proving current AI safety to determining how to identify if future AIs are truly dangerous.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Comments

More from the author

OpenAI's CBRN tests seem unclear

Luca Righetti 🔸·1y ago·8m read

[Podcast] Suggest a question for Jeffrey Sachs

Luca Righetti 🔸·5y ago·1m read

Curated and popular this week

Cultivating hope: calibrating the expectations for cultivated meat to end factory farming

PAMC 🔸·1w ago·Curated 3d ago·22m read

Maybe do the thing you wish CEA would do

alejoacelas 🔸·2d ago·2m read

I used AI to fix transcription errors, rerrarange the ideas, and suggest tweaks to the title and some sentences. Three of the most exciting projects to come out of EA in recent years are, in a vague sense, CEA spinouts: * Kairos is directly a spinout of CEA and now handles most support for university AI safety groups. Basically everyone I've found who knows them is really excited about what they do * NEST is an opinionated ideas-fi...

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·5d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Recent opportunities to take action

RP is looking for project founders in neglected animal areas

Rethink Priorities·3d ago·7m read

157

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·1w ago·4m read

Announcing the Safe Pareto Improvements (SPI) Fundamentals Program

Center on Long-Term Risk, Anthony DiGiovanni 🔸, Santeri T 🔹·2d ago·3m read

^{^}

Open Philanthropy funded the development of this benchmark as part of its RFP on difficult benchmarks for LLM agents (Ajeya Cotra, who edits this blog, was the grant investigator)

^{^}

However, as the Future House study notes a major limitation of this study was that “human evaluators [...] were permitted to utilize tools, whereas the models were not provided with such resources”. Thus, it could be that AIs with web-search enabled do a lot better. It could also be that the model performs much better if it’s fine-tuned on similar questions.

^{^}

Clymer et al. (2024) call this an ‘inability argument’ — a safety case that relies on showing that “AI systems are incapable of causing unacceptable outcomes in any realistic setting.”

^{^}

In cybersecurity risk, Google Project Zero found that upon moving from GPT-3.5-Turbo (in the original paper) to GPT-4-Turbo (with Naptime), AI’s ability to zero-shot discover and exploit memory safety issues hugely improved – going from scoring 2% to 71% on buffer overflow tests. The authors concluded “To effectively monitor progress, we need more difficult and realistic benchmarks, and we need to ensure that benchmarking methodologies can take full advantage of LLMs' capabilities.” In biorisk, UK AISI reported that its “in-house research team analysed the performance of a set of LLMs on 101 microbiology questions between 2021 and 2023. In the space of just two years, LLM accuracy in this domain has increased from ~5% to 60%.” And, as noted, in 2024 AIs performed as well as PhD students on an even more advanced test. They now need to “assess longer horizon scientific planning and execution” and “also [run] human uplift studies”.

^{^}

As Narayanan and Kapoor note: “Justification is essential to the legitimacy of government and the exercise of power. A core principle of liberal democracy is that the state should not limit people's freedom based on controversial beliefs that reasonable people can reject. Explanation is especially important when the policies being considered are costly, and even more so when those costs are unevenly distributed among stakeholders.”

^{^}

For example, to ensure participants are safe enough, we might task them with creating a virus that we know will be defective and, at worst, cause mild symptoms that can be treated – such as RSV. An expert could oversee what they do and intervene before anything harmful happens. Furthermore, it seems plausible to separate out some especially dangerous steps and have these completed by a trusted red team working with law enforcement. For example, steps involving ideating dangerous designs or bypassing synthesis DNA screening to obtain especially hazardous materials.

^{^}

For instance, OpenAI’s blueprint for biorisk had participants complete written tasks, and if an expert scored their answers at least 8/10, it was seen as a sign of increased concern. But the authors note this number was chosen fairly arbitrarily and depends heavily on who is doing the judging. Setting a threshold “turns out to be difficult.”

^{^}

Even here, I imagine that readers might find objections or disagree on how to set things up. Who counts as non-experts? Some viruses are harder to make than others—how do we know what virus to task people with? Would 5% of people succeeding be scary enough to warrant drastic action? Would 50%?