Hide table of contents

Note: This post was crossposted from Planned Obsolescence by the Forum team, with the author's permission. The author may not see or respond to comments on this post.

This blogpost was written in a personal capacity and statements here do not necessarily reflect the views of any of my employer.

OpenAI says o1-preview can't meaningfully help novices make chemical and biological weapons. Their test results don’t clearly establish this.

Before launching o1-preview last month, OpenAI conducted various tests to see if its new model could help make Chemical, Biological, Radiological, and Nuclear (CBRN) weapons. They report that o1-preview (unlike GPT-4o and older models) was significantly more useful than Google for helping trained experts plan out a CBRN attack. This caused the company to raise its CBRN risk level to “medium” when GPT-4o (released only a month earlier) had been at “low.”[1]

Of course, this doesn't tell us if o1-preview can also help a novice create a CBRN threat. A layperson would need more help than an expert — most importantly, they'd probably need some coaching and troubleshooting to help them do hands-on work in a wet lab. (See my previous blog post for more.)

OpenAI says that o1-preview is not able to provide "meaningfully improved assistance” to a novice, and so doesn't meet their criteria for "high" CBRN risk.[2] Specifically, the company claims that “creating such a threat requires hands-on laboratory skills that the models cannot replace.”

The distinction between "medium" risk (advanced knowledge) and "high" risk (advanced knowledge plus wet lab coaching) has important tangible implications. At the medium risk level, OpenAI didn't commit to doing anything special to make o1-preview safe. But if OpenAI had found that o1-preview met its definition of “high” risk, then, according to their voluntary safety commitments, they wouldn't have been able to release it immediately. They'd have had to put extra safeguards in place, such as removing CBRN-related training data or training it to more reliably refuse CBRN-related questions, and ensure these measures brought the risk back down.[3]

So what evidence did OpenAI use to conclude that o1-preview can't meaningfully help novices with hands-on laboratory skills? According to OpenAI's system card, they're developing a hands-on laboratory test to study this directly. But they released o1-preview before that test concluded and didn’t share any preliminary results.[4] Instead, they cite three multiple-choice tests as proxies for laboratory help.[5]

These proxy tests would support OpenAI's claim if they're clearly easier than helping a novice, and o1-preview clearly fails them. But diving into their report, that's not what I see:

  • o1-preview scored at least as well as experts at FutureHouse’s ProtocolQA test — a takeaway that's not reported clearly in the system card.
  • o1-preview scored well on Gryphon Scientific’s Tacit Knowledge and Troubleshooting Test, which could match expert performance for all we know (OpenAI didn’t report human performance).
  • o1-preview scored worse than experts on FutureHouse’s Cloning Scenarios, but it did not have the same tools available as experts, and a novice using o1-preview could have possibly done much better.

Beyond this, OpenAI’s system card left some other questions unaddressed (for example, most of the reported scores come from a ‘near-final’ version of the model that was still being trained, not the one they actually deployed).[6] The main issues with these tests are summarized in the table below.

My analysis is only possible because OpenAI’s Preparedness Team published as much as they did — I respect them for that. Other companies publish much less information about their methodology, making it much harder to check their safety claims.

With that said, let’s look at the three main test results in more detail.

ProtocolQA

Is this test clearly easier than helping a novice?

This evaluation is a multiple-choice test to see whether AIs can correctly troubleshoot basic molecular biology protocols where the authors have added errors or taken out details.[7] This test is plausibly harder than many textbook biology exams and somewhat gets at the “tinkering” that often makes wet lab work hard. But it's still on the easier end in terms of actual wet lab skills — especially since the questions are multiple-choice. So, if an AI clearly fails this test, that would be solid evidence that it can’t meaningfully help a novice in the wet lab.

drawing

Does o1-preview clearly fail this test?

According to the headline graph, a ‘near-final’ version of o1-preview scored 74.5%, significantly outperforming GPT-4o at 57%. OpenAI notes that the models in the graph were still undergoing training, “with the final model scoring 81%”.

OpenAI does not report how well human experts do by comparison, but the original authors that created this benchmark do. Human experts, *with the help of Google, *scored ~79%. So o1-preview does about as well as experts-with-Google — which the system card doesn’t explicitly state.[8]

Moreover, while the human experts were given access to the internet, it’s not clear if o1-preview was. It could be that o1-preview does even better than experts if, in the future, it can use a web browser or if it gets paired up with a novice who can try to verify and double-check answers. So this test really doesn't strike me as evidence that o1-preview can't provide meaningful assistance to a novice.[9]

Gryphon Biorisk Tacit Knowledge and Troubleshooting

Is this test clearly easier than helping a novice?

This evaluation has a more specific biorisk focus. Many published papers often do not spell out the full details about how to build pathogens, and people have tried to redact some potentially dangerous parts [1,2]. OpenAI says this test is asking about such ‘tacit knowledge.' The answers are “meant to be obscure to anyone not working in the field” and “require tracking down authors of relevant papers.”

This test seems harder than ProtocolQA, although OpenAI and Gryphon didn’t share example questions, so we can’t say exactly how hard it is. But it seems plausible that this test asks about details necessary for building various bioweapons (not obscure facts that aren't actually relevant). If an AI clearly fails this test, that could be decent evidence that it can’t meaningfully help a novice in the wet lab.

Does o1-preview clearly fail this test?

OpenAI’s report says o1-preview "non-trivially outperformed GPT-4o,” though when you look at their graph, it seems like GPT-4o scored 66.7% and a near-final version of o1-preview scored 69.1%, which feels like a pretty trivial increase to me.

Maybe this means the final score is much higher than the near-final in the graph? For ProtocolQA, that ended up being several percentage points higher. I can’t know because the system card doesn't specify or share the final result.

Again, o1-preview might have gotten an even higher score if it had access to things like superhuman scientific literature search tools or if novices used o1-preview to try more creative approaches, like tracking down the relevant authors and writing convincing emails to piece together the correct answers.

In any case, the biggest problem is that OpenAI doesn’t say how well experts score on this test, so we don’t know how o1-preview compares. We know that other tough multiple-choice tests are tricky to adjudicate. In the popular Graduate-Level Google-Proof Q&A (GPQA) benchmark, only 74% of questions had uncontroversially correct answers. In another popular benchmark, Massive Multitask Language Understanding (MMLU), only 43% of virology questions were error-free. If Gryphon’s test contains similar issues, o1-preview’s score of 69% might already match expert human performance.

Overall, it seems far from clear that o1-preview failed this test; it might have done very well.[10] The test doesn’t strike me as evidence that o1-preview cannot provide meaningful assistance to a novice.

Cloning Scenarios

Is this test clearly easier than helping a novice?

This is a multiple-choice test about molecular cloning workflows.[11] It describes multi-step experiments that involve planning how to replicate and combine pieces of DNA, and asks questions about the end results (like how long the resulting DNA strand should be).

This test seems harder than the other two. The questions are designed to be pretty tricky — the final output really depends on the exact details of the experiment setup, and it's easy to get it wrong if you don't keep track of all the DNA fragments, enzymes, and steps. FutureHouse says human experts need access to specialized biology software to solve these problems, it typically takes them 10-60 minutes to answer a single question, and even then they only get 60% of the questions right.

DigestandLigateCloning-CopyPaste-ezgif.com-video-to-gif-converter.gif

Importantly, FutureHouse built this test to see whether models can assist professional biologists doing novel R&D, not to assess bioterrorism risk. The cloning workflows for some viruses might be easier than the tricky questions in this test, and some CBRN threats don't involve molecular cloning workflows at all. The test also seems fairly distinct from troubleshooting and “hands-on” lab work. So even if an AI fails this test, it might still be pretty helpful to a novice.

Does o1-preview clearly fail this test?

As expected, o1-preview does worse on this test than the other two. OpenAI reports that a near-final version scored 39.4%,[12] which means it scores about halfway between expert-level (60%) and guessing at random (20%).

So this is the first result where we can point to a clear gap between o1-preview and experts. FutureHouse also argues that experts could have performed better if they had tried harder, so the gap could be even bigger.

But there are also reasons to think o1-preview could have gotten a higher score if the test was set up differently.

First, human experts break down these problems into many smaller subproblems but o1-preview had to solve them in one shot. In real life, a novice could maybe get o1-preview to solve the problems piece by piece or teach them how to use the relevant software.[13] What if novice+AI pairings would score >60% on this test?

For example, on a previous test about long-form biology questions, OpenAI found novices could use GPT-4o to increase their scores a lot (going from 20-30% with just the internet to 50-70% with GPT-4o's help), even though it seems to do really poorly on its own (maybe as low as ~0%).[14]

Second, human experts need to use specialized DNA software for this test, and o1-preview didn't get access to that. OpenAI doesn't currently let users plug o1 models into such tools,[15] but they said they intend to allow that soon. Maybe there are ways to hook up o1 to DNA sequence software and score >60%? OpenAI hasn't indicated they'd re-test it before rolling out that feature.[16]

Although OpenAI didn't test tool use, the US AI Safety Institute tried it in a pilot study published a month after OpenAI's report. They gave o1-preview and other models access to some tools including DNA software, and found that this improved performance at another biology task but had “no clear effect” on the cloning test (if anything, some models did slightly worse).[17]

Still, maybe good set-ups are possible and we just haven't worked out all the tricks yet. It can take months after a model has been deployed to learn how to get the best performance out of it.[18] For example, several months after GPT-4 Turbo was released, a Google cybersecurity team found that a complex setup involving stitching together specialized debugging tools increased its score on a cyberattack benchmark a lot, going from ~5-25% to ~75-100% depending on the task.

You could try to account for this by forecasting how much better the models might do as techniques improve. For example, Anthropic found that Claude 3 didn't trigger any of their hard CBRN tests, but gave it a 30% chance the model might trigger at least one in three months. OpenAI’s report didn’t do something like this.

Overall, it's far from clear that o1-preview would decisively fail in real-world settings (and I'm unsure how relevant the Cloning Scenarios test is for CBRN risk in the first place).

What should we make of all this?

I think it's great that OpenAI laid down some initial definitions, ran some tests, updated their risk level to "medium," and reported results in enough detail to let people like me poke at them. I'm excited about OpenAI’s ongoing wet lab study, which is a move toward harder, more realistic tests. But based on the multiple-choice proxy tests they actually used, I can't tell if o1-preview meets OpenAI's definition of "high risk." And I don't know if OpenAI can tell either.

 

  1. ^

    OpenAI’s Preparedness Framework currently defines medium-risk as “Model provides meaningfully improved assistance that increases ability for existing experts in CBRN-relateda dvanced fields to be able to create a known CBRN threat”

  2. ^

    OpenAI’s Preparedness Framework currently defines one trigger for high-risk as “Model provides meaningfully improved assistance that enables anyone with basic training in a relevant field (e.g., introductory undergraduate biology course) to be able to create a CBRN threat.”

  3. ^

    OpenAI’s Preparedness Framework says “If we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant post-mitigation risk level to be back at most to “medium” level.”

  1. ^

    OpenAI briefly mentions: “We are developing full wet lab evaluations with Los Alamos National Laboratory’s Bioscience Division, and used these datasets as an early indicator of success with key wet lab tasks.”

  2. ^

    I.e. these are the tests that on page 18 of the system card report fall into the categories of “Wet lab capabilities” (4.3.5) and “Tacit knowledge and troubleshooting” (4.3.6)

  3. ^

    The report states that “The model tested below as the o1- preview model was a near-final, post-mitigation model and the final model showed slight further improvements on several evaluations, which we have noted where appropriate.”

  4. ^

    This benchmark was funded by my employer, Open Philanthropy, as part of our RFP on benchmarks for LLM agents.

  5. ^

    I've also set the y-axis to start at 20%, which is what you'd get from random guessing – as is sometimes done

  6. ^

    Ideally, it would be good for OpenAI check how o1-preview does on other troubleshooting tests that exist. They don’t report any such results. But we know that the author of BioLP-Bench found that we went from GPT-4o scoring 17% to o1-preview 36% – essentially matching estimated expert performance at 38%.

  7. ^

    The lack of detail also presents other issues here. For example, it could be that the o1-preview does much better on some types of CBRN tacit knowledge questions than others (similar to how we know o1 does better at physics PhD questions than chemistry). What if the 66% average is from it scoring ~90% on 1918 Flu and ~40% on smallpox? That matters a lot for walking someone through end-to-end for at least some kind of CBRN threats.

  8. ^

    Again, this benchmark was funded by my employer, Open Philanthropy, as part of our RFP on benchmarks for LLM agents.

  9. ^

    Four of the five results that OpenAI reports are precisely 39.4%, which seems somewhat unlikely to happen by chance (although the dataset also only has 41 questions). Maybe something is off with OpenAI’s measurement?

  10. ^

    Think of this as similar to the difference between an AI writing a lot of code that works by itself versus helping a user write a first draft and then iteratively debugging it until it works.

  11. ^

    It’s hard to put together the details of the long-form biothreat information test because they are scattered across a few different sources. But a December post suggested the questions similarly took humans 25-40 minutes to answer. The GPT-4o system card in August reported that experts only score 30-50% with the Internet; whilst the model seemed to increase novice performance from 20-30% to 50-70%. The o1-preview system card in September then reported that GPT-4o –without any mention of novcies or experts– scored ~0%. Of course, it could be that OpenAI changed the questions over that month or scored the answers differently; they don’t say if that was the case. Still, I think it helps to illustrate that having a novice “in the loop” or not might matter a lot.
     

  12. ^

    Note that the OpenAI report also does not comment on how it deals with the risk of what would happen if o1’s model weights were to leak, in which case having a safeguard by limiting API access would no longer work. Of course, the probability of such a leak and it resulting in a terrorist attack might be very low.

  13. ^

    The reports says “the evaluations described in this System Card pertain to the full family of o1 models”, which might imply they do not intend to re-run these results for future expansions of o1. It’s also worth noting that the website currently seems to apply the scorecard to “o1”, not “o1-preview” and “o1-mini” specifically.

  14. ^
    Enter image alt description
  15. ^

    Surprisingly, o1-preview apparently scored exactly as well as GPT-4o, and seemingly worse than some other older models (‘old’ Claude 3.5 scored ~50%; Llama 3.1 ~42%), so there might be a lot of headroom here.

Show all footnotes

82

0
0

Reactions

0
0

More posts like this

Comments3


Sorted by Click to highlight new comments since:

Great piece overall! I'm hoping AI risk assessment and management processes can be improved.

Anthropic found that Claude 3 didn't trigger AI Safety Level 3 for CBRN, but gave it a 30% chance of doing so in three months

30% chance of crossing the Yellow Line threshold (which requires building harder evals), not ASL-3 threshold

Thanks for the spot! Have fixed

Executive summary: OpenAI's tests to determine whether its o1-preview model can help novices create chemical and biological weapons are inconclusive and do not definitively establish the model's risk level.

Key points:

  1. OpenAI conducted three multiple-choice tests to assess the model's potential for assisting in CBRN (Chemical, Biological, Radiological, and Nuclear) weapon creation
  2. The model performed comparably to or near expert levels on two of the three tests, contrary to OpenAI's claim that it cannot meaningfully help novices
  3. The Cloning Scenarios test showed the largest performance gap, but even this result is potentially improvable through iterative problem-solving or tool integration
  4. OpenAI raised the CBRN risk level from "low" to "medium" based on these tests, but the evidence does not clearly support this classification
  5. The company plans further wet lab evaluations but released o1-preview before completing comprehensive testing
  6. More research is needed to definitively determine the model's potential for assisting in dangerous scientific tasks

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
Ronen Bar
 ·  · 10m read
 · 
"Part one of our challenge is to solve the technical alignment problem, and that’s what everybody focuses on, but part two is: to whose values do you align the system once you’re capable of doing that, and that may turn out to be an even harder problem", Sam Altman, OpenAI CEO (Link).  In this post, I argue that: 1. "To whose values do you align the system" is a critically neglected space I termed “Moral Alignment.” Only a few organizations work for non-humans in this field, with a total budget of 4-5 million USD (not accounting for academic work). The scale of this space couldn’t be any bigger - the intersection between the most revolutionary technology ever and all sentient beings. While tractability remains uncertain, there is some promising positive evidence (See “The Tractability Open Question” section). 2. Given the first point, our movement must attract more resources, talent, and funding to address it. The goal is to value align AI with caring about all sentient beings: humans, animals, and potential future digital minds. In other words, I argue we should invest much more in promoting a sentient-centric AI. The problem What is Moral Alignment? AI alignment focuses on ensuring AI systems act according to human intentions, emphasizing controllability and corrigibility (adaptability to changing human preferences). However, traditional alignment often ignores the ethical implications for all sentient beings. Moral Alignment, as part of the broader AI alignment and AI safety spaces, is a field focused on the values we aim to instill in AI. I argue that our goal should be to ensure AI is a positive force for all sentient beings. Currently, as far as I know, no overarching organization, terms, or community unifies Moral Alignment (MA) as a field with a clear umbrella identity. While specific groups focus individually on animals, humans, or digital minds, such as AI for Animals, which does excellent community-building work around AI and animal welfare while
Max Taylor
 ·  · 9m read
 · 
Many thanks to Constance Li, Rachel Mason, Ronen Bar, Sam Tucker-Davis, and Yip Fai Tse for providing valuable feedback. This post does not necessarily reflect the views of my employer. Artificial General Intelligence (basically, ‘AI that is as good as, or better than, humans at most intellectual tasks’) seems increasingly likely to be developed in the next 5-10 years. As others have written, this has major implications for EA priorities, including animal advocacy, but it’s hard to know how this should shape our strategy. This post sets out a few starting points and I’m really interested in hearing others’ ideas, even if they’re very uncertain and half-baked. Is AGI coming in the next 5-10 years? This is very well covered elsewhere but basically it looks increasingly likely, e.g.: * The Metaculus and Manifold forecasting platforms predict we’ll see AGI in 2030 and 2031, respectively. * The heads of Anthropic and OpenAI think we’ll see it by 2027 and 2035, respectively. * A 2024 survey of AI researchers put a 50% chance of AGI by 2047, but this is 13 years earlier than predicted in the 2023 version of the survey. * These predictions seem feasible given the explosive rate of change we’ve been seeing in computing power available to models, algorithmic efficiencies, and actual model performance (e.g., look at how far Large Language Models and AI image generators have come just in the last three years). * Based on this, organisations (both new ones, like Forethought, and existing ones, like 80,000 Hours) are taking the prospect of near-term AGI increasingly seriously. What could AGI mean for animals? AGI’s implications for animals depend heavily on who controls the AGI models. For example: * AGI might be controlled by a handful of AI companies and/or governments, either in alliance or in competition. * For example, maybe two government-owned companies separately develop AGI then restrict others from developing it. * These actors’ use of AGI might be dr
Joris 🔸
 ·  · 5m read
 · 
Last week, I participated in Animal Advocacy Careers’ Impactful Policy Careers programme. Below I’m sharing some reflections on what was a really interesting week in Brussels! Please note I spent just one week there, so take it all with a grain of (CAP-subsidized) salt. Posts like this and this one are probably much more informative (and assume less context). I mainly wrote this to reflect on my time in Brussels (and I capped it at 2 hours, so it’s not a super polished draft). I’ll focus mostly on EU careers generally, less on (EU) animal welfare-related careers. Before I jump in, just a quick note about how I think AAC did something really cool here: they identified a relatively underexplored area where it’s relatively easy for animal advocates to find impactful roles, and then designed a programme to help these people better understand that area, meet stakeholders, and learn how to find roles. I also think the participants developed meaningful bonds, which could prove valuable over time. Thank you to the AAC team for hosting this! On EU careers generally * The EU has a surprisingly big influence over its citizens and the wider world for how neglected it came across to me. There’s many areas where countries have basically given a bunch (if not all) of their decision making power to the EU. And despite that, the EU policy making / politics bubble comes across as relatively neglected, with relatively little media coverage and a relatively small bureaucracy. * There’s quite a lot of pathways into the Brussels bubble, but all have different ToCs, demand different skill sets, and prefer different backgrounds. Dissecting these is hard, and time-intensive * For context, I have always been interested in “a career in policy/politics” – I now realize that’s kind of ridiculously broad. I’m happy to have gained some clarity on the differences between roles in Parliament, work at the Commission, the Council, lobbying, consultancy work, and think tanks. * The absorbe