Abstract
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.
Summary
When its publicly available weights were fine-tuned to remove safeguards, Llama-2-70B assisted hackathon participants in devising plans to obtain infectious 1918 pandemic influenza virus, even though participants openly shared their (pretended) malicious intentions. Liability laws that hold foundation model makers responsible for all forms of misuse above a set damage threshold that result from model weight proliferation could prevent future large language models from expanding access to pandemics and other foreseeable catastrophic harms.
Thanks! This is helpful because it clarifies a few areas where we disagree.
I think future LLMs will likely still be very helpful for such people since there are more steps to being an effective bioterrorist than just understanding, eg existing reverse genetics protocols. I don't want to say much more on that point. That said, I'm personally less concerned about LLMs enhancing the capabilities of people who are already experts in some of these domains versus enhancing the ability of non-experts.
I disagree. I think future LLMs will enhance the ability of average people to do something with biology. I expect LLMs will get much better at generating protocols, recommending upskilling strategies, providing lab tutorials, interpreting experimental results, etc etc. And it will do all of those things in a much more accessible manner. Also, keep in mind Fig 1 in our paper shows that there is more than one path to obtain 1918 virus.
I also think there is an underappreciated point here about LLMs making it more likely for people to attempt bioterrorism in the first place. If a malicious actor looking to cause mass harm spends a couple of hours in conversation with an uncensored LLM, and learns that biology is a feasible path towards doing that... then I expect more people to try – even if it takes significant time and money.
These examples indeed constitute nasty ways to cause harm to people and sound significantly easier. However, the scale of harm you can cause with infectious or otherwise exponential biology is significantly beyond that of targeted CW attacks. The potential harm is such that the statement "hardly anyone wants to carry out such attacks" doesn't seem a sufficient reason not to be concerned.