Hide table of contents

Epistemic status: raising concerns, rather than stating confident conclusions.

I’m worried that a lot of work on AI safety evals matches the pattern of “Something must be done. This is something. Therefore this must be done.” Or, to put it another way: I judge eval ideas on 4 criteria, and I often see proposals which fail all 4. The criteria:

1. Possible to measure with scientific rigor.

Some things can be easily studied in a lab; others are entangled with a lot of real-world complexity. If you predict the latter (e.g. a model’s economic or scientific impact) based on model-level evals, your results will often be BS.

(This is why I dislike the term “transformative AI”, by the way. Whether an AI has transformative effects on society will depend hugely on what the society is like, how the AI is deployed, etc. And that’s a constantly moving target! So TAI a terrible thing to try to forecast.)

Another angle on “scientific rigor”: you’re trying to make it obvious to onlookers that you couldn’t have designed the eval to get your preferred results. This means making the eval as simple as possible: each arbitrary choice adds another avenue for p-hacking, and they add up fast.

(Paraphrasing a different thread): I think of AI risk forecasts as basically guesses, and I dislike attempts to make them sound objective (e.g. many OpenPhil worldview investigations). There are always so many free parameters that you can get basically any result you want. And so, in practice, they often play the role of laundering vibes into credible-sounding headline numbers. I'm worried that AI safety evals will fall into the same trap.

(I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.)

2. Provides signal across scales.

Evals are often designed around a binary threshold (e.g. the Turing Test). But this restricts the impact of the eval to a narrow time window around hitting it. Much better if we can measure (and extrapolate) orders-of-magnitude improvements.

3. Focuses on clearly worrying capabilities.

Evals for hacking, deception, etc track widespread concerns. By contrast, evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk. And even they don’t think it’s necessary for risk.

4. Motivates useful responses.

Safety evals are for creating clear Schelling points at which action will be taken. But if you don’t know what actions your evals should catalyze, it’s often more valuable to focus on fleshing that out. Often nobody else will!

In fact, I expect that things like model releases, demos, warning shots, etc, will by default be much better drivers of action than evals. Evals can still be valuable, but you should have some justification for why yours will actually matter, to avoid traps like the ones above. Ideally that justification would focus either on generating insight or being persuasive; optimizing for both at once seems like a good way to get neither.

Lastly: even if you have a good eval idea, actually implementing it well can be very challenging

Building evals is scientific research; and so we should expect eval quality to be heavy-tailed, like most other science. I worry that the fact that evals are an unusually easy type of research to get started with sometimes obscures this fact.

Comments2


Sorted by Click to highlight new comments since:

Thanks for the post!

“Something must be done. This is something. Therefore this must be done.”

1. Have you seen grants that you are confident are not a good use of EA's money?

2. If so, do you think that if the grantmakers had asked for your input, you would have changed their minds about making the grant?

3. Do you think Open Philanthropy (and other grantmakers) should have external grant reviewers?

I remain in favor of people doing work on evals, and in favor of funding talented people to work on evals. The main intervention I'd like to make here is to inform how those people work on evals, so that it's more productive. I think that should happen not on the level of grants but on the level of how they choose to conduct the research.

Curated and popular this week
 ·  · 1m read
 · 
Science just released an article, with an accompanying technical report, about a neglected source of biological risk. From the abstract of the technical report: > This report describes the technical feasibility of creating mirror bacteria and the potentially serious and wide-ranging risks that they could pose to humans, other animals, plants, and the environment...  > > In a mirror bacterium, all of the chiral molecules of existing bacteria—proteins, nucleic acids, and metabolites—are replaced by their mirror images. Mirror bacteria could not evolve from existing life, but their creation will become increasingly feasible as science advances. Interactions between organisms often depend on chirality, and so interactions between natural organisms and mirror bacteria would be profoundly different from those between natural organisms. Most importantly, immune defenses and predation typically rely on interactions between chiral molecules that could often fail to detect or kill mirror bacteria due to their reversed chirality. It therefore appears plausible, even likely, that sufficiently robust mirror bacteria could spread through the environment unchecked by natural biological controls and act as dangerous opportunistic pathogens in an unprecedentedly wide range of other multicellular organisms, including humans. > > This report draws on expertise from synthetic biology, immunology, ecology, and related fields to provide the first comprehensive assessment of the risks from mirror bacteria.  Open Philanthropy helped to support this work and is now supporting the Mirror Biology Dialogues Fund (MBDF), along with the Sloan Foundation, the Packard Foundation, the Gordon and Betty Moore Foundation, and Patrick Collison. The Fund will coordinate scientific efforts to evaluate and address risks from mirror bacteria. It was deeply concerning to learn about this risk, but gratifying to see how seriously the scientific community is taking the issue. Given the potential infoha
 ·  · 1m read
 · 
 ·  · 14m read
 · 
1. Introduction My blog, Reflective Altruism, aims to use academic research to drive positive change within and around the effective altruism movement. Part of that mission involves engagement with the effective altruism community. For this reason, I try to give periodic updates on blog content and future directions (previous updates: here and here) In today’s post, I want to say a bit about new content published in 2024 (Sections 2-3) and give an overview of other content published so far (Section 4). I’ll also say a bit about upcoming content (Section 5) as well as my broader academic work (Section 6) and talks (Section 7) related to longtermism. Section 8 concludes with a few notes about other changes to the blog. I would be keen to hear reactions to existing content or suggestions for new content. Thanks for reading. 2. New series this year I’ve begun five new series since last December. 1. Against the singularity hypothesis: One of the most prominent arguments for existential risk from artificial agents is the singularity hypothesis. The singularity hypothesis holds roughly that self-improving artificial agents will grow at an accelerating rate until they are orders of magnitude more intelligent than the average human. I think that the singularity hypothesis is not on as firm ground as many advocates believe. My paper, “Against the singularity hypothesis,” makes the case for this conclusion. I’ve written a six-part series Against the singularity hypothesis summarizing this paper. Part 1 introduces the singularity hypothesis. Part 2 and Part 3 together give five preliminary reasons for doubt. The next two posts examine defenses of the singularity hypothesis by Dave Chalmers (Part 4) and Nick Bostrom (Part 5). Part 6 draws lessons from this discussion. 2. Harms: Existential risk mitigation efforts have important benefits but also identifiable harms. This series discusses some of the most important harms of existential risk mitigation efforts. Part 1 discus