Here’s Holden Karnofsky:
I tend to think it’s worse than 51/49. I tend to think we’re always going to be prone to overestimate how robustly good our actions are. And the more we learn about all the galaxy-brained considerations that one should have had in one’s head, the more it’s going to be like 50+ε%. I think AI safety is a great cause to work in. I’m excited to work in it. I think it’s high impact. I am doing my best to do things that I will be proud to have done and hope for the best. But I really do have to live with the possibility that my ultimate impact on the utilons or whatever is going to be negative.
I’m not aware of a good list of downside risks for AI safety broadly[1], so I decided to make one.
This is not intended to be fully comprehensive, these are just the ones that I personally take seriously[2][3]:
- AI governance interventions are obviously high-variance: bad regulation can easily make things worse, many interventions could increase the risk of great power conflict, increased political polarization around AI could be really bad, more centralization of power increases authoritarianism risk, more decentralization of power increases misuse risk, and so on. And technical work can have flow-through effects on these variables that outweigh its direct effects.[4]
- Activist work can polarize people against the cause.[5]
- Human takeover might be worse than AI takeover, and many AI safety interventions effectively attempt[6] to make human takeover more likely relative to AI takeover.
- If powerful AI will be well-described as doing humanlike roleplaying, trying to control it could make it eventually dislike its “oppressors”, or make it less “mentally healthy” in some way. And even without that assumption, AI safety work could lead to an adversarial relationship with AI in other ways.
- Future AIs may be moral patients themselves, which would substantially reduce the value of preventing human extinction, and increase the downside risk (including S-risk) of “AI control”-style interventions.
- Misleading or insufficiently useful work could contribute to “safety-washing” or a false sense of security.
- There’s cultural concerns around scale, professionalization and “mainstreaming”[7] - decreases in integrity and epistemic virtue could be very bad for achieving good outcomes.
- Capabilities externalities (directly through technical work, or via talent pipelines, funding, or raising awareness) could accelerate AI progress, which many think is bad - people have raised this worry about RLHF historically, and raise it about interpretability and evals nowadays. Most infamously, AI safety activity, to varying extents, contributed to the foundings of all three of DeepMind, OpenAI and Anthropic.
(This list is taken from a previous post of mine, but I thought it deserved its own top-level reference.)
- ^
The closest thing I’m aware of is Safeguarding the Safeguards, but even that is more narrow.
- ^
To be clear, I don’t personally think AI safety has been net negative so far, like some do. I wouldn’t even say that I have a properly considered view about it - maybe 60% that it’s been net positive, with very low credal resilience.
But I do feel a vibe of overconfidence in the discourse here sometimes, and I think this can have downstream consequences, e.g. an action bias.
- ^
Quickly, here are others that I excluded because I don’t personally see them as potentially major factors, and didn’t want to water down the main list by including a bunch of implausible galaxy-brained stuff:
- Differential slowdown of safety-minded actors: This feels somewhat falsified and “out of fashion” now that Anthropic has taken the lead and concern about China passing the US is a bit lower than 1-2 years ago? And the AI safety community also has less relative power now that more and larger forces have gotten involved.
- Putting AI doom stories in the training data: I don’t buy that this could be a major factor, there’s a lot of stuff in the training data and post-training applies a lot of optimization away from a Simulators-style reproduction of the training data.
- Theoretical concerns about the value of the future, most commonly associated with suffering-focused people: Since AI would most likely expand through the universe too, I don’t see this as an argument that AI safety might be net negative specifically (over and above what’s already in the list) (although I do think there are important considerations in general there).
- “Crying wolf” dynamics if doom predictions don’t pan out: I don’t buy this as a major factor, since many safety people are not that overtly/confidently doomy nowadays, and so wouldn’t lose credibility.
- Most of our impact comes from acausal effects, and effects on the base reality if we are in a simulation: I’m confused here like everyone else, but I currently don’t buy this as a major factor because we only know our reality, and therefore the same things that are good here should naively also have good acausal effects in expectation. (except that it maybe pushes for somewhat more cooperation and virtue ethics).
- ^
Holden Karnofsky: “Most things that touch policy at all in any way will move us along that spectrum in one direction or another, so therefore have a high chance of being negative [...]
And then most things that you can do in AI at all will have some impact on policy. Even just alignment research: policy will be shaped by what we’re seeing from alignment research, how tractable it looks, what the interventions look like.” (h/t Anthony DiGiovanni)
- ^
Holden Karnofsky: “there’s also a lot of micro ways in which you could do harm. Just literally working in safety and being annoying, you might do net harm. You might just talk to the wrong person at the wrong time, get on their nerves. I’ve heard lots of stories of this. Just like, this person does great safety work, but they really annoyed this one person, and that might be the reason we all go extinct” (h/t Anthony DiGiovanni)
- ^
Among other things.
- ^
I associate these with people like Richard Ngo (and here) and Oliver Habryka.

If I understand correctly, this is the "Extrapolation" response to unawareness I discuss here. What do you think of my response?
I think it's not quite your "Extrapolation", because it's specifically about the acausal mechanism - by definition, the only (EDIT: direct) acausal effect possible is to make other agents take similar actions to us.
(and then the simulation thing I kind of sweep under the rug because the footnote was quickly written, but the argument is somewhat similar (although very vague and I'd like something better): Whatever purpose our simulators have for simulating us, it's probably good for their reality too if we reach a good outcome in the simulation.)
Hmm, these arguments seem too anchored on what we happen to currently be aware of.
OK I read the LW comment you linked and skimmed the post, but I don't see how they show that we should expect lots of crucial considerations to come up specifically - they seem to argue more for "we're clueless about how much we should do ECL"? (but correct me if I'm wrong, I may have missed something). On your example, but why should I expect their attempts to do so to backfire if I don't already expect our own attempts to backfire? That seems like it just grounds out in the original debate.
(btw, I also would like to get less confused about the similarity thing)
On the simulators, it just seems like its hard to think of possible simulator-motivations where us reaching good outcomes in the simulation would be bad for the base reality, and easy to think of ones where it would be neutral or good.
In general, I think the unawareness angle is genuinely interesting, I'm just not moved by it as much as you, probably for a few different reasons that would take some time to articulate.
I think they suggest that there's just a lot of subtlety in working out the implications of acausal decision theories in practice. Which is reason to expect more crucial considerations in this domain generally / reason to doubt your "by definition" argument.
Why should you expect them to be positive in expectation either? (The broader point of the unawareness sequence is that there's an ambiguous pile of positive and negative effects to weigh up.)
But then we're back to the Extrapolation argument, which you claimed you weren't committed to. I'm saying, even if the balance of effects we can think of looks good, we're looking at a super tiny sliver of the set of effects our fully aware selves would be weighing up — and it's a biased sample of such effects, so extrapolating from that sample is dubious.
"But then we're back to the Extrapolation argument, which you claimed you weren't committed to."
No, I didn't claim that - I said the snippet you quoted wasn't the Extrapolation argument, and I stand by that. I'm definitely sympathetic to something like Extrapolation in general.
I'm confused. The snippet I quoted — about acausal stuff and simulations — is what's at issue in this discussion.
Anyway, consistency aside, my response to Extrapolation stands, so I'm interested to hear your counterargument to it.
The original snippet was more about the acausal stuff, and wasn't Extrapolation, and the distinct argument I subsequently mentioned about simulators was Extrapolation.
There's no quick objection I can give to your response specifically (except to simplistically say "no, I don't buy it, we know more than that, we can have some reasonable guesses about simulators' intentions") - properly laying out my disagreements would take a little bit of time and effort, as is usually the case for deep worldview differences.