TLDR:
- AI safety is confusing to navigate, because it is a pre-paradigmatic field composed of people making different, theoretical arguments for why x-risk is likely (or unlikely).
- Arguments that x-risk is likely are unfalsifiable and have little empirical evidence. This does not mean they’re wrong.
- Much of your probability of x-risk boils down to your priors, and whether you more heavily weight theory or empiricism
- I think AI safety is important to work on, but I’m optimistic that alignment will be solved through iterative development of technology.
In this post, the cofounders of Mechanize, Tamay Besiroglu, Matthew Barnett, and Ege Erdil, wrote a rebuttal of the case that current trends of AI progress will likely lead to a misaligned superintelligence leading to extinction. I will call the people who believe this view pessimists (people with p(doom) > 50%). The post is not arguing that we do not need safety research, but rather that they expect safety to work.
Their first claim: “...there is no standard argument to respond to, no single text that unifies the AI safety community.”
This argument is further explained in this article, written by the blogger a3orn. For example, many cite Yudkowsky’s arguments as their main source for concern. This argument roughly goes: under sufficient optimization pressure, we should expect an AI to act as an “optimizer” for certain values. These values are likely to be different from ours, due to goal misgeneralization, and even small differences in values result in the AI optimizing for goals that kill everyone. Alex Turner meanwhile does not find the “reward optimizer” hypothesis or “inner/outer misalignment” distinction plausible. Central figures in AI safety are unable to come to a consensus despite 100+ hours of debate. Richard Ngo points out five clusters of alignment researchers. Some are focused on LLM safety, while others, like Steve Byrnes, think the big risks lie not with LLMs, but with future architectures that will likely be developed.
The implication of this is that it is very hard to have one concrete AI risk argument I can read and respond to. It is difficult to form opinions on AI safety when most experts are in great disagreement about threat models.
Matthew and Ege think that it is suspicious that safetyists come up with many widely different arguments to arrive at the same conclusion. They suspect motivated reasoning. Ege points out that in most circumstances, groups should have a couple big arguments for why they expect something. Economists, for example, will give you roughly the same argument for why tariffs are generally bad (comparative advantage, gains from trade, specialization, etc.).
I think that tariffs are a cherrypicked example that we have vast amounts of empirical data on. Predicting whether alignment will be easy or impossible, and why, is a much harder and more speculative question. It seems to me that many AI researchers a decade ago believed that we would get human-level AI soon, despite disagreeing on the exact mechanisms. Whether it was doing reinforcement learning with games, writing a new programming language for Seed AI, or scaling up transformers. However, they were still united by the rough narrative that computational power was increasing and intelligence could be reduced to computation. Shane Legg obviously deserves a huge amount of credit for predicting in 2009 that we’d have AGI by now, even if he didn’t foresee that scaling transformers would take us there.
There is a rough common thread that unites pessimists. I think their basic case is something like:
- We will build superintelligence (ASI)
- ASI will be goal directed and agentic
- ASI will develop goals during training that are different from what humans want
- ASI will scheme and hide its true values from us in training
- In deployment, ASI will try to optimize for these goals
- The optimization of these goals will be bad. (There’s disagreement on why it will be bad, from sudden extinction, to loss of control over our future, etc.)
For these reasons, I don’t significantly update against x-risk because of there being very different arguments — AI safety is a pre-paradigmatic field. However, the disagreement does make me quite wary of assigning high probabilities to doom.
The article’s second claim: “...we’re not saying Y&S [Yudkowsky and Soares] need to provide direct evidence of an already-existing unfriendly superintelligent AI… But their predictions are only credible if they follow from a theory that has evidential support. And if their theory about deep learning only makes predictions about future superintelligent AIs, with no testable predictions about earlier systems, then it is functionally unfalsifiable.”
I agree with this. Pessimists think that we have only one shot at alignment. Once AIs reach a certain capability, they will scheme and grow more intelligent until they are capable of taking over the world. And so if you’re trying to rebut the pessimism argument, you are placed in an inconvenient spot. You can provide ample evidence that current models are aligned, but they could always claim that a future superintelligence would be very different from the LLMs we currently imagine. And therefore, LLM alignment provides no evidence.
Or they could argue, in the future, that models only seem aligned, but they are actually acting deceptively. And there is no way to uncover this deception, they are simply too capable. Solving mechanistic interpretability would maybe be a potential solution to falsify this claim, but it sounds like a very high bar. (And also doesn’t deal with the “future AIs will be very different” critique).
What does current empirical evidence suggest? I think alignment is a spectrum — there is a lot of room between perfectly aligned AI and rogue AI that leads to catastrophe. This post by Ryan Greenblatt argues that current models are misaligned. However, much of the failure modes he describes sound more like capability issues to me (Claude not being able to assess whether it’s completed a task well, having sloppy outputs, agents overselling their work, etc.). I expect them to be resolved as capabilities increase and models are better able to judge task completion. I associate misalignment more with lying, intent to cause harm, deliberate scheming, blackmail etc. I agree models aren’t perfectly aligned, but they seem fairly aligned to me. Greenblatt also says he expects current (but not future) “misalignment” failures to be solved soon. I think current evidence points to alignment being the default path.
Again, the pessimists might be right in their theoretical claim that under sufficient optimization pressure, superintelligence becomes misaligned, is deceptive, and takes over the world. I might spend more time listening to debates, weighing arguments, and so on. However, people have spent thousands of hours thinking about this, and I will find it hard to tell apart which detailed, technical, complicated arguments are correct or wrong. I’m wary of being swayed too much by abstract reasoning. I don’t think I could win an argument against a smart conspiracy theorist who made logical arguments for why global warming was a myth, even though they’re wrong, unless I spent lots of time being a subject matter expert. I’d just dismiss their arguments for outside view reasons.
You might point out that AI x-risk skeptics have similar problems. They all have different reasons for why AI is not going to cause catastrophe, many don’t agree with each other. Some think alignment is easy, some think alignment may be hard, but control is easy. This is true. The common thread that probably unites skeptics is “we will iteratively develop, test, and make safe superintelligence, just like we make any technology safe, even if it’s unclear the specific alignment techniques we use”. And overall, I think skeptics have empirical evidence — this was how all technology was developed in the past. Additionally, alignment on current LLMs seems to be working pretty well! So one might argue that on priors, we should assume any technology is safe and the burden of proof is on the people who think x-risk is likely.
On the contrary, “x-risk is likely” people think that superintelligence is bad by default. Paraphrasing a blogpost that clarifies the central argument: building an agent more powerful than all humans, which may have different goals is obviously dangerous. They believe the onus is on skeptics to definitively prove that it will be safe.
Generally I think much of people’s likelihood of x-risk is just their prior. If someone’s prior is alignment by default, it is easy to dismiss x-risk arguments as theoretical, vague, and not grounded in evidence. If someone’s prior is x-risk, they point out that the AI optimists have no unifying, solid argument either for why AI will be safe, and that this technology is genuinely unprecedented.
I want to engage more with the theoretical arguments that Yudkowsky, Soares, Christiano, Ngo, Turner, and more, present. I also want to do more outside view thinking about what my priors should be. Admittedly, this post does not have a satisfying conclusion, but at this point, I think it’d be more beneficial if I wrote new posts instead of editing this one. I would also really really like to see verifiable claims made by the safety community that allow me to update either way.
Given these confusing viewpoints, what (weakly held) opinions have I come to? (Inspired by Stephen Casper’s list).
- There is a 70% chance of very impactful AI coming within the next ten years. (10% GDP growth per year)
- AI safety is important. When any technology becomes very powerful, and is dual-use, it is very important to make it safe. AI is not an exception to this rule. It is important to guard against misalignment risks and AI takeover (along with a host of other risks).
- There should be some regulations on AI (as with any powerful technology), though I’m unsure which ones are good
- I agree with the Center for AI Safety’s statement that: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
- Labs should be transparent about their safety frameworks, and should uphold safety as a key priority.
- AI alignment and control will likely succeed.
- I don’t have good judgement on which kinds of safety research is good.
- I have a low probability of existential risk (~10%)
- More people doing alignment research (both in frontier labs and research organizations) and labs spending more on alignment research is good.
- However, speeding up AI capabilities is bad, because it would be good for society to have time to adapt to transformative AI. I also think acceleration makes alignment harder.
- I’d prefer capabilities to go a bit slower (unsure how much), but I’m skeptical of trying to advocate for pause. I don’t think it’ll work and could lead to lots of worse outcomes.
- AI will probably lead to good outcomes.
