This is a crosspost for If Anything Changes, All Value Dies? by Robin Hanson, which was originally published on Overcoming Bias on 17 September 2025. I remain open to bets against short AI timelines, or what they supposedly imply, up to 10 k$.

Today Yudkowsky & Soares published their book If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All [website]. I spent the day reading it.

Their core arguments (my paraphrase):

Knowing that a mind was evolved by natural selection, or by training on data, tells you little about what it will want outside of that selection or training context. For example, it would have been very hard to predict that humans would like ice cream, sucralose, or sex with contraception. Or that peacocks would like giant colorful tails. Analogously, training an AI doesn’t let you predict what it will want long after it is trained. Thus we can’t predict what the AIs we start today will want later when they are far more powerful, and able to kill us. To achieve most of the things they could want, they will kill us. QED.

Also, minds states that feel happy and joyous, or embody value in any way, are quite rare, and so quite unlikely to result from any given selection or training process. Thus future AIs will embody little value.

These arguments seem to me to prove way too much, as their structure applies to any changed descendants, not just AIs: any descendants who change from how we are today due to something like training or natural selection won’t be happy or joyous, or embody value, and they’ll kill any other creatures less powerful than they. [I recommend reading the article linked at the start of this, which I think contains the thrust of Robin's critique.]

Let us break future creatures who change due to selection or training into any two categories of a small us vs a big them. As we can’t predict what they will want later, and they will be much bigger than us later, then we can predict that they will kill us later. Thus we must prevent any changed big future they from existing. Except, as neither us nor they are happy or joyous later, who cares?

Some I’ve talked to accept my summary above, but say that the difference with AI is that it might change faster than would other descendants. But culture-mediated non-AI value change should be pretty fast, and I’m not sure why I should care about clock time, relative to the rates of events experienced by key creatures. Others say that humans are just much less pliable in their desires than are AIs, but I see much less difference there; human culture makes us quite pliable.

We can reasonably doubt three strong claims above:

  1. That subjective joy and happiness are very rare. Seem likely to be common to me.
  2. That one can predict nothing at all from prior selection or training experience.
  3. That all influence must happen early, after which all influence is lost. There might instead be a long period of reacting to and rewarding varying behavior.

Some relevant quotes:

AI companies won’t get what they trained for. They’ll get AIs that want weird and surprising stuff instead. …

The link between what the AI was trained for and what it ends up caring about would be complicated, unpredictable to engineers in advance, and possibly not predictable in principle. …

The link between what a creature is trained to do and what it winds up doing can get pretty twisted and complex, …

But the stuff that AIs really want, that they’d invent if they could? That’ll be weird and surprising, and will bear little resemblance to anything nice. …

There will not be a simple, predictable relationship between what the programmers and AI executives fondly imagine that they are commanding and ordaining, and (1) what an AI actually gets trained to do, and (2) which exact motivations and preferences develop inside the AI, and (3) how the AI later fulfills those preferences once it has more power and ability. …

The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained. …

it may act subservient while it’s young and dumb, but nobody has any idea how to avoid the eventuality of that AI inventing its own sucralose version of subservience if it ever gained the power to do so. …

Most alien species, if they evolved similarly to how known biological evolution usually works, and if given a chance to have things the way they liked them most, probably would not choose a civilization where all their homes contained a large prime number of stones. …

Similarly, most powerful artificial intelligences, created by any method remotely resembling the current methods, would not choose to build a future full of happy, free people. …

We predict the result will be an alien mechanical mind with internal psychology almost absolutely different from anything that humans evolved and then further developed by way of culture. …

Making a future full of flourishing people is not the best, most efficient way to fulfill strange alien purposes. …

It’s easy to imagine that the AI will live a happy and joyous life once we’re gone; that it will marvel at the beauty of the universe and laugh at the humor of it all. But we don’t think it will, any more than it will make sure that all its dwellings contain a “correct” number of stones. We think a mechanical mind could feel joy, that it could marvel at the beauty of the universe, if we carefully crafted it to have that ability. …

The endpoint of modern AI development is the creation of a machine superintelligence with strange and alien preferences.

Added 17Sep: I suspect Yudkowsky & Soares see non-AI-descendant value change as minor or unimportant, perhaps due to seeing culture as minor relative to DNA.

20

1
1

Reactions

1
1

More posts like this

Comments12
Sorted by Click to highlight new comments since:

Contrary to their claim that "it would have been very hard to predict that humans would like ice cream, sucralose, or sex with contraception," I think it was predictable that these preferences would likely result from natural selection under constraints. In each of these examples, a mechanism that evolved to detect the achievement of an instrumentally important subgoal is triggered by a stimulus that is i) very similar to the stimuli an animal would experience when the subgoal is achieved, ii) did not exist in the evolutionary environment. We should expect any (partially or fully) optimized bounded agent to have detectors for the achievement of instrumentally important subgoals. We should expect these detectors to only analyze a limited number of features with limited precision. And we should expect the limited number of comparisons they perform precisely to be optimized for distinctions that were important for success on the training data.

Given that these failures were predictable, it should be possible to systematically predict many analogous failures that might result from training AI systems on specific data sets or (simulated) environments. If we can predict such failures of generalization beyond the training data, then we might be able to either prevent them, mitigate them, or regulate real-world applications so that AI systems won't be applied to inputs where misclassification is likely and problematic. The latter approach is analogous to outlawing highly addictive drugs that mimic neurotransmitters signalling the achievement of instrumentally important subgoals.

Interesting!

Given that these failures were predictable, it should be possible to systematically predict many analogous failures that might result from training AI systems on specific data sets or (simulated) environments.

Your framework seems to work for simple cases like "ice cream, sucralose, or sex with contraception", but I don't think it works for more complex cases like "peacocks would like giant colorful tails"?

There is so much human behaviour also that would have been essentially impossible to predict just from first principles and natural selection under constraints: poetry, chess playing, comedy, monasticism, sports, philosophy, effective altruism. These behaviours seem further removed from your detectors for instrumentally important subgoals, and/or to have a more complex relationship to those detectors, but they're still widespread and important parts of human life. This seems to support the argument that the relationship between how a mind was evolved (e.g., by natural selection) and what it ends up wanting is unpredictable, possibly in dangerous ways.

Your model might still tell us that generalisation failures are very likely to occur, even if, as I am suggesting, it can't predict many of the specific ways things will misgeneralise. But I'm not sure this offers much practical guidance when trying to develop safer AI systems. But maybe I'm wrong about that?

I think the post The Selfish Machine by Maarten Boudry is relevant to this discussion.

Consider dogs. Canine evolution under human domestication satisfies Lewontin’s three criteria: variation, heritability, and differential reproduction. But most dogs are bred to be meek and friendly, the very opposite of selfishness. Breeders ruthlessly select against aggression, and any dog attacking a human usually faces severe fitness consequences—it is put down, or at least not allowed to procreate. In the evolution of dogs, humans call the shots, not nature. Some breeds, like pit bulls or Rottweilers, are of course selected for aggression (to other animals, not to its guardian), but that just goes to show that domesticated evolution depends on breeders’ desires.

How can we extend this difference between blind evolution and domestication to the domain of AI? In biology, the defining criterion of domestication is control over reproduction. If humans control an animal’s reproduction, deciding who gets to mate with whom, then it’s domesticated. If animals escape and regain their autonomy, they’re feral. By that criterion, house cats are only partly domesticated, as most moggies roam about unsupervised and choose their own mates, outside of human control. If you apply this framework to AIs, it should be clear that AI systems are still very much in a state of domestication. Selection pressures come from human designers, programmers, consumers, and regulators, not from blind forces. It is true that some AI systems self-improve without direct human supervision, but humans still decide which AIs are developed and released. GPT-4 isn’t autonomously spawning GPT-5 after competing in the wild with different LLMs; humans control its evolution.

By and large, current selective pressures for AI are the opposite of selfishness. We want friendly, cooperative AIs that don’t harm users or produce offensive content. If chatbots engage in dangerous behavior, like encouraging suicide or enticing journalists to leave their spouse, companies will frantically try to update their models and stamp out the unwanted behavior. In fact, some language models have become so safe, avoiding any sensitive topics or giving anodyne answers, that consumers now complain they are boring. And Google became a laughing stock when its image generator proved to be so politically correct as to produce ethnically diverse Vikings or founding fathers.

Thanks for the great point, Falk. I very much agree.

I also find it problematic that they end the paragraph with "QED." "QED" is a technical term used to indicate that a mathematical theorem has been proven. The quoted verbal argument clearly does not meet the rigorous standards of mathematical proof. This looks like an attempt to exploit superficial, intuitive heuristics to persuade readers to believe the conclusion with a level of confidence that is unwarranted by the information in the quoted paragraph. 

Who are "they"? If you mean Yudkowsky and Soares, "QED" is something that Hanson (the author of this critique) includes in his paraphrase of Yudkowsky and Soares, but I don't think it's anything Yudkowsky and Soares wrote in their book. The quoted argument is not actually a quote, but a paraphrase.

Thanks for clarifying, Erich. I believe Falk was referring to Yudkowsky and Soares. I have not read their book. I have just listened to podcasts they have done, and skimmed some of their writings. However, I think the broader point stands that they often use language that implies much greater confidence in and robustness of the possibility of human extinction than warranted by their arguments.

I didn't realize the quoted text was a paraphrase rather than an exact quote. I only commented on the paraphrase, not on the book itself. I apologize for the oversight.

I completely agree with what you just stated (although I have not read the post you linked), but I do not understand why it would undermine the broader point I mentioned in my comment.

If you thought Yudkowsky and Soares used overly confident language and would have taken the "QED" as further evidence of that, but this particular example turns out not to have been written by Yudkowsky and Soares, that's some evidence against your hypothesis. But instead of updating away a little, you seemed to dismiss that evidence and double down. (I think you originally replied to the original comment approvingly or at least non-critically, but then deleted that comment after I replied to it, but I could be misremembering that.)

For what it's worth, I think you're right that Yudkowsky at least uses overly confident language sometimes -- or I should say, is overly confident sometimes, because I think his language generally reflects his beliefs -- but I would've been surprised to see him use "QED" in that way, which is why I reacted to the original comment here with skepticism and checked whether "QED" actually appeared in the book (it didn't). I take that to imply I was better calibrated than anyone who did not so react.

If you thought Yudkowsky and Soares used overly confident language and would have taken the "QED" as further evidence of that, but this particular example turns out not to have been written by Yudkowsky and Soares, that's some evidence against your hypothesis.

I agree.

But instead of updating away a little, you seemed to dismiss that evidence and double down.

I updated away a little, but negligibly so.

I think you originally replied to the original comment approvingly or at least non-critically, but then deleted that comment after I replied to it, but I could be misremembering that.

I deleted a comment which said something like the following. "Thanks, Falk. I very much agree". I did not remember "QED" was Robin paraphrasing. However, I think the "QED" is still supposed to represent the level of confidence of the authors (in the book) in their arguments for a high risk of human extinction.

I would've been surprised to see him use "QED" in that way, which is why I reacted to the original comment here with skepticism and checked whether "QED" actually appeared in the book (it didn't).

Interesting. I would not have found the use of "QED" surprising. To me it seems that Yudkowsky is often overly confident.


I remain open to bets against short AI timelines, or what they supposedly imply, up to 10 k$. Do you see any that we could make that is good for both of us considering we could invest our money, and that you could take loans?

Curated and popular this week
Relevant opportunities