I tend to disagree with most EAs about existential risk from AI. Unfortunately, my disagreements are all over the place. It's not that I disagree with one or two key points: there are many elements of the standard argument that I diverge from, and depending on the audience, I don't know which points of disagreement people think are most important.
I want to write a post highlighting all the important areas where I disagree, and offering my own counterarguments as an alternative. This post would benefit from responding to an existing piece, along the same lines as Quintin Pope's article "My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"". By contrast, it would be intended to address the EA community as a whole, since I'm aware many EAs already disagree with Yudkowsky even if they buy the basic arguments for AI x-risks.
My question is: what is the current best single article (or set of articles) that provide a well-reasoned and comprehensive case for believing that there is a substantial (>10%) probability of an AI catastrophe this century?
I was considering replying to Joseph Carlsmith's article, "Is Power-Seeking AI an Existential Risk?", since it seemed reasonably comprehensive and representative of the concerns EAs have about AI x-risk. However, I'm a bit worried that the article is not very representative of EAs who have substantial probabilities of doom, since he originally estimated a total risk of catastrophe at only 5% before 2070. In May 2022, Carlsmith changed his mind and reported a higher probability, but I am not sure whether this is because he has been exposed to new arguments, or because he simply thinks the stated arguments are stronger than he originally thought.
I suspect I have both significant moral disagreements and significant empirical disagreements with EAs, and I want to include both in such an article, while mainly focusing on the empirical points. For example, I have the feeling that I disagree with most EAs about:
- How bad human disempowerment would likely be from a utilitarian perspective, and what "human disempowerment" even means in the first place
- Whether there will be a treacherous turn event, during which AIs violently take over the world after previously having been behaviorally aligned with humans
- How likely AIs are to coordinate near-perfectly with each other as a unified front, leaving humans out of their coalition
- Whether we should expect AI values to be "alien" (like paperclip maximizers) in the absence of extraordinary efforts to align them with humans
- Whether the AIs themselves will be significant moral patients, on par with humans
- Whether there will be a qualitative moment when "the AGI" is created, rather than systems incrementally getting more advanced, with no clear finish line
- Whether we get only "one critical try" to align AGI
- Whether "AI lab leaks" are an important source of AI risk
- How likely AIs are to kill every single human if they are unaligned with humans
- Whether there will be a "value lock-in" event soon after we create powerful AI that causes values to cease their evolution over the coming billions of years
- How bad problems related to "specification gaming" will be in the future
- How society is likely to respond to AI risks, and whether they'll sleepwalk into a catastrophe
However, I also disagree with points made by many other EAs who have argued against the standard AI risk case. For example, I think that,
- AIs will eventually become vastly more powerful and smarter than humans. So, I think AIs will eventually be able to "defeat all of us combined"
- I think a benign "AI takeover" event is very likely even if we align AIs successfully
- AIs will likely be goal-directed in the future. I don't think, for instance, that we can just "not give the AIs goals" and then everything will be OK.
- I think it's highly plausible that AIs will end up with substantially different values from humans (although I don't think this will necessarily cause a catastrophe).
- I don't think we have strong evidence that deceptive alignment is an easy problem to solve at the moment
- I think it's plausible that AI takeoff will be relatively fast, and the world will be dramatically transformed over a period of several months or a few years
- I think short timelines, meaning a dramatic transformation of the world within 10 years from now, is pretty plausible
I'd like to elaborate on as many of these points as possible, preferably by responding to direct quotes from the representative article arguing for the alternative, more standard EA perspective.
I think you misunderstood the points I was making. Sorry for writing an insufficently clear comment.
Agreed that's why I wrote "0.1% to 0.01% reduction in p(doom) per year". I wasn't talking about the absolute level of doom here. I edited my comment to say "0.1% to 0.01% reduction in p(doom) per year of delay" which is hopefully more clear. The expected absolute level of doom is probably notably higher than 0.1% to 0.01%.
I don't. That's why I said "Similarly, I would potentially be happier to turn over the universe to aliens instead of AIs."
Also, note that I think AI take over is unlikely to lead to extinction.
ETA: I'm pretty low confidence about a bunch of these tricky moral questions.
I would be reasonably happy (e.g. 50-90% of the value relative to human control) to turn the universe over to aliens. The main reduction in value is due to complicated questions about the likely distribution of values of aliens. (E.g., how likely is that aliens are very sadistic or lack empathy. This is probably still not the exact right question.) I'd also be pretty happy with (e.g.) uplifted dogs (dogs which are made to be as intelligent as humans while keeping the core of "dog" whatever that means) so long as the uplifting process was reasonable.
I think the exact same questions apply to AIs, I just have empirical beliefs that AIs which end up taking over are likely to do predictably worse things with the cosmic endowment (e.g. 10-30% of the value). This doesn't have to be true, I can imagine learning facts about AIs which would make me feel a lot better about AI takeover. Note that conditioning on the AI taking over is important here. I expect to feel systematically better about smart AIs with long horizon goals which are either not quite smart enough to take over or don't take over (for various complicated reasons).
More generally, I think I basically endorse the views here (which discusses the questions of when you should cede power etc.).
Note that in my ideal future it seems really unlikely that we end up spending a non-trivial fraction of future resources running literal humans instead of finding out better stuff to spend computational resources on (e.g. like beings with experiences that a wildly better than our experiences or beings which are vastly cheaper to run).
(That said, we can and should let all humans live for as long as they want and dedicate some fraction of resources to basic continuity of human civilization insofar as people want this. 1/10^12 of the resources would easily suffice from my perspective, but I'm sympathic to making this more like 1/10^3 or 1/10^6.)
I think "identify" is the wrong word from my perspective. The key question is "what would the smart behavioral clone do with the vast amount of future resources". That said, I'm somewhat sympathetic to the claim that this behavioral clone would do basically reasonable things with future resources. I also feel reasonably optimistic about pure imitation LLM alignment for somewhat similar reasons.
Am I ignoring this case? I just think we should treat "what do I terminally value"[1] and "what is the best route to achieving that" as most separate questions. So, we should talk about whether "high discount rates due to epistemic uncertainty" is a good reasoning heuristic for achieving my terminal values separately from what my terminal values are.
Separately, I think a high per year discount rate due to epistemic uncertainty seems pretty clearly wrong. I'm pretty confiden that I can influence, to at least a small degree (e.g. I can affect the probability by >10^-10, probably much greater), whether or not the moral equivalent of 10^30 people are tortured in 10^6 years. It seems like a very bad idea from my perspective to put literally zero weight on this due to 1% annual discount rates.
For less specific things like "does a civilization descended from and basically endorsed by humans exist in 10^6 years", I think I have considerable influence. E.g., I can affect the probability by >10^-6 (in expectation). (This influence is distinct from the question of how valuable this is to influence, but we were talking about epistemic uncertainty here.)
My guess is that we end up with basically a moderate fixed discount over very long run future influence due to uncertainty over how the future will go, but this is more like 10% or 1% than 10^-30. And, because the long run future still dominates in my views, this just multiplies though all calculations and ends up not mattering much for decision making. (I think acausal trade considerations implicitly mean that I would be willing to tradeoff long run considerations in favor of things which look good as weighted by current power structures (e.g. helping homeless children in the US) if I had a 1,000x-10,000x opportunity to do this. E.g., if I could stop 10,000 US children from being homeless with a day of work and couldn't do direct trade, I would still do this.
More precisely, what would my CEV (Coherant Extrapolated Volition) want and how do I handle uncertainty about what my CEV would want?