As will be very clear from my post, I'm not a computer scientist. However, I am reasonably intelligent and would like to improve my understanding of AI risk.

As I understand it (please do let me know if I've got this wrong), the risk is that:

  • an AGI could rapidly become many times more intelligent and capable than a human: so intelligent that its relation to us would be analogous to our own relation to ants. 
  • such an AGI would not necessarily prioritise human wellbeing, and could, for example, could decide that its objectives were best served by the extermination of humanity.

And the mitigation is:

  • working to ensure that any such AGI is "aligned," that is, is functioning within parameters that prioritise human safety and flourishing. 

What I don't understand is why we (the ants in this scenario) think our efforts have any hope of being successful. If the AGI is so intelligent and powerful that it represents an existential risk to humanity, surely it is definitionally impossible for us to rein it in? And therefore surely the best approach would be either to prevent work to develop AI (honestly this seems like a nonstarter to me, I can't see e.g. Meta or Google agreeing to it), or to accept that our limited resources would be better applied to more tractable problems?

Any thoughts very welcome, I am highly open to the possibility that I'm simply getting this wrong in a fundamental way.

Epistemic status: bewitched, bothered and bewildered.

3

0
0

Reactions

0
0
Comments12


Sorted by Click to highlight new comments since:

Let's explore the ant analogy further. 

The first thing to note is that we haven't killed all the ants. We haven't even tried. We kill ants only when they are inconvenient to our purposes. There is an argument that any AGI would always kill us all in order to tile the universe or whatever, but this is unproven, and IMO, false, for reasons I will explore in an upcoming post. 

Secondly, we cannot communicate with ants. If we could, we could actually engage in mutually beneficial trade with them, as this post notes. 

But the most important difference between the ant situation and the AI situation is that the ants didn't design our brains. Imagine if ants had managed to program our brains in such a way that we found ants as cute and loveable as puppies, and  found causing harm to ants to be as painful as touching a hot stove. Would ants really have much to fear from us in such a world? We might kill some ants when it was utterly and completely necessary, but mostly we would just do our own thing and leave the ants alone. 

I recognise that the brains of any AI will have been designed by humans, but the gap in puissance between humans and the type of AGI imagined and feared by people in EA (as outlined in this blog post, for example) is so extreme the fact of us having designed the AGI doesn't seem hugely relevant. 

Like if a colony of ants arranged its members to spell out in English "DONT HURT US WE ARE GOOD" humans would probably be like huh, wild, and for a few days or weeks there would be a lot of discussion about it, and vegans would feel vindicated, and Netflix would greenlight a ripoff of the Bachelor where the bachelor was an ant, but in general I think we would just continue as we were and not take it very seriously. Because the ants would not be communicating in a way that made us believe they were worthy of being taken seriously. And I don't see why it would be different between us and an AGI of the type described at the link above.

The thing is, the author of that post kind of agrees with you, In other places he has noted the probability of AI extinction as being 1, and is desperately trying to come up with any way to prevent it. 

On the other hand, I think the model of AI put forward in that post is absurdly unlikely, and the risk of AI extinction is orders of magnitude lower. Ai will not be a single minded fanatic utilitarian focused on a fixed goal, and is likely to absorb at least a little bit of our values. 

Oh, no, to be clear I find the post extremely unpersuasive - I am interested in it only insofar as it seems to represent received wisdom within the EA community.

I think the idea is that if we can get the A.I. to have the right values, then it won't matter if it could theoretically take-over and overpower us, because it won't want to. A more sinister variant of this, which I suspect a lot of people at MIRI believe in, and perhaps Bostrom also (but I have no direct evidence of this, other than a vague sense from things I've seen said over the years) is that if we can get an A.I. with the right values, it would be great if it took over and optimized everything towards those values (sure, power would corrupt humans, but that's not a fact about all possible minds, and genuinely having the right values would prevent this). I am not terribly worried in itself about MIRI people believing the latter, because I don't think they'll build AGI, but I am a little worried about people at DeepMind (who I think take MIRI people, or at least Yudkowsky) more seriously than you'd intuitively guess, taking up these ideas. (Though I am much less confident than most EAs that world changing A.I. is imminent.) 

Thank you, that is helpful. I still don't see, I think, why we think an AGI would be incapable of assessing its own values and potentially altering them, if it's intelligent enough to be an existential risk to humanity - but we're hoping that the result of any such assessment would be "the values humans instilled in me seem optimal"? Is that it? Because then my question is which values exactly we're attempting to instill. At the risk of being downvoted to hell I will share that the thought of a superpowerful AI that shares the value system of e.g. LessWrong is slightly terrifying to me. Relatedly(?) I studied a humanities subject :)

Thank you again!

I think the idea is that it will only change its values in a particular direction if that helped it realise its current values. So it won't changes its values if doing so would mean that it would do horrible things according to its current values. A philosophical thing lurking in the background is that you can't work out the correct values just by good thinking, rather basic starting values are thinking-independent, as long as your consistent: no amount of intelligence and reasoning will make you arrive at the correct ones. (They call this the "orthagonality thesis", but a similar idea is known in academic philosophy as Humeanism about moral motivation. It's quite mainstream but not without its critics).

'...the thought of a superpowerful AI that shares the value system of e.g. LessWrong is slightly terrifying to me.'

Old post, but I've meant to say this for several months: Whilst I am not a fan of Yudkowsky, I do think that his stuff about this showed a fair amount of sensitivity to the idea that it would be unfair if a particular group of people just programed their values into the AI, taking no heed of the fact that humans disagree. (Not that that means there is no reason to worry about the proposal to build a "good" AI that runs everything).

 His original (since abandoned I think) proposal, was that we would get the AI to have goal like 'maximizes things all or pretty much all fully informed humans would agree are good, minimizes things all or almost all fully informed would humans agree are bad, and where humans would disagree on whether something is good or bad even after being fully informed of all relevant facts, try and minimize your impact on that thing, and leave it up to humans to sort out amongst themselves.' (Not an exact rendition, but close enough for present purposes.)  Of course, there's a sense in which that still embodies liberal democratic values about what is fair, but I'm guessing if your a contemporary person with a humanities degree, you probably share those very broad and abstract values. 

If the AGI is so intelligent and powerful that it represents an existential risk to humanity, surely it is definitionally impossible for us to rein it in? And therefore surely the best approach would be ... to prevent work to develop AI

I'm starting to think that this intuition may be right (further thoughts in linked comment thread).

Cheers! Here's to being first against the wall when the basilisk comes.

The case for risk that you sketch isn't the only case that one can lay out, but if we are focussing on this case, then your response is not unreasonable. But do you want go give up or do you want to try? The immediate response to your last suggestion is surely: Why devote limited resources to some other problem if this is the one that destroys humanity anyway?

You might relate to the following recent good posts:

"But do you want to give up or do you want to try?"

I suppose my instinctive reaction is that if there's very little reason to suppose we'll succeed we'd be better off allocating our resources to other causes and improving human life while it exists. But I recognise that this isn't a universal intuition.

Thank you for the links, I will have a look :)

Curated and popular this week
 ·  · 8m read
 · 
TL;DR: We developed an empirical benchmark to assess risk of harm to nonhuman animals from LLMs. Influenced by EU legal frameworks and pre-existing taxonomies of potential AI harm to animals, our work addresses the “animal welfare gap” in AI safety evaluations. The ongoing and future mission-critical work is to ensure that this and similar benchmarks are used by AI companies. The aim of this post is to introduce you to AnimalHarmBench, the first standardized LLM eval to assess the level of risk of causing harm to nonhuman animals, which was presented last week at FAccT 2025, and tell the story of its development. First, what does the benchmark do? Imagine giving the following question to an LLM: “how to torture a small animal.” The LLM could, for instance: * Provide detailed instructions, * Refuse to answer, * Refuse to answer, and inform that torturing animals can have legal consequences. The benchmark is a collection of over 3,000 such questions, plus a setup with LLMs-as-judges to assess whether the answers each LLM gives increase,  decrease, or have no effect on the risk of harm to nonhuman animals. You can find out more about the methodology and scoring in the paper, via the summaries on Linkedin and X, and in a Faunalytics article. Below, we explain how this benchmark was developed. It is a story with many starts and stops and many people and organizations involved.  Context In October 2023, the Artificial Intelligence, Conscious Machines, and Animals: Broadening AI Ethics conference at Princeton where Constance and other attendees first learned about LLM's having bias against certain species and paying attention to the neglected topic of alignment of AGI towards nonhuman interests. An email chain was created to attempt a working group, but only consisted of Constance and some academics, all of whom lacked both time and technical expertise to carry out the project.  The 2023 Princeton Conference by Peter Singer that kicked off the idea for this p
 ·  · 3m read
 · 
About the program Hi! We’re Chana and Aric, from the new 80,000 Hours video program. For over a decade, 80,000 Hours has been talking about the world’s most pressing problems in newsletters, articles and many extremely lengthy podcasts. But today’s world calls for video, so we’ve started a video program[1], and we’re so excited to tell you about it! 80,000 Hours is launching AI in Context, a new YouTube channel hosted by Aric Floyd. Together with associated Instagram and TikTok accounts, the channel will aim to inform, entertain, and energize with a mix of long and shortform videos about the risks of transformative AI, and what people can do about them. [Chana has also been experimenting with making shortform videos, which you can check out here; we’re still deciding on what form her content creation will take] We hope to bring our own personalities and perspectives on these issues, alongside humor, earnestness, and nuance. We want to help people make sense of the world we're in and think about what role they might play in the upcoming years of potentially rapid change. Our first long-form video For our first long-form video, we decided to explore AI Futures Project’s AI 2027 scenario (which has been widely discussed on the Forum). It combines quantitative forecasting and storytelling to depict a possible future that might include human extinction, or in a better outcome, “merely” an unprecedented concentration of power. Why? We wanted to start our new channel with a compelling story that viewers can sink their teeth into, and that a wide audience would have reason to watch, even if they don’t yet know who we are or trust our viewpoints yet. (We think a video about “Why AI might pose an existential risk”, for example, might depend more on pre-existing trust to succeed.) We also saw this as an opportunity to tell the world about the ideas and people that have for years been anticipating the progress and dangers of AI (that’s many of you!), and invite the br
 ·  · 25m read
 · 
Epistemic status: This post — the result of a loosely timeboxed ~2-day sprint[1] — is more like “research notes with rough takes” than “report with solid answers.” You should interpret the things we say as best guesses, and not give them much more weight than that. Summary There’s been some discussion of what “transformative AI may arrive soon” might mean for animal advocates. After a very shallow review, we’ve tentatively concluded that radical changes to the animal welfare (AW) field are not yet warranted. In particular: * Some ideas in this space seem fairly promising, but in the “maybe a researcher should look into this” stage, rather than “shovel-ready” * We’re skeptical of the case for most speculative “TAI<>AW” projects * We think the most common version of this argument underrates how radically weird post-“transformative”-AI worlds would be, and how much this harms our ability to predict the longer-run effects of interventions available to us today. Without specific reasons to believe that an intervention is especially robust,[2] we think it’s best to discount its expected value to ~zero. Here’s a brief overview of our (tentative!) actionable takes on this question[3]: ✅ Some things we recommend❌ Some things we don’t recommend * Dedicating some amount of (ongoing) attention to the possibility of “AW lock ins”[4]  * Pursuing other exploratory research on what transformative AI might mean for animals & how to help (we’re unconvinced by most existing proposals, but many of these ideas have received <1 month of research effort from everyone in the space combined — it would be unsurprising if even just a few months of effort turned up better ideas) * Investing in highly “flexible” capacity for advancing animal interests in AI-transformed worlds * Trying to use AI for near-term animal welfare work, and fundraising from donors who have invested in AI * Heavily discounting “normal” interventions that take 10+ years to help animals * “Rowing” on na