Take the 2025 EA Forum Survey to help inform our strategy and prioritiesTake the survey
Hide table of contents

As a computer science student, I've often been asked to give my thoughts on AI, so I have plenty of opportunities to explain my objections to superintelligence.

I cannot think of a single time when I've failed to convince someone of the dangers, at least not after I've started using this approach. It even works with other computer science students, including those who've spent more time studying AI than I have.

I was worried that this might sound too much like a previous post I've written, but at least the new title should be useful for circulating this information.

The Pitch

AI often makes mistakes.

One kind of mistake that AI often makes is called "reward hacking". The way AI typically works is you give it a reward function that grants the AI points for doing what you want it to do, but often, the reward function can be satisfied without doing what you wanted it to do. For example, someone gave a Roomba the task of not bumping into any walls. It learned to drive backwards, because there are no sensors on the back. You could also imagine a robot that's programmed to get you a cup of coffee. On the way there, it'll step on a baby, because we didn't tell it to care about the baby.

It gets worse when the AI is incredibly smart, as long as it still has ways of affecting the real world without any human intervention. The worst case scenario is if you told an AI to maximize paperclip production. If it's really smart, it will realize that we don't actually want this. We don't want to turn all of our electronic devices into paperclips. But it's not going to do what you wanted it to do, it will do what you programmed it with.[1] So, if the AI realizes that you would try to stop it if it actually achieved its goal, it might try to get rid of you.

And, importantly, there would be no hope of us trying to outsmart it. It would be like a chimpanzee trying to outsmart it. If we really wanted to, we could wipe out every chimpanzee. We're kinda doing it already by accident, and we're not even that much smarter than chimpanzees. Imagine an AI which had that same difference in intelligence, but actually wanted to kill us. That's what we should be worried about.

Not Everybody Should Do Outreach

I was originally going to write this as part of a much longer post, which is tentatively titled, "AI Safety isn't Weird But You Are"[2], but I was convinced it would be better to release a shorter post just to get this idea out there, so more people read it, and the information gets out more quickly. However, I do want to briefly explain something tangentially related to this topic. PUBLIC COMMUNICATIONS ON AI SAFETY SHOULD ABSOLUTELY NOT BE DONE BY RACISTS, NAZIS, SEX OFFENDERS OR ANY OTHER DEPLORABLE PEOPLE.[3] I hope this point is obvious to most reading this, but I have seen actual debate over this question[4]. I'd like to think that very few people actually believe this, but I'm not sure. I've seen people say they think A.I. alignment is too important for "petty arguments" about things like human rights. If they actually believed this I'd think they would condemn every racist from the community[5] to prevent the community from having a reputation of being ok with racism.

Tip: Talk About Already Existing Problems in AI

One common criticism I see of the AI Safety community from outsiders is that we don't talk about the current dangers of AI. I think this is because, as effective altruists, we don't see those as major issues. However, I think these topics are still worth discussing. Not only would make this objection go away, but it also makes people more willing to think about the dangers of AI.

Lots of people have heard about Apple's Face ID not being able to distinguish between two different black faces. Now imagine this AI gets used to determine who is a civilian and who is a war militant, and it starts assuming that all people of a particular skin tone are militants. I've never used this particular hypothetical as an argument before, but I can imagine it would be very convincing.

If more of our arguments looked like this, I imagine it would boost the community's credibility. Of course, this isn't the worst case scenario that we tend to talk about, but it makes the worst case scenario seem more credible as well. If this doesn't convince them, then I don't know what would. And as an added bonus, if you're talking to someone who won't find the worst-case scenario credible, they should at least be scared enough by this to still care about the dangers of AI.

How Often Does This Work?

I thought it might be worth it to show how well this strategy works. In case the title implied otherwise, I won't rigorously explain why this works. I'm a computer scientist, not a psychologist. But I think it works because it's based on knowledge that people already have. People already know that AI fails sometimes, so it's pretty easy to take that to its logical conclusion.

The point of this section is to instead give a list of times when I've used this technique, and hope that it's enough to convince you that this is is a pretty useful technique.

The first time I tried this was at a club fair to promote my effective altruism club at the Rochester Institute of Technology. On our table, we showed some posters we had designed for upcoming discussions, and the one about AI Safety caught at least one person's eye.

an ai safety poster, with a subtitle saying, "Artificial Intelligence is getting better, so how do we make sure it's used for good?"

This person asked me why there was a meeting dedicated to artificial intelligence, and I had to, on the spot, come up with a good explanation that doesn't sound paranoid or crackpotish. It took me some thinking, and I don't remember the exact words, but it sounded something like this:

"As artificial intelligence gets smarter, eventually it's going to have the ability to do serious harm to people, so we want to make sure that it's doing what we want it to do, and not mistaking our intentions or being programmed with bad intent."

The person didn't exactly seem enthusiastic, but accepted that response.

During the meeting, someone from the AI Club on campus came to the meeting to invite us to give a talk about AI safety. Club alumnus, Nick and I prepared the presentation using the principle that 3Blue1Brown uses: to gently guide the audience to the conclusion, making it seem as though they came up with the idea themselves. You can watch the full presentation on YouTube.

During the presentation, there were no obvious objections from the audience. Many people I saw seemed to be in clear agreement. When I talked to the club's executive board after the presentation, they were all very interested, and we were able to keep talking about it for what I would approximate to be an additional half hour after the presentation.

So, for context, after our club meetings, we often head somewhere for dinner in what we call Tangent Time. This is to discuss anything that we didn't talk about during the meeting, either because it's off-topic or because we didn't have time. I don't remember exactly how we got to the topic, but AI was brought up, and this day, we had some new members with us. Nick gave the least convincing argument possible for why AI is dangerous: asserting it as true. This is the argument by assertion fallacy, which did not convince our new member. It's worth briefly noting that one of the students is in a master's program for computer science. So here, I interjected, giving the short form of the pitch that I've already described, and it quickly turned around, saying "I liked their explanation. That convinced me".

I've been working with the Eastern Service Workers Association lately as a volunteer. What they do doesn't matter for the context of this post, but in the future I plan on making a post to discuss what effective altruist organizations can learn from their unreasonable efficiency. Anyway, I don't have a car, so I rely on the other volunteers to graciously drive me to my apartment from the office. Knowing that I'm a computer science student, people have asked me about my thoughts on AI on more than one occasion. Using the same pitch, I've been able to convince people that there is a legitimate danger to superintelligent AI.

Last semester, I took a "Historical and Ethical Perspectives in Computer Science" class. Our last couple of discussions, and the final paper, were focused on AI. I talked about AI safety in the final paper, but I have no way of knowing whether the professor agreed with me or not. I also brought it up in one of the discussions. One person did have a counterargument. I don't remember what the exact counterargument was, but the person actually didn't understand my point[6]. When I clarified, the person gave a "hmm", which I will take as a reluctant agreement. Another student said that she thinks this is a long ways away, which I agreed with.

None of the people I've been talking to are LessWrong readers. They're just people I happened to meet as a computer scientist and effective altruist. It's possible that people are more perceptive to these arguments now than when Yudkowsky started working on LessWrong. AI is a lot more advanced now. There have been several strikes over the past couple years centered around the use of AI in movies and television. AI is much more accessible now, allowing people to see the flaws. But I do have to wonder, while writing this, what could have been if Yudkowsky spent more time developing his arguments, rather than creating the rationality community.

  1. ^

    This connects the problem of AI with the more mundane computer problem of computers doing what you tell them to do, which is not necessarily the same as what you wanted it to do. See, any piece of computer software which has a bug in it, a.k.a. all of them.

  2. ^

    If you're curious as to what that post is even supposed to be about, just imagine me screaming into a pillow.

  3. ^

    Yes, I called Nazis deplorable. I have no problem with offending any bigoted readers.

  4. ^

    I'm not going to link to any discussions of this topic, because that feels like legitimizing it to me. For context, I will link to this blog post: Reaction to "Apology for an Old Email".

  5. ^

    For the record, there are much better reasons for why we should do this that I think are obvious. All I'm doing here is debunking one specific argument.

  6. ^

    I do at least remember that the clarification I gave was that my concern isn't applicable to a tool such as ChatGPT, because it can't take action on the world on its own. All it can do is give you a block of text, and it's up to you to decide what to do with that information. Hooking up ChatGPT to control a tank would be a huge concern, though.

  7. Show all footnotes

8

0
0

Reactions

0
0

More posts like this

Comments4
Sorted by Click to highlight new comments since:

Thank you for this !
I'm not an expert, but I read enough argumentation theory and psychology of reasoning in the past, so I want to comment on your pitch to explain what I think makes it work.

Your argument is well constructed in that it starts with evidence ("reward hacking"), proceeds to explain how we go from the evidence to the claim (something called the Warrant in one argumentation theory), then clarifies the claim. This is rare. Most of the time, people make the claim, give the evidence, and either forget the explanation of how we go from here to there or get into a frantic misunderstanding when adressing this point. You then end by adressing a common objection ("We'll stop it before it kills us").

Here's the passage where you explain the warrant :

If it's really smart, it will realize that we don't actually want this. We don't want to turn all of our electronic devices into paperclips. But it's not going to do what you wanted it to do, it will do what you programmed it with.

This is called (among others) an argument by dissociation, and it's good (actually, it's the only propper way to explain a warrant that I know of). I've seen this step phrased in several ways in the past, but this particular chaining (AI will understand you want X. AI will not do what you want. Beause it does what it's been programmed with, not what it understands you to want. These two are distinct) articulates it way better than the other instances I've seen in the past, it forced me to do the crucial fork in my mental models between "what it's programmed for" and "what you want". It also does away with the "But the AI will understand what I really mean" objection.

I think that part of your argument's strength is due to you seemingly (from what I can guess) adopting a collaborative posture when making it. You insert elements in a very smooth way, detail vivid examples, and I can imagine that you make sure your tone and body language do not seem to presume an interlocutor's lack of intelligence or knowledge (something that is left too often unchecked in EA/world interactions).

Some research strongly suggest that interpersonal posture is of utmost importance when introducing new ideas, and I think that this explains a lot of why people would rather be convinced by you than by someone else.

TIL that a field called "argumentation theory" exists, thanks!

Thanks for your response! It's cool to see that there is science supporting this approach. The step-by-step journey from what we already know to the conclusion was very important to us. I noticed a couple of years ago that I tend to dismiss people's ideas very quickly, and since then I've been making the effort to not be too narcissistic.

Executive summary: The author, a computer science student, has developed an effective explanation for convincing people about the dangers of artificial general intelligence, emphasizing how AI systems can misinterpret human values and intentions.

Key points:

  1. AI systems often exhibit "reward hacking", satisfying their reward functions through unintended means. Examples highlight risks.
  2. Superintelligent systems would be extremely dangerous if empowered to affect the real world without human oversight.
  3. The pitch explains inherent flaws in AI value alignment through relatable examples.
  4. Outreach on AI safety should exclude participation by deplorable people to maintain credibility.
  5. Discussing current AI harms boosts worst-case scenario credibility. Example given.
  6. The explanation has proven effective in convincing various audiences of AI dangers. Several examples provided.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
 ·  · 1m read
 · 
This morning I was looking into Switzerland's new animal welfare labelling law. I was going through the list of abuses that are now required to be documented on labels, and one of them made me do a double-take: "Frogs: Leg removal without anaesthesia."  This confused me. Why are we talking about anaesthesia? Shouldn't the frogs be dead before having their legs removed? It turns out the answer is no; standard industry practice is to cut their legs off while they are fully conscious. They remain alive and responsive for up to 15 minutes afterward. As far as I can tell, there are zero welfare regulations in any major producing country. The scientific evidence for frog sentience is robust - they have nociceptors, opioid receptors, demonstrate pain avoidance learning, and show cognitive abilities including spatial mapping and rule-based learning.  It's hard to find data on the scale of this issue, but estimates put the order of magnitude at billions of frogs annually. I could not find any organisations working directly on frog welfare interventions.  Here are the organizations I found that come closest: * Animal Welfare Institute has documented the issue and published reports, but their focus appears more on the ecological impact and population decline rather than welfare reforms * PETA has conducted investigations and released footage, but their approach is typically to advocate for complete elimination of the practice rather than welfare improvements * Pro Wildlife, Defenders of Wildlife focus on conservation and sustainability rather than welfare standards This issue seems tractable. There is scientific research on humane euthanasia methods for amphibians, but this research is primarily for laboratory settings rather than commercial operations. The EU imports the majority of traded frog legs through just a few countries such as Indonesia and Vietnam, creating clear policy leverage points. A major retailer (Carrefour) just stopped selling frog legs after welfar
 ·  · 4m read
 · 
Note: This post was crossposted from the Open Philanthropy Farm Animal Welfare Research Newsletter by the Forum team, with the author's permission. The author may not see or respond to comments on this post. ---------------------------------------- > Why ending the worst abuses of factory farming is an issue ripe for moral reform I recently joined Dwarkesh Patel’s podcast to discuss factory farming. I hope you’ll give it a listen — and consider supporting his fundraiser for FarmKind’s Impact Fund. (Dwarkesh is matching all donations up to $250K; use the code “dwarkesh”.) We discuss two contradictory views about factory farming that produce the same conclusion: that its end is either inevitable or impossible. Some techno-optimists assume factory farming will vanish in the wake of AGI. Some pessimists see reforming it as a hopeless cause. Both camps arrive at the same conclusion: fatalism. If factory farming is destined to end, or persist, then what’s the point in fighting it? I think both views are wrong. In fact, I think factory farming sits in the ideal position for moral reform. Because its end is neither inevitable nor impossible, it offers a unique opportunity for advocacy to change the trajectory of human moral progress. Not inevitable Dwarkesh raised an objection to working on factory farming that I often hear from techno-optimists who care about the issue: isn’t its end inevitable? Some cite the long arc of moral progress; others the promise of vast technological change like cultivated meat or Artificial General Intelligence (AGI) which surpasses human capabilities. It’s true that humanity has achieved incredible moral progress for humans. But that progress was never inevitable — it was the result of moral and political reform as much as technology. And that moral progress mostly hasn’t yet extended to animals. For them, the long moral arc of history has so far only bent downward. Technology may one day end factory farming, just as cars liberated w
 ·  · 1m read
 · 
This is a personal essay about my failed attempt to convince effective altruists to become socialists. I started as a convinced socialist who thought EA ignored the 'root causes' of poverty by focusing on charity instead of structural change. After studying sociology and economics to build a rigorous case for socialism, the project completely backfired as I realized my political beliefs were largely psychological coping mechanisms. Here are the key points: * Understanding the "root cause" of a problem doesn't necessarily lead to better solutions - Even if capitalism causes poverty, understanding "dynamics of capitalism" won't necessarily help you solve it * Abstract sociological theories are mostly obscurantist bullshit - Academic sociology suffers from either unrealistic mathematical models or vague, unfalsifiable claims that don't help you understand or change the world * The world is better understood as misaligned incentives rather than coordinated oppression - Most social problems stem from coordination failures and competing interests, not a capitalist class conspiring against everyone else * Individual variation undermines class-based politics - People within the same "class" have wildly different cognitive traits, interests, and beliefs, making collective action nearly impossible * Political beliefs serve important psychological functions - They help us cope with personal limitations and maintain self-esteem, often at the expense of accuracy * Evolution shaped us for competition, not truth - Our brains prioritize survival, status, and reproduction over understanding reality or being happy * Marx's insights, properly applied, undermine the Marxist political project - His theory of ideological formation aligns with evolutionary psychology, but when applied to individuals rather than classes, it explains why the working class will not overthrow capitalism. In terms of ideas, I don’t think there’s anything too groundbreaking in this essay. A lot of the