The Case for Superintelligence Safety As A Cause: A Non-Technical Summary

HunterJay

The Case for Superintelligence Safety As A Cause: A Non-Technical Summary

HunterJay

7 min readMay 21, 2019

Comments 9

Sorted by

New & upvoted

beth

Thank you for this nice summary of the argument in favour of AI Safety as a cause. I am not convinced, but I appreciate your write-up. As you asked for counterarguments, I'll try to describe some of my gripes with the AI Safety field. Some have to do with how there seems to be little awareness of results in adjacent fields, making me doubt if any of it would stand up to scrutiny from people more knowledgeable in those areas. There are also a number of issues I have with the argument itself.

Where’s does it end? Well, eventually, at the theoretical limits of computation. These theoretical limits are very, very high - without even getting close to the limit, a 10kg computer could do more computation every hour than 10 billion human brains could do in a million years. (And a superintelligence wouldn’t be limited to just 10kg). At that point, we are talking about something that can essentially do anything that is allowed by the laws of physics - something so incredibly smart it’s comparable to a civilisation millions of years ahead of us.

The theoretical limits of computation are lower bounds, we don't know if it is possible to achieve them for any kind of computation, let alone for general computation. Moreover, having a lot of computational power probably doesn't mean that you can calculate everything. A lot of real-world problems are hard to approximate in a way that adding more computational power doesn't meaningfully help you. For example, computing approximate Nash-equilibria or finding good lay-outs for microchip design. It is not clear that having a lot of computing power translates into relevant superior capabilities.

We don’t yet know how to program any high-level human concept like morality, love, or happiness - the difficulty is in nailing down the concept to the kind of mathematical language a computer can understand before it becomes superintelligent.

There is a growing literature on making algorithms fair, accountable and transparent. This is a collaborative effort between researchers in computer science, law and many other fields. There are so many similarities between this and the professed goals of the AI Safety community that it is strange that no cross-fertilization is happening.

The problem is Instrumental Convergence.

You can't just ask the AI to "be good", because the whole problem is getting the AI to do what you mean instead of what you ask. But what if you asked the AI to "make itself smart"? On the one hand, instrumental convergence implies that the AI should make itself smart. On the other hand, the AI will misunderstand what you mean, hence not making itself smart. Can you point the way out of this seeming contradiction?

So a superintelligence could be super powerful and super dangerous if and when we are able build it.

AI Safety would be a worthy cause if a superintelligence were powerful and dangerous enough to be an issue but not so powerful and dangerous as to be uncontrollable. A solution has to be necessary, but it also has to exist. Thus, there is a tension between scale and tractability here. Both Bostrom and Yudkowsky only ever address one thing at a time, never acknowledging this tension.

If it takes off slow enough, we’ll have time to figure out how to make it safe after we create the first superintelligence, which would be very handy indeed. Unfortunately, it turns out nobody agrees on that either.

Most estimates on take-off speed start counting from the point that the AI is superintelligent. Why wait until then? A computer can be reset, so if you had a primitive AGI specimen you'd have unlimited tries to spot problems and make it behave.

I'd say that a 0.0001% chance of a superintelligence catastrophe is a huge over-estimate. Hence, AI Safety would be an ineffective cause area if you hold a person-affecting view. If you don't, then at least this opens the way for the kind of counterarguments used against Pascal's Mugging.

Tetraspace

You can't just ask the AI to "be good", because the whole problem is getting the AI to do what you mean instead of what you ask. But what if you asked the AI to "make itself smart"? On the one hand, instrumental convergence implies that the AI should make itself smart. On the other hand, the AI will misunderstand what you mean, hence not making itself smart. Can you point the way out of this seeming contradiction?

(Under the background assumptions already being made in the scenario where you can "ask things" to "the AI":) If you try to tell the AI to be smart, but fail and instead give it some other goal (let's call it being smart'), then in the process of becoming smart' it will also try to become smart, because no matter what smart' actually specifies, becoming smart will still be helpful for that. But if you want it to be good and mistakenly tell it to be good', it's unlikely that being good will be helpful for being good'.

HunterJay

Sorry for the delay on this reply. It’s been a very busy week.

Okay, so, to be clear -- I am making the argument that superintelligence safety is an important area that is underfunded today, and you are arguing that extinction caused by superintelligence is so unlikely that it shouldn’t be a concern. Is that accurate?

With that in mind, I’ll go through you points here one by one, and then attempt to address some of arguments in your blog posts (though the first post was unavailable!).

The theoretical limits of computation are lower bounds, we don't know if it is possible to achieve them for any kind of computation, let alone for general computation. Moreover, having a lot of computational power probably doesn't mean that you can calculate everything. A lot of real-world problems are hard to approximate in a way that adding more computational power doesn't meaningfully help you. For example, computing approximate Nash-equilibria or finding good lay-outs for microchip design. It is not clear that having a lot of computing power translates into relevant superior capabilities.

I agree with you here. My reason for bringing this up in the main post was to show that superintelligence is possible under today’s understanding of physics. Raw computation is not intelligent by itself, we agree, but rather one requirement for it. I was just pointing out the computation that could be done in a small amount of matter is much larger than the computation that is done in the brain. (And that the brain’s computation is in a pattern that we call general intelligence).

There is a growing literature on making algorithms fair, accountable and transparent. This is a collaborative effort between researchers in computer science, law and many other fields. There are so many similarities between this and the professed goals of the AI Safety community that it is strange that no cross-fertilization is happening.

I didn’t mention a lot of good research relevant to safety, and progress is being made in many independant directions for sure. I do agree, I would also like to see more of a crossover, though I really don’t know how much the two areas are already working off the other’s progress. I’d be surprised if it were zero. Regardless, if it were zero, it would show poor communication, rather than say anything about the concerns being wrong.

You can't just ask the AI to "be good", because the whole problem is getting the AI to do what you mean instead of what you ask. But what if you asked the AI to "make itself smart"? On the one hand, instrumental convergence implies that the AI should make itself smart. On the other hand, the AI will misunderstand what you mean, hence not making itself smart. Can you point the way out of this seeming contradiction?

I mean, there’s no rule that a superintelligence has to misunderstand you. And there’s no certainty instrumental convergence is correct. (I wouldn’t risk my life on either statement!) It’s just that we think being smarter would help achieve most goals, so we probably should expect a superintelligence to try and make itself smarter.

The other part is we just don’t know how to guarantee that a superintelligence will do what we mean. (If you do know how to do this, that would be a huge relief). Even in your example of trying to get an superintelligence just to make itself smarter, I certainly wouldn’t be confident it would do it in the way I expect -- I have enough trouble predicting how my programs today will run. Suppose I’d written a utility function for ‘smartness’ that actually just measured total bits flipped, for example, I might not realise until afterwards, which wouldn't be good.

AI Safety would be a worthy cause if a superintelligence were powerful and dangerous enough to be an issue but not so powerful and dangerous as to be uncontrollable. A solution has to be necessary, but it also has to exist. Thus, there is a tension between scale and tractability here. Both Bostrom and Yudkowsky only ever address one thing at a time, never acknowledging this tension.

I might be misunderstanding you here. Are you arguing that because superintelligence does not yet exist, it is not yet worthwhile to work on safety? Or are you arguing that we can’t be confident that a solution to alignment will work without a superintelligence to test it on?

If it’s the first, I would argue that there’s a major risk that we won’t find a solution in the period of time between creating a superintelligence, and the superintelligence having enough power to be a big problem. Unless I was super confident this time period would be very large, wouldn’t it make more sense to try and find a solution as early as possible?

I’d also argue that solving a solution early would mean it could be worked into the design of a superintelligence early, rather than just relying on the class of solutions that would fit something that’s already been built.

If it’s the second, I agree -- it would be a much easier problem to solve if we had a ‘mini’-superintelligence to practice on, for sure. Figuring out how to do this is a part of safety research! How can we limit a superintelligence’s capabilities so it stays in this state? How can we predict what will happen as we increase a weak superintelligence to a strong superintelligence? We still need to figure out how to do that as well, hence my call for research funding.

Most estimates on take-off speed start counting from the point that the AI is superintelligent. Why wait until then? A computer can be reset, so if you had a primitive AGI specimen you'd have unlimited tries to spot problems and make it behave.

I am not sure this is true, I’ve always read takeoff speed estimates as counting from the moment of human-level general intelligence - though I know many people imagine a human-level AGI as having access to current narrow superintelligence (as in, max[human, current computer] abilities at each task). Maybe that’s it.

Regardless, as above, I hope we get that chance, though from the little research that has been done it looks like this might not be as safe as it sounds. We would have to be very very good at determining the capability of an AGI, be confident that no other project is moving forward faster than us, and be confident that the behaviour will remain the same as intelligence increases -- which might be the trickiest one. For example, a near-human AGI might be able to predict that doing what humans want early on would make it more likely to achieve its goal later on, no matter what the goal actually is. -- So we haven’t avoided catastrophe, only added an instrumental goal of ‘behaving the way humans want me to until I have enough power to disregard them without being shut down’. Still, this is an open area of research and I hope it gets more funding and attention.

I'd say that a 0.0001% chance of a superintelligence catastrophe is a huge over-estimate. Hence, AI Safety would be an ineffective cause area if you hold a person-affecting view. If you don't, then at least this opens the way for the kind of counterarguments used against Pascal's Mugging.

Getting into your arguments for that figure below, though I want to clarify here my estimate of superintelligence being built this century is in the double digits percentage wise, and that if it's built before we solve alignment it is almost certain to be dangerous. I'm not relying on very low probabilities of drastic outcomes, so Pascal's Mugging doesn't apply.

Onwards to a some limited responses to your blog posts. I wasn’t entirely sure if I understood your argument properly, so I’m going to try and list the main points here and see if you agree.

1. You argue that if the probability of an AI-related extinction event were large, and if a single AI-related extinction event could affect any lifeform in the universe ever, one should have already happened somewhere and we shouldn’t exist.

2. You argue that current safety research is ineffective -- we’d be able to work more effectively and cheaply if we waited until we were closer to developing superintelligence.

3. You believe that if a superintelligence was going to be built in the near future, and if it was going to be dangerous, it would probably result in a smaller scale catastrophe that would give us plenty of warning that a bigger catastrophe was coming.

4. You believe that there are numerous psychological reasons people are inclined to believe superintelligence is likely and dangerous, and so increase your skepticism of the claims because of that.

5. You argue that left to its own devices, regular commercial or academic research will be able to solve the problem.

If there’s a major point I’ve missed here, or if I’ve phrased these badly, do correct me! Anyway, let’s go through them.

1. You argue that if the probability of an AI-related extinction event were large, and if a single AI-related extinction event could affect any lifeform in the universe ever, one should have already happened somewhere and we shouldn’t exist.

If the probability of broadcasting radio into space were large, we should have already detected alien radio. (Since radio would also spread at the speed of light in all directions, and be distinct from natural events). I don’t believe this is strong evidence against the hypothesis that superintelligence (or radio) is possible and dangerous, though I suppose it’s evidence that there are no other advanced civilisations within our past light cone.

2. You argue that current safety research is ineffective -- we’d be able to work more effectively and cheaply if we waited until we were closer to developing superintelligence.

It is hard to say how effective current safety research is, for sure. If anything, the limited progress should make us think this problem is very hard and make us way less confident about being able to solve it in a short period of time in the future. Particularly since some aspects of safety get harder to implement the longer we wait -- building culture and institutions that consider the issue when setting up their AGI projects, for instance.

3. You believe that if a superintelligence was going to be built in the near future, and if it was going to be dangerous, it would probably result in a smaller scale catastrophe that would give us plenty of warning to do safety research to prevent a larger one at that point.

If the time period between a small scale catastrophe and a large one is small, we shouldn’t be confident that we can solve safety in time -- especially if you are right about a small scale catastrophe being evidence we are nearing superintelligence.

Additionally, if there exist large scale failure modes that are wholly different to any small scale failure mode, we shouldn’t expect learning from small scale catastrophes to help us prevent larger ones.

Alternatively, we might even make large scale failures harder to detect by patching small failures -- for example, we might think we’ve prevented a superintelligence from trying to escape onto the internet, but we’ve really just made escaping so hard that only a strong superintelligence could manage it.

4. You believe that there are numerous psychological reasons people are inclined to believe superintelligence is likely and dangerous, and so increase your skepticism of the claims because of that.

Humanities general lack of concern generally about climate change or nuclear weapons (prior to them being created / caused) would indicate to me the psychological trends go in the other direction, at least for most people. Regardless, I would certainly agree with being really skeptical about extraordinary claims.

I would argue that it’s an extraordinary claim both ways. Either superintelligence is not that hard to build, or there is something so incredibly complicated and special about biological general intelligence that even with billions of dollars of funding per year for a hundred years, we won’t manage to replicate it - even as we replicate other aspects of biological intelligence (like vision, or motor control).

You might argue, fairly, that this is more likely, but do you really believe it is billions of times as likely?

I’m not sure if you’re main disagreement is with superintelligence being built at all, or with it being dangerous, so let’s look at that quickly too. If we are skeptical of superintelligence being dangerous because it seems extraordinary, we should also be skeptical of the extraordinary claim that a superintelligence would be be safe and good by default. (If it is not by default, we already have discovered how difficult it is specify safe behaviour).

5. You argue that left to its own devices, regular commercial or academic research will be able to solve the problem.

I really hope so.

Commercially, building a superintelligence (or rather, every step towards superintelligence) would be extremely profitable. But since safety research would take some of your best minds away from building it, the incentives are in the wrong direction. Whoever spends the least on safety has the largest proportion of their resources to spend on development.

As far as regular academic research goes, it’s more hopeful, but the number of people working on safety in traditional academia is very very low. How confident can we be that this low output would be enough to solve the problem prior to building a superintelligence -- especially given how difficult we’ve found it to be so far -- and considering how many ambitious researchers are working on building a superintelligence as soon as possible? Perhaps money could be best spent persuading those researchers to consider safety, I don’t know.

To conclude, I want to lay out what would change my mind:

If progress on computer hardware and software seemed very likely to halt (or slow dramatically) in the near future.

If our current understanding of neuroscience turned out to be wrong, and we could show that simulating general purpose computation required far more computation than the brain’s cells do -- perhaps the brain uses hard-to-compute actions on the level of atoms or smaller, rather than something that could be done in abstract models of cells.

If somebody was able to disprove (or provide very strong evidence against) the orthogonality thesis and instrumental convergence thesis.

If no project was working on building superintelligence.

Otherwise, it seems very much like we could have the capability of simulating and optimising a general intelligence in the near future, and that this could be very dangerous.

beth

Let me try to rephrase this part, as I consider it to be the main part of my argument and it doesn't look like I managed to convey what I intended to:

AI Safety would be a worthy cause if a superintelligence were powerful and dangerous enough to be an issue but not so powerful and dangerous as to be uncontrollable.

The most popular cause evaluation framework within EA seems to be Importance/Neglectedness/Tractability. AI Safety enthusiasts tell a convincing story on importance and neglectedness being good and make an effort at arguing that tractability is as well.

But here is the thing: all arguments given in favour of AI being risky (to establish importance) can be rephrased as arguments against tractability. Similarly for neglectedness.

I'll illustrate this with a caricature, but it takes little effort to transfer this line of thought to the real arguments being made. Let's say the pro-AIS argument is "AGI will become infinitely smart, so it can out-think all humans and avoid all our security measures. Hence AGI is likely to escape any restrictions we put on it, so it will be able to tile the universe with paperclips if it wants to". Obviously, if it can out smart any security measure, then no sufficient security exists, AI Safety research will never lead to anything and the problem is intractable.

AI Safety is only effective if you can simultaneously argue for each of importance/neglectedness/tractability without detracting from the others. Moreover, your arguments have to address the exact same scenarios. It is not enough for AIS to be important with 50% probability and tractable with 50% probability, these two properties have be likely to hold simultaneously. A coin flip has 50% probability of heads and 50% probability of tails, but they will never happen at the same time.

AI Safety can only be an effective cause (on the margin) if solving it is possible (tractability) but not trivial (importance/neglectedness). I think this is a narrow window to hit, and current arguments are all way off-target.

HunterJay

Ah, thanks for rephrasing that. To make sure I’ve got this right - there’s a window between something being ‘easy to solve’ and ‘impossible to solve’ that a cause has to exist in to be worth funding. If it were ‘easy to solve’ it would be solved in the natural course of things, but if it were ‘impossible to solve’ there’s no point working on it.

When I argue that AGI safety won’t be solved in the normal course of AGI research, that is an argument that pushes it towards the ‘impossible’ side of the tractability scale. We agree up to this point, I think.

If I’ve got that right, then if I could show that it would be possible to solve AGI safety with increased funding, you would agree that it’s a worthy cause area? I suppose we should go through all the literature and judge for ourselves if progress is being made in the field. That might be a bit of a task to do here, though.

For the sake of argument, let’s say technical alignment is a totally intractable problem, when then? Give up and let extinction happen? If the problem does turn out to be impossible to solve, then no other cause area matters either because everybody is dead. If the problem is solvable, and we build a superintelligence, then still no other cause area matters because a superintelligence would be able to solve those problems.

This is kind of why I expected your argument to be about whether a superintelligence will be built, and when. Or about why you think that safety is a more trivial problem than I do. If you’re arguing the other way -- that safety is an impossible problem -- then wouldn’t you instead argue for stopping it being built in the first place?

I don’t know how tractable technical alignment will turn out to be. There has been some progress, but my main takeaway has been “We’ve discovered X, Y, and Z won’t work.”. If there is still no solution as we get closer to AGI being developed, then at least we’ll be able to point to that failure to try and slow down dangerous projects. Maybe the only safe solution will be to somehow increase human intelligence, rather than creating an independent AGI at all, I don’t know.

On the other hand, it might be totally solvable. It’s theoretical research, we don’t know until it’s done. If it is easily solved, then the problem becomes making sure that all AGI projects implement the solution, which would still be an effective cause. In either case, marginal increases in funding wouldn't be wasted.

beth

Thank you for your response.

Yes, that is what I meant. If you could convince me that AGI Safety were solvable with increased funding, and only solvable with increased funding, that would go a long way in convincing me of it being an effective cause.

In response to your question of giving up: If AGI were a long way off from being built, then helping others now is still a useful thing to do, no matter if either of the scenarios you describe were to happen. Sure, extinction would be bad, but at least from some person-affecting viewpoints I'd say extinction is not worse than existing animal agriculture.

HunterJay

Thanks for your response! I just wanted to let you know I'm taking the time to read your links and write out a well thought out reply, which might take another evening or two.

Aaron Gertler 🔸

Do you still plan to publish a reply at some point?

HunterJay

Yes, apologies for the delay, it's been a hectic week! Will hopefully post tomorrow.

Comments

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 5d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

114

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·6d ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

How (not) to fundraise from Anthropic staff

Jack Lewars·5d ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

Recent opportunities to take action

Starting an EA group @ SUNY Binghamton

micahzarin·9h ago·1m read

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·1d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·1d ago·3m read

HunterJay

Sorry for the delay on this reply. It’s been a very busy week.

With that in mind, I’ll go through you points here one by one, and then attempt to address some of arguments in your blog posts (though the first post was unavailable!).

The theoretical limits of computation are lower bounds, we don't know if it is possible to achieve them for any kind of computation, let alone for general computation. Moreover, having a lot of computational power probably doesn't mean that you can calculate everything. A lot of real-world problems are hard to approximate in a way that adding more computational power doesn't meaningfully help you. For example, computing approximate Nash-equilibria or finding good lay-outs for microchip design. It is not clear that having a lot of computing power translates into relevant superior capabilities.

There is a growing literature on making algorithms fair, accountable and transparent. This is a collaborative effort between researchers in computer science, law and many other fields. There are so many similarities between this and the professed goals of the AI Safety community that it is strange that no cross-fertilization is happening.

You can't just ask the AI to "be good", because the whole problem is getting the AI to do what you mean instead of what you ask. But what if you asked the AI to "make itself smart"? On the one hand, instrumental convergence implies that the AI should make itself smart. On the other hand, the AI will misunderstand what you mean, hence not making itself smart. Can you point the way out of this seeming contradiction?

AI Safety would be a worthy cause if a superintelligence were powerful and dangerous enough to be an issue but not so powerful and dangerous as to be uncontrollable. A solution has to be necessary, but it also has to exist. Thus, there is a tension between scale and tractability here. Both Bostrom and Yudkowsky only ever address one thing at a time, never acknowledging this tension.

Most estimates on take-off speed start counting from the point that the AI is superintelligent. Why wait until then? A computer can be reset, so if you had a primitive AGI specimen you'd have unlimited tries to spot problems and make it behave.

I'd say that a 0.0001% chance of a superintelligence catastrophe is a huge over-estimate. Hence, AI Safety would be an ineffective cause area if you hold a person-affecting view. If you don't, then at least this opens the way for the kind of counterarguments used against Pascal's Mugging.

Onwards to a some limited responses to your blog posts. I wasn’t entirely sure if I understood your argument properly, so I’m going to try and list the main points here and see if you agree.

2. You argue that current safety research is ineffective -- we’d be able to work more effectively and cheaply if we waited until we were closer to developing superintelligence.

4. You believe that there are numerous psychological reasons people are inclined to believe superintelligence is likely and dangerous, and so increase your skepticism of the claims because of that.

5. You argue that left to its own devices, regular commercial or academic research will be able to solve the problem.

If there’s a major point I’ve missed here, or if I’ve phrased these badly, do correct me! Anyway, let’s go through them.

1. You argue that if the probability of an AI-related extinction event were large, and if a single AI-related extinction event could affect any lifeform in the universe ever, one should have already happened somewhere and we shouldn’t exist.

2. You argue that current safety research is ineffective -- we’d be able to work more effectively and cheaply if we waited until we were closer to developing superintelligence.

3. You believe that if a superintelligence was going to be built in the near future, and if it was going to be dangerous, it would probably result in a smaller scale catastrophe that would give us plenty of warning to do safety research to prevent a larger one at that point.

4. You believe that there are numerous psychological reasons people are inclined to believe superintelligence is likely and dangerous, and so increase your skepticism of the claims because of that.

You might argue, fairly, that this is more likely, but do you really believe it is billions of times as likely?

5. You argue that left to its own devices, regular commercial or academic research will be able to solve the problem.

I really hope so.

To conclude, I want to lay out what would change my mind:

If progress on computer hardware and software seemed very likely to halt (or slow dramatically) in the near future.

If somebody was able to disprove (or provide very strong evidence against) the orthogonality thesis and instrumental convergence thesis.

If no project was working on building superintelligence.

Otherwise, it seems very much like we could have the capability of simulating and optimising a general intelligence in the near future, and that this could be very dangerous.