My personal cruxes for working on AI safety

Buck

Comments 35

Sorted by

New & upvoted

Thanks a lot for this great post! I think the part I like the most, even more than the awesome deconstruction of arguments and their underlying hypotheses, is the sheer number of times you said "I don't know" or "I'm not sure" or "this might be false". I feel it places you at the same level than your audience (including me), in the sense that you have more experience and technical competence than the rest of us, but you still don't know THE TRUTH, or sometimes even good approximations to it. And the standard way to present clearly ideas and research is to structure them so that these points that we don't know are not the focus. So that was refreshing.

On the more technical side, I had a couple of questions and remarks concerning your different positions.

One underlying hypothesis that was not explicitly pointed out, I think, was that you are looking for priority arguments. That is, part of your argument is about whether AI safety research is the most important thing you could do (It might be so obvious in an EA meeting or the EA forum that it's not worth exploring, but I like expliciting the obvious hypotheses). But that's different from whether or not we should do AI safety research at all. That is one common criticism I have about taking at face value effective altruism career recommendations: we would not have for example pure mathematicians, because pure mathematics is never the priority. Whereas you could argue that without pure mathematics, almost all the positive technological progress we have now (from quantum mechanics to computer science) would not exist. (Note that this is not an argument for having a lot of mathematicians, just an argument for having some).
For the problems-that-solve-themselves arguments, I feel like your examples have very "good" qualities for solving themselves: both personal and economic incentives are against them, they are obvious when one is confronted with the situation, and at the point where the problems becomes obvious, you can still solve them. I would argue that not all these properties holds for AGI. What are your thoughts about that?
About the "big deal" argument, I'm not sure that another big deal before AGI would invalidate the value of current AI Safety research. What seems weird in your definition of big deal is that if I assume the big deal, then I can make informed guess and plans about the world after it, no? Something akin to The Age of Em by Hanson, where he starts with ems (whole-brain emulations) and then try to derive what our current understanding of the various sciences can tell us about this future. I don't see why you can't do this even if there is another big deal before AGI. Maybe the only cost is more and more uncertainty.
The arguments you point out against the value of research now compared to research closer to AGI seems to forget about incremental research. Not all research is a breakthrough, and most if not all breakthrough build on previous decades or centuries of quiet research work. In this sense, working on it now might be the only way to ensure the necessary breakthroughs closer to the deadline.

Buck

For the problems-that-solve-themselves arguments, I feel like your examples have very "good" qualities for solving themselves: both personal and economic incentives are against them, they are obvious when one is confronted with the situation, and at the point where the problems becomes obvious, you can still solve them. I would argue that not all these properties holds for AGI. What are your thoughts about that?

I agree that it's an important question whether AGI has the right qualities to "solve itself". To go through the ones you named:

"Personal and economic incentives are aligned against them"--I think AI safety has somewhat good properties here. Basically no-one wants to kill everyone, and AI systems that aren't aligned with their users are much less useful. On the other hand, it might be the case that people are strongly incentivised to be reckless and deploy things quickly.
"they are obvious when one is confronted with the situation"--I think that alignment problems might be fairly obvious, especially if there's a long process of continuous AI progress where unaligned non-superintelligent AI systems do non-catastrophic damage. So this comes down to questions about how rapid AI progress will be.
"at the point where the problems become obvious, you can still solve them"--If the problems become obvious because non-superintelligent AI systems are behaving badly, then we can still maybe put more effort into aligning increasingly powerful AI systems after that and hopefully we won't lose that much of the value of the future.

Buck

One underlying hypothesis that was not explicitly pointed out, I think, was that you are looking for priority arguments. That is, part of your argument is about whether AI safety research is the most important thing you could do (It might be so obvious in an EA meeting or the EA forum that it's not worth exploring, but I like expliciting the obvious hypotheses).

This is a good point.

Whereas you could argue that without pure mathematics, almost all the positive technological progress we have now (from quantum mechanics to computer science) would not exist.

I feel pretty unsure on this point; for a contradictory perspective you might enjoy this article.

adamShimi

I'm curious about the article, but the link points to nothing. ^^

Kirsten

"And it seems to me that the stories that I have for how my work ends up making a difference to the world, most of those are just look really unlikely to work if AGI is more than 50 years off. It's really hard to do research that impacts the world positively more than 50 years down the road."

This was nice to read, because I'm not sure I've ever seen anyone actually admit this before.

You say you think there's a 70% chance of AGI in the next 50 years. How low would that probability have to be before you'd say, "Okay, we've got a reasonable number of people to work on this risk, we don't really need to recruit new people into AI safety"?

Buck

This was nice to read, because I'm not sure I've ever seen anyone actually admit this before.

Not everyone agrees with me on this point. Many safety researchers think that their path to impact is by establishing a strong research community around safety, which seems more plausible as a mechanism to affect the world 50 years out than the "my work is actually relevant" plan. (And partially for this reason, these people tend to do different research to me.)

You say you think there's a 70% chance of AGI in the next 50 years. How low would that probability have to be before you'd say, "Okay, we've got a reasonable number of people to work on this risk, we don't really need to recruit new people into AI safety"?

I don't know what the size of the AI safety field is such that marginal effort is better spent elsewhere. Presumably this is a continuous thing rather than a discrete thing. Eg it seems to me that now compared to five years ago, there are way more people in AI safety and so if your comparative advantage is in some other way of positively influencing the future, you should more strongly consider that other thing.

Gordon Seidoh Worley

Regarding the 14% estimate, I'm actually surprised it's this high. I have the opposite intuition, that there is so much uncertainty, especially about whether or not any particular thing someone does will have impact, that I place the likelihood of anything any particular person working on AI safety does producing positive outcomes at <1%. The only reason it seems worth working on to me despite all of this is that when you multiply it against the size of the payoff it ends up being worthwhile anyway.

Eli Rose🔸

I agree with this intuition. I suspect the question that needs to be asked is "14% chance of what?"

RomeoStevens

The chance that the full stack of individual propositions evaluates as true in the relevant direction (work on AI vs work on something else).

Eli Rose🔸

Suppose you're in the future and you can tell how it all worked out. How do you know if it was right to work on AI safety or not?

There are a few different operationalizations of that. For example, you could ask whether your work obviously directly saved the world, or you could ask whether, if you could go back and do it over again with what you knew now, you would still work in AI safety.

The percentage would be different depending on what you mean. I suspect Gordon and Buck might have different operationalizations in mind, and I suspect that's why Buck's number seems crazy high to Gordon.

RomeoStevens

You don't, but that's a different proposition with a different set of cruxes since it is based on ex post rather than ex ante.

Eli Rose🔸

I'm saying we need to specify more than, "The chance that the full stack of individual propositions evaluates as true in the relevant direction." I'm not sure if we're disagreeing, or ... ?

Rohin Shah

I enjoyed this post, it was good to see this all laid out in a single essay, rather than floating around as a bunch of separate ideas.

That said, my personal cruxes and story of impact are actually fairly different: in particular, while this post sees the impact of research as coming from solving the technical alignment problem, I care about other sources of impact as well, including:

1. Field building: Research done now can help train people who will be able to analyze problems and find solutions in the future, when we have more evidence about what powerful AI systems will look like.

2. Credibility building: It does you no good to know how to align AI systems if the people who build AI systems don't use your solutions. Research done now helps establish the AI safety field as the people to talk to in order to keep advanced AI systems safe.

3. Influencing AI strategy: This is a catch all category meant to include the ways that technical research influences the probability that we deploy unsafe AI systems in the future. For example, if technical research provides more clarity on exactly which systems are risky and which ones are fine, it becomes less likely that people build the risky systems (nobody _wants_ an unsafe AI system), even though this research doesn't solve the alignment problem.

As a result, cruxes 3-5 in this post would not actually be cruxes for me (though 1 and 2 would be).

Buck

Yeah, for the record I also think those are pretty plausible and important sources of impact for AI safety research.

I think that either way, it’s useful for people to think about which of these paths to impact they’re going for with their research.

Matthew_Barnett

I like this way of thinking about AI risk, though I would emphasize that my disagreement comes a lot from my skepticism of crux 2 and in turn crux 3. If AI is far away, then it seems pretty difficult to understand how it will end up being used, and I think even when timelines are 20-30 years from now, this remains an issue [ETA: Note that also, during a period of rapid economic growth, much more intellectual progress might happen in a relatively small period of physical time, as computers could automate some parts of human intellectual labor. This implies that short physical timelines could underestimate the conceptual timelines before systems are superhuman].

I have two intuitions that pull me in this direction.

The first is that it seems like if you asked someone from 10 years ago what AI would look like now, you'd mostly get responses that wouldn't really help us that much at aligning our current systems. If you agree with me here, but still think that we know better now, I think you need to believe that the conceptual distance between now and AGI is smaller than the conceptual distance between AI in 2010 and AI in 2020.

The second intuition is that it seems like safety engineering is usually very sensitive to small details of a system that are hard to get access to unless the design schematics are right in front of you.

Without concrete details, the major approach within AI safety (as Buck explicitly advocates here) is to define a relaxed version of the problem that abstracts low level details away. But if safety engineering mostly involves getting little details right rather than big ones, then this might not be very fruitful.

I haven't discovered any examples of real world systems where doing extensive abstract reasoning beforehand was essential for making it safe. Computer security is probably the main example where abstract mathematics seems to help, but my understanding is that the math probably could have been developed alongside the computers in question, and that the way these systems are compromised is usually not due to some conceptual mistake.

Rohin Shah

I broadly agree with this, but I feel like this is mostly skepticism of crux 3 and not crux 2. I think to switch my position on crux 2 using only timeline arguments, you'd have to argue something like <10% chance of transformative AI in 50 years.

Matthew_Barnett

I think to switch my position on crux 2 using only timeline arguments, you'd have to argue something like <10% chance of transformative AI in 50 years.

That makes sense. "Plausibly soonish" is pretty vague so I pattern matched to something more similar to -- by default it will come within a few decades.

It's reasonable that for people with different comparative advantages, their threshold for caring should be higher. If there were only a 2% chance of transformative AI in 50 years, and I was in charge of effective altruism resource allocation, I would still want some people (perhaps 20-30) to be looking into it.

Neel Nanda

Thanks for writing this up! I thought it was really interesting (and this seems a really excellent talk to be doing at student groups :) ). Especially the arguments about the economic impact of AGI, and the focus on what it costs - that's an interesting perspective I haven't heard emphasised elsewhere.

The parts I feel most unconvinced by:

The content in Crux 1 seems to argue that AGI will be important when it scales and becomes cheap, because of the economic impact. But the argument for the actual research being done seem more focused on AGI as a single monolithic thing, eg framings like a safety tax/arms race, comparing costs of building an unaligned AGI vs an aligned AGI.

My best guess for what you mean is that "If AGI goes well, for economic reasons, the world will look very different and so any future plans will be suspect. But the threat from AGI comes the first time one is made", ie that Crux 1 is an argument for prioritising AGI work over other work, but unrelated to the severity of the threat of AGI - is this correct?

The claim that good alignment solutions would be put to use. The fact that so many computer systems put minimal effort into security today seems a very compelling counter-argument.

I'm especially concerned if the problems are subtle - my impression is that especially a lot of what MIRI thinks about sounds weird and "I could maybe buy this", but could maybe not buy it. And I have much lower confidence that companies would invest heavily in security for more speculative, abstract concerns

This seems bad, because intuitively AI Safety research seems more counterfactually useful the more subtle the problems are - I'd expect people to solve obvious problems before deploying AGI even without AI Safety as a field.

Related to the first point, I have much higher confidence AGI would be safe if it's a single, large project eg a major $100 billion deployment, that people put a lot of thought into, than if it's cheap and used ubiquitously.

RomeoStevens

First, doing philosophy publicly is hard and therefore rare. It cuts against Ra-shaped incentives. Much appreciation to the efforts that went into this.

>he thinks the world is metaphorically more made of liquids than solids.

Damn, the convo ended just as it was getting to the good part. I really like this sentence and suspect that thinking like this remains a big untapped source of generating sharper cruxes between researchers. Most of our reasoning is secretly analogical with deductive and inductive reasoning back-filled to try to fit it to what our parallel processing already thinks is the correct shape that an answer is supposed to take. If we go back to the idea of security mindest, then the representation that one tends to use will be made up of components, your type system for uncertainty will be uncertainty of those components varying. So which sorts of things your representation uses as building blocks will be the kinds of uncertainty that you have an easier time thinking about and managing. Going upstream in this way should resolve a bunch of downstream tangles since the generators for the shape/direction/magnitude (this is an example of such a choice that might impact how I think about the problem) of the updates will be clearer.

This gets at a way of thinking about metaphilosophy. We can ask what more general class of problems AI safety is an instance of, and maybe recover some features of the space. I like the capability amplification frame because it's useful as a toy problem to think about random subsets of human capabilities getting amplified, to think about the non-random ways capabilities have been amplified in the past, and what sorts of incentive gradients might be present for capability amplification besides just the AI research landscape one.

D_M_x

This was great, thank you. I've been asking people about their reasons to work on AI safety as opposed to other world improving things, assuming they want to maximize the world improving things they do. Wonderful when people write it up without me having to ask!

One thing this post/your talk would have benefited from to make things clearer (or well, at least for me) is if you gave more detail on the question of how you define 'AGI', since all the cruxes depend on it.

Thank you for defining AGI as something that can do regularly smart human things and then asking the very important question how expensive that AGI is. But what are those regularly smart human things? What fraction of them would be necessary (though that depends a lot on how you define 'task')?

I still feel very confused about a lot of things. My impression is that AI is much better than humans at quite a few narrow tasks though this depends on the definition. If AI was suddenly much better than humans at half of all the tasks human can do, but sucked at the rest, then that wouldn't count as artificial 'general' intelligence under your definition(?) but it's unclear to me whether that would be any less transformative though this depends a lot on the cost again. Now that I think about it, I don't think I understand how your definition of AGI is different to the results of whole-brain emulation, apart from the fact that they used different ways to get there. I'm also not clear on whether you use the same definition as other people, whether those usually use the same one and how much all the other cruxes depend on how exactly you define AGI.

Linch

(Only attempting to answer this because I want to practice thinking like Buck, feel free to ignore)

Now that I think about it, I don't think I understand how your definition of AGI is different to the results of whole-brain emulation, apart from the fact that they used different ways to get there

My understanding is that Buck defines AGI to point at a cluster of things such that technical AI Safety work (as opposed to, eg., AI policy work or AI safety movement building, or other things he can be doing) is likely to be directly useful. You can imagine that "whole-brain emulation safety" will look very different as a problem to tackle, since you can rely much more on things like "human values", introspection, the psychology literature, etc.

omernevo

Thank you for writing this!

I really appreciate your approach of thoroughly going through potential issues with your eventual conclusion. It's a really good way of getting to the interesting parts of the discussion!

The area where I'm left least convinced by is the use of Laplace's Law of Succession (LLoC) to suggest that AGI is coming soonish (that isn't to say there aren't convincing arguments for this, but I think this argument probably isn't one of them).

There are two ways of thinking that make me skeptical of using LLoC in this context (they're related but I think it's helpful to separate them):

1. Given a small amount of observations, there's not enough information to "get away" from our priors. So whatever prior we load into the formula - we're bound to get something relatively close to it. This works if we have a good reason to use a uniform prior or in contexts where we're only trying to separate hypotheses that aren't "far enough away" from the uniform prior, which I don't think is the case here:

In my understanding, what we're really trying to do is separate two hypotheses: The first is that the chance of AGI appearing in the next 50 years is non-negligible (it won't make a huge difference to our eventual decision making if it's 40% or 30% or 20%). The second is that it is negligible (let's say, less than 0.1%, or one in a thousand).

When we use a uniform prior (which starts out with a 50% chance of AGI appearing within a year) - we have already loaded the formula with the answer and the method isn't helpful to us.

2. In continuation to the "demon objection" within the text, I think the objection there could be strengthened to become a lot more convincing. The objection is that LLoC doesn't take the specific event it's trying to predict into account, which is strange and sounds problematic. The example given turns out ok: We've been trying to summon demons for thousands of years so the chance of it happening in the next 50 years is calculated to be small.

But of course, that's just not the best example to show that LLoC is problematic in these areas:

Example 1: I have thought up of a completely new and original demon. It was obviously never attempted to summon my new and special demon until this year, when, apparently it wasn't summoned. The LLoC chance of summoning my demon next year is quite high (and over the next 50 years is incredibly high). It's also larger than the chance of summoning any demon (including my own) over those time periods.

The problematic nature of it isn't just because I picked an extreme example with a single observation -

Example 2: What is the chance that the movie Psycho is meant to hypnotize everyone watching it and we'll only realize it when Hitchcock takes over the world? Well, turns out that this hasn't yet happened for exactly 60 years. So, it seems like the chance of this happening soon is precisely the same as the chance of AGI appearing.

Next, what is the chance of Hitchcock doing this AND Harper Lee (To Kill a Mockingbird came out in the same year) attempts doing this in a similar fashion AND Andre Cassagnes (Etch-A-Sketch is also from 1960) does so (I want to know the chance of all three happening at the exact same time)? Turns out that this specific and convoluted scenario is just as likely since it could only start happening at 1960… This is both obviously wrong and an instance of the conjunction fallacy.

EdoArad🔸

This reminds me of the discussion around the Hinge of History Hypothesis (and the subsequent discussion of Rob Wiblin and Will Macaskill).

I'm not sure that I understand the first point. What sort of prior would be supported by this view?

The second point I definitely agree with, and the general point of being extra careful about how to use priors :)

omernevo

Sorry, I wasn't very clear on the first point: There isn't a 'correct' prior.

In our context (by context I mean both the small number of observations and the implicit hypotheses that we're trying to differentiate between), the prior has a large enough weight that it affects the eventual result in a way that makes the method unhelpful.

MichaelA🔸

Thanks for writing this! As others have commented, I thought the focus on your actual cruxes and uncertainties, rather than just trying to lay out a clean or convincing argument, was really great. I'd be excited to see more talks/write-ups of a similar style from other people working on AI safety or other causes.

I think that long-term, it's not acceptable to have there be people who have the ability to kill everyone. It so happens that so far no one has been able to kill everyone. This seems good. I think long-term we're either going to have to fix the problem where some portion of humans want to kill everyone or fix the problem where humans are able to kill everyone.

This, and the section it's a part of, reminded me quite a bit of Nick Bostrom's Vulnerable World Hypothesis paper (and specifically his "easy nukes" thought experiment). From that paper's abstract:

Scientific and technological progress might change people’s capabilities or incentives in ways that would destabilize civilization. For example, advances in DIY biohacking tools might make it easy for anybody with basic training in biology to kill millions; novel military technologies could trigger arms races in which whoever strikes first has a decisive advantage; or some economically advantageous process may be invented that produces disastrous negative global externalities that are hard to regulate. This paper introduces the concept of a vulnerable world: roughly, one in which there is some level of technological development at which civilization almost certainly gets devastated by default, i.e. unless it has exited the ‘semi-anarchic default condition’. [...] A general ability to stabilize a vulnerable world would require greatly amplified capacities for preventive policing and global governance.

I'd recommend that paper for people who found that section of this post interesting.

MaxDalton

[I'm doing a bunch of low-effort reviews of posts I read a while ago and think are important. Unfortunately, I don't have time to re-read them or say very nuanced things about them.]

I really like the direct, personal, thoughtful style of this talk, and would like to see more posts like it. Seems like maybe one of the best intros-of-this-length to the reasons for working on AI alignment.

Rohin Shah

Planned summary for the Alignment Newsletter:

This post describes how Buck's cause prioritization within an effective altruism framework leads him to work on AI risk. The case can be broken down into a conjunction of five cruxes. Specifically, the story for impact is that 1) AGI would be a big deal if it were created, 2) has a decent chance of being created soon, before any other "big deal" technology is created, and 3) poses an alignment problem that we both **can** and **need to** think ahead in order to solve. His research 4) would be put into practice if it solved the problem and 5) makes progress on solving the problem.

Planned opinion:

I enjoyed this post, and recommend reading it in full if you are interested in AI risk because of effective altruism. (I've kept the summary relatively short because not all of my readers care about effective altruism.) My personal cruxes and story of impact are actually fairly different: in particular, while this post sees the impact of research as coming from solving the technical alignment problem, I care about other sources of impact as well. See this comment for details.

Buck

I think your summary of crux three is slightly wrong: I didn’t say that we need to think about it ahead of time, I just said that we can.

Rohin Shah

My interpretation was that the crux was

We can do good by thinking ahead

One thing this leaves implicit is the counterfactual: in particular, I thought the point of the "Problems solve themselves" section was that if problems would be solved by default, then you can't do good by thinking ahead. I wanted to make that clearer, which led to

we both **can** and **need to** think ahead in order to solve [the alignment problem].

Where "can" talks about feasibility, and "need to" talks about the counterfactual.

I can remove the "and **need to**" if you think this is wrong.

Buck

I'd prefer something like the weaker and less clear statement "we **can** think ahead, and it's potentially valuable to do so even given the fact that people might try to figure this all out later".

adamShimi

On a tangent, what are your issues with quantum computing? Is it the hype? that might indeed be abusive for what we can do now. But the theory is fascinating, there are concrete applications where we should get positive benefits for humanity, and the actual researchers in the field try really hard to clarify what we know and what we don't about quantum computing.

EdoArad🔸

Jaime Sevilla wrote a long (albeit preliminary) and interesting report on the topic

Aaron Gertler 🔸

This post was awarded an EA Forum Prize; see the prize announcement for more details.

My notes on what I liked about the post, from the announcement:

“I edited [the transcript] for style and clarity, and also to occasionally have me say smarter things than I actually said.”
The “enhanced transcript” format seems very promising for other Forum content, and I hope to see more people try it out!
As for this enhanced transcript: here, Buck reasons through a difficult problem using techniques we encourage — laying out his “cruxes,” or points that would lead him to change his mind if he came to believe they were false. This practice encourages discussion, since it makes it easier for people to figure out where their views differ from yours and which points are most important to discuss. (You can see this both in the Q&A section of the transcript and in comments on the post itself.)
I also really appreciated Buck’s introduction to the talk, where he suggested to listeners how they might best learn from his work, as well as his concluding summary at the end of the post.
Finally, I’ll quote one of the commenters on the post:
I think the part I like the most, even more than the awesome deconstruction of arguments and their underlying hypotheses, is the sheer number of times you said "I don't know" or "I'm not sure" or "this might be false".

[anonymous]

What does AI safety movement building look like? What sorts of projects or tasks does this involve? What are the relevant organizations where one could do AI safety movement building work?

Sam Clarke

On crux 4: I agree with your argument that good alignment solutions will be put to use, in worlds where AI risk comes from AGI being an unbounded maximiser. I'm less certain that they would be in worlds where AI risk comes from structural loss of control leading to influence-seeking agents (the world still gets better in Part I of the story, so I'm uncertain whether there would be sufficient incentive for corporations to use AIs aligned with complex values rather than AIs aligned with profit maximisation).

Do you have any thoughts on this or know if anyone has written about it?

Comments

My personal cruxes for working on AI safety

Introduction

Meta level thoughts

Heuristic arguments

Ways heuristic arguments are insufficient

Ways of listening to a talk

Norms

Crux 1: AGI would be a big deal if it showed up here

Crux 2: AGI is plausibly soonish, and the next big deal

AI timelines

Crux 3: You can do good by thinking ahead on AGI

Problems solve themselves

Thinking ahead is real hard

Arguments for thinking ahead

Relaxations

Analogy to security

Crux 4: good alignment solutions will be put to use

Wouldn’t someone eventually kill everyone?

Crux 5: My research is the good kind

Conclusion

Q&A