RG

Ryan Greenblatt

Member of Technical Staff @ Redwood Research
524 karmaJoined

Bio

This other Ryan Greenblatt is my old account[1]. Here is my LW account.

  1. ^

    Account lost to the mists of time and expired university email addresses.

Comments
136

Topic contributions
2

Ultimately what matters most is what the leadership's views are.

I'm skeptical this is true particularly as AI companies grow massively and require vast amounts of investment.

It does seem important, but unclear it matters most.

One key issue with this model is that I expect that the majority of x-risk from my perspective doesn't correspond to extinction and instead corresponds to some undesirable group unding up with control over the long run future (either AIs seizing control (AI takeover) or undesirable human groups).

So, I would reject:

We can model extinction here by n(t) going to zero.

You might be able to recover things by supposing n(t) gets transformed by some constant multiple on x-risk maybe?

(Further, even if AI takeover does result in extinction there will probably still be some value due to acausal trade and potentially some value due to the AI's preferences.)

(Regardless, I expect that if you think the singularity is plausible, the effects of discounting are more complex because we could very plausibly have >10^20 experience years per year within 5 years of the singularity due to e.g. building a Dyson sphere around the sun. If we just look at AI takeover, ignore (acausal) trade, and assume for simplicity that AI preferences have no value, then it is likely that the vast, vast majority of value is contingent on retaining human control. If we allow for acausal trade, then the discount rates of the AI will also be important to determine how much trade should happen.)

(Separately, pure temporal discounting seems pretty insane and incoherent with my view of the universe works.)

I tried to find out if the time-horizons for potential x-risk events have been explicitly discussed in longtermism literature but I didn’t come across anything.

See here

More specifically, is there any good reason to assume that the odds are in favor of humans even by a little bit? If so, what exactly is the argument for that?

There is a good argument from your perspective: human resource utilization is likely to be more similar to your values on reflection than a randomly chosen other species.

Is there any specific reason for discounting the possibility that arthropods or reptiles evolving over millions of years to something that equals or surpasses the intelligence of humans that were last alive?

No, I think analysis shouldn't discount this. Unless there is an unknown hard-to-pass point (a filter) between existing mammals/primates and human level civilization, it seems like life re-evolving is quite likely. (I'd say 85% chance of a new civilization conditional on human extinction, but not primate extinction, and 75% if primates also go extinct.)

There is also the potential for alien civilizations, though I think this has a lower probability (perhaps 50% that aliens capture >75% of the cosmic resources in our light cone if earth originating civilizations don't caputure these resources).

IMO, the dominant effect of extinction due to bio-risk is that a different earth originating species acquires power and my values on reflection are likely to be closer to humanities values on reflection than the other species. (I also have some influence over how humanity spends its resources, though I expect this effect is not that big.)

If you were equally happy with other species, then I think you still only take a 10x discount from these considerations because there is some possibility of a hard-to-pass barrier between other life and humans. 10x discounts don't usually seem like cruxes IMO.

I would also note that for AI x-risk, life intelligent life reevolving is unimportant. (I also think AI x-risk is unlikely to result in extinction because AIs are unlikely to want to kill all humans for various reasons.)

And over time scales of billions, we could enter the possibility of evolution from basic eukaryotes too. 

Earth will be habitable for about ~1 billion more years which probably isn't quite enough for this.

Perceived counter-argument:

My proposed counter-argument loosely based on the structure of yours.

Summary of claims

  • A reasonable fraction of computational resources will be spent based on the result of careful reflection.
  • I expect to be reasonably aligned with the result of careful reflection from other humans
  • I expect to be much less aligned with result of AIs-that-seize-control reflecting due to less similarity and the potential for AIs to pursue relatively specific objectives from training (things like reward seeking).
  • Many arguments that human resource usage won't be that good seem to apply equally well to AIs and thus aren't differential.

Full argument

The vast majority of value from my perspective on reflection (where my perspective on reflection is probably somewhat utilitarian, but this is somewhat unclear) in the future will come from agents who are trying to optimize explicitly for doing "good" things and are being at least somewhat thoughtful about it, rather than those who incidentally achieve utilitarian objectives. (By "good", I just mean what seems to them to be good.)

At present, the moral views of humanity are a hot mess. However, it seems likely to me that a reasonable fraction of the total computational resources of our lightcone (perhaps 50%) will in expectation be spent based on the result of a process in which an agent or some agents think carefully about what would be best in a pretty delibrate and relatively wise way. This could involve eventually deferring to other smarter/wiser agents or massive amounts of self-enhancement. Let's call this a "reasonably-good-reflection" process.

Why think a reasonable fraction of resources will be spent like this?

  • If you self-enhance and get smarter, this sort of reflection on your values seems very natural. The same for deferring to other smarter entities. Further, entities in control might live for an extremely long time, so if they don't lock in something, as long as they eventually get around to being thoughtful it should be fine.
  • People who don't reflect like this probably won't care much about having vast amounts of resources and thus the resources will go to those who reflect.
  • The argument for "you should be at least somewhat thoughtful about how you spend vast amounts of resources" is pretty compelling at an absolute level and will be more compelling as people get smarter.
  • Currently a variety of moderately powerful groups are pretty sympathetic to this sort of view and the power of these groups will be higher in the singularity.

I expect that I am pretty aligned (on reasonably-good-reflection) with the result of random humans doing reasonably-good-reflection as I am also a human and many of the underlying arguments/intuitions I think seem important seem likely to seem important to many other humans (given various common human intuitions) upon those humans becoming wiser. Further, I really just care about the preferences of (post-)humans who end care most about using vast, vast amounts of computational resources (assuming I end up caring about these things on reflection), because the humans who care about other things won't use most of the resources. Additionally, I care "most" about the on-reflection preferences I have which are relatively less contingent and more common among at least humans for a variety of reasons. (One way to put this is that I care less about worlds in which my preferences on reflection seem highly contingent.)

So, I've claimed that reasonably-good-reflection resource usage will be non-trivial (perhaps 50%) and that I'm pretty aligned with humans on reasonably-good-reflection. Supposing these, why think that most of the value is coming from something like reasonably-good-reflection prefences rather than other things, e.g. not very thoughtful indexical preferences (selfish) consumption? Broadly three reasons:

  • I expect huge returns to heavy optimization of resource usage (similar to spending altruistic resources today IMO and in the future we'll we smarter which will make this effect stronger).
  • I don't think that (even heavily optimized) not-very-thoughtful indexical preferences directly result in things I care that much about relative to things optimized for what I care about on reflection (e.g. it probably doesn't result in vast, vast, vast amounts of experience which is optimized heavily for goodness/$).
    • Consider how billionaries currently spend money which doesn't seem to have have much direct value, certainly not relative to their altruistic expenditures.
    • I find it hard to imagine that indexical self-ish consumption results in things like simulating 10^50 happy minds. See also my other comment. It seems more likely IMO that people with self-ish preferences mostly just buy positional goods that involve little to no experience (separately, I expect this means that people without self-ish preferences get more of the compute, but this is counted in my earlier argument, so we shouldn't double count it.)
  • I expect that indirect value "in the minds of the laborers producing the goods for consumption" is also small relative to things optimized for what I care about on reflection. (It seems pretty small or maybe net-negative (due to factory farming) today (relative to optimized altruism) and I expect the share will go down going forward.)

(Aside: I was talking about not-very-thoughtful indexical-preferences. It's likely to me that doing a reasonably good job reflecting on selfish preferences get back to something like de facto utilitarianism (at least as far as how you spend the vast majority of computational resources) because personal identity and indexical preferences don't make much sense and the thing you end up thinking is more like "I guess I just care about experiences in general".)

What about AIs? I think there are broadly two main reasons to expect that what AIs do on reasonably-good-reflection to be worse from my perspective than what humans do:

  • As discussed above, I am more similar to other humans and when I inspect the object level of how other humans think or act, I feel reasonably optimistic about the results of reasonably-good-reflection for humans. (It seems to me like the main thing holding me back from agreement with other humans is mostly biases/communication/lack of smarts/wisdom given many shared intuitions.) However, AIs might be more different and thus result in less value. Further, the values of humans after reasonably-good-reflection seem close to saturating in goodness from my perspective (perhaps 1/3 or 1/2 of the value of purely my values), so it seems hard for AI to do better.
    • To better understand this argument, imagine that instead of humanity the question was between identical clones of myself and AIs. It's pretty clear I share the same values the clones, so the clones do pretty much strictly better than AIs (up to self-defeating moral views).
    • I'm uncertain about the degree of similarity between myself and other humans. But, mostly the underlying similarity uncertainties also applies to AIs. So, e.g., maybe I currently think on reasonably-good-reflection humans spend resources 1/3 as well as I would and AIs spend resources 1/9 as well. If I updated to think that other humans after reasonably-good-reflection only spend resources 1/10 as well as I do, I might also update to thinking AIs spend resources 1/100 as well.
  • In many of the stories I imagine for AIs seizing control, very powerful AIs end up directly pursuing close correlated of what was reinforced in training (sometimes called reward-seeking, though I'm trying to point at a more general notion). Such AIs are reasonably likely to pursue relatively obviously valueless-from-my-perspective things on reflection. Overall, they might act more like a ultra powerful corporation that just optimizes for power/money rather than our children (see also here). More generally, AIs might in some sense be subjected to wildly higher levels of optimization pressure than humans while being able to better internalize these values (lack of genetic bottleneck) which can plausibly result in "worse" values from my perspective.

Note that we're conditioning on safety/alignment technology failing to retain human control, so we should imagine correspondingly less human control over AI values.

I think that the fraction of computation resources of our lightcone used based on the result of a reasonably-good-reflection process seems similar between human control and AI control (perhaps 50%). It's possible to mess this up of course and either mess up the reflection or to lock-in bad values too early. But, when I look at the balance of arguments, humans messing this up seems pretty similar to AIs messing this up to me. So, the main question is what the result of such a process would be. One way to put this is that I don't expect humans to differ substantially from AIs in terms of how "thoughtful" they are.

I interpret one of your arguments as being "Humans won't be very thoughtful about how they spend vast, vast amounts of computational resources. After all, they aren't thoughtful right now." To the extent I buy this argument, I think it applies roughly equally well to AIs. So naively, it just divides by both sides rather than making AI look more favorable. (At least, if you accept that all most all of the value comes from being at least a bit thoughtful, which you also contest. See my arguments for that.)

In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.

Suppose that a single misaligned AI takes control and it happens to care somewhat about its own happiness while not having any more "altruistic" tendencies that I would care about or you would care about. (I think misaligned AIs which seize control caring about their own happiness substantially seems less likely than not, but let's suppose this for now.) (I'm saying "single misaligned AI" for simplicity, I get that a messier coalition might be in control.) It now has access to vast amounts of computation after sending out huge numbers of probes to take control over all available energy. This is enough computation to run absolutely absurd amounts of stuff.

What are you imagining it spends these resources on which is competitive with optimized goodness? Running >10^50 copies of itself which are heavily optimized for being as happy as possible while spending?

If a small number of agents have a vast amount of power, and these agents don't (eventually, possibly after a large amount of thinking) want to do something which is de facto like the values I end up caring about upon reflection (which is probably, though not certainly, vaguely like utilitarianism in some sense), then from my perspective it seems very likely that the resources will be squandered.

If you're imagining something like:

  1. It thinks carefully about what would make "it" happy.
  2. It realizes it cares about having as many diverse good experience moments as possible in a non-indexical way.
  3. It realizes that heavy self-modification would result in these experience moments being better and more efficient, so it creates new versions of "itself" which are radically different and produce more efficiently good experiences.
  4. It realizes it doesn't care much about the notion of "itself" here and mostly just focuses on good experiences.
  5. It runs vast numbers of such copies with diverse experiences.

Then this is just something like utilitarianism by another name via a differnet line of reasoning.

I thought your view was that step (2) in this process won't go like this. E.g., currently self-ish entities will retain indexical preferences. If so, then I do see where the goodness can plausibly come from.

The fact that our current world isn't well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.

When I look at very rich people (people with >$1 billion), it seems like the dominant way they make the world better via spending money (not via making money!) is via thoughtful altuistic giving not via consumption.

Perhaps your view is that with the potential for digital minds this situation will change?

(Also, it seems very plausible to me that the dominant effect on current welfare is driven mostly by the effect on factory farming and other animal welfare.)

I expect this trend to further increase as people get much, much wealthier and some fraction (probably most) of them get much, much smarter and wiser with intelligence augmentation.

Additionally, how are you feeling about voluntary commitments from labs (RSPs included) relative to alternatives like mandatory regulation by governments

This is discussed in Holden's earlier post on the topic here.

Explicit +1  to what Owen is saying here.

(Given that I commented with some counterarguments, I thought I would explicitly note my +1 here.)

In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking unethical actions, allowing us to shape its rewards during training accordingly. After we've aligned a model that's merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.

This reasoning seems to imply that you could use GPT-2 to oversee GPT-4 by bootstrapping from a chain of models of scales between GPT-2 and GPT-4. However, this isn't true, the weak-to-strong generalization paper finds that this doesn't work and indeed bootstrapping like this doesn't help at all for ChatGPT reward modeling (it helps on chess puzzles and for nothing else they investigate I believe).

I think this sort of bootstrapping argument might work if we could ensure that the each model in the chain was sufficiently aligned and capable of reasoning that it would carefully reason about what humans would want if they were more knowledgeable and then rate outputs based on this. However, I don't think GPT-4 is either aligned enough or capable enough that we see this behavior. And I still think it's unlikely it works under these generous assumptions (though I won't argue for this here).

Load more