In general, to me it seems quite fruitful to examine in more detail whether, in fact, multipolarity of various kinds might alleviate concerns about value fragility. And to those who have the intuition that it would (especially in cases, like Multipolar value fragility, where agent A’s exact values aren’t had by any of agents 1-n), I’d be curious to hear the case spelled out in more detail.
Here's a case that I roughly believe: multipolarity means that there's a higher likelihood that one's own values will be represented because it gives them the opportunity to literally live in, and act in the world to bring about outcomes they personally want.
This case is simple enough, and it's consistent with the ordinary multipolarity the world already experiences. Consider an entirely selfish person. Now, divide the world into two groups: the selfish person (which we call Group A) and the rest of the world (which we call Group B).
Group A and Group B have very different values, even "upon reflection". Group B is also millions or billions of times more powerful than Group A (as it comprises the entire world minus the selfish individual). Therefore, on a naive analysis, you might expect Group B to "take over the world" and then implement its values without any regard whatsoever to Group A. Indeed, because of the vast power differential, it would be "very easy" for Group B to achieve this world takeover. And such an outcome would indeed be very bad according to Group A's values.
Of course, this naive analysis is flawed, because the real world is multipolar in an important respect: usually, Group B will let Group A (the individual) have some autonomy, and let them receive a tiny fraction of the world's resources, rather than murdering Group A and taking all their stuff. They will do this because of laws, moral norms, and respect for one's fellow human. This multipolarity therefore sidesteps all the issues with value fragility, and allows Group A to achieve a pretty good outcome according to their values.
This is also my primary hope with misaligned AI. Even if misaligned AIs are collectively millions or billions of times more powerful than humans (or aligned AIs), I would hope they would still allow the humans or aligned AIs to have some autonomy, leave us alone, and let us receive a sufficient fraction of resources that we can enjoy an OK outcome, according to our values.
I would go even further than the position argued in this paper. This paper focuses on whether we should give agentic AIs certain legal rights (right to make contracts, hold property, and bring tort claims), but I also think as an empirical matter, we probably will do so. I have two main justifications for my position here:
Beyond the basic question of whether AIs should or will receive basic legal rights in the future, there are important remaining questions about how post-AGI law should be structured. For example:
I believe these questions, among others, deserve more attention among those interested in AGI governance.
I think I basically agree with you, and I am definitely not saying we should just shrug. We should instead try to shape the future positively, as best we can. However, I still feel like I'm not quite getting my point across. Here's one more attempt to explain what I mean.
Imagine if we achieved a technology that enabled us to build physical robots that were functionally identical to humans in every relevant sense, including their observable behavior, and their ability to experience happiness and pain in exactly the same way that ordinary humans do. However, there is just one difference between these humanoid robots and biological humans: they are made of silicon rather than carbon, and they look robotic, rather than biological.
In this scenario, it would certainly feel strange to me if someone were to suggest that we should be worried about a peaceful robot takeover, in which the humanoid robots collectively accumulate the vast majority of wealth in the world via lawful means.
By assumption, these humanoid robots are literally functionally identical to ordinary humans. As a result, I think we should have no intrinsic reason to disprefer them receiving a dominant share of the world's wealth, versus some other subset of human-like beings. This remains true even if the humanoid robots are literally "not human", and thus their peaceful takeover is equivalent to "human disempowerment" in a technical sense.
There ultimate reason why I think one should not worry about a peaceful robot takeover in this specific scenario is because I think these humanoid robots have essentially the same moral worth and right to choose as ordinary humans, and therefore we should respect their agency and autonomy just as much as we already do for ordinary humans. Since we normally let humans accumulate wealth and become powerful via lawful means, I think we should allow these humanoid robots to do the same. I hope you would agree with me here.
Now, generalizing slightly, I claim that to be rationally worried about a peaceful robot takeover in general, you should usually be able to identify a relevant moral difference between the scenario I have just outlined and the scenario that you're worried about. Here are some candidate moral differences that I personally don't find very compelling:
Say you're worried about any take-over-the-world actions, violent or not -- in which case this argument about the advantages of non-violent takeover is of scant comfort;
This is reasonable under the premise that you're worried about any AI takeovers, no matter whether they're violent or peaceful. But speaking personally, peaceful takeover scenarios where AIs just accumulate power—not by cheating us or by killing us via nanobots—but instead by lawfully beating humans fair and square and accumulating almost all the wealth over time, just seem much better than violent takeovers, and not very bad by themselves.
I admit the moral intuition here is not necessarily obvious. I concede that there are plausible scenarios in which AIs are completely peaceful and act within reasonable legal constraints, and yet the future ends up ~worthless. Perhaps the most obvious scenario is the "Disneyland without children" scenario where the AIs go on to create an intergalactic civilization, but in which no one (except perhaps the irrelevant humans still on Earth) is sentient.
But when I try to visualize the most likely futures, I don't tend to visualize a sea of unsentient optimizers tiling the galaxies. Instead, I tend to imagine a transition from sentient biological life to sentient artificial life, which continues to be every bit as cognitively rich, vibrant, and sophisticated as our current world—indeed, it could be even moreso, given what becomes possible at a higher technological and population level.
Worrying about non-violent takeover scenarios often seems to me to arise simply from discrimination against non-biological forms of life, or perhaps a more general fear of rapid technological change, rather than naturally falling out as a consequence of more robust moral intuitions.
Let me put it another way.
It is often conceded that it was good for humans to take over the world. Speaking broadly, we think this was good because we identify with humans and their aims. We belong to the "human" category of course; but more importantly, we think of ourselves as being part of what might be called the "human tribe", and therefore we sympathize with the pursuits and aims of the human species as a whole. But equally, we could identify as part of the "sapient tribe", which would include non-biological life as well as humans, and thus we could sympathize with the pursuits of AIs, whatever those may be. Under this framing, what reason is there to care much about a non-violent, peaceful AI takeover?
I want to distinguish between two potential claims:
I think claim (1) is clearly true and is supported by your observation that Neanderthals' went extinct, but I intended to argue against claim (2) instead. (Although, separately, I think the evidence that Neanderthals' were less intelligent than homo sapiens is rather weak.)
Despite my comment above, I do not actually have much sympathy towards the claim that humans can't possibly go extinct, or that our species is definitely going to survive over the very long run in a relatively unmodified form, for the next billion years. (Indeed, perhaps like the Neanderthals, our best hope to survive in the long-run may come from merging with the AIs.)
It's possible you think claim (1) is sufficient in some sense to establish some important argument. For example, perhaps all you're intending to argue here is that AI is risky, which to be clear, I agree with.
On the other hand, I think that claim (2) accurately describes a popular view among EAs, albeit with some dispute over what counts as a "population" for the purpose of this argument, and what counts as "value-aligned". While important, claim (1) is simply much weaker than claim (2), and consequently implies fewer concrete policy prescriptions.
I think it is important to critically examine (2) even if we both concede that (1) is true.
I'm not sure I fully understand this framework, and thus I could easily have missed something here, especially in the section about "Takeover-favoring incentives". However, based on my limited understanding, this framework appears to miss the central argument for why I am personally not as worried about AI takeover risk as most EAs seem to be.
Here's a concise summary of my own argument for being less worried about takeover risk:
A big counterargument to my argument seems well-summarized by this hypothetical statement (which is not an actual quote, to be clear): "if you live in a world filled with powerful agents that don't fully share your values, those agents will have a convergent instrumental incentive to violently take over the world from you". However, this argument proves too much.
We already live in a world where, if this statement was true, we would have observed way more violent takeover attempts than what we've actually observed historically.
For example, I personally don't fully share values with almost all other humans on Earth (both because of my indexical preferences, and my divergent moral views) and yet the rest of the world has not yet violently disempowered me in any way that I can recognize.
what we want from progress in our airplanes is, first and foremost, safety.
I dispute that this is what I want from airplanes. First and foremost, what I want from an airplane is for it to take me from point A to point B at a high speed. Other factors are important too, including safety, comfort, and reliability. But, there are non-trivial tradeoffs between these other factors: for example, if we could make planes 20% safer but at the cost that flights took twice as long, I would not personally want to take this trade.
You might think this is a trivial objection to your analogy, but I don't think it is. In general, humans have a variety of values, and are not single-mindedly focused on safety at the cost of everything else. We put value on safety, but we also put value on capabilities, and urgency, along with numerous other variables. As another analogy, if we were to have delayed the adoption of the covid vaccine by a decade to perform more safety testing, that cost would have been substantial, even if it were done in the name of safety.
In my view, the main reason not to delay AI comes from a similar line of reasoning. By delaying AI, we are delaying all the technological progress and economic value that could be hastened by AI, and this cost is not small. If you think that accelerated technological progress could save your life, cure aging, and eliminate global poverty, then from the perspective of existing humans, delaying AI can start to sound like it mainly prolongs the ongoing catastrophes in the world, rather than preventing new ones.
It might be valuable to delay AI even at the price of letting billions of humans to die of aging, prolonging the misery of poverty, and so on. Whether it's worth delaying depends, of course, on what we are getting in exchange for the delay. However, unless you think this price is negligible, or you're simply very skeptical that accelerated technological progress will have these effects, then this is not an easy dilemma.
From the full report,
It is not merely enough that we specify an “aligned” objective for a powerful AI system, nor just that objective be internalized by the AI system, but that we do both of these on the first try. Otherwise, an AI engaging in misaligned behaviors would be shut down by humans. So to get ahead, the AI would first try to shut down humans.
I dispute that we need to get alignment right on the first try, and otherwise we're doomed. However, this question depends critically on what is meant by "first try". Let's consider two possible interpretations of the idea that we only get "one try" to develop AI:
Interpretation 1: "At some point we will build a general AI system for the first time. If this system is misaligned, then all humans will die. Otherwise, we will not all die."
Interpretation 2: "The decision to build AI is, in a sense, irreversible. Once we have deployed AI systems widely, it is unlikely that we could roll them back, just like how we can't roll back the internet, or electricity."
I expect the first interpretation of this thesis will turn out incorrect because the "first" general AI systems will likely be rather weak and unable to unilaterally disempower all of humanity. This seems evident to me because of the fact that current AI systems are already fairly general (and increasingly so), and yet are weak, and are as-yet far from being able to disempower humanity.
These current systems also seem to be increasing in their capabilities somewhat incrementally, albeit at a rapid pace[1]. I think it is highly likely that we will have many attempts at aligning general AI systems before they become more powerful than the rest of humanity combined, either individually or collectively. This implies that we do not get only "one try" to align AI—in fact, we will likely have many tries, and these attempts will help us accumulate evidence about the difficulty of alignment on the even more powerful systems that we build next.
To the extent that you are simply defining the "first try" as the last system developed before humans become disempowered, then this claim seems confused. Building such a system is better viewed as a "last try" than a "first try" at AI, since it would not necessarily be the first general AI system that we develop. It also seems likely that the construction of such a system would be aided substantially by AI-guided R&D, making it unclear to what extent it was really "humanity's try" at AI.
Interpretation 2 appears similarly confused. It may be true that the decision to deploy AI on a wide scale is irreversible, if indeed these systems have a lot of value and are generally intelligent, which would make it hard to "put the genie back in the bottle". However, AI does not seem unusual in this respect among technologies, as it is similarly nearly impossible to reverse the course of technological progress in almost all other domains.
More generally, it is simply a fundamental feature of all decision-making that actions are irreversible, in the sense that it is impossible to go back in time and make different decisions than the ones we had in fact made. As a general property of the world, rather than a narrow feature of AI development in particular, this fact in isolation does little to motivate any specific AI policy.
I do not think the existence of emergent capabilities implies that general AI systems are getting more capable in a discontinuous fashion, as emergent capabilities are generally quite narrow abilities, rather than constituting an average competence level of AI systems. On broad measures of intelligence, such as the MMLU, AI systems appear to be developing more incrementally. And moreover, many apparently emergent capabilities are merely artifacts of the way we measure them, and therefore do not reflect underlying discontinuities in latent abilities.
Speaking generally, it is true that humans are frequently hesitant to change the status quo, and economic shocks can be quite scary to people. This provides one reason to think that people will try to stop explosive growth, and slow down the rate of change.
On the other hand, it's important to recognize the individual incentives involved here. On an individual, personal level, explosive growth is equivalent to a dramatic rise in real income over a short period of time. Suppose you were given the choice of increasing your current income by several-fold over the next few years. For example, if your real income is currently $100,000/year, then you would see it increase to $300,000/year in two years. Would you push back against this change? Would this rise in your personal income be too fast for your tastes? Would you try to slow it down?
Even if explosive growth is dramatic and scary on a collective and abstract level, it is not clearly bad on an individual level. Indeed, it seems quite clear to me that most people would be perfectly happy to see their incomes rise dramatically, even at a rate that far exceeded historical norms, unless they recognized a substantial and grave risk that would accompany this rise in their personal income.
If we assume that people collectively follow what is in each of their individual interests, then we should conclude that incentives are pretty strongly in favor of explosive growth (at least when done with low risk), despite the fact that this change would be dramatic and large.