(Cross-posted from my website. Podcast version here, or search "Joe Carlsmith Audio" on your podcast app.
This essay is part of a series I'm calling "Otherness and control in
the age of AGI." I'm hoping that the individual essays can be read
fairly well on their own, but see
here
for a summary of the essays that have been released thus far.)
In my last
essay,
I discussed a certain kind of momentum, in some of the philosophical
vibes underlying the AI risk discourse,[1] towards deeming more and
more agents – including: human agents – "misaligned" in the sense of:
not-to-be-trusted to optimize the universe hard according to their
values-on-reflection. We can debate exactly how much mistrust to have in
different cases, here, but I think the sense in which AI risk issues can
extend to humans, too, can remind us of the sense in which AI risk is
substantially (though, not entirely) a generalization and
intensification of the sort of "balance of power between agents with
different values" problem we already deal with in the context of the
human world. And I think it may point us towards guidance from our
existing ethical and political traditions, in navigating this problem,
that we might otherwise neglect.
In this essay, I try to gesture at a part of these traditions that I see
as particularly important: namely, the part that advises us to be "nicer
than Clippy" – not just in what we do with spare matter and energy, but
in how we relate to agents-with-different-values more generally. Let me
say more about what I mean.
Utilitarian vices
As many have noted, Yudkowsky's paperclip maximizer looks a lot like
total utilitarian. In particular, its sole aim is to "tile the universe"
with a specific sort of hyper-optimized pattern. Yes, in principle, the
alignment worry applies to goals that don't fit this schema (for
example: "cure cancer" or "do god-knows-whatever kludge of weird
gradient-descent-implanted proxy stuff"). But somehow, especially in
Yudkowskian discussions of AI risk, the misaligned AIs often end up
looking pretty utilitarian-y, and a universe tiled with something – and
in particular, "tiny-molecular-blahs" – often ends seeming like a
notably common sort of superintelligent Utopia.
What's more, while Yudkowsky doesn't think human values are utilitarian,
he thinks of us (or at least, himself) as sufficiently galaxy-eating
that it's easy to round off his "battle of the utility functions"
narrative into something more like a "battle of the preferred-patterns" – that is, a battle over who gets to turn the galaxies into their
favored sort of stuff. The AIs want to tile the universe with
paperclips; the humans, in Yudkowsky's world, want to tile it with
"Fun."
(Tiny-molecular-Fun?)

ChatGPT imagines "tiny molecular fun."
But actually, the problem Yudkowsky talks about most – AIs killing
everyone – isn't actually a paperclips vs. Fun problem. It's not a
matter of your favorite uses for spare matter and energy. Rather, it's
something else.
Thus, consider utilitarianism. A version of human values, right? Well,
one can debate. But regardless, put utilitarianism side-by-side with
paperclipping, and you might notice: utilitarianism is omnicidal, too – at least in theory, and given enough power. Utilitarianism does not love
you, nor does it hate you, but you're made of atoms that it can use for
something else. In particular: hedonium (that is: optimally-efficient
pleasure, often imagined as running on some optimally-efficient
computational substrate).
But notice: did it matter what sort of onium? Pick your favorite optimal
blah-blah. Call it Fun instead if you'd like (though personally, I find
the word "Fun" an off-putting and under-selling summary of
Utopia).
Still, on a generalized utilitarian vibe, that blah-blah is going to be
a way more optimal use of atoms, energy, etc than all those squishy
inefficient human bodies. They never told you in philosophy class? It's
not just organ-harvesting and fat-man-pushing. The utilitarians have
paperclipper problems, too.[2]
Oh, maybe you heard this about the negative utilitarians. "Doesn't
your philosophy want to kill everyone?" But the negative utilitarians
protest: "so does the classical version!" And straw-Yudkowsky, at least,
is not surprised. In straw-Yudkowsky's universe, killing everyone is,
like, the first thing that (almost) any strong-enough rational agent
does. After all, "everyone" is in the way of that agent's yang.
But are foomed-up humans actually this omnicidal? I hope not. And
real-Yudkowsky, at least, doesn't think so. There's a bit in his
interview with Lex
Fridman,
where Yudkowsky tries to get Lex to imagine being trapped in a computer
run by extremely slow-moving aliens who want their society to be very
different from how Lex wants it to be (in particular: the aliens have
some sort of equivalent of factory farming). Yudkowsky acknowledges that
Lex is presumably "nice," and so would not, himself, actually just
slaughter all of these aliens in the process of escaping. And
eventually, Lex agrees.
What is this thing, "nice"? Not, apparently, the same thing as
"preferring the right tiny-molecular-pattern." Existing creatures are
unlikely to be in this pattern by default, so if that's the sum total of
your ethics, you're on the omnicide train with Clippy and Bentham.
Rather, it seems, niceness is something else: something where, when you
wake up in an alien civilization, you don't just kill everyone first
thing, even though you're strong enough to get away with it. And this
even-though (gasp) their utility functions are different from yours.
What gives?
"Something very contingent and specific to humans, or at least to
evolved creatures, and which won't occur in AIs by default in any way
we'd like" answers
Yudkowsky.
And maybe so.[3] But I'm interested, here, not in whether AIs will be
nice-like-us, but rather, in understanding what our niceness consists
in, and what it might imply about the sorts of otherness and control
issues I've been talking about in this series.
In particular: a key feature of niceness, in my view, is some sort of
direct responsiveness to the preferences of the agents you're
interacting with. That is, "nice" values give the values of others
some sort of intrinsic weight. The aliens don't want to be killed, and
this, in itself, is a pro tanto reason not to kill them. In this
sense, niceness allows some aspect of yin into its agency. It is
influenced by others; it receives others; it allows itself to
channel – or at least, to respect and make space for – the yang of
others.
The extreme version of this is preference
utilitarianism,
which tries to make of itself, solely, a conduit of everyone else. And
it might seem, prima facie, an attractive view. In particular: to
someone who doesn't like the idea of imposing their own arbitrary,
contingent will upon the world, an ideal that instead enacts some sort
of "universal compromise will" (i.e., the combination of everyone's
preferences) can seem to regain the kind of objective and
other-centered footing that anti-realism about ethics threatens to deny.
But as I've written about
previously,
I think the appeal of a pure preference utilitarianism fades on closer
scrutiny.[4] In particular: I think it founders on possible
people,
on
paperclippers,
and in particular, on
sadists.
But rejecting a pure preference utilitarianism does not mean embracing a
stance that refuses to ever give the preferences of others intrinsic
weight. [5] And my sense is that sometimes the AI safety discourse goes
too far in this respect. It learns, from paperclippers, the strange and
unappealing places that the preferences of arbitrary others can lead.
Indeed, Yudkowsky takes explicit steps to break his audience's
temptation towards sympathy with Clippy's preferences (this is the
point of the abstract notion of
"paperclips"), and to
place Clippy's agency firmly in the role of "adversary" (see, e.g., the
"true prisoner's
dilemma").
And against such a backdrop, it's easy (though: not endorsed by
Yudkowsky) for the idea that preferences like Clippy's deserve any
intrinsic weight to fall out of the picture. After all: Clippy doesn't
give our preferences any weight. And aren't we and Clippy ultimately
alike, modulo our favored blah-blah-onium?
No. In addition to liking happier onium than Clippy, we are nicer than
Clippy to agents-with-different-values. Or: we should be. Indeed, I
think we should strive to be the sort of agents that aliens would not
fear the way Yudkowsky fears paperclippers, if the aliens discovered
they were on the verge of creating us. This doesn't mean we should just
adopt the alien preferences as our own – and especially not if the
stuff they like is actively evil rather than merely meaningless (more
below). But it does mean, for example, not killing them. But also:
actively helping them (on their own terms) in cheap ways, treating them
with respect and dignity, not enslaving them or oppressing them, and
more.[6]

Alien alignment researcher thinking about p(doom)
That is: human values themselves have stuff to say about how we should
treat agents-with-different-values – including, non-humans. Indeed, a
huge portion of our ethics and politics ends up dealing with this in one
form or another. AI otherness will be new, yes – but we have deep,
richly textured, and at-least-somewhat battle-tested traditions to draw
on in orienting towards it. Too often, utilitarian vibes forget about
these traditions ("isn't it all just an empirical question about
what-causes-the-utils?"). And too often, fear that the
agents-with-different-values might hurt us makes us forget, too (which,
I re-emphasize, isn't to say that agents-with-different-values won't
hurt us – cf all this stuff about bears and Nazis and
the-brutality-of-nature etc in the
previous
essays).
But faced with a new class of others/fellow-creatures/potential-threats,
we should be drawing on every source of wisdom we can.
Boundaries
Let me give an example of ways in which bringing to mind some of the
less utilitarian dimensions of human ethics can make a difference to how
we orient towards AI systems with values different from our own.
In "Does AI risk 'other' the
AIs?,"
I mentioned two worries the AI alignment discourse has about
paperclippers:
-
That they'll kill everyone (and relatedly: violate people's basic rights, steal people's stuff, and violently overthrow the government).
-
That they'll gain power in a way that results in their values (rather than human values) steering the trajectory of earth-originating civilization, thereby leading to a future of ~zero value.
These two worries are often lumped together under the more unified
concern that the AIs will have the "wrong values." After all, if they
had the right values, presumably they would do neither of these
things.
But the two worries are importantly distinct.[7] For one thing, as has
been
oft-noted,
different human ethical views might disagree about their respective
importance. But beyond this, these two worries interact very differently
with our existing ethical and political norms governing how agents with
different values should relate to one another.
In particular: as a civilization, we have extremely deep and robust
norms prohibiting agents from doing worry-number-1-style behavior: i.e.,
killing other people, stealing other people's stuff, and trying to
overthrow the government (though of course, there are exceptions and
complexities). That is, worry-number-1 casts the AIs in a role that
triggers very directly our sense that we are dealing with aggressors
who are violating important boundaries -- boundaries that lie at the
core of human cooperative arrangements – and whose behavior therefore
warrants unusually strong forms of defensive response. For example: if
someone is breaking into your home with nano-bots trying to kill you,
you are morally permitted – on the basis of self-defense – to do
things that would otherwise be impermissible (even to save your own
life) in other contexts: for example, killing them (where this is
necessary and proportionate).[8] Similarly: you are justified in doing
things to people who are invading your country that you aren't
justified in doing if they aren't invading your country, and so forth.
The misaligned AIs, according to worry-number-1, are enemies of this
deep and familiar sort.[9]

"Hitler watching German soldiers march into Poland in September 1939."
An example of a worry-number-1-style boundary violation. (Image source
here.)
But what of worry-number-2? Here, hmm: if we take worry-number-1 full
off the table, I think it becomes quite a bit less clear what standard
(western, liberal, broadly democratic) ethical and political norms have
to say about worry-number-2 on its own. To see this, consider the
following thought experiment (caveat: I'm really, really not saying
that misaligned AIs will be like this).
Imagine a liberal society very much like our own, except with the
addition of one extra human cultural group: namely, the
humans-who-like-paperclips. The humans-who-like-paperclips are a sect of
humans that arose at some point in the sixties and has been growing ever
since. They are meticulously law-abiding, kind, and cooperative, but
they have one weird quirk: the main thing they all want to do with their
personal resources is to make paperclips. Passing by a house owned by a
human-who-likes-paperclips, you'll often see large, neatly-sorted stacks
of paperclip boxes in their backyards, and through the windows of their
garages, and sometimes in the living rooms. The richer
humans-who-like-paperclips own whole warehouses. The paperclip industry
is booming.

Yeah sometimes he just stands there looking at them...
Now, let's start by noticing that in this context, it's not at all
clear that "the humans-who-like-paperclips have different values from
us" qualifies as a problem, at least by the lights of basic western,
liberal norms (here I mean liberalism in the political-philosophy sense
roughly at stake in this Wikipedia
page, rather
than in the "liberals vs. republicans" sense). What the
humans-who-like-paperclips do with their private resources, and in the
privacy of their homes/backyards, is their own business, conditional on
its compatibility with certain basic norms around harm, consent, and so
forth. After all: Alicia down the street spends her free time and money
listening to noise music; Jim sits around watching trashy TV in a
drunken haze; Felipe has sex with other men; Maria collects stamps; and
Jason is Mormon. Are the humans-who-like-paperclips importantly
different? What happened to liberal tolerance?
Now, of course, utilitarianism-in-theory was never, erm, actually very
tolerant. Utilitarianism is actually kinda pissed about all these
hobbies. For example: did you notice the way they aren't hedonium?
Seriously tragic. And even setting aside the not-hedonium problem (it
applies to all-the-things), I checked Jim's pleasure levels for the
trashy-TV, and they're way lower than if he got into Mozart; Mary's
stamp-collecting is actually a bit obsessive and out-of-balance; and
Mormonism seems too confident about optimal amount of
coffee.
Oh noes! Can we optimize these backyards somehow? And Yudkowsky's
paradigm misaligned AIs are thinking along the same lines – and they've
got the nano-bots to make it happen.
I sometimes think about this sort of vibe via the concept of "meddling
preferences." That is: roughly, we imagine dividing up the world into
regions ("spaces," "spheres") that are understood as properly owned or
controlled by different agents/combinations of agents. Literal property
is a paradigm example, but these sorts of boundaries and accompanying
divisions-of-responsibility occur at all sorts of levels – in the
context of bodily autonomy, in the context of who has the right to make
what sort of social and ethical demands of others, and so forth (see
also, in more interpersonal contexts, skills involved in "having
boundaries," "maintaining your own sovereignty," etc).
Some norms/preferences concern making sure that these boundaries
function in the right way – that transactions are appropriately
consensual, that property isn't getting stolen, that someone's autonomy
is being given the right sort of space and respect. A lot of deontology,
and related talk about rights, is about this sort of thing (though not
all). And a lot of liberalism is about using boundaries of this kind of
help agents with different values live in peace and mutual benefit.
Meddling preferences, by contrast, concern what someone else does within
the space that is properly "theirs" – space that liberal ethics would
often designate as "private," or as "their own business." And being
pissed about people using their legally-owned and ethically-gained
resources to make paperclips looks a lot like this. So, too, being
pissed about noise-musicians, stamp-collectors, gay people, Mormons,
etc. Traditionally, a liberal asks, of the humans-who-like-paperclips:
are they violating any laws? Are they directly hurting anyone? Are they
[insert complicated-and-contested set of further criteria]? If not:
let them be, and may they do the same towards "us."

Humans-who-like-stamps, at a convention. (Image source
here.)
Many
"axiologies"
(that is, ways of evaluating the "goodness" of the world) are meddling
in a way that creates tension with this sort of liberal vibe. After all:
axiologies concern the goodness of the entire world. Which means: all
the "regions." In this sense, axiology is no respecter of boundaries. Of
course, you could have an axiology that prefers worlds precisely
insofar as they obey some set of boundary-related norms, and which has
no preferences about what-happens-in-back-yards, but one finds this
rarely in practice. To the contrary, many axiologies are concerned, for
example, with the welfare of the agents involved (the average welfare,
the total welfare, etc), or the beauty/friendship/complexity/fun etc
occurring in the different regions. And if you give people liberal
freedoms in their own spheres, sometimes they make those spheres
less-than-optimally welfare-y/beautiful/complex/fun etc. Thus that
classic tension between goodness and freedom (cf. "top down" vs.
"bottom
up";
and see also Nozick's critique of "end-state" and "patterned"
principles of
justice).
The "utility functions" that Yudkowskian rational agents pursue need not
be axiologies in a traditional sense. But somehow, they often end up
pretty axiology-vibed.[10] No wonder, then, that Clippy is no respecter
of boundaries, either. Indeed, in many respects, Yudkowsky's AI
nightmare is precisely the nightmare of all-boundaries-eroded. The
nano-bots eat through every wall, and soon, everywhere, a single pattern
prevails. After all: what makes a boundary bind? In Yudkowsky's world
(is he wrong?), only two things: hard power, and ethics. But the AIs
will get all the hard power, and have none of the ethics. So no walls
will stand in their way.
But I claim that humans often have the ethics bit.[11] Or at least,
human liberals, on their current self-interpretation. Of course, this
isn't to say that liberals are OK with anything happening inside
"walled" zones that might be intuitively understood as "private." For
example: it's a contested question what aspects of a child's life should
be under the control of a parent, but clearly, you aren't allowed to
abuse or torture your own children (or anyone else), even in your own
living room with the blinds drawn. And similarly, at a larger scale: the
borders between nation states are a paradigm example of a certain kind
of "boundary," but we believe, nevertheless, that certain sorts of
human-rights-abuses inside a sovereign nation warrant infringing this
boundary and righting the relevant wrong.
Often, though, these sorts of boundary infringements are justified
precisely insofar as they are necessary to prevent some other boundary
violation (e.g., child abuse, genocide) taking place within the first
boundary. Indeed, Yudkowsky often turns to this sort of thing when he
tries to prompt humans to behave in a manner analogous to a
paperclipping AI. Thus, in "Three Worlds
Collide,"
he specifically has humans encounter (and then: decide to intervene on
violently) an alien species that eats their own conscious, suffering
children – rather than, e.g., a species that just spends its resources
making paperclips. And in trying to induce Lex to try to take over an
alien world he wakes up in ("don't think of it as 'world domination',"
Yudkowsky says with a
grin, "think of it as
'world optimization'"), Yudkowsky specifically appeals to the idea that
the alien civilization involves a lot harm and suffering – via war,
or via some equivalent of factory farming – that Lex could alleviate,
rather than to the idea that the aliens use their resources (and still
less: their atoms) on boring/meaningless/sub-optimal things.
And to be clear: I agree that preventing harm, suffering, genocide, and
so forth can justify infringing otherwise-important boundaries. (Indeed,
I think that as it becomes possible to create suffering and harm in
digital minds using personal computers, we're going to have to grapple
with new tensions in this respect. Your backyard is yours, yes: but just
as you can't abuse your children there, neither can you abuse digital
minds.) But I also want to be clear that what's going on with the part
of human values that says "no torturing people even in your own
backyard" is much more specific, and much more compatible with
"niceness" in other contexts, than what's going on with an arbitrary
rational optimizer stealing your atoms to make its favored form of
blah-blah-onium.
For example: if Lex were to wake up in a civilization of peaceful
paperclippers, whose civilization involves no suffering (but also, let's
say, very little happiness), but who spend all of their resources on
paperclips, it seems very plausible to me that the right thing for Lex
to do is to mostly leave them alone, rather than to engage in some
project of world-domination/optimization (maybe Lex escapes to some
other planet, but he doesn't take over the alien government and turn
their paperclip factories into Fun-onium factories instead). And this
even though Lex likes fun a lot more than paperclips.
Yudkowsky, to his credit, is attuned to this aspect of human ethics (the
humans in Three Worlds Collide, for example, look for ways to respect
and preserve baby-eater culture while still saving the babies) – but
his rhetoric can easily leave it in the background. For example, in
trying to induce Lex to world-dominate/optimize, Yudkowsky reminds
him: "the point is:
they want the world to be one way, you want the world to be a different
way." But for a liberal: that's not good enough. All the time, my
preferences conflict with the preferences of others. All the time,
according to me, they could be using their private resources more
optimally. Does this mean I dominate/optimize their backyards as soon as
I'm powerful enough to get away with it? Not, I claim, if I am nice.
Of course, an even-remotely-sophisticated ethics of "boundaries"
requires engaging with a ton of extremely gnarly and ambiguous stuff.
When, exactly, does something become "someone's"? Do wild animals, for
example, have rights to their "territory"? See all of the philosophy
of property
for just a start on the problems. And aspirations to be "nice" to
agents-with-different-values clearly need ways of balancing the
preferences of different agents of this kind – e.g., maybe you don't
steal Clippy's resources to make fun-onium; but can you tax the rich
paperclippers to give resources to the multitudes of poor
staple-maximizers?[12] Indeed, remind me your story about the ethics of
taxation in general?
I'm not saying we have a settled ethic here, and still less, that its
rational structure is sufficiently natural and privileged that tons of
agents will converge on it. Rather, my claim is that we have some
ethic here – an ethic that behaves towards "agents with different
values" in a manner importantly different from (and "nicer" than)
paperclipping, utilitarianism, and a whole class of related forms of
consequentialism; and in particular, an ethic that doesn't view the mere
presence of (law-abiding, cooperative) people-who-like-paperclips as a
major problem.
And such an ethic seems well-suited, too, to handling the possibility – discussed in the previous essay – that different humans might end up
with pretty different values-on-reflection as well. Liberalism does not
ask that agents sharing a civilization be "aligned" with each other in
the sense at stake in "optimizing for the same utility function."
Rather, it asks something more minimal, and more compatible with
disagreement and diversity – namely, that these agents respect certain
sorts of boundaries; that they agree to transact on certain sorts of
cooperative and mutually-beneficial terms; that they give each other
certain kinds of space, freedom, and dignity. Or as a crude and
distorting summary: that they be a certain kind of nice. Obviously, not
all agents are up for this – and if they try to mess it up, then
liberalism will, indeed, need hard power to defend itself. But if we
seek a vision of a future that avoids Yudkowsky's nightmare, I think the
sort of pluralism and tolerance at the core of liberalism will often be
more a promising guide than "getting the utility function that steers
the future right."
What if the humans-who-like-paperclips get a bunch of power, though?
Let's keep going, though, with the thought experiment about the
humans-who-like-paperclips, until it hits on worry-number-2 more
directly. In particular: thus far the humans-who-like-paperclips are
just one human group among others. But what happens if we imagine them
becoming the dominant human group – albeit, via means entirely
compatible with respect for the boundaries of others, and with
conformity to liberal ethics and laws.
Thus, let's say that the humans-who-like-paperclips are quite a bit
smarter, more productive, and better coordinated than basically everyone
else. As a result of their labors in the economy and their upstanding
citizenship, humans in general are richer, happier, stronger, and
healthier relative to a world without them. But for closely related
reasons, and without violating any legal or ethical norms (all the
economic transactions they engage in are consensual, fully-informed, and
mutually beneficial), they are gradually accumulating more and more
power. Their population is growing unusually fast; they own a larger and
larger share of capital; and they exert more and more influence over
politics and public opinion – albeit, in entirely above-board ways
(much more above board, indeed, than many of the other groups vying for
influence). Analysts are projecting that in a few decades,
humans-who-like-paperclips will be the most powerful human group, for
most measures of power – more powerful, indeed, than all the other
groups combined. And they're predicting that for various reasons to do
with the pace of technological
development,
this dominance will grant the humans-who-like-paperclips enormous
influence over the trajectory of humanity's future.
Now, it's natural to wonder whether, once the humans-who-like-paperclips
achieve sufficient dominance, all this niceness and cooperativeness and
good-citizenship and respect-for-the-law stuff might fall by the
wayside, and whether they might start looking more hungrily at your
babies and your atoms. But suppose that somehow, you know that this
won't happen. Rather, the humans-who-like-paperclips will continue to
meticulously respect legal and ethical norms (or at least, the sort of
minimal, boundary-related ethical norms I gestured at above). No one
will get nano-bot-ed; the humans-who-like-paperclips won't sneak any
suffering or slavery into their paperclip piles; and the
humans-who-like-other-stuff (e.g. "Fun") will be able to happily pursue
this other stuff from within secure backyards that are extremely ample
by today's standards. But most of the resources of the future will go
towards paperclips regardless.[13]
How bad is this outcome? Different ethical views will disagree, and a
less-crude analysis would obviously include factors other than
"conformity to very basic liberal norms" and "what happens with the
galaxies." Crudely, my own view is that the galaxy thing is actually a
huge
deal,
and that even with basic liberal norms secure, turning ~all reachable
resources into literal paperclips would be a catastrophic waste of
potential. [14] But I also want to acknowledge that this is a very
different sort of big deal than someone, or some group, killing
everyone else and taking their stuff (and note that distant galaxies are
not, in any meaningful sense, "ours," despite transhumanist talk about
"our cosmic
endowment"). In
particular: the pure galaxies thing implicates different, and more
fraught, ethical questions about otherness and control.
Thus: once we specify that basic liberal norms will be respected
regardless, further disputes-over-the-galaxies look much more like a
certain kind of raw competition for resources. It's much less akin to a
country defending itself from an invader, and much more akin to one
country racing another country to settle and control some piece of
currently-uninhabited territory.[15] The dispute is less about
upholding the basic conditions of cooperation and
peace-among-differences, and more about whose hobbies get-done-more; who
gets the bigger backyard. Does it all come down to land use?
Well, even it did: land use is actually a very big deal.[16] And to be
clear: I don't like paperclips any more than you do. I much prefer stuff
like joy and understanding and beauty and love. But I also want to be
clear about what sort of ground I am standing on, according to my own
values, when I fight for these things in different ways in different
contexts. And according to my own values: it is one thing to defend your
boundaries and your civilization's basic norms against aggressors and
defectors. It is another to compete with someone who
prefers-different-stuff, even while those norms are secure. And it is a
third, yet, to become an aggressor/defector yourself, in pursuit of the
stuff-you-prefer. But to talk, only, about "having different values" – and especially, to assume that the main thing re: values is your favored
use of unclaimed energy/matter, your preferred blah-blah-onium – obscures these distinctions.
In particular: the defending-boundaries thing is where liberalism goes
most readily to identify the forms of "otherness" that are not OK:
namely, otherness done Nazi-style; otherness that actually, really, is
trying to kill you and eat your babies. But the otherness at stake in
"cooperative and nice, but still has a different
favorite-use-of-resources" is quite different. It's the sort of
otherness that liberalism wants to tolerate, respect, include, and even
celebrate. Cf noise music, Mormonism, and that greatest test of
tolerance: sub-optimally-efficient pleasure. Such tolerance/respect/etc
is compatible with certain kinds of competition, yes. But not
fighting-the-Nazis style. Not, for example, with the same sort of moral
righteousness; and relatedly, not with the same sorts of justifications
for violence and coercion.
Indeed, importantly not, if you want peace and diversity both. After
all, the wider the set of differences-in-values you allow to justify
violence and coercion, the more you are asking either for
violence/coercion, or for everyone-having-the-same-values. Or perhaps
most likely: violence/coercion in the service of
everyone-having-the-same-values. Cf cleansing, purging. Like how the
paperclipper does it. But we can do better.
An aside on AI sentience
I want to pause here to address an objection: namely, "Joe, all this
talk about tolerance and respect etc – for example, re: the
humans-who-like-paperclips – is assuming that the Others being
tolerated/respected/etc are sentient. But the
AIs-with-different-values – even: the cooperative, nice,
liberal-norm-abiding ones – might not even be sentient! Rather, they
might be mere empty machines. Should you still tolerate/respect/etc
them, then?"
My sense is that I'm unusually open to "yes," here.[17] I'm not going to try to defend this openness in depth here,
but in brief: while I take consciousness very seriously,[18] and
definitely care a lot about something-in-the-vicinity-of-consciousness,
I don't feel very confident that our current concepts of "sentience" and
"consciousness" are going to withstand enough scrutiny to handle the
moral weight that some people currently want to put on them;[19] I
think focus on consciousness does poorly on golden-rule-like tests when
applied to civilizations with different conceptions of the precise sorts
of functional mental architectures that matter (e.g., aliens that would
look at us and say "these agents aren't schmonscious, because their
introspection doesn't have blah-precise-functional-set-up" – see e.g.
this
story
for an intuition pump); and I think some of the more cooperation-focused
origins and functions of niceness/liberalism/boundaries (including:
functions I discuss below re: liberalism and real-politik, where
sentience more clearly doesn't matter[20]) don't point towards
consciousness as a key desideratum (and note that I'm here specifically
talking about the bits of ethics that are cooperation-flavored, rather
than the bits associated with what you personally do in your
backyard).[21] Plus, more generally, I think this is all sufficiently
confusing territory that we should err on the side of caution and
inclusivity in allocating our moral concern, rather than saying e.g.
"whatever, this cognitively-sophisticated-agent-with-preferences isn't
conscious – by which I mean, um, that we-know-not-what-thing, that
least-understood-thing – so it's fine to torture it, deprive it of
basic rights, etc."
Of course, if you stop using sentience as a necessary condition for
being worthy-of-tolerance/respect etc, then you need to say additional
stuff about where you do draw the sorts of lines I discussed a few
essays
ago:
e.g., "OK to eat apples but not babies," "furbies and thermostats don't
get the vote," "you can own a laptop but not a slave,"[22] and so
on.[23] And indeed, gnarly stuff. My current best guess here would be
to hand-wave about agenty-ness and cognitive sophistication and
who-would've-been-a-good-target-for-cooperation-in-other-circumstances – but obviously, one needs to say quite a bit more.
For the purposes of understanding the ethical underpinnings of the AI
risk discourse, though, I don't think that we need to resolve questions
about whether non-sentient AIs-with-different-values are worthy of
tolerance/respect. Why? Because the core bits of the Yudkowskian
narrative I've been discussing apply even if all the
AIs-with-different-values are sentient. The classic paperclipper-doom
story, for example, does not require that the paperclipper be
insentient: it still kills all the humans, it still turns the galaxies
into paperclips, and that's enough.[24] And Yudkowsky himself would
find the possibility of conscious AIs, at least, obvious. Where this
includes, presumably, conscious paperclippers. (In reality, my sense is
that Yudkowsky thinks consciousness unusually scarce – for example,
he's skeptical that pigs are
conscious.
But this view isn't important to his story.) So for now, in talking
about tolerating/respecting AIs with-different-values, I'll just assume
they're sentient, and see what follows.
Indeed: did you think it matters a lot, to the Yudkowsky narrative,
whether the AI was sentient? If so, then I suspect you are thinking of
this narrative as a less familiar story than it truly is. Ultimately, AI
risk is not about humans vs. AIs (in that case, it really would be
species-ism/bio-chauvinism), or sentience vs. insentience (the AIs might
well be sentient). Rather, it's about something more ancient and basic:
namely, agents with different values competing for power. So I encourage
you: run the story with conscious humans-with-different-values in the
place of the AIs-with-different-values – humans to whom you are more
immediately inclined to ascribe moral status, rights, citizenship,
tolerance-worthiness, and so forth. You want to make sure that you get
the differences-in-values different enough, sure (though: "maximize
paperclips" is an unfortunate cartoon; thinking about where RLHF + foom
leads seems a better guide). And as I said earlier: people with souls
can still be enemy soldiers. But if you're finding that words like
"human" or "sentient" are making the agents-with-different-values seem
substantially less like enemies, then you're not yet fully keyed to the
particular sort of conflict that Yudkowsky has in mind.
Giving AIs-with-different-values a stake in civilization
Let me give another example of a place where I worry that a naïve
Yudkowskian discourse can too-easily neglect the virtues of niceness and
liberalism: namely, the sort of influence we imagine intentionally
giving to AIs-with-different-values that we end up sharing the world
with.
Thus, consider Yudkowsky's "proposed thing-to-do with an
extremely advanced AGI, if you're extremely confident of your ability
to align it on complicated targets": namely, use it to implement
humanity's "coherent extrapolated
volition" ("CEV"). This means,
basically: have the AI do what currently-existing humans would want it
to do if they were "idealized" (see more
here),
to the extent those idealized humans would want the same things.
We see, in Yudkowsky's discussion of CEV, some of his effort to
implement a less power-grabby ethic than a simple interpretation of his
philosophy might imply. That is: Yudkowsky (at least in
2004) is
explicitly imagining a team of AGI programmers who are in the position
to take over the world and have their particular (idealized) values rule
the future (let's set aside questions about the degree of resemblance
this scenario is likely to have to the actual dynamics surrounding AGI
development, and treat it, centrally, as a thought experiment). And one
might've thought, given the apparent convergence of
oh-so-many-rational-agents on the advisability of taking over the world,
that Yudkowsky's programmers would do the same.[25] But he suggests
that they should not.
Part of this, says Yudkowsky, is about not ending up like ancient greeks
who impose values on the future they wouldn't actually endorse if they
understood better. But that only gets you, in Yudkowsky's ontology, to
the programmers making sure to extrapolate their own volitions. It
doesn't get you to including the rest of humanity in the process.
What gets you to giving that wider circle a say? Yudkowsky mentions
various values – "fairness," "not being a jerk," trying to act as you
would wish other agents would act in your place,
cooperation/real-politik, not acting like you are uniquely appointed to
determine humanity's destiny, and others. I won't interrogate these
various considerations in detail here (though see footnote for a bit
more discussion).[26] Rather, my point is about how far the pluralism
they motivate should extend.
In particular: Yudkowsky's "extrapolation base" – that is, the set of
agents his process grants direct influence over the future – stops at
humanity. But it seems plausible to me that whatever considerations
motivate empowering all of humanity, in a thought experiment like this,
should motivate empowering certain kinds of AIs-with-different-values as
well, at least if we are already sharing the world with such AIs by the
time the relevant sort of power is being thought-experimentally
allocated. For example, in this thought experiment: if at the time the
programmers are making this sort of decision, there are lots of
moral-patienty AIs with human-level-or-higher intelligence running
around, who happen to have very different values from humans, I think
they should plausibly be included in the "extrapolation" base too. After
all, why wouldn't they be? "Because they're not humans" is actually
species-ism. But absent such species-ism, the most salient answer is
"because their values are different from ours, so giving them influence
will make the future worse by our lights." But that answer could
easily motivate not-empowering many humans as well – and the logic, in
the limit, might well prompt the programmers to empower only themselves.
Now, the details here about what it means to empower moral-patienty
AIs-with-different-values in the right way get gnarly fast (see e.g.
Bostrom and Shulman
(2022) for a
flavor). Indeed, questions about how to handle the empowerment of such
AIs are one of the few places I've seen Yudkowsky, in his
words,
"give up and flee screaming into the night." See, also, one of his characters' exclamation in the face of a sentient iPhone that's been stalking him,
and which begs not to be wiped: "I don't know what the fuck else I'm
supposed to do! Someone tell me what the fuck else I'm supposed to do
here!" At least as of 2008 (has he written on this since?[27]),
Yudkowsky's central advice, in the face of the moral dilemma posed by
creating AI moral patients with different values, seems to be: don't do
it, at least until you're much readier than we are. And indeed: yes.
Just like how: don't create AGI at all until you're much readier than
we are. But unfortunately, in both cases: I worry that we're going to
need a better plan.

Dese ne wipe...
I won't try to outline such a plan here. Rather, I mostly want to point
at the general fact that, insofar as we are in fact aiming to build a
world that succeeds at whatever "liberalism" and "boundaries" and
"niceness" are trying to do, this world should probably be inclusive,
tolerant, and pluralistic with respect to AIs-with-different-values (or
at least, moral patient-y ones) as well as humans-with-different-values – at least absent some clear and not-just-species-ist story about why
AIs-with-different-values should be excluded. And note, importantly,
that this doesn't mean tolerating arbitrarily horrible value systems
doing whatever they want, or arbitrarily alien value systems trampling
on other people's backyards. This is part of why I think it's worth
being clear – indeed, clearer than I've been thus far – about the
sorts of values differences liberalism/boundaries/niceness gets fussed
about.[28] Peaceful, cooperative AIs that want to make paperclips in
their backyards – that's one thing. Paperclippers who want to murder
everyone; sadists who want to use their backyards as torture chambers;
people who demand that they be able to own sentient, suffering slaves – that's, well, a different thing. Yes, drawing the lines requires work.
And also: it probably requires drawing on specific human (or at least,
not-fully-universal) values for guidance. I'm not saying that
liberalism/niceness/boundaries is a fully "neutral arbiter" that isn't
"taking a stand." Nor am I saying that we know what stand it, or the
best version of it, takes. Rather, my point is that this stand probably
does not treat "those AIs-we-share-the-world-with have different values
from us" as enough, in itself, to justify excluding them from influence
over the society we share.
The power of niceness, community, and civilization
So far, I've been making the case for this sort of inclusivity centrally
on ethical grounds. But liberalism/niceness/boundaries clearly have
practical benefits as well. Nice people, for example, are nicer to
interact with. Free and tolerant societies are more attractive to live
in, work in, immigrate to. Secure boundaries save resources otherwise
wasted on conflict. And so on. There's a reason so many European
scientists – including German scientists – ended up working on the
Manhattan
project,
rather than with the Nazis; and it seems closely related to differences
in "niceness."
Indeed, these benefits are enough, at times, to soften the atheism of
certain rationalists. For example: Scott Alexander.[29] As I mentioned
in a previous essay: Alexander, in writing about
liberalism/niceness/boundaries (e.g.
here
and
here),
attributes to it a kind of mysterious power. "Somehow Elua is still
here. No one knows exactly how. And the gods who oppose Him tend to find
Themselves meeting with a surprising number of unfortunate accidents."
Liberalism/niceness/boundaries is not, for Alexander, just another
utility function. Still less is it actively weak. Rather, it is a
"terrifying unspeakable elder God." "Elua is the god of flowers and free
love and he is terrifying. If you oppose him, there will not be enough
left of you to bury, and it will not matter because there will not be
enough left of your city to bury you in."

A bit like this?
Here, Alexander's vibe is un-Yudkowskian in a number of ways. First,
Alexander seems to want to trust, at least partly, in something
mysterious – namely, the ongoing power of
liberalism/niceness/boundaries, which Alexander admits he does not fully
understand. Indeed, I think that various more consequentialist-y
stories
about the justification for deontological-y norms and virtues – including the ones at stake in liberalism/niceness/boundaries – have
some of this flavor as well. That is: consequentialists often argue that
you should abide by deontological norms, or be blah sort of virtuous,
even when it seems like doing so will make things worse, because
somehow, actually, doing so will make things better (for example:
because at the level of choosing a policy, or adjusting for biases, or
dealing with the constraints of a bounded mind, deontology/virtue does
better than consequentialist calculation). Deontology/virtue, on this
story, is its own form of power-to-achieve-your-goals – but a form that
remains at least somewhat cognitively inaccessible while it is being
put-into-practice (otherwise, it could be more fully subsumed within a
direct consequentialist calculation). So trust in deontology/virtue, in
the hard cases, requires trusting in something not-fully-calculated.
(Though of course, there are tons of ways to trust-wrongly, here,
too.)[30]
But beyond his willingness to trust-in-something-mysterious, Alexander's
attribution of power to Elua is also in tension with certain kinds of
orthogonality between ethics and optimization power. That is, to the
extent that Elua represents a set of values, Elua, in a Yudkowskian
ontology, is orthogonal to intelligence at least – and thus, to a key
source of power. "Paperclips," after all, are neither elder Gods nor
younger Gods, neither unspeakable nor speakable. They are, rather, just
another direction that power can try to drive an indifferent universe.
Why would niceness be any different?
Well, we can think of reasons. Plausibly, for example, the indifferent
universe is steered more easily in some directions vs. others. Indeed,
the social/evolutionary histories of niceness/boundaries/liberalism are
themselves testaments to the ways in which the indifferent universe
favors Elua under certain conditions – favoritism that plays a key role
in explaining why we ended up valuing Elua-stuff intrinsically, to the
extent we do. In this sense, our values are not fully orthogonal to the
"universe's values." True, we are not simple might-makes-right-ists, who
love, only, whatever is in fact most powerful. But our hearts have, in
fact, been shaped by power – so we should not be all that surprised if
the stuff we love is also powerful.
Will power of this kind persist into a post-AGI future – and in
particular, in a way that should motivate extending various sorts of
tolerance and inclusivity towards AIs-with-different-values on pragmatic
rather than purely ethical grounds? My sense is that Yudkowskian-ism
often imagines that it won't. In particular: the practical benefits of
liberalism/niceness/boundaries often have to do with the ways in which
they allow agents with different values, but broadly comparable levels
of power, to cooperate and to live together in harmony rather than to
engage in conflict. But as I discussed above: Yudkowsky is typically
imagining a post-AGI world in which AIs-with-different-values and humans
do not have broadly comparable levels of power. Rather, either
AIs-with-different-values have all the power, or (somehow, due to a
miracle) humans do. So finding a modus vivendi can seem less
practically necessary.
Again, I'm not going to delve into these dynamics in any detail, but I'm
skeptical that we should be writing off the purely practical benefits of
extending various forms of niceness/liberalism/boundaries to
AIs-with-different-values, especially from our current epistemic
position. In particular: I think there may well be crucial stages along
the path to a post-AGI future in which AIs-with-different-values and
humans do indeed have sufficiently comparable levels of power, at least
in expectation, that the practical virtues of
niceness/liberalism/boundaries may well have a positive role to play – including: a role that helps us avoid having to put our trust in any
foomed-up concentration of power, whether human or artificial. I am
especially interested, here, in visions of a post-AGI distribution of
power that would give various AIs-with-different-values more of an
incentive, ex ante, to work with humans to realize the vision in
question, as a part of a broadly fair and legitimate project, rather
than as part of an effort, on humanity's part, to use (potentially
misaligned and unwilling) AI labor to empower human values in
particular. But fleshing this out is a task for another time.
Is niceness enough?
My main aim, in this essay, has been to point at the distinction between
a paradigmatically paperclip-y way of being, and some broad and hazily
defined set of alternatives that I've grouped under the label
"liberalism/niceness/boundaries" (and obviously, there are tons of other
options as well). Too often, I think, a simplistic interpretation of the
alignment discourse imagines that humans and paperclippers are both
paperclippy at heart – but just, with a different favored sort of
stuff. I think this picture neglects core aspects of human ethics that
are, themselves, about navigating precisely the sorts of
differences-in-values that the possibility of AIs-with-different-values
forces us to grapple with. I think that attention to these aspects of
human ethics can help us be better than the paperclippers we fear – not
just in what we do with spare resources, but in how we relate to the
distribution of power amongst a plurality of value systems more broadly.
And I think it may have practical benefits as well, in navigating
possible conflicts both between different humans, and between humans and
AIs.
That said: depending on how exactly we interpret
liberalism/niceness/boundaries, it's also possible to imagine futures
compatible with various versions (and especially, minimal versions – e.g., property rights are respected, laws don't get broken, laws are
passed democratically, etc), but which are nevertheless bleak and even
horrifying in other respects – for example, because love and joy and
beauty and even consciousness have vanished entirely from the
world.[31] In this sense, and depending on the details, the bits of
ethics I've been gesturing at here aren't necessarily enough, on their
own, for even a minimally good future (let alone a great one). In
particular: absent help from an indifferent universe, in order to have
substantive amounts of love/joy/beauty in the future, you need agents
who care about these things having enough power to keep them around to
the relevant degree – and different conceptions of
liberalism/niceness/boundaries may not guarantee this. So even beyond
the yin of being nice/liberal/boundary-respecting towards agents who
don't like love/joy/beauty, some kind of active yang, in the
direction of love/joy/beauty etc, is necessary, too.[32] In the next
essay, I'll return to questions about this sort of yang – and in
particular, questions about whether it involves attempting to exert
inappropriate levels of control.
Thanks for the post, Joe. Relatedly, readers may want to check Brian Tomasik's posts on cooperation and peace.