AI alignment shouldn't be conflated with AI moral achievement

Matthew_Barnett

AI alignment shouldn't be conflated with AI moral achievement

Matthew_Barnett

6 min readDec 30, 2023

117

Comments 15

Sorted by

New & upvoted

Wei Dai

To be sure, ensuring AI development proceeds ethically is a valuable aim, but I claim this goal is *not *the same thing as “AI alignment”, in the sense of getting AIs to try to do what people want.

There was at least one early definition of "AI alignment" to mean something much broader:

The "alignment problem for advanced agents" or "AI alignment" is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

I've argued that we should keep using this broader definition, in part for historical reasons, and in part so that AI labs (and others, such as EAs) can more easily keep in mind that their ethical obligations/opportunities go beyond making sure that AI does what people want. But it seems that I've lost that argument so it's good to periodically remind people to think more broadly about their obligations/opportunities. (You don't say this explicitly, but I'm guessing it's part of your aim in writing this post?)

(Recently I've been using "AI safety" and "AI x-safety" interchangeably when I want to refer to the "overarching" project of making the AI transition go well, but I'm open to being convinced that we should come up with another term for this.)

That said, I think I'm less worried than you about "selfishness" in particular and more worried about moral/philosophical/strategic errors in general. The way most people form their morality is scary to me, and personally I would push humanity to be more philosophically competent/inclined before pushing it to be less selfish.

Matthew_Barnett

There was at least one early definition of "AI alignment" to mean something much broader:

I agree. I have two main things to say about this point:

My thesis is mainly empirical. I think, as a matter of verifiable fact, that if people solve the technical problems of AI alignment, they will use AIs to maximize their own economic consumption, rather than pursue broad utilitarian goals like "maximize the amount of pleasure in the universe". My thesis is independent of whatever we choose to call "AI alignment".
Separately, I think the war over the semantic battle seems to be trending against those on "your side". The major AI labs seem to use the word "aligned" to mean something closer to "the AI does what users want (and also respects moral norms, and doesn't output harmful content etc.)" rather than "the AI produces positive outcomes in the world morally, even if this isn't what the user wants". Personally, the word "alignment" also just seems to conjure an image of the AI trying to do what you want, rather than fighting you if you decide to do something bad or selfish.

That said, I think I'm less worried than you about "selfishness" in particular and more worried about moral/philosophical/strategic errors in general.

There is a lot I could say about this topic, but I'll just say a few brief things here. In general I think the degree to which moral reasoning determines the course of human history is frequently exaggerated. I think mundane economic forces are simply much more impactful. Indeed, I'd argue that much of what we consider human morality is simply a byproduct of social coordination mechanisms that we use to get along with each other, rather than the result of deep philosophical reflection.

At the very least, mundane economic forces seem to have been more impactful historically compared to philosophical reasoning. I probably expect the future of society to resemble the past more strongly than you do?

Wei Dai

I think, as a matter of verifiable fact, that if people solve the technical problems of AI alignment, they will use AIs to maximize their own economic consumption, rather than pursue broad utilitarian goals like “maximize the amount of pleasure in the universe”.

If you extrapolate this out to after technological maturity, say 1 million years from now, what does selfish "economic consumption" look like? I tend to think that people's selfish desires will be fairly easily satiated once everyone is much much richer and the more "scalable" "moral" values would dominate resource consumption at that point, but it might just be my imagination failing me.

I think mundane economic forces are simply much more impactful.

Why does "mundane economic forces" cause resources to be consumed towards selfish ends? I think economic forces select for agents who want to and are good at accumulating resources, but will probably leave quite a bit of freedom in how those resources are ultimately used once the current cosmic/technological gold rush is over. It's also possible that our future civilization uses up much of the cosmic endowment through wasteful competition, leaving little or nothing to consume in the end. Is that's your main concern?

(By "wasteful competition" I mean things like military conflict, costly signaling, races of various kinds that accumulate a lot of unnecessary risks/costs. It seems possible that you categorize these under "selfishness" whereas I see them more as "strategic errors".)

Matthew_Barnett

Why does "mundane economic forces" cause resources to be consumed towards selfish ends?

Because most economic agents are essentially selfish. I think this is currently true, as a matter of empirical fact. People spend the vast majority of their income on themselves, their family, and friends, rather than using their resources to pursue utilitarian/altruistic ideals.

I think the behavioral preferences of actual economic consumers, who are not mostly interested in changing their preferences via philosophical reflection, will more strongly shape the future than other types of preferences. Right now that means human consumers determine what is produced in our economy. In the future, AIs themselves could become economic consumers, but in this post I'm mainly talking about humans as consumers.

I tend to think that people's selfish desires will be fairly easily satiated once everyone is much much richer and the more "scalable" "moral" values would dominate resource consumption at that point, but it might just be my imagination failing me.

I think it's currently very unclear whether selfish preferences can be meaningfully "satiated". Current humans are much richer than their ancestors, and yet I don't think it's obvious that we are more altruistic than our ancestors, at least when measured by things like the fraction of our income spent on charity. (But this is a complicated debate, and I don't mean to say that it's settled.)

It's also possible that our future civilization uses up much of the cosmic endowment through wasteful competition, leaving little or nothing to consume in the end. Is that's your main concern?

This seems unlikely to me, but it's possible. I don't think it's my main concern. My guess is that we still likely fundamentally disagree on something like "how much will the future resemble the past?".

On this particular question, I'd point out that historically, competition hasn't resulted in the destruction of nearly all resources, leaving little to nothing to consume in the end. In fact, insofar as it's reasonable to talk about "competition" as a single thing, competition in the past may have increased total consumption on net, rather than decreased it, by spurring innovation to create more efficient ways of creating economic value.

Steven Byrnes

(Recently I've been using "AI safety" and "AI x-safety" interchangeably when I want to refer to the "overarching" project of making the AI transition go well, but I'm open to being convinced that we should come up with another term for this.)

I’ve been using the term “Safe And Beneficial AGI” (or more casually, “awesome post-AGI utopia”) as the overarching “go well” project, and “AGI safety” as the part where we try to make AGIs that don’t accidentally [i.e. accidentally from the human supervisors’ / programmers’ perspective] kill everyone, and (following common usage according to OP) “Alignment” for “The AGI is trying to do things that the AGI designer had intended for it to be trying to do”.

(I didn’t make up the term “Safe and Beneficial AGI”. I think I got it from Future of Life Institute. Maybe they in turn got it from somewhere else, I dunno.)

(See also: my post Safety ≠ alignment (but they’re close!))

See also a thing I wrote here:

Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.
I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.

Pablo

Great post. A few months I wrote a private comment that makes a very similar point but frames it somewhat differently; I share it below in case it is of any interest.

Victoria Krakovna usefully defines the outer and inner alignment problems in terms of different “levels of specification”: the outer alignment problem is the problem of aligning the ideal specification (the goals of the designer) with the design specification (the goal implemented in the system), while the inner alignment problem is the problem of aligning this design specification with the revealed specification (the goal the system actually pursues). I think this model could be extended to define a third subcomponent of the alignment problem, next to the inner and outer alignment problems. This would be the problem of moving from what we may call the normative specification (the goals that ought to be pursued) to the ideal specification (though it would be clearer to call the latter “human specification”).
This “third alignment problem” is rarely formulated explicitly, in part because “AI alignment” is ambiguously defined to mean either “getting AI systems to do what we want them to do” and “getting AI systems to do what they ought to do”. But it seems important to distinguish between normative and human specifications, not only because (arguably) “humanity” may fail to pursue the goals it should, but also because the team of humans that succeeds in building the first AGI may not represent the goals of “humanity”. So this should be relevant both to people (like classical and negative utilitarians) with values that deviate from humanity’s in ways that could matter a lot, and to “commonsense moralists” who think we should promote human values but are concerned that AI designers may not pursue these values (because these people may not be representative members of the population, because of self-interest, or because of other reasons).
It’s unclear to me how important this third alignment problem is relative to the inner or outer alignment problems. But it seems important to be aware that it is a separate problem so that one can think about it explicitly and estimate its relative importance.

Matthew_Barnett

I replied to your comment in a new post here.

Pablo

Thank you for the ping; I’ll take a look shortly.

[anonymous]

Great post! I've written a paper along similar lines for the SERI Conference in April 2023 here, titled "AI Alignment Is Not Enough to Make the Future Go Well." Here is the abstract:

AI alignment is commonly explained as aligning advanced AI systems with human values. Especially when combined with the idea that AI systems aim to optimize their world based on their goals, this has led to the belief that solving the problem of AI alignment will pave the way for an excellent future. However, this common definition of AI alignment is somewhat idealistic and misleading, as the majority of alignment research for cutting-edge systems is focused on aligning AI with task preferences (training AIs to solve user-provided tasks in a helpful manner), as well as reducing the risk that the AI would have the goal of causing catastrophe.
We can conceptualize three different targets of alignment: alignment to task preferences, human values, or idealized values.
Extrapolating from the deployment of advanced systems such as GPT-4 and from studying economic incentives, we can expect AIs aligned with task preferences to be the dominant form of aligned AIs by default.
Aligning AI to task preferences will not by itself solve major problems for the long-term future. Among other problems, these include moral progress, existential security, wild animal suffering, the well-being of digital minds, risks of catastrophic conflict, and optimizing for ideal values. Additional efforts are necessary to motivate society to have the capacity and will to solve these problems.

I don't necessarily think of humans as maximizing economic consumption, but I argue that power-seeking entities (e.g., some corporations or hegemonic governments using AIs) will have predominant influence, and these will not have altruistic goals to optimize for impartial value, by default.

[anonymous]

This post is a great exemplar for why the term “AI alignment” has proven a drag on AI x-risk safety. The concern is and has always been that AI would dominate humanity like humans dominate animals. All of the talk about aligning AI to “human values” leads to pedantic posts like this one arguing about what “human values” are and how likely AIs are to pursue them.

Matthew_Barnett

Is there a particular part of my post that you disagree with? Or do you think the post is misleading. If so, how?

I think there are a lot of ways AI could go wrong, and "AIs dominating humans like how humans dominate animals" does not exhaust the scope of potential issues.

Roman Leventov

How about coordination and multi-scale planning (optimising both for short term and long term) failures? They both have economic value (i.e., economic value is lost when these failures happen), and they are both at least in part due to the selfish, short-term, impulsive motives/desires/"values" of humans.

E.g., I think people would like to buy an AI that manipulated them into following their exercise plan through some tricks, and likewise they would like to "buy" (build) collectively an AI that restricts their selfishness for the median benefit and the benefit of their own children and grandchildren.

Caspar Oesterheld

Nice post! I generally agree and I believe this is important.

I have one question about this. I'll distinguish between two different empirical claims. My sense is that you argue for one of them and I'd be curious whether you'd also agree with the other. Intuitively, it seems like there are lots of different but related alignment problems: "how can we make AI that does what Alice wants it to do?", "how can we make AI that does what the US wants it to do?", "how can we make AI follow some set of moral norms?", "how can we make AI build stuff in factories for us, without it wanting to escape and take over the world?", "how can we make AI that helps us morally reflect (without manipulating us in ways we don't want)?", "how can we make a consequentialist AI that doesn't do any of the crazy things that consequentialism implies in theory?". You (and I and everyone else in this corner of the Internet) would like the future to solve the more EA-relevant alignment questions and implement the solutions, e.g., help society morally reflect, reduce suffering, etc. Now here are two claims about how the future might fail to do this:

1. Even if all alignment-style problems were solved, then humans would not implement the solutions to the AI-y alignment questions. E.g., if there was the big alignment library that just contains the answer to all these alignment problems, then individuals would grab "from pauper to quadrillionaire and beyond with ChatGPT-n", not "how to do the most you can do better with ChatGPT-n", and so on. (And additionally one has to hold that people's preferences for the not-so-ethical books/AIs will not just go away in the distant future. And I suppose for any of this to be relevant, you'd also need to believe that you have some sort of long-term influence on which books people get from the library.)
2. Modern-day research under the "alignment" (or "safety") umbrella is mostly aimed at solving the not-so-EA-y alignment questions, and does not put much effort toward the more specifically-EA-relevant questions. In terms of the alignment library analogy, there'll be lots of books in the aisle on how to get your AI to build widgets without taking over the world, and not so many books in the aisle on how to use AI to do moral reflection and the like. (And again one has to hold that this has some kind of long-term effect, despite the fact that all of these problems can probably be solved _eventually_. E.g., you might think that for the future to go in a good direction, we need AI to help with moral reflection immediately once we get to human-level AI, because of some kinds of lock-in.)

My sense is that you argue mostly for 1. Do you also worry about 2? (I worry about both, but I mostly think about 2, because 1 seems much less tractable, especially for me as a technical person.)

Joseph_Chu

I would just like to point out that this consideration of there being two different kinds of AI alignment, one more parochial, and one more global, is not entirely new. The Brookings Institute put out a paper about this in 2022.

Leo

This was a nice post. I haven't thought about these selfishness concerns before, but I did think about possible dangers arising from aligned servant AI used as a tool to improve military capabilities in general. A pretty damn risky scenario in my view and one that will hugely benefit whoever gets there first.

Comments