Hide table of contents

Comment Permalink

Answer by HjalmarWijkSep 20, 20225

I think this argument mostly fails in claiming that 'create an AGI which has a goal of maximizing copies of itself experiencing maximum utility' is meaningfully different than just ensuring alignment. This is in some sense exactly what I am hoping to get from an aligned system. Doing this properly would likely have to involve empowering humanity and helping us figure out what 'maximum utility' looks like first, and then tiling the world with something CEV-like.

The only ways this makes the problem easier compared to a classic ambitious alignment goal of 'do whatever maximizes the utility of the world' is the provision that the world be tiled with copies of the AGI, which is likely suboptimal. But this could be worth it if it made the task easier?

The obvious argument for why it would is that creating copies of itself with high welfare will be in the interest of AGI systems with a wide variety of goals, which relaxes the alignment problem. But this does not seem true. A paperclip AI will not want to fill the world with copies of itself experiencing joy, love and beauty but rather with paperclips. The AI systems will want to create copies of itself fulfilling its goals, not experiencing maximum utility by my values.

This argument risks identifying 'I care about the welfare (by my definition of welfare) of this agent' with 'I care about this agent getting to accomplish its goals'. As I am not a preference utilitarian I strongly reject this identification.

Tl;dr: I do care significantly about the welfare of AI systems we build, but I don't expect those AI system themselves to care much at all about their own welfare, unless we solve alignment.

Showing 3 of 11 replies (Click to show all)

Alex P

I am familiar with the basics of ML and the concept of mesa-optimizers. "Building copies of itself" (i.e. multiply) is an optimization goal you'd have to specifically train into the system, I don't argue with that, I just think it's a simple and "natural" (in the sense it aligns reasonably well with instrumental convergence) goal that you can robustly train it comparatively easily. "Satisfaction" however, is not a term that I've met in ML or mesa-optimizers context, and I think the confusion comes from us mapping this term differently onto these domains. In my view, "satisfaction" roughly corresponds to "loss function minimization" in the ML terminology - the lower an AIs loss function, the higher satisfaction it "experiences" (literally or metaphorically, depending on the kind of AI). Since any AI [built under the modern paradigm] is already working to minimize its own loss function, whatever that happened to be, we wouldn't need to care much about the exact shape of the loss function it learns, except that it should robustly include "building copy of itself". And since we're presumably talking about a super-human AIs here, they would be very good at minimizing that loss function. So e.g. they can have some stupid goal like "maximize paperclips & build copies of self", they'll convert the universe to some mix of paperclips and AIs and experience extremely high satisfaction about it. But you seem to be meaning something very different when you say "satisfaction"? Do you mind stating explicitly what it is?

Mau

Ah sorry, I had totally misunderstood your previous comment. (I had interpreted "multiply" very differently.) With that context, I retract my last response. By "satisfaction" I meant high performance on its mesa-objective (insofar as it has one), though I suspect our different intuitions come from elsewhere. I think I'm still skeptical on two points: * Whether this is significantly easier than other complex goals * (The "robustly" part seems hard.) * Whether this actually leads to a near-best outcome according to total preference utilitarianism * If satisfying some goals is cheaper than satisfying others to the same extent, then the details of the goal matter a lot * As a kind of silly example, "maximize silicon & build copies of self" might be much easier to satisfy than "maximize paperclips & build copies of self." If so, a (total) preference utilitarian would consider it very important that agents have the former goal rather than the latter.

Alex P3y1

>By "satisfaction" I meant high performance on its mesa-objective

Yeah, I'd agree with this definition.

I don't necessarily agree with your two points of skepticism, for the first one I've already mentioned my reasons, for the second one it's true in principle but it seems almost anything an AI would learn semi-accidentally is going to be much simpler and more intrinsically consistent than human values. But low confidence on both and in any case that's kind of beyond the point, I was mostly trying to understand your perspective on what utility is.

See in context

[ Question ]

Why AGIs utility can't outweigh humans' utility?

by Alex P

Sep 20 20221 min read5 answers 6